VDJML XML Schema

VDJML is a file format for storing the results of VDJ analysis of immune receptor (IR) sequence reads.

The VDJ analysis involves aligning each input sequence to V, D, and J germline gene segments, and, based on the alignments, identifying various regions of interest and other properties of the input sequence. Currently, every VDJ analysis software application produces output in a different format.

The purpose of VDJML is to provide a common format for different VDJ analysis applications and to facilitate downstream processing of the results in an application-agnostic manner.

The VDJ analysis is performed using either "raw" input sequence reads, (e.g., as obtained from a sequencing instrument), "consensus" read sequences, or sequences obtained from other sources. In this document, the input sequences used for the analysis will be referred to as read sequences or reads.

VDJML stores information about the software packages and about the germline sequence databases that were used for producing the analysis results. For each read sequence, VDJML stores the information about the aligned gene segments. For combinations of the aligned segments, VDJML stores regions of interest, e.g., FR1, CDR3.

VDJML was designed to have a relatively narrow scope. In general, the data with commonly accepted file formats should be stored in separate files and not in VDJML. Examples of information that should not be stored in VDJML:

read sequences and quality scores
germline sequences and annotations
sequencing metadata, e.g., sequencing instrument, pre-processing parameters
sample metadata, e.g., organism, tissue

All elements and attributes defined by this version of the VDJML schema have a namespace http://vdjserver.org/vdjml/xsd/1/ (prefix vdj:). The top element of the schema is vdj:vdjml.

The schema uses primitive datatypes defined by the W3C XML Schema. The primitive datatypes are defined in the namespace http://www.w3.org/2001/XMLSchema, prefix xs:, e.g., xs:decimal.

A VDJML file consists of two parts. General information about the analysis appears under the element vdj:meta and the result for each sequence read appears in a sequence of vdj:read elements under the vdj:read_results element.

The schema also allows some user-defined elements and attributes, which may appear under vdj:meta and vdj:read elements. User-defined elements and attributes, should have namespaces other than vdj.

vdjml

Top element of a VDJML document.

attributetyperequireddescription

versionxs:decimal✓

version of the VDJML format, e.g., 1.0

meta

Contains general information common to all analysis results. May also contain user-defined attributes and elements, which should use namespaces other than vdj.

generator

Describes the software that wrote the VDJML file

attributetyperequireddescription

namexs:string✓

The name of the VDJML generator, e.g., libVDJML

versionxs:string✓

Version of the generator software, e.g., 1.42.0

time_gmtxs:dateTime✓

Date and time of writing the file, GMT, in xs:dateTime format, e.g., 2014-07-24T14:47:24

aligner

Information about the aligner (VDJ analysis software), a program that generated all or some of the results in the VDJML document by either performing the VDJ analysis de novo (e.g., IgBlast or VQuest) or by building upon the results of another aligner.

attributetyperequireddescription

aligner_idxs:positiveInteger✓

A unique identifier for the IR analysis software

namexs:string✓

The name of the aligner

versionxs:string

The version of the aligner

run_idxs:integer

A unique identifier for the analysis run

urixs:anyURI

A URL from which the aligner or information about it can be obtained

parameters

Parameters used for running the VDJ analysis.

germline_db

Information about a germline database used for the analysis.

attributetyperequireddescription

gl_db_idxs:positiveInteger✓

A unique identifier for the germline database

namexs:string✓

The database name

speciesxs:string✓

The germline database species

versionxs:string✓

Germline database version

urixs:anyURI

The URL where the database or information about it can be obtained

read_results

Sequence of alignment results for individual sequence reads. It is recommended that the sequence read results appear in the same order as in the FASTA/Q files from which the alignment results were obtained.

read

Stores the analysis results of a single sequence read. May also contain user-defined elements and attributes. The user-defined elements and attributes should be in namespaces other than vdj.

attributetyperequireddescription

read_idxs:string✓

Unique identifier of the read sequence; should be same as the sequence ID from the original FASTA/Q file

alignment

Immune receptor sequence alignment results for the sequence read.

segment_match

Information about germline gene segment(s) aligned to the read sequence

attributetyperequireddescription

segment_match_idxs:positiveInteger✓

A unique identifier for segment match to be used in other parts of the VDJML document

read_pos0xs:nonNegativeInteger✓

Zero-based starting position of the match in the read sequence

read_lenxs:nonNegativeInteger✓

Number of nucleotides in the read sequence for the matching region

gl_lenxs:nonNegativeInteger✓

length of aligned germline segment

identityvdj:Percent

Percent of nucleotide sequence identity, e.g., 90%.

scorexs:integer

Alignment score, as defined by the aligner software

insertionsxs:nonNegativeInteger

number of nucleotide insertions in the read sequence aligned to the germline sequence

deletionsxs:nonNegativeInteger

number of nucleotide deletions from the read sequence aligned to the germline sequence

substitutionsxs:nonNegativeInteger

number of nucleotide substitutions

stop_codonxs:boolean

true if the stop codon is present

mutated_invariantxs:boolean

true if a codon for a conserved amino acid is mutated

invertedxs:boolean

true if the read sequence is a reverse-complement to germline gene segments

out_frame_indelxs:boolean

true if indel mutation resulted in a frame shift

out_frame_vdjxs:boolean

true if V(D)J recombination occurred out of frame

btop

BTOP (BLAST traceback operations) string. Integer indicates the number of aligned nucleotides. A pair of letters indicates a nucleotide mismatch between read/germline sequences. Dash ('-') indicates a deletion. Example: 5AC-G35.

gl_seg_match

Information about aligned germline segment

attributetyperequireddescription

gl_seg_match_idxs:positiveInteger✓

A unique identifier for germline segment match to be used in other parts of the VDJML document

num_systemxs:string✓

Numbering system name.

namexs:string✓

Name of the germline segment

gl_db_idxs:positiveInteger✓

A unique identifier referring to the germline database in which the segment was found

aligner_idxs:positiveInteger✓

A unique identifier referring to the aligner software that produced the alignment

typevdj:Segment_type✓

The type of the segment (V, D, or J)

gl_pos0xs:nonNegativeInteger✓

Zero-based start of the alignment position in the germline sequence.

aa_substitution

Information about an amino acid codon substitution

attributetyperequireddescription

read_pos0xs:nonNegativeInteger✓

Zero-based position of the first nucleotide of the codon in the read sequence.

read_aavdj:Aminoacid✓

amino acid encoded by the read sequence

gl_aavdj:Aminoacid✓

amino acid encoded by the germline sequence

combination

Annotations of the read sequence based on certain alignments with germline gene segments.

attributetyperequireddescription

segments✓

List of identifiers for the segment matches (segment_match_id-s) that serve as a basis for the annotations listed in this element.

region

Region of an immune receptor gene, i.e., Ig domain, segment junction. Note: the corresponding positions of the germline sequences are not stored because they may differ depending on the individual germline segment matches.

attributetyperequireddescription

namexs:string✓

Name of the region as provided by the alignment software, e.g., FR1

num_systemxs:string

Numbering system name. Attribute is not required because for some regions, e.g., for VD-junctions, numbering system is not applicable.

aligner_idxs:positiveInteger✓

Unique identifier for the IR analysis software that produced the region annotation aligner_id

read_pos0xs:nonNegativeInteger✓

Zero-based index of the first read nucleotide that is mapped to the region. In case of an Ig domain, the corresponding position for each germline segment match may be determined based on segment match positions and BTOPs.

read_lenxs:nonNegativeInteger✓

Length of the read sequence that corresponds to the region.

identityvdj:Percent

Percent of nucleotide sequence identity, e.g., 90%.

scorexs:integer

Alignment score, as defined by the aligner software

insertionsxs:nonNegativeInteger

number of nucleotide insertions in the read sequence aligned to the germline sequence

deletionsxs:nonNegativeInteger

number of nucleotide deletions from the read sequence aligned to the germline sequence

substitutionsxs:nonNegativeInteger

number of nucleotide substitutions

stop_codonxs:boolean

true if the stop codon is present

mutated_invariantxs:boolean

true if a codon for a conserved amino acid is mutated

invertedxs:boolean

true if the read sequence is a reverse-complement to germline gene segments

out_frame_indelxs:boolean

true if indel mutation resulted in a frame shift

out_frame_vdjxs:boolean

true if V(D)J recombination occurred out of frame