VDJML XML Schema

VDJML is a file format for storing the results of VDJ analysis of immune receptor (IR) sequence reads.

The VDJ analysis involves aligning each input sequence to V, D, and J germline gene segments, and, based on the alignments, identifying various regions of interest and other properties of the input sequence. Currently, every VDJ analysis software application produces output in a different format.

The purpose of VDJML is to provide a common format for different VDJ analysis applications and to facilitate downstream processing of the results in an application-agnostic manner.

The VDJ analysis is performed using either "raw" input sequence reads, (e.g., as obtained from a sequencing instrument), "consensus" read sequences, or sequences obtained from other sources. In this document, the input sequences used for the analysis will be referred to as read sequences or reads.

VDJML stores information about the software packages and about the germline sequence databases that were used for producing the analysis results. For each read sequence, VDJML stores the information about the aligned gene segments. For combinations of the aligned segments, VDJML stores regions of interest, e.g., FR1, CDR3.

VDJML was designed to have a relatively narrow scope. In general, the data with commonly accepted file formats should be stored in separate files and not in VDJML. Examples of information that should not be stored in VDJML:

All elements and attributes defined by this version of the VDJML schema have a namespace http://vdjserver.org/vdjml/xsd/1/ (prefix vdj:). The top element of the schema is vdj:vdjml.

The schema uses primitive datatypes defined by the W3C XML Schema. The primitive datatypes are defined in the namespace http://www.w3.org/2001/XMLSchema, prefix xs:, e.g., xs:decimal.

A VDJML file consists of two parts. General information about the analysis appears under the element vdj:meta and the result for each sequence read appears in a sequence of vdj:read elements under the vdj:read_results element.

The schema also allows some user-defined elements and attributes, which may appear under vdj:meta and vdj:read elements. User-defined elements and attributes, should have namespaces other than vdj.

vdjml
Top element of a VDJML document.
attributetyperequireddescription
versionxs:decimal
version of the VDJML format, e.g., 1.0
meta
Contains general information common to all analysis results. May also contain user-defined attributes and elements, which should use namespaces other than vdj.
generator
Describes the software that wrote the VDJML file
attributetyperequireddescription
namexs:string
The name of the VDJML generator, e.g., libVDJML
versionxs:string
Version of the generator software, e.g., 1.42.0
time_gmtxs:dateTime
Date and time of writing the file, GMT, in xs:dateTime format, e.g., 2014-07-24T14:47:24
aligner
Information about the aligner (VDJ analysis software), a program that generated all or some of the results in the VDJML document by either performing the VDJ analysis de novo (e.g., IgBlast or VQuest) or by building upon the results of another aligner.
attributetyperequireddescription
aligner_idxs:positiveInteger
A unique identifier for the IR analysis software
namexs:string
The name of the aligner
versionxs:string
The version of the aligner
run_idxs:integer
A unique identifier for the analysis run
urixs:anyURI
A URL from which the aligner or information about it can be obtained
parameters
Parameters used for running the VDJ analysis.
germline_db
Information about a germline database used for the analysis.
attributetyperequireddescription
gl_db_idxs:positiveInteger
A unique identifier for the germline database
namexs:string
The database name
speciesxs:string
The germline database species
versionxs:string
Germline database version
urixs:anyURI
The URL where the database or information about it can be obtained
read_results
Sequence of alignment results for individual sequence reads. It is recommended that the sequence read results appear in the same order as in the FASTA/Q files from which the alignment results were obtained.
read
Stores the analysis results of a single sequence read. May also contain user-defined elements and attributes. The user-defined elements and attributes should be in namespaces other than vdj.
attributetyperequireddescription
read_idxs:string
Unique identifier of the read sequence; should be same as the sequence ID from the original FASTA/Q file
alignment
Immune receptor sequence alignment results for the sequence read.
segment_match
Information about germline gene segment(s) aligned to the read sequence
attributetyperequireddescription
segment_match_idxs:positiveInteger
A unique identifier for segment match to be used in other parts of the VDJML document
read_pos0xs:nonNegativeInteger
Zero-based starting position of the match in the read sequence
read_lenxs:nonNegativeInteger
Number of nucleotides in the read sequence for the matching region
gl_lenxs:nonNegativeInteger
length of aligned germline segment
identityvdj:Percent
Percent of nucleotide sequence identity, e.g., 90%.
scorexs:integer
Alignment score, as defined by the aligner software
insertionsxs:nonNegativeInteger
number of nucleotide insertions in the read sequence aligned to the germline sequence
deletionsxs:nonNegativeInteger
number of nucleotide deletions from the read sequence aligned to the germline sequence
substitutionsxs:nonNegativeInteger
number of nucleotide substitutions
stop_codonxs:boolean
true if the stop codon is present
mutated_invariantxs:boolean
true if a codon for a conserved amino acid is mutated
invertedxs:boolean
true if the read sequence is a reverse-complement to germline gene segments
out_frame_indelxs:boolean
true if indel mutation resulted in a frame shift
out_frame_vdjxs:boolean
true if V(D)J recombination occurred out of frame
btop
BTOP (BLAST traceback operations) string. Integer indicates the number of aligned nucleotides. A pair of letters indicates a nucleotide mismatch between read/germline sequences. Dash ('-') indicates a deletion. Example: 5AC-G35.
gl_seg_match
Information about aligned germline segment
attributetyperequireddescription
gl_seg_match_idxs:positiveInteger
A unique identifier for germline segment match to be used in other parts of the VDJML document
num_systemxs:string
Numbering system name.
namexs:string
Name of the germline segment
gl_db_idxs:positiveInteger
A unique identifier referring to the germline database in which the segment was found
aligner_idxs:positiveInteger
A unique identifier referring to the aligner software that produced the alignment
typevdj:Segment_type
The type of the segment (V, D, or J)
gl_pos0xs:nonNegativeInteger
Zero-based start of the alignment position in the germline sequence.
aa_substitution
Information about an amino acid codon substitution
attributetyperequireddescription
read_pos0xs:nonNegativeInteger
Zero-based position of the first nucleotide of the codon in the read sequence.
read_aavdj:Aminoacid
amino acid encoded by the read sequence
gl_aavdj:Aminoacid
amino acid encoded by the germline sequence
combination
Annotations of the read sequence based on certain alignments with germline gene segments.
attributetyperequireddescription
segments
List of identifiers for the segment matches (segment_match_id-s) that serve as a basis for the annotations listed in this element.
region
Region of an immune receptor gene, i.e., Ig domain, segment junction. Note: the corresponding positions of the germline sequences are not stored because they may differ depending on the individual germline segment matches.
attributetyperequireddescription
namexs:string
Name of the region as provided by the alignment software, e.g., FR1
num_systemxs:string
Numbering system name. Attribute is not required because for some regions, e.g., for VD-junctions, numbering system is not applicable.
aligner_idxs:positiveInteger
Unique identifier for the IR analysis software that produced the region annotation aligner_id
read_pos0xs:nonNegativeInteger
Zero-based index of the first read nucleotide that is mapped to the region. In case of an Ig domain, the corresponding position for each germline segment match may be determined based on segment match positions and BTOPs.
read_lenxs:nonNegativeInteger
Length of the read sequence that corresponds to the region.
identityvdj:Percent
Percent of nucleotide sequence identity, e.g., 90%.
scorexs:integer
Alignment score, as defined by the aligner software
insertionsxs:nonNegativeInteger
number of nucleotide insertions in the read sequence aligned to the germline sequence
deletionsxs:nonNegativeInteger
number of nucleotide deletions from the read sequence aligned to the germline sequence
substitutionsxs:nonNegativeInteger
number of nucleotide substitutions
stop_codonxs:boolean
true if the stop codon is present
mutated_invariantxs:boolean
true if a codon for a conserved amino acid is mutated
invertedxs:boolean
true if the read sequence is a reverse-complement to germline gene segments
out_frame_indelxs:boolean
true if indel mutation resulted in a frame shift
out_frame_vdjxs:boolean
true if V(D)J recombination occurred out of frame