The vdj_pipe configuration consists of general options, sequencing data inputs, and the processing steps. The pipeline reads the input files record-by-record and applies the processing steps sequentially to each of the records. A sequencing data record may include a sequence, its description, and the quality score in case of single-read data, or two sequences and two quality scores for paired-read data.
The processing steps operate by reading and setting values to a set of variables. For a single-read pipeline, the following values are provided by the input system:
ID | Type | Description |
read_id | string | sequencing read ID; no white space characters |
description | string | sequencing read description string; includes ID; no \n \r characters |
sequence | string | nucleotide sequence; no white space characters |
quality | unsigned vector | quality scores; same size as sequence |
trim | int, int | sequence range marked for further processing; [0, sequence length] initially |
is_reverse | boolean | true if read is marked as reverse |
seq_file_path | Path_id | input sequence file ID |
qual_file_path | Path_id | input quality file ID; same as seq_file_path for FASTQ files |
- Parameters without names should not be shown to user
- Relative input paths are taken relatively to global base_path_input parameter (if set)
- Relative output paths are taken relatively to global base_path_output parameter (if set)
- Absolute paths (e.g., on Unix starting from “/”) are interpreted as absolute
- Output file paths with archive extensions (i.e., gz, bz2, z) are compressed automatically
- Compression of input files is determined by their “magic bytes”
Some processing steps accept parameters of variable_string type. variable_string is a JSON string that may include variable names enclosed in curly brackets {}. The variables should be defined by one of the previous steps. The value of a variable_string is generated for each sequencing data record by substituting the enclosed variable names by the variable values converted to strings. The processing steps may also accept a unset_value parameter that defines a string used as a default variable value. If unset_value parameter is not defined and the value of the variable is not set, the value of the variable_string is also not set. variable_string -s can be used, e.g., for writing sequences to different directories and files depending on some variables.
variable_string_path is a variable_string used as an output path. The output to variable_string_path cannot be compressed. variable_string_path should not include an archive extension.
Name | Character | Comment |
Adenine | A, a | |
Cytosine | C, c | |
Guanine | G, g | |
Thymine | T, t | |
Any | N, n | A, C, G, T |
Uracil | U, u | |
Purine | R, r | A, G |
Pyrimidine | Y, y | C, T, U |
Ketone | K, k | G, T, U |
Amine | M, m | A, C |
Strong | S, s | C, G |
Weak | W, w | A, T, U |
not_A | B, b | C, G, T |
not_C | D, d | A, G, T |
not_G | H, h | A, C, T |
not_T | V, v | A, C, G |
Id | Type | Default | Description |
base_path_input | directory path string | current directory | Path prefix for input files. Relative path is taken relatively to the current directory. The directory should exist. |
base_path_output | directory path string | ||
config_output_path | file path | ||
csv_file_delimiter | char | \t | comma, (,) may be another option |
external_MIDs | bool | false | |
max_file_reads | unsigned | inf | |
max_reads | unsigned | inf | |
paired_reads | bool | false | |
plots_list_path | file path | ||
quality_scores | bool | true | |
summary_output_path | file path | ||
input | array of input objects | required | see Input options |
steps | array of step objects | required | see Processing steps |
Each input object describes a sequencing data file or multiple files processed together. For example, a *.fastq file, a pair of *.fasta and *.qual files, a pair of forward/reverse *.fastq files, two pairs of forward/reverse *.fasta and *.qual files.
In addition, each input object may contain any number of user-defined name-value pairs. During the processing, those will be associated with every sequencing data record and could be accessed by the processing steps as a variable of the same name. All input objects should have the same set of user-defined names. It is recommended that the value types for
Id | Type | Default | Description |
sequence | file path | path to *.fasta or *.fastq file; use for single reads | |
quality | file path | path to *.qual file; sequence should also be specified | |
is_reverse | bool | false | true indicates that the sequences are in reverse direction |
forward_seq | file path | path to *.fasta or *.fastq file with sequences in forward direction; use for single or paired reads; for single reads, this is equivalent to sequence keyword or to sequence and is_reverse: false | |
reverse_seq | file path | path to *.fasta or *.fastq file with sequences in reverse direction; use for single or paired reads; for single reads, this is equivalent to sequence and is_reverse: true | |
forward_qual | file path | path to *.qual file with quality scores in forward direction; use for single or paired reads; for single reads, this is equivalent to quality keyword or to quality and is_reverse: false; forward_seq should be also specified | |
reverse_qual | file path | ||
forward_mid | file path | ||
reverse_mid | file path |
Single reads
"input": [
{ "forward_seq": "file1.fastq" },
{ "reverse_seq": "file2.fastq" },
{ "sequence": "file3.fastq" },
{ "sequence": "file4.fastq", "is_reverse": true },
{ "sequence": "file5.fasta", "quality": "file5.qual" }
]
Single reads with a user variable
"input": [
{ "forward_seq": "file1.fastq" },
{ "reverse_seq": "file2.fastq" }
]
Paired reads
"input": [
{ "forward_seq": "file1.fastq", "reverse_seq": "file2.fastq" },
{ "forward_seq": "file3.fastq", "reverse_seq": "file4.fastq" }
]
Base composition statistics
Id | Tags | Description |
composition_stats | stats | Calculate relative abundance of ambiguous base calls and each of the four nucleotides. |
Parameters
Id | Name | Type | Default | Description |
out_prefix | string | “” | Path prefix for output files. | |
out_path_composition | string | out_prefix + composition.csv | Output path for composition summary file; if this option is used, out_prefix is ignored for this file. | |
out_path_gc_hist | string | out_prefix + gc_hist.csv | Output path for GC histogram file; if this option is used, out_prefix is ignored for this file. |
Read quality statistics
Id | Tags | Description |
quality_stats | stats | Generate quality score distributions for each position, and generate histograms for both read length and average read quality. |
Parameters
Id | Name | Type | Default | Description |
out_prefix | string | “” | Path prefix for output files. | |
out_path_hm | string | out_prefix + heat_map.csv | Output path for quality heat map; if this option is used, ‘out_prefix’ is ignored for this file. | |
out_path_stats | string | out_prefix + qstats.csv | Output path for quality statistics summary; if this option is used, ‘out_prefix’ is ignored for this file. | |
out_path_mq_hist | string | out_prefix + mean_q_hist.csv | Output path for read quality histogram; if this option is used, ‘out_prefix’ is ignored for this file. | |
out_path_len_hist | string | out_prefix + len_hist.csv | Output path for read length histogram; if this option is used, ‘out_prefix’ is ignored for this file. |
Filter steps determine whether sequence read is suitable for further analysis based on some condition, i.e., quality score, length, or composition.
Common parameters for all filter steps:
Id | Name | Type | Default | Description |
passed_name | string | optional | name of boolean variable defined by the step indicating whether the current read passed the filter |
Nucleotide filter
Id | Tags | Description |
character_filter | filter | Discard reads with non-allowed characters. |
Parameters
Id | Name | Type | Default | Description |
chars | allowed nucleotides | string | ACGTacgt | The list of allowed characters may be expanded to contain any of the following case-senstive characters: ACGTURYKMSWBDHVNacgturykmswbdhvn |
Length filter
Id | Tags | Description |
length_filter | filter | Discard reads with a length outside of the range [min, max]. |
Parameters
Id | Name | Type | Default | Description |
min | minimal length | unsigned | 0 | Minimal accepted read length. |
max | maximal length | unsigned | <large number> | Maximal accepted length. |
trim | trim | bool | false | 3’ end trim reads longer than max |
Homopolymer filter
Id | Tags | Description |
homopolymer_filter | filter | discard reads that contain a homopolymer longer than a specified length |
Parameters
Id | Name | Type | Default | Description |
max_length | max length | unsigned | <required> | maximum homopolymer length |
Minimal quality filter
Id | Tags | Description |
min_quality_filter | filter | discard reads that contain a quality score lower than a specified minimum |
Parameters
Id | Name | Type | Default | Description |
min_quality | minimal quality | unsigned | <required> |
Minimal average quality filter
Id | Tags | Description |
average_quality_filter | filter | discard reads with average quality score lower than a specified minimal |
Parameters
Id | Name | Type | Default | Description |
min_quality | minimal quality | float | <required> |
Minimal quality window filter
Id | Tags | Description |
min_quality_window_filter | filter, trimming | Find the longest subsequence with every quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length. |
Parameters
Id | Name | Type | Default | Description |
min_quality | minimal quality | unsigned | <required> | minimal quality score |
min_length | minimal length | unsigned | <required> | minimal interval length |
Average quality window filter
Id | Tags | Description |
average_quality_window_filter | filter, trimming | Find the longest subsequence with average quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length. |
Parameters
Id | Name | Type | Default | Description |
min_quality | minimal quality | float | <required> | minimal accepted average quality score |
window_length | window length | unsigned | <required> | length for computing averages |
min_length | minimal length | unsigned | window_length | minimal interval length (should be <= window_length) |
Ambiguous nucleotide window filter
Id | Tags | Description |
ambiguous_window_filter | filter, trimming | Find the longest subsequence with no more than a specified number of ambiguous base calls. Discard reads where the found subsequence is shorter than a specified length. |
Parameters
Id | Name | Type | Default | Description |
min_length | minimal length | unsigned | 0 | minimal accepted read interval length; if zero, require that the whole read contains at most max_ambiguous ambiguous nucleotides |
max_ambiguous | maximal number of ambiguous nucleotides | unsigned | 0 |
Histogram
Id | Tags | Description |
histogram | stats | Build a histogram of value occurrences |
Parameters
Id | Name | Type | Default | Description |
name | string or array of strings | <required> | The name of the variable to build a histogram from; the available names are dependent on the configuration used. | |
out_path | string | variable name(s) | output path for histogram CSV file; by default, variable names delimited by “_”, followed by “.csv” |
Match sequence element
Id | Tags | Description |
match | demultiplexing, alignment, filter, trimming | find one or many sequence features in relationship to each other positionally, by strict or fuzzy matching, or by alignment |
Parameters
Id | Name | Type | Default | Description |
reverse | reverse-complement | boolean | false | search reverse-complemented sequences for reads marked as reverse |
trimmed | trimmed | boolean | false | search within trimming marks set by previous steps |
elements | match elements | array of Element objects | <required> | define sequence features to search for |
combinations | element combinations | array of Combination objects | <empty> | define feature combinations |
Element object
Id | Name | Type | Default | Description |
start/end | element start/end | Position object | <empty> | define beginning or end of element |
length | unsigned | length of match element; required if the element is defined by position only (no sequences provided); if sequences are provided and the length is greater then the sequence length, scanning will be done to find the best position | ||
seq_file | sequence file | string | none | path to fasta/fastq file for matching sequences |
csv_file | CSV file | csv_element object | none | CSV file description |
sequence | matching sequences | string or array of strings | none | matching sequence or array of sequences |
required | is required | boolean | false | filter out read if element is not found |
min_score | minimal alignment score | int | 0 | minimal alignment score between the read sequence element and a matching sequence |
max_mismatches | maximum number of mismatches | unsigned | either min_score or max_mismatches should be used | |
allow_gaps | allow gaps | boolean | false | perform gapped alignment |
min_match_length | minimal contiguous match | unsigned | 0 | minimal contiguous matching length (used for gapped alignment) |
value_name | value name | string | none | unique name that will be used by other processing steps to refer to the matching sequence |
score_name | match score name | string | none | unique name that will be used by other processing steps to refer to the match score |
identity_name | match identity name | string | none | unique name that will be used by other processing steps to refer to the match identity |
cut_lower | cut before element | Cut object | none | set lower sequence truncation point |
cut_upper | cut after element | Cut object | none | set upper sequence truncation point |
csv_element object
Id | Name | Type | Default | Description |
path | string | <required> | file location | |
sequences_column | column name or index | string or int | <required> | name or index of the column containing sequences |
Position object
Id | Name | Type | Default | Description |
pos | position | int | 0 | position of element start or end |
before/after | before/after | string | none | name of matching element relative to which the position is defined |
Cut object
Id | Name | Type | Default | Description |
before/after | before/after | int | 0 | set sequence truncation point relative to the element’s beginning or end |
Combination object
Id | Name | Type | Default | Description |
value_name | value name | string | none | unique name that will be used by other processing steps to refer to the matching combination |
csv_file | csv_combination object | <required> | define combinations of matched elements, e.g., bar code pairs |
csv_combination object
Id | Name | Type | Default | Description |
path | path string | <required> | file location | |
values_column | array of string: string or string: int pairs | <required> | pairs of either element name - column name or element name - column index; column names and column indices should not appear in the same csv_combination object | |
names_column | string or int | optional | name or index of the column defining the names of the combinations; the type should be same as in values_column | |
skip_header | bool | false | if true and column names are not used, skip first line of CSV file |
Match external molecular identifier
Id | Tags | Description |
eMID_map | demultiplexing | identify which of the given short sequences best matches external molecular identifier located in read description line |
Parameters
Id | Name | Type | Default | Description |
value_name | value name | string | eMID | name of the value |
fasta_path | string | <required> | file path with sequences to match eMID | |
pairs_path | string | “” | CSV file path with sequence name pairs |
Write sequences
Id | Tags | Description |
write_sequence | output | write each read in FASTA or FASTQ format, possibly into multiple files |
Parameters
Id | Name | Type | Default | Description |
out_path | ??? | variable_string_path | <required> | path for output file(s) |
unset_value | ??? | string | <discard> | use this string if a value specified by the out_path is undefined; by default, such reads are discarded |
trimmed | trimmed | bool | true | whether trimming defined by previous steps should be applied |
reverse_complemented | reverse complemented | bool | true | whether to reverse-complement sequences marked as reverse |
skip_empty | bool | true | if true, skip the sequences that were filtered out |
Write values
Id | Tags | Description |
write_value | output | write specified values for each read in CSV format, possibly into multiple files |
Parameters
Id | Name | Type | Default | Description |
names | value names | string array | <required> | list of value names to write |
out_path | ??? | variable_string_path | <required> | path for output file(s) |
unset_value | ??? | string | <discard> | use this string if a value specified by the out_path is undefined; by default, such reads are discarded |
Find unique sequences
Id | Tags | Description |
find_shared | Identify unique sequences and sequences common between different groups of reads |
Parameters
Id | Name | Type | Default | Description |
min_length | minimal length | unsigned | required with ignore_ends or fraction_match | minimal matching length to consider sequences identical |
ignore_ends | ignore end length | unsigned | optional, cannot be used with fraction_match | maximum length of mismatch at sequence ends, requires min_length |
fraction_match | fraction of sequence length to match | float (0,1] | optional, cannot be used with ignore_ends | fraction of sequence length to match, requires min_length |
min_duplicates | minimal number of copies | unsigned | 0 | output sequences with at least this many duplicates |
trimmed | bool | true | operate on trimmed sequences | |
reverse | bool | true | reverse-complement sequences labelled as reverse | |
out_summary | path string | "sharing_summary.csv" | output path for summary statistics | |
out_redundancy_histogram | path string | <optional> | output path for redundancy histogram | |
out_unique | variable_string_path | <optional> | ||
out_duplicates | variable_string_path | <optional> | ||
out_group_unique | variable_string_path | <optional> | ||
out_group_duplicates | variable_string_path | <optional> | ||
unset_value | use this string if a value specified by the out_path is undefined; by default, such reads are discarded |
Note: the variable strings in out_unique, out_duplicates, out_group_unique, and out_group_duplicates parameters should contain the same set of variables.
Merge paired reads
Id | Tags | Description |
merge_paired | paired_reads | merge forward and reverse reads. This step should be only available for processing paired reads |
Parameters
Id | Name | Type | Default | Description |
min_score | minimal score | unsigned | 0 | merge reads if overlap has this minimal score |
<wrapper step>
Id | Tags | Description |
apply | paired_reads | apply specified step to forward, reverse, or merged sequence reads; use this behind the scenes to process paired reads |
Parameters
Id | Name | Type | Default | Description |
to | token-string or array of token-strings | <required> | either a single or an array of string tokens: forward reverse merged | |
step | single read processing step object | <required> | apply step to each of specified directions |