vdj_pipe pipeline

The vdj_pipe configuration consists of general options, sequencing data inputs, and the processing steps. The pipeline reads the input files record-by-record and applies the processing steps sequentially to each of the records. A sequencing data record may include a sequence, its description, and the quality score in case of single-read data, or two sequences and two quality scores for paired-read data.

The processing steps operate by reading and setting values to a set of variables. For a single-read pipeline, the following values are provided by the input system:

ID	Type	Description
read_id	string	sequencing read ID; no white space characters
description	string	sequencing read description string; includes ID; no \n \r characters
sequence	string	nucleotide sequence; no white space characters
quality	unsigned vector	quality scores; same size as sequence
trim	int, int	sequence range marked for further processing; [0, sequence length] initially
is_reverse	boolean	true if read is marked as reverse
seq_file_path	Path_id	input sequence file ID
qual_file_path	Path_id	input quality file ID; same as seq_file_path for FASTQ files

Notes:

- Parameters without names should not be shown to user

- Relative input paths are taken relatively to global base_path_input parameter (if set)

- Relative output paths are taken relatively to global base_path_output parameter (if set)

- Absolute paths (e.g., on Unix starting from “/”) are interpreted as absolute

- Output file paths with archive extensions (i.e., gz, bz2, z) are compressed automatically

- Compression of input files is determined by their “magic bytes”

Some processing steps accept parameters of variable_string type. variable_string is a JSON string that may include variable names enclosed in curly brackets {}. The variables should be defined by one of the previous steps. The value of a variable_string is generated for each sequencing data record by substituting the enclosed variable names by the variable values converted to strings. The processing steps may also accept a unset_value parameter that defines a string used as a default variable value. If unset_value parameter is not defined and the value of the variable is not set, the value of the variable_string is also not set. variable_string -s can be used, e.g., for writing sequences to different directories and files depending on some variables.

variable_string_path is a variable_string used as an output path. The output to variable_string_path cannot be compressed. variable_string_path should not include an archive extension.

Supported nucleotide characters

Name	Character	Comment
Adenine	A, a
Cytosine	C, c
Guanine	G, g
Thymine	T, t
Any	N, n	A, C, G, T
Uracil	U, u
Purine	R, r	A, G
Pyrimidine	Y, y	C, T, U
Ketone	K, k	G, T, U
Amine	M, m	A, C
Strong	S, s	C, G
Weak	W, w	A, T, U
not_A	B, b	C, G, T
not_C	D, d	A, G, T
not_G	H, h	A, C, T
not_T	V, v	A, C, G

Global options

Id	Type	Default	Description
base_path_input	directory path string	current directory	Path prefix for input files. Relative path is taken relatively to the current directory. The directory should exist.
base_path_output	directory path string
config_output_path	file path
csv_file_delimiter	char	\t	comma, (,) may be another option
external_MIDs	bool	false
max_file_reads	unsigned	inf
max_reads	unsigned	inf
paired_reads	bool	false
plots_list_path	file path
quality_scores	bool	true
summary_output_path	file path
input	array of input objects	required	see Input options
steps	array of step objects	required	see Processing steps

Input object

Each input object describes a sequencing data file or multiple files processed together. For example, a *.fastq file, a pair of *.fasta and *.qual files, a pair of forward/reverse *.fastq files, two pairs of forward/reverse *.fasta and *.qual files.

In addition, each input object may contain any number of user-defined name-value pairs. During the processing, those will be associated with every sequencing data record and could be accessed by the processing steps as a variable of the same name. All input objects should have the same set of user-defined names. It is recommended that the value types for

Input object keywords

Id	Type	Default	Description
sequence	file path		path to .fasta or .fastq file; use for single reads
quality	file path		path to *.qual file; sequence should also be specified
is_reverse	bool	false	true indicates that the sequences are in reverse direction
forward_seq	file path		path to .fasta or .fastq file with sequences in forward direction; use for single or paired reads; for single reads, this is equivalent to sequence keyword or to sequence and is_reverse: false
reverse_seq	file path		path to .fasta or .fastq file with sequences in reverse direction; use for single or paired reads; for single reads, this is equivalent to sequence and is_reverse: true
forward_qual	file path		path to *.qual file with quality scores in forward direction; use for single or paired reads; for single reads, this is equivalent to quality keyword or to quality and is_reverse: false; forward_seq should be also specified
reverse_qual	file path
forward_mid	file path
reverse_mid	file path

Input options examples

Single reads

"input": [

{ "forward_seq": "file1.fastq" },

{ "reverse_seq": "file2.fastq" },

{ "sequence": "file3.fastq" },

{ "sequence": "file4.fastq", "is_reverse": true },

{ "sequence": "file5.fasta", "quality": "file5.qual" }

]

Single reads with a user variable

"input": [

{ "forward_seq": "file1.fastq" },

{ "reverse_seq": "file2.fastq" }

]

Paired reads

"input": [

{ "forward_seq": "file1.fastq", "reverse_seq": "file2.fastq" },

{ "forward_seq": "file3.fastq", "reverse_seq": "file4.fastq" }

]

Processing steps

Base composition statistics

Id	Tags	Description
composition_stats	stats	Calculate relative abundance of ambiguous base calls and each of the four nucleotides.

Parameters

Id	Name	Type	Default	Description
out_prefix		string	“”	Path prefix for output files.
out_path_composition		string	out_prefix + composition.csv	Output path for composition summary file; if this option is used, out_prefix is ignored for this file.
out_path_gc_hist		string	out_prefix + gc_hist.csv	Output path for GC histogram file; if this option is used, out_prefix is ignored for this file.

Read quality statistics

Id	Tags	Description
quality_stats	stats	Generate quality score distributions for each position, and generate histograms for both read length and average read quality.

Parameters

Id	Name	Type	Default	Description
out_prefix		string	“”	Path prefix for output files.
out_path_hm		string	out_prefix + heat_map.csv	Output path for quality heat map; if this option is used, ‘out_prefix’ is ignored for this file.
out_path_stats		string	out_prefix + qstats.csv	Output path for quality statistics summary; if this option is used, ‘out_prefix’ is ignored for this file.
out_path_mq_hist		string	out_prefix + mean_q_hist.csv	Output path for read quality histogram; if this option is used, ‘out_prefix’ is ignored for this file.
out_path_len_hist		string	out_prefix + len_hist.csv	Output path for read length histogram; if this option is used, ‘out_prefix’ is ignored for this file.

Filter steps

Filter steps determine whether sequence read is suitable for further analysis based on some condition, i.e., quality score, length, or composition.

Common parameters for all filter steps:

Id	Name	Type	Default	Description
passed_name		string	optional	name of boolean variable defined by the step indicating whether the current read passed the filter

Nucleotide filter

Id	Tags	Description
character_filter	filter	Discard reads with non-allowed characters.

Parameters

Id	Name	Type	Default	Description
chars	allowed nucleotides	string	ACGTacgt	The list of allowed characters may be expanded to contain any of the following case-senstive characters: ACGTURYKMSWBDHVNacgturykmswbdhvn

Length filter

Id	Tags	Description
length_filter	filter	Discard reads with a length outside of the range [min, max].

Parameters

Id	Name	Type	Default	Description
min	minimal length	unsigned	0	Minimal accepted read length.
max	maximal length	unsigned	<large number>	Maximal accepted length.
trim	trim	bool	false	3’ end trim reads longer than max

Homopolymer filter

Id	Tags	Description
homopolymer_filter	filter	discard reads that contain a homopolymer longer than a specified length

Parameters

Id	Name	Type	Default	Description
max_length	max length	unsigned	<required>	maximum homopolymer length

Minimal quality filter

Id	Tags	Description
min_quality_filter	filter	discard reads that contain a quality score lower than a specified minimum

Parameters

Id	Name	Type	Default	Description
min_quality	minimal quality	unsigned	<required>

Minimal average quality filter

Id	Tags	Description
average_quality_filter	filter	discard reads with average quality score lower than a specified minimal

Parameters

Id	Name	Type	Default	Description
min_quality	minimal quality	float	<required>

Minimal quality window filter

Id	Tags	Description
min_quality_window_filter	filter, trimming	Find the longest subsequence with every quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id	Name	Type	Default	Description
min_quality	minimal quality	unsigned	<required>	minimal quality score
min_length	minimal length	unsigned	<required>	minimal interval length

Average quality window filter

Id	Tags	Description
average_quality_window_filter	filter, trimming	Find the longest subsequence with average quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id	Name	Type	Default	Description
min_quality	minimal quality	float	<required>	minimal accepted average quality score
window_length	window length	unsigned	<required>	length for computing averages
min_length	minimal length	unsigned	window_length	minimal interval length (should be <= window_length)

Ambiguous nucleotide window filter

Id	Tags	Description
ambiguous_window_filter	filter, trimming	Find the longest subsequence with no more than a specified number of ambiguous base calls. Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id	Name	Type	Default	Description
min_length	minimal length	unsigned	0	minimal accepted read interval length; if zero, require that the whole read contains at most max_ambiguous ambiguous nucleotides
max_ambiguous	maximal number of ambiguous nucleotides	unsigned	0

Histogram

Id	Tags	Description
histogram	stats	Build a histogram of value occurrences

Parameters

Id	Name	Type	Default	Description
name		string or array of strings	<required>	The name of the variable to build a histogram from; the available names are dependent on the configuration used.
out_path		string	variable name(s)	output path for histogram CSV file; by default, variable names delimited by “_”, followed by “.csv”

Match sequence element

Id	Tags	Description
match	demultiplexing, alignment, filter, trimming	find one or many sequence features in relationship to each other positionally, by strict or fuzzy matching, or by alignment

Parameters

Id	Name	Type	Default	Description
reverse	reverse-complement	boolean	false	search reverse-complemented sequences for reads marked as reverse
trimmed	trimmed	boolean	false	search within trimming marks set by previous steps
elements	match elements	array of Element objects	<required>	define sequence features to search for
combinations	element combinations	array of Combination objects	<empty>	define feature combinations

Element object

Id	Name	Type	Default	Description
start/end	element start/end	Position object	<empty>	define beginning or end of element
length		unsigned		length of match element; required if the element is defined by position only (no sequences provided); if sequences are provided and the length is greater then the sequence length, scanning will be done to find the best position
seq_file	sequence file	string	none	path to fasta/fastq file for matching sequences
csv_file	CSV file	csv_element object	none	CSV file description
sequence	matching sequences	string or array of strings	none	matching sequence or array of sequences
required	is required	boolean	false	filter out read if element is not found
min_score	minimal alignment score	int	0	minimal alignment score between the read sequence element and a matching sequence
max_mismatches	maximum number of mismatches	unsigned		either min_score or max_mismatches should be used
allow_gaps	allow gaps	boolean	false	perform gapped alignment
min_match_length	minimal contiguous match	unsigned	0	minimal contiguous matching length (used for gapped alignment)
value_name	value name	string	none	unique name that will be used by other processing steps to refer to the matching sequence
score_name	match score name	string	none	unique name that will be used by other processing steps to refer to the match score
identity_name	match identity name	string	none	unique name that will be used by other processing steps to refer to the match identity
cut_lower	cut before element	Cut object	none	set lower sequence truncation point
cut_upper	cut after element	Cut object	none	set upper sequence truncation point

csv_element object

Id	Name	Type	Default	Description
path		string	<required>	file location
sequences_column	column name or index	string or int	<required>	name or index of the column containing sequences

Position object

Id	Name	Type	Default	Description
pos	position	int	0	position of element start or end
before/after	before/after	string	none	name of matching element relative to which the position is defined

Cut object

Id	Name	Type	Default	Description
before/after	before/after	int	0	set sequence truncation point relative to the element’s beginning or end

Combination object

Id	Name	Type	Default	Description
value_name	value name	string	none	unique name that will be used by other processing steps to refer to the matching combination
csv_file		csv_combination object	<required>	define combinations of matched elements, e.g., bar code pairs

csv_combination object

Id	Name	Type	Default	Description
path		path string	<required>	file location
values_column		array of string: string or string: int pairs	<required>	pairs of either element name - column name or element name - column index; column names and column indices should not appear in the same csv_combination object
names_column		string or int	optional	name or index of the column defining the names of the combinations; the type should be same as in values_column
skip_header		bool	false	if true and column names are not used, skip first line of CSV file

Match external molecular identifier

Id	Tags	Description
eMID_map	demultiplexing	identify which of the given short sequences best matches external molecular identifier located in read description line

Parameters

Id	Name	Type	Default	Description
value_name	value name	string	eMID	name of the value
fasta_path		string	<required>	file path with sequences to match eMID
pairs_path		string	“”	CSV file path with sequence name pairs

Write sequences

Id	Tags	Description
write_sequence	output	write each read in FASTA or FASTQ format, possibly into multiple files

Parameters

Id	Name	Type	Default	Description
out_path	???	variable_string_path	<required>	path for output file(s)
unset_value	???	string	<discard>	use this string if a value specified by the out_path is undefined; by default, such reads are discarded
trimmed	trimmed	bool	true	whether trimming defined by previous steps should be applied
reverse_complemented	reverse complemented	bool	true	whether to reverse-complement sequences marked as reverse
skip_empty		bool	true	if true, skip the sequences that were filtered out

Write values

Id	Tags	Description
write_value	output	write specified values for each read in CSV format, possibly into multiple files

Parameters

Id	Name	Type	Default	Description
names	value names	string array	<required>	list of value names to write
out_path	???	variable_string_path	<required>	path for output file(s)
unset_value	???	string	<discard>	use this string if a value specified by the out_path is undefined; by default, such reads are discarded

Find unique sequences

Id	Tags	Description
find_shared		Identify unique sequences and sequences common between different groups of reads

Parameters

Id	Name	Type	Default	Description
min_length	minimal length	unsigned	required with ignore_ends or fraction_match	minimal matching length to consider sequences identical
ignore_ends	ignore end length	unsigned	optional, cannot be used with fraction_match	maximum length of mismatch at sequence ends, requires min_length
fraction_match	fraction of sequence length to match	float (0,1]	optional, cannot be used with ignore_ends	fraction of sequence length to match, requires min_length
min_duplicates	minimal number of copies	unsigned	0	output sequences with at least this many duplicates
trimmed		bool	true	operate on trimmed sequences
reverse		bool	true	reverse-complement sequences labelled as reverse
out_summary		path string	"sharing_summary.csv"	output path for summary statistics
out_redundancy_histogram		path string	<optional>	output path for redundancy histogram
out_unique		variable_string_path	<optional>
out_duplicates		variable_string_path	<optional>
out_group_unique		variable_string_path	<optional>
out_group_duplicates		variable_string_path	<optional>
unset_value				use this string if a value specified by the out_path is undefined; by default, such reads are discarded

Note: the variable strings in out_unique, out_duplicates, out_group_unique, and out_group_duplicates parameters should contain the same set of variables.

Merge paired reads

Id	Tags	Description
merge_paired	paired_reads	merge forward and reverse reads. This step should be only available for processing paired reads

Parameters

Id	Name	Type	Default	Description
min_score	minimal score	unsigned	0	merge reads if overlap has this minimal score

Id	Tags	Description
apply	paired_reads	apply specified step to forward, reverse, or merged sequence reads; use this behind the scenes to process paired reads

Parameters

Id	Name	Type	Default	Description
to		token-string or array of token-strings	<required>	either a single or an array of string tokens: forward reverse merged
step		single read processing step object	<required>	apply step to each of specified directions