vdj_pipe pipeline

The vdj_pipe configuration consists of general options, sequencing data inputs, and the processing steps.  The pipeline reads the input files record-by-record and applies the processing steps sequentially to each of the records.  A sequencing data record may include a sequence, its description, and the quality score in case of single-read data, or two sequences and two quality scores for paired-read data.

The processing steps operate by reading and setting values to a set of variables.  For a single-read pipeline, the following values are provided by the input system:

ID

Type

Description

read_id

string

sequencing read ID; no white space characters

description

string

sequencing read description string; includes ID; no \n \r characters

sequence

string

nucleotide sequence; no white space characters

quality

unsigned vector

quality scores; same size as sequence

trim

int, int

sequence range marked for further processing;

[0, sequence length] initially

is_reverse

boolean

true if read is marked as reverse

seq_file_path

Path_id

input sequence file ID

qual_file_path

Path_id

input quality file ID; same as seq_file_path for FASTQ files

Notes:

- Parameters without names should not be shown to user

- Relative input paths are taken relatively to global base_path_input parameter (if set)

- Relative output paths are taken relatively to global base_path_output parameter (if set)

- Absolute paths (e.g., on Unix starting from “/”) are interpreted as absolute

- Output file paths with archive extensions (i.e., gz, bz2, z)  are compressed automatically

- Compression of input files is determined by their “magic bytes”

Some processing steps accept parameters of variable_string type.  variable_string is a JSON string that may include variable names enclosed in curly brackets {}.  The variables should be defined by one of the previous steps.  The value of a variable_string is generated for each sequencing data record by substituting the enclosed variable names by the variable values converted to strings.  The processing steps may also accept a unset_value parameter that defines a string used as a default variable value.  If unset_value parameter is not defined and the value of the variable is not set, the value of the variable_string is also not set.  variable_string -s can be used, e.g., for writing sequences to different directories and files depending on some variables.

variable_string_path is a variable_string used as an output path. The output to variable_string_path cannot be compressed.  variable_string_path should not include an archive extension.

Supported nucleotide characters

Name

Character

Comment

Adenine

A, a

Cytosine

C, c

Guanine

G, g

Thymine

T, t

Any

N, n

A, C, G, T

Uracil

U, u

Purine

R, r

A, G

Pyrimidine

Y, y

C, T, U

Ketone

K, k

G, T, U

Amine

M, m

A, C

Strong

S, s

C, G

Weak

W, w

A, T, U

not_A

B, b

C, G, T

not_C

D, d

A, G, T

not_G

H, h

A, C, T

not_T

V, v

A, C, G

Global options

Id

Type

Default

Description

base_path_input

directory path string

current directory

Path prefix for input files.  Relative path is taken relatively to the current directory.  The directory should exist.

base_path_output

directory path string

config_output_path

file path

csv_file_delimiter

char

\t

comma, (,) may be another option

external_MIDs

bool

false

max_file_reads

unsigned

inf

max_reads

unsigned

inf

paired_reads

bool

false

plots_list_path

file path

quality_scores

bool

true

summary_output_path

file path

input

array of input objects

required

see Input options

steps

array of step objects

required

see Processing steps

Input object

Each input object describes a sequencing data file or multiple files processed together.  For example, a *.fastq file, a pair of *.fasta and *.qual files, a pair of forward/reverse *.fastq files, two pairs of forward/reverse *.fasta and *.qual files.

In addition, each input object may contain any number of user-defined name-value pairs.  During the processing, those will be associated with every sequencing data record and could be accessed by the processing steps as a variable of the same name.  All input objects should have the same set of user-defined names. It is recommended that the value types for

Input object keywords

Id

Type

Default

Description

sequence

file path

path to *.fasta or *.fastq file; use for single reads

quality

file path

path to *.qual file; sequence should also be specified

is_reverse

bool

false

true indicates that the sequences are in reverse direction

forward_seq

file path

path to *.fasta or *.fastq file with sequences in forward direction; use for single or paired reads; for single reads, this is equivalent to sequence keyword or to sequence and is_reverse: false 

reverse_seq

file path

path to *.fasta or *.fastq file with sequences in reverse direction; use for single or paired reads; for single reads, this is equivalent to sequence and is_reverse: true 

forward_qual

file path

path to *.qual file with quality scores in forward direction; use for single or paired reads; for single reads, this is equivalent to quality keyword or to quality and is_reverse: false; forward_seq should be also specified

reverse_qual

file path

forward_mid

file path

reverse_mid

file path

Input options examples

Single reads

"input": [

  { "forward_seq": "file1.fastq" },

  { "reverse_seq": "file2.fastq" },

  { "sequence": "file3.fastq" },

  { "sequence": "file4.fastq", "is_reverse": true },

  { "sequence": "file5.fasta", "quality": "file5.qual" }

]

Single reads with a user variable

"input": [

  { "forward_seq": "file1.fastq" },

  { "reverse_seq": "file2.fastq" }

]

Paired reads

"input": [

  { "forward_seq": "file1.fastq", "reverse_seq": "file2.fastq" },

  { "forward_seq": "file3.fastq", "reverse_seq": "file4.fastq" }

]

Processing steps

Base composition statistics

Id

Tags

Description

composition_stats

stats

Calculate relative abundance of ambiguous base calls and each of the four nucleotides.

Parameters

Id

Name

Type

Default

Description

out_prefix

string

“”

Path prefix for output files.

out_path_composition

string

out_prefix + composition.csv

Output path for composition summary file; if this option is used, out_prefix is ignored for this file.

out_path_gc_hist

string

out_prefix + gc_hist.csv

Output path for GC histogram file; if this option is used, out_prefix is ignored for this file.

Read quality statistics

Id

Tags

Description

quality_stats

stats

Generate quality score distributions for each position, and generate histograms for both read length and average read quality.

Parameters

Id

Name

Type

Default

Description

out_prefix

string

“”

Path prefix for output files.

out_path_hm

string

out_prefix + heat_map.csv

Output path for quality heat map; if this option is used, ‘out_prefix’ is ignored for this file.

out_path_stats

string

out_prefix + qstats.csv

Output path for quality statistics summary; if this option is used, ‘out_prefix’ is ignored for this file.

out_path_mq_hist

string

out_prefix + mean_q_hist.csv

Output path for read quality histogram; if this option is used, ‘out_prefix’ is ignored for this file.

out_path_len_hist

string

out_prefix + len_hist.csv

Output path for read length histogram; if this option is used, ‘out_prefix’ is ignored for this file.

Filter steps

Filter steps determine whether sequence read is suitable for further analysis based on some condition, i.e., quality score, length, or composition.

Common parameters for all filter steps:

Id

Name

Type

Default

Description

passed_name

string

optional

name of boolean variable defined by the step indicating whether the current read passed the filter

Nucleotide filter

Id

Tags

Description

character_filter

filter

Discard reads with non-allowed characters.

Parameters

Id

Name

Type

Default

Description

chars

allowed nucleotides

string

ACGTacgt

The list of allowed characters may be expanded to contain any of the following case-senstive characters: ACGTURYKMSWBDHVNacgturykmswbdhvn

Length filter

Id

Tags

Description

length_filter

filter

Discard reads with a length outside of the range [min, max].

Parameters

Id

Name

Type

Default

Description

min

minimal length

unsigned

0

Minimal accepted read length.

max

maximal length

unsigned

<large number>

Maximal accepted length.

trim

trim

bool

false

3’ end trim reads longer than max

Homopolymer filter

Id

Tags

Description

homopolymer_filter

filter

discard reads that contain a homopolymer longer than a specified length

Parameters

Id

Name

Type

Default

Description

max_length

max length

unsigned

<required>

maximum homopolymer length

Minimal quality filter

Id

Tags

Description

min_quality_filter

filter

discard reads that contain a quality score lower than a specified minimum

Parameters

Id

Name

Type

Default

Description

min_quality

minimal quality

unsigned

<required>

Minimal average quality filter

Id

Tags

Description

average_quality_filter

filter

discard reads with average quality score lower than a specified minimal

Parameters

Id

Name

Type

Default

Description

min_quality

minimal quality

float

<required>

Minimal quality window filter

Id

Tags

Description

min_quality_window_filter

filter, trimming

Find the longest subsequence with every quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id

Name

Type

Default

Description

min_quality

minimal quality

unsigned

<required>

minimal quality score

min_length

minimal length

unsigned

<required>

minimal interval length

Average quality window filter

Id

Tags

Description

average_quality_window_filter

filter, trimming

Find the longest subsequence with average quality score greater than a specified minimum. Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id

Name

Type

Default

Description

min_quality

minimal quality

float

<required>

minimal accepted average quality score

window_length

window length

unsigned

<required>

length for computing averages

min_length

minimal length

unsigned

window_length

minimal interval length

(should be <= window_length)

Ambiguous nucleotide window filter

Id

Tags

Description

ambiguous_window_filter

filter, trimming

Find the longest subsequence with no more than a specified number of ambiguous base calls.  Discard reads where the found subsequence is shorter than a specified length.

Parameters

Id

Name

Type

Default

Description

min_length

minimal length

unsigned

0

minimal accepted read interval length; if zero, require that the whole read contains at most max_ambiguous ambiguous nucleotides

max_ambiguous

maximal number of ambiguous nucleotides

unsigned

0

Histogram

Id

Tags

Description

histogram

stats

Build a histogram of value occurrences

Parameters

Id

Name

Type

Default

Description

name

string or array of strings

<required>

The name of the variable to build a histogram from; the available names are dependent on the configuration used.

out_path

string

variable name(s)

output path for histogram CSV file; by default, variable names delimited by “_”, followed by “.csv”

Match sequence element

Id

Tags

Description

match

demultiplexing, alignment, filter, trimming

find one or many sequence features in relationship to each other positionally, by strict or fuzzy matching, or by alignment

Parameters

Id

Name

Type

Default

Description

reverse

reverse-complement

boolean

false

search reverse-complemented sequences for reads marked as reverse

trimmed

trimmed

boolean

false

search within trimming marks set by previous steps

elements

match elements

array of Element objects

<required>

define sequence features to search for

combinations

element combinations

array of Combination objects

<empty>

define feature combinations

Element object

Id

Name

Type

Default

Description

start/end

element start/end

Position object

<empty>

define beginning or end of element

length

unsigned

length of match element; required if the element is defined by position only (no sequences provided); if sequences are provided and the length is greater then the sequence length, scanning will be done to find the best position

seq_file

sequence file

string

none

path to fasta/fastq file for matching sequences

csv_file

CSV file

csv_element object

none

CSV file description

sequence

matching sequences

string or array of strings

none

matching sequence or array of sequences

required

is required

boolean

false

filter out read if element is not found

min_score

minimal alignment score

int

0

minimal alignment score between the read sequence element and a matching sequence

max_mismatches

maximum number of mismatches

unsigned

either min_score or max_mismatches should be used

allow_gaps

allow gaps

boolean

false

perform gapped alignment

min_match_length

minimal contiguous match

unsigned

0

minimal contiguous matching length (used for gapped alignment)

value_name

value name

string

none

unique name that will be used by other processing steps to refer to the matching sequence

score_name

match score name

string

none

unique name that will be used by other processing steps to refer to the match score

identity_name

match identity name

string

none

unique name that will be used by other processing steps to refer to the match identity

cut_lower

cut before element

Cut object

none

set lower sequence truncation point

cut_upper

cut after element

Cut object

none

set upper sequence truncation point

csv_element object

Id

Name

Type

Default

Description

path

string

<required>

file location

sequences_column

column name or index

string or int

<required>

name or index of the column containing sequences

Position object

Id

Name

Type

Default

Description

pos

position

int

0

position of element start or end

before/after

before/after

string

none

name of matching element relative to which the position is defined

Cut object

Id

Name

Type

Default

Description

before/after

before/after

int

0

set sequence truncation point relative to the element’s beginning or end

Combination object

Id

Name

Type

Default

Description

value_name

value name

string

none

unique name that will be used by other processing steps to refer to the matching combination

csv_file

csv_combination object

<required>

define combinations of matched elements, e.g., bar code pairs

csv_combination object

Id

Name

Type

Default

Description

path

path string

<required>

file location

values_column

array of string: string or string: int pairs

<required>

pairs of either element name - column name or element name - column index; column names and column indices should not appear in the same csv_combination object

names_column

string or int

optional

name or index of the column defining the names of the combinations; the type should be same as in values_column

skip_header

bool

false

if true and column names are not used, skip first line of CSV file

Match external molecular identifier

Id

Tags

Description

eMID_map

demultiplexing

identify which of the given short sequences best matches external molecular identifier located in read description line

Parameters

Id

Name

Type

Default

Description

value_name

value name

string

eMID

name of the value

fasta_path

string

<required>

file path with sequences to match eMID

pairs_path

string

“”

CSV file path with sequence name pairs

Write sequences

Id

Tags

Description

write_sequence

output

write each read in FASTA or FASTQ format, possibly into multiple files

Parameters

Id

Name

Type

Default

Description

out_path

???

variable_string_path

<required>

path for output file(s)

unset_value

???

string

<discard>

use this string if a value specified by the out_path is undefined; by default, such reads are discarded

trimmed

trimmed

bool

true

whether trimming defined by previous steps should be applied

reverse_complemented

reverse complemented

bool

true

whether to reverse-complement sequences marked as reverse

skip_empty

bool

true

if true, skip the sequences that were filtered out

Write values

Id

Tags

Description

write_value

output

write specified values for each read in CSV format, possibly into multiple files

Parameters

Id

Name

Type

Default

Description

names

value names

string array

<required>

list of value names to write

out_path

???

variable_string_path

<required>

path for output file(s)

unset_value

???

string

<discard>

use this string if a value specified by the out_path is undefined; by default, such reads are discarded

Find unique sequences

Id

Tags

Description

find_shared

Identify unique sequences and sequences common between different groups of reads

Parameters

Id

Name

Type

Default

Description

min_length

minimal length

unsigned

required with ignore_ends or fraction_match

minimal matching length to consider sequences identical

ignore_ends

ignore end length

unsigned

optional, cannot be used with fraction_match

maximum length of mismatch at sequence ends, requires min_length

fraction_match

fraction of sequence length to match

float (0,1]

optional, cannot be used with ignore_ends

fraction of sequence length to match, requires min_length

min_duplicates

minimal number of copies

unsigned

0

output sequences with at least this many duplicates

trimmed

bool

true

operate on trimmed sequences

reverse

bool

true

reverse-complement sequences labelled as reverse

out_summary

path string

"sharing_summary.csv"

output path for summary statistics

out_redundancy_histogram

path string

<optional>

output path for redundancy histogram

out_unique

variable_string_path

<optional>

out_duplicates

variable_string_path

<optional>

out_group_unique

variable_string_path

<optional>

out_group_duplicates

variable_string_path

<optional>

unset_value

use this string if a value specified by the out_path is undefined; by default, such reads are discarded

Note: the variable strings in out_unique, out_duplicates, out_group_unique, and out_group_duplicates parameters should contain the same set of variables.

Merge paired reads

Id

Tags

Description

merge_paired

paired_reads

merge forward and reverse reads. This step should be only available for processing paired reads

Parameters

Id

Name

Type

Default

Description

min_score

minimal score

unsigned

0

merge reads if overlap has this minimal score

<wrapper step>

Id

Tags

Description

apply

paired_reads

apply specified step to forward, reverse, or merged sequence reads; use this behind the scenes to process paired reads

Parameters

Id

Name

Type

Default

Description

to

token-string or array of token-strings

<required>

either a single or an array of string tokens: forward reverse merged

step

single read processing step object

<required>

apply step to each of specified directions