assemble command builds a set of clones using alignments
obtained with align command in order to extract
specific gene regions (e.g. CDR3). The syntax of
assemble is the
mixcr assemble [options] alignments.vdjca output.clns
The following flowchart shows the pipeline of
This pipeline consists of the following steps:
The assembler sequentially processes records (aligned reads) from input
.vdjcafile produced by align. On the first step, assembler tries to extract gene feature sequences from aligned reads (called clonal sequence) specified by
CDR3by default); the clonotypes are assembled with respect to clonal sequence. If aligned read does not contain clonal sequence (e.g.
CDR3region), it will be dropped.
If clonal sequence contains at least one nucleotide with low quality (less than
badQualityThresholdparameter value), then this record will be deferred for further processing by mapping procedure. If fraction of low quality nucleotides in deferred record is greater than
maxBadPointsPercentparameter value, then this record will be finally dropped. Records with clonal sequence containing only good quality nucleotides are used to build core clonotypes by grouping records by equality of clonal sequences (e.g. CDR3). The sequence quality of the resulting core clonotype will be equal to the total of qualities of the assembled reads. Each core clonotype has two main properties: clonal sequence and
count— a number of records aggregated by this clonotype.
After the core clonotypes are built, MiXCR runs mapping procedure that processes records deferred on the previous step. Mapping is aimed on rescuing of quantitative information from low quality reads. For this, each deferred record is mapped onto already assembled clonotypes: if there is a fuzzy match, then this record will be aggregated by the corresponding clonotype; in case of several matched clonotypes, a single one will be randomly chosen with weights equal to clonotype counts. If no matches found, the record will be finally dropped.
After clonotypes are assembled by initial assembler and mapper, MiXCR proceeds to clustering. The clustering algorithm tries to find fuzzy matches between clonotypes and organize matched clonotypes in hierarchical tree (cluster), where each child layer is highly similar to its parent but has significantly smaller
count. Thus, clonotypes with small counts will be attached to highly similar “parent” clonotypes with significantly greater count. The typical cluster looks as follows:
After all clusters are built, only their heads are considered as final clones. The maximal depths of cluster, fuzzy matching criteria, relative counts of parent/childs and other parameters can be customized using
clusteringStrategyparameters described below.
The final step is to align clonal sequences to reference V,D,J and C genes. Since the
assemblingFeaturesare different from those used in
align, it is necessary to rebuild alignments for clonal sequences. This alignments are built by more accurate aligner (since all hits are known in advance); thus, better alignments will be built for each clonal sequence.
The result is written to the binary output file (
.clns) with a comprehensive information about clones.
Command line parameters¶
The command line options of
assemble are the following:
||Print help message.|
||Report file name. If this option is not specified, no report file be produced. See below for detailed description of report fields.|
||number of available CPU cores||Number of processing threads.|
||Specify file which will store information about particular reads aggreagated by each clone (mapping readId -> cloneId).|
||Overrides default value of assembler
All parameters are optional.
MiXCR uses a wide range of parameters that controls assembler behaviour.
There are some global parameters and parameters organized in groups for
each stage of assembling:
cloneFactoryParameters. Each group of parameters may contain further
subgroups of parameters etc. In order to override some parameter value
one can use
-O followed by fully qualified parameter name and
parameter value (e.g.
One of the key MiXCR features is ability to assemble clonotypes by
sequence of custom gene region (e.g.
target clonal sequence can even be disjoint. This region can be
assemblingFeatures parameter, as in the following
mixcr assemble -OassemblingFeatures="[V5UTR+L1+L2+FR1,FR3+CDR3]" alignments.vdjca output.clns
assemblingFeatures must cover
Other global parameters are:
||Minimal length of clonal sequence|
||Minimal value of sequencing quality score: nucleotides with lower quality are considered as “bad”. If sequencing read contains at least one “bad” nucleotide within the target gene region, it will be deferred at initial assembling stage, for further processing by mapper.|
||Maximal allowed fraction of “bad” points in sequence: if sequence contains more than
||Algorithm used for aggregation of total clonal sequence quality during assembling
of sequencing reads. Possible values:
||Minimal allowed quality of each nucleotide of assembled clone. If at least one
nucleotide in the assembled clone has quality lower than
||Aggregate cluster counts when assembling final clones: if
One can override these parameters in the following way:
mixcr assemble -ObadQualityThreshold=10 alignments.vdjca output.clns
In order to prevent mapping of low quality reads (filter them off) one
maxBadPointsPercent to zero:
mixcr assemble -OmaxBadPointsPercent=0 alignments.vdjca output.clns
Separation of clones with same CDR3 (clonal sequence) but different V/J/C genes¶
Since v1.8 MiXCR can separates clones with equal clonal sequence and different V, J and C (e.g. do distinguish clones with different IG isotype) genes.
To make analysis more robust to sequencing errors there is an additional clustering step to shrink artificial diversity generated by this separation mechanism.
The following criteria are used on this pre-clusterization step: more abondant clone (
smaller clone (
clone2.count < clone1.count * maximalPreClusteringRatio (
denotes number of reads in corresponding clone)and
clone2 contain top V/J/C gene from
it’s corresponding gene list.
The following parameter control separation behaviour and pre-clusterization:
||See conditions for clustering above for more inforamtion.|
Example, in order to separate IG clones by isotypes use the following options:
mixcr assemble -OseparateByC=true alignments.vdjca output.clns
Parameters that control clustering procedure are placed in
cloneClusteringParameters parameters group which determines the rules for the frequency-based correction of PCR and sequencing errors:
||Maximum number of cluster layers (not including head).|
||Maximum allowed number of mutations in N regions
(non-template nucleotides in VD, DJ or VJ junctions): if
two fuzzy matched clonal sequences will contain more than
||Parameters that control fuzzy match criteria between
clones in adjacent layers. Available predefined values:
||Probability of a single nucleotide mutation in clonal
sequence which has non-hypermutation origin (i.e. PCR or
sequencing error). This parameter controls relative counts
between two clones in adjacent layers: a smaller clone can
be attached to a larger one if its count smaller than
count of parent multiplied by
One can override these parameters in the following way:
mixcr assemble -OcloneClusteringParameters.searchParameters=oneMismatchOrIndel alignments.vdjca output.clns
In order to turn off clustering one should use the following parameters:
mixcr assemble -OcloneClusteringParameters=null alignments.vdjca output.clns
Summary of assemble procedure can be exported with
--report option. Report is appended to the end of the file if it already exist, the same file name can be used in several analysis runs.
Report contains the following lines:
|Final clonotype count||Number of clonotypes after all error correction steps|
|Average number of reads per clonotype|
|Reads used in clonotypes, percent of total||Sum of all clonotype abundances. Percent is calculated from the initial number of reads processed on the
|Reads used in clonotypes before clustering, percent of total||The same as above, but before clustering step. If
|Number of reads used as a core, percent of used||Number of reads with clonal sequence (e.g. CDR3) having all positions quality scores above
|Mapped low quality reads, percent of used||Number of reads mapped during low quality reads mapping. See above for details. Percent of “Reads used in clonotypes”.|
|Reads clustered in PCR error correction, percent of used||Number of reads in clonotypes that were clustered during clustering step.|
|Reads pre-clustered due to the similar VJC-lists, percent of used||Reads in clonotypes with the same clonal sequence, that were merged into more reliable clonotypes during clonotype splitting by V/J/C genes. This value will be zero if all
|Reads dropped due to the lack of a clone sequence||Reads where MiXCR failed to extract clonal sequence. Each read should fully cover clonal sequence (specified by
|Reads dropped due to low quality||Reads having too many positions with low quality score. Percent is calculated from the initial number of reads processed on the
|Reads dropped due to failed mapping||Reads with at least one low quality score position in the clonal sequence, that were not mapped to any clonotype during mapping step. Percent is calculated from the initial number of reads processed on the
|Reads dropped with low quality clones||Number of reads in clonotypes having at least one position with aggregated quality score less than
|Clonotypes eliminated by PCR error correction||Number of clonotypes eliminated on the clustering step|
|Clonotypes dropped as low quality||Number of clonotypes having at least one position with aggregated quality score less than
|Clonotypes pre-clustered due to the similar VJC-lists||Number of clonotypes with the same clonal sequence, that were merged into more reliable clonotypes during clonotype splitting by V/J/C genes. This value will be zero if all