Importing gene segment sequences

Tip

The mixcr importFromIMGT command is the simplest way to import reference segment sequences from IMGT. (see documnetation below)

Automated import of reference sequences from IMGT

To simplify import of IMGT reference sequences we developed an interactive bash script that will automatically download and import all possible reference sequences for a selected species.

The sctipt can be invoked using mixcr importFromIMGT command, or can be found in the root folder of MiXCR distribution zip file (importFromIMGT.sh).

Script has the following dependacies:

  • wget
  • pup (see installation instractions here)

To use the script, just execute it from any folder to where you have a write access:

mixcr importFromIMGT

or execute it directly

/path/to/unzipped/mixcr/importIMGT.sh

It will ask you to accept the copyright rules of IMGT website, to select a species and to provide it’s common names. After doing this, script will automatically download all required files from IMGT website and import them to a local loci library.

During execution script will create log files for each type of imported segment. See below for example log file.

After import reference sequences can be used as follows:

mixcr align --library local -s macaca ....

Import of V, D and J gene sequences from a file

If you need to analyse data from species that are not covered by MiXCR built-it reference V, D, J genes library, or you just want to use alternative reference library, you can convert specially formatted fasta files to MiXCR loci-library format by using importSegments action.

Here is the examaple command:

mixcr importSegments -p imgt -v human_TRBV.fasta -j human_TRBJ.fasta \
      -d human_TRBD.fasta -l TRB -s 9606:hs -r report.txt

This command will import IMGT formatted fasta files (like those that can be downloade on this page) and import it to a local loci library file (stored in ~/.mixcr/local.ll).

Command line parameters

Here is the list of command line parameters for importSegments action:

Option Description
-p {params}
--parameters {params}
select the parameters of import. Parameters determine how to parse fasta headers and how to extract information about anchor points (e.g. using specific positions in sequences with IMGT gaps or searching for a specific patterns in gene seqeuence).

currently, the only possible value is imgt
-v {file} specify fasta-formatted file with sequences ov V genes
-d {file} specify fasta-formatted file with sequences ov D genes
-j {file} specify fasta-formatted file with sequences ov J genes
-l {locus}
--locus {locus}
determines which immunological locus data is being imported

possible values: TRA, TRB, TRG, TRD, IGH, IGL, IGK
-s {taxonID:commName1:..}
--species {...}
specify NCBI Taxonomy ID (e.g. 9606 for human) and a list of common species names for organism to be imported

example: 9606:hs:hsa:human:homsap
-r {reportFile}
--report {reportFile}
specify report file.
Report contains comprehancive error and warning log of importing procedure and amino-acid and nucleotide alignments of allelic variants imported from file, along with information ot infered positions of anchor points for all imported genes (see below)
-f force overwrite already existing locus records in the output file

Report file

It is very important to manually check results of importing, as this process involves several empirical steps like search of an anchor points using patterns in the sequence. MiXCR produces comprehansive report file with errors and warnings arised during importing and well-formatted nucleotide and amino acid alignments of allelic variants of V, D and J genes which are marked up with anchor points, so any mistakes can be easily detected.

Here is the example report file record:


TRBV4-1
=======
                    <FR1                                                                      FR1><C
 TRBV4-1*01 [F]   0 GACACTGAAGTTACCCAGACACCAAAACACCTGGTCATGGGAATGACAAATAAGAAGTCTTTGAAATGTGAACAACATAT 79
 TRBV4-1*02 [F]   0                                                                               .. 1

                    DR1     CDR1><FR2                                           FR2><CDR2        CDR
 TRBV4-1*01 [F]  80 GGGGCACAGGGCTATGTATTGGTACAAGCAGAAAGCTAAGAAGCCACCGGAGCTCATGTTTGTCTACAGCTATGAGAAAC 159
 TRBV4-1*02 [F]   2 ............A................................................................... 81

                    2><FR3
 TRBV4-1*01 [F] 160 TCTCTATAAATGAAAGTGTGCCAAGTCGCTTCTCACCTGAATGCCCCAACAGCTCTCTCTTAAACCTTCACCTACACGCC 239
 TRBV4-1*02 [F]  82 ................................................................................ 161

                                              FR3><CDR3          V>
 TRBV4-1*01 [F] 240 CTGCAGCCAGAAGACTCAGCCCTGTATCTCTGCGCCAGCAGCCAAGA 286
 TRBV4-1*02 [F] 162 ..............................................- 207


 **********

                   <FR1                  FR1>CDR1><FR2         FR2><CDR2><FR3
 TRBV4-1*01 [F]  0 DTEVTQTPKHLVMGMTNKKSLKCEQHMGHRAMYWYKQKAKKPPELMFVYSYEKLSINESVPSRFSPECPNSSLLNLHLHA 79
 TRBV4-1*02 [F]  0                           ...................................................... 53

                         FR3><CDR3
 TRBV4-1*01 [F] 80 LQPEDSALYLCASSQ_ 95
 TRBV4-1*02 [F] 54 ................ 69