MIGEC ANALYSIS REPORT

Working directory: /Users/mikesh/Programming/repseq-tutorial/part1

Created: 09 August, 2015

Step I: Checkout

This section summarizes the efficiency of sample barcode matching and unique molecular identifier (UMI) extraction.

Command line:

## migec-1.2.1b.jar Checkout -ou barcodes.txt trb_R1.fastq.gz trb_R2.fastq.gz checkout/

Fig.1 Total extraction efficiency

Overall extraction efficiency, showing the success rate of finding the

master, main barcode containing sample-specific sequence
slave, auxiallary barcode, used to protect from contamination, only checked if master is detected

and the rate of read overlapping, overlapped (if enabled).

Fig.2 Extraction efficiency by sample

Distribution of reads recovered by sample

Step II: Histogram

This section contains data on sequencing depth based on counting the number of reads tagged with the same UMI, i.e. reads that belong to the same Molecular Identifier Group (MIG).

Command line:

## migec-1.2.1b.jar Histogram checkout/ histogram/

Fig.3 UMI coverage plot

Below is the plot of MIG size distribution that was used to estimate MIG size threshold for assembly (shown by dashed lines).

Note that the distribution is itself weighted by the number of reads to highlight the mean coverage value and to reflect the percent of reads that will be retained for a given MIG size threshold.

Step III: Assemble

This section contains information on the assembly of MIG consensuses from raw reads.

Command line:

## migec-1.2.1b.jar AssembleBatch --force-overseq 5 --force-collision-filter --default-mask 0:1 checkout/ histogram/ assemble/

Fig.4 Number of assembled MIGs

Below is a plot showing the total number of assembled MIGs per sample. The number of MIGs should be interpreted as the total number of starting molecules that have been successfully recovered.

Fig.5 Frequency of productive UMI tags

Next comes the plot showing the fraction of UMIs that have resulted in assembled consensuses.

This value is typically low, as it accounts for UMIs filtered due to not passing MIG size threshold. Such UMIs are mostly sequencing errors which generate a high amount of artificial UMI diversity.

Fig.6 Fraction of assembled reads

The fraction of reads contained within assembled MIGs when compared to total number of reads. This number should be high for a high-quality experiment.

Fig.7 Fraction of reads not fitting MIG consensus

The fraction of reads that were dropped during consensus assembly. Reads are being dropped if they are found to be substantially different from reads that form the core of consensus sequence. High numbers indicate low sequencing quality and/or presence of library preparation artifacts.

Step IV: CdrBlast

This section contains results of the V(D)J segment mapping and CDR3 extraction algorithm running both for raw and assembled reads.

Command line:

## migec-1.2.1b.jar CdrBlastBatch -R TRB checkout/ assemble/ cdrblast/

Fig.8 Number of CDR3-containing MIGs

The plot below shows the total number of MIGs that contain good-quality CDR3 region in the consensus sequence

Fig.9 Number of CDR3-containing reads

Total number of reads that contain good-quality CDR3 region in raw reads

Fig.10 CDR3 extraction rate

Mapping rate, the fraction of reads/MIGs that contain a CDR3 region

Panels show assembled (asm) and unprocessed (raw) data. Values are given in number of molecules (mig, assembled samples only) and the corresponding read count (read)

Fig.11 Low-quality CDR3 filtering rate

Good-quality CDR3 sequence rate, the fraction of CDR3-containing reads/MIGs that pass quality filter

Note that while raw data is being filtered based on Phred quality score, consensus quality score (CQS, the ratio of major variant) is used for assembled data

Step V: CdrBlastFitler

This section contains the result of hot-spot error filtering stage, non-functional clonotype filtering (if enabled) and final statistics.

Command line:

## migec-1.2.1b.jar FilterCdrBlastResultsBatch cdrblast/ cdrfinal/

Fig.12 Number of clonotypes found

Below is the plot of sample diversity, i.e. the number of clonotypes in a given sample

Fig.13 Final number of TCR/Ig molecules

Total number of molecules (MIGs) in final clonotype tables

Fig.14 Rate of hot-spot and singleton error filtering

Rate of hot-spot and singleton error filtering, in terms of clonotypes (clone panel) and MIGs (mig panel). Hot-spot errors are indentified by checking whether a given variant is frequently corrected during MIG consensus assembly.

As clonotypes represented by a single MIG (singletons) have insufficient info to apply MiGEC-style error filtering, a simple frequency-based filtering is used for them.

Fig.15 Fraction of non-coding clonotypes

Rate of non-coding CDR3 sequences, in terms of clonotypes (clone panel) and MIGs (mig panel)