Working directory: /Users/mikesh/Programming/repseq-tutorial/part1
Created: 09 August, 2015
This section summarizes the efficiency of sample barcode matching and unique molecular identifier (UMI) extraction.
Command line:
## migec-1.2.1b.jar Checkout -ou barcodes.txt trb_R1.fastq.gz trb_R2.fastq.gz checkout/
Overall extraction efficiency, showing the success rate of finding the
master, main barcode containing sample-specific sequence
slave, auxiallary barcode, used to protect from contamination, only checked if master is detected
and the rate of read overlapping, overlapped (if enabled).
Distribution of reads recovered by sample
This section contains data on sequencing depth based on counting the number of reads tagged with the same UMI, i.e. reads that belong to the same Molecular Identifier Group (MIG).
Command line:
## migec-1.2.1b.jar Histogram checkout/ histogram/
Below is the plot of MIG size distribution that was used to estimate MIG size threshold for assembly (shown by dashed lines).
Note that the distribution is itself weighted by the number of reads to highlight the mean coverage value and to reflect the percent of reads that will be retained for a given MIG size threshold.
This section contains information on the assembly of MIG consensuses from raw reads.
Command line:
## migec-1.2.1b.jar AssembleBatch --force-overseq 5 --force-collision-filter --default-mask 0:1 checkout/ histogram/ assemble/
Below is a plot showing the total number of assembled MIGs per sample. The number of MIGs should be interpreted as the total number of starting molecules that have been successfully recovered.
The fraction of reads contained within assembled MIGs when compared to total number of reads. This number should be high for a high-quality experiment.
The fraction of reads that were dropped during consensus assembly. Reads are being dropped if they are found to be substantially different from reads that form the core of consensus sequence. High numbers indicate low sequencing quality and/or presence of library preparation artifacts.
This section contains results of the V(D)J segment mapping and CDR3 extraction algorithm running both for raw and assembled reads.
Command line:
## migec-1.2.1b.jar CdrBlastBatch -R TRB checkout/ assemble/ cdrblast/
The plot below shows the total number of MIGs that contain good-quality CDR3 region in the consensus sequence
Total number of reads that contain good-quality CDR3 region in raw reads
Mapping rate, the fraction of reads/MIGs that contain a CDR3 region
Panels show assembled (asm) and unprocessed (raw) data. Values are given in number of molecules (mig, assembled samples only) and the corresponding read count (read)
Good-quality CDR3 sequence rate, the fraction of CDR3-containing reads/MIGs that pass quality filter
Note that while raw data is being filtered based on Phred quality score, consensus quality score (CQS, the ratio of major variant) is used for assembled data
This section contains the result of hot-spot error filtering stage, non-functional clonotype filtering (if enabled) and final statistics.
Command line:
## migec-1.2.1b.jar FilterCdrBlastResultsBatch cdrblast/ cdrfinal/
Below is the plot of sample diversity, i.e. the number of clonotypes in a given sample
Total number of molecules (MIGs) in final clonotype tables
Rate of hot-spot and singleton error filtering, in terms of clonotypes (clone panel) and MIGs (mig panel). Hot-spot errors are indentified by checking whether a given variant is frequently corrected during MIG consensus assembly.
As clonotypes represented by a single MIG (singletons) have insufficient info to apply MiGEC-style error filtering, a simple frequency-based filtering is used for them.
Rate of non-coding CDR3 sequences, in terms of clonotypes (clone panel) and MIGs (mig panel)