Reply
Highlighted

Best practices for trimming adapters when variant calling outside of longranger

Posted By: chapmanb, on Jun 5, 2017 at 8:57 AM

I'm variant calling with 10x input data using bcbio (https://github.com/chapmanb/bcbio-nextgen), which does not yet integrate any of the special magic in the longranger workflow. We're hoping to use the more standard runs as a gateway to being able to better integrate and handle 10x data in bcbio.

 

When running validations on your NA24385 public data using a pretty standard bwa and GATK/FreeBayes pipeline, I'm seeing high false positive rates for GATK-based algorithms (GATK3.7, GATK4 and Sentieon's Haplotyper):

 

https://github.com/bcbio/bcbio_validations/tree/master/gatk4#na24385-10x-data-on-grch37

 

In contrast FreeBayes doesn't show the false positive inflation and is similar to the NA12878 validations that use non-10x data.

 

Are there recommended best practices to trim adapters, or otherwise pre-process, to help remove the false positive issue? I'd welcome any tips/tricks for how best to handle this. Thanks much.

3 Replies

Re: Best practices for trimming adapters when variant calling outside of longranger

Posted By: patrick-10x, on Jun 7, 2017 at 8:42 AM

Hi Brad, 

 

Thanks for the question! The most powerful filtering comes from the phasing-based analysis: if the support for the alternate allele doesn't segregate cleanly onto one of the haplotypes construced from the neighboring variants, it's likely to be a false positive. Our PHASE_SNPINDELS stage does this analysis simultaneously with large-scale phasing.  For best results with 10x data, we strongly encourage using Long Ranger. Of course Long Ranger is also a really good way to get phased variants as well!

 

We use the following filters to remove the most obvious false positives generated by GATK and to a lesser extent FreeBayes:

 

QUAL_FILTER = '(%QUAL <= 15 || (AF[0] > 0.5 && %QUAL < 50))'
ALLELE_FRACTION_FILTER = '(AO[0] < 2 || AO[0]/(AO[0] + RO) < 0.15)'

 

bcftools filter -O v --soft-filter <name> -e <filter_spec> -m '+' <input vcf>

 

You can see the details in the python code in the Long Ranger tarball here:

longranger-2.1.3/longranger-cs/2.1.3/mro/stages/snpindels/populate_info/__init__.py:167

longranger-2.1.3/longranger-cs/2.1.3/tenkit/lib/python/tenkit/constants.py:138

 

In terms of trimming, we recommend trimming the first 16+7bp of R1, and the first 1bp of R2. R1 contains the 16bp 10x barcode + 7bp of low accuracy sequence from an N-mer oligo.  The first bp of R2 empirically has about a 5x higher mismatch rate.  Given the stats you're showing, I don't expect the trimming to have a huge influence -- my guess is that you'll get the biggest win from filtering poor variants.

 

Cheers,

Pat

Re: Best practices for trimming adapters when variant calling outside of longranger

Posted By: chapmanb, on Jun 19, 2017 at 8:10 AM

Pat;

Thank you so much for the detailed answer. This gave me a lot of great directions to go in, and a combination of a low frequency filter and trimming helped us get similar specificity and sensitivity to FreeBayes. I tried just using a low frequency filter similar to yours but with QD and the addition of read position metrics, and it helped some:

 

https://github.com/bcbio/bcbio_validations/tree/master/gatk4#low-frequency-allele-filter

 

I also tried trimming only and that also helped some, but it was really the combination of both that did the trick:

 

https://github.com/bcbio/bcbio_validations/tree/master/gatk4#10x-adapter-trimming--low-frequency-all...

 

Thank you again for the help pointing us in the right direction. I'm excited to have this working and looking forward to exploring more 10x data and starting to incorporate Long Ranger.

Re: Best practices for trimming adapters when variant calling outside of longranger

Posted By: cypridina, on Oct 8, 2018 at 5:29 PM

I checked out some of my own 10X libraries and it seems like the data should actually be trimmed a bit more than the above responses suggest.

 

To figure this out I made a sequence logo from 1,000,000 reads from my 10X fastqs, then rescaled the bits (y-axis) from the min bitscore of all positions to max bitscore of all positions between 0-1. First I plotted all 150 bp.

 

sequence_entropy_output_R2.pngsequence_entropy_output_r1_to150.pngYep - the R2 reads look like they will be OK with just the first 5bp or so trimmed. (The GC bias is a little concerning though!).

 

The R1 reads definitely have something going on well into the 20-30bp position though. Zoomed in:

 

sequence_entropy_output_r1_to50.pngAs you can see here, there is sequence bias even up to the 28th base in the reads. These are apparently low-quality bases according to the post above, but it looks like it would be good to trim them anyway. In the future I will probably just trim off the first 30bp of R1 if I need to do so.