Reply

Unexpected Peak in the frequency of uniquely occurring UMI

Posted By: avi_sri, on Sep 13, 2017 at 7:30 AM

Hi,

 

While working on the dataset provided here, I was trying to get the distribution of the frequency of the UMI in a gene. Specifically, I ran STAR with the following command:

 

{starBin} --runThreadN {num_thread} --genomeDir {star_index_path} \
              --readFilesIn {reads_file_path} --outFileNamePrefix {output_path} \
              --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat \
              --outFilterMultimapNmax 1"

I used the generated BAM file to make a set of UMIs (say s1 <- {u1,  u2} ) for a gene (g1) mapped to the reference (concatenated human+mouse), where u1, u2 are the UMIs with their read-sequence mapped to gene g1 with the frequency of the occurence of the UMI as f1,f2 in gene g1.

I plotted the histogram of the set s1 for all the gene in the reference which looks like as follows:Screen Shot 2017-09-10 at 21.06.26.png

Here the x-axis is the frequency of a UMI in a gene i.e. f1,f2 (defined above) and the y-axis is the frequency of the frequencies (i.e. it's a histogram).

You would observe that there is a peak at the singly occurring UMI which basically signifies that there is a huge number UMIs which is unique in a given gene. This puzzles me because of the following reason:

1.) If I am right then according to the UMI-theory, the minimum number of UMI (prior-deduplication) should be at least as frequent as the number of PCR cycles but what we observe here is something very different.

IMO the above plot would make sense if either of the below is true:

2.) The reads which are made public are processed reads (removed PCR deduplicates with some heuristics) explaining the peak in single occurence of the UMI or

3.) The reads are raw-reads but the sequencing itself was done only on some subsample of the original (after full PCR runs) of the biological material.

 

I'd really appreciate any help on understanding this.

thanks,

Avi Srivastava

 

2 Replies

Re: Unexpected Peak in the frequency of uniquely occurring UMI

Posted By: avi_sri, on Sep 13, 2017 at 10:43 AM

@shauna-10x @paulr @jens-10x Any comments?

Highlighted

Re: Unexpected Peak in the frequency of uniquely occurring UMI

Posted By: paulr, on Sep 13, 2017 at 11:26 AM

Hi Avi,

 

The library was not sequenced to saturation, which is why you see so many single-read UMIs. If you look at the web summary, there is a metric called "Sequencing Saturation" which is essentially a function of the UMI duplication rate. It's only 25% for this library, indicating that we haven't come close to seeing every UMI in the library yet. In our experience, due to the high RNA content in these cell lines, you typically need to sequence to 1,000,000 reads per cell to approach saturation.

Re: Unexpected Peak in the frequency of uniquely occurring UMI

Posted By: avi_sri, on Sep 14, 2017 at 6:47 AM

Hi @paulr,

 

Thanks for the quick reply.

Interesting, now I am curious how does one decide how much to sequence i.e. why do we stop at 25%? Is it a user given parameter (because of financial or any other reason) or it's an automated procedure based on empirical evidence of the UMI?

Basically, I am interested in estimating the abundance of the genes across cells but non-availability of the majority of the reads, even though a PCR duplicate, would potentially create a bias in the posterior abundances.

 

Thanks again,

Avi

Re: Unexpected Peak in the frequency of uniquely occurring UMI

Posted By: paulr, on Sep 22, 2017 at 8:08 PM

Apologies for the delayed response.

 

> Interesting, now I am curious how does one decide how much to sequence i.e. why do we stop at 25%? Is it a user given parameter (because of financial or any other reason) or it's an automated procedure based on empirical evidence of the UMI?

 

It is up to the user and is usually specified as reads per cell. Some considerations are the cell count, the RNA content of the cells and the goals trying to be achieved - are you simply profiling cell type abundances and highly expressed markers or do you want to go deep and detect low-to-moderately expressed genes? We recommend users start at 50k reads per cell when they start out and adjust accordingly going forward.