UMI Counts: raw or normalized?

Posted By: nsag, on Apr 14, 2018 at 4:47 PM

Hi all,


From the "HDF5 Gene-Barcode Matrix Format" page:


barcodesstringBarcode sequences and their corresponding gem groups (e.g. AAACGGGCAGCTCGAC-1)
datauint32Nonzero UMI counts in column-major order
gene_namesstringGene symbols (e.g. Xkr4)
genesstringEnsembl gene IDs (e.g. ENSMUSG00000051951)
indicesuint32Row index of corresponding element in data
indptruint32Index into data / indices of the start of each column
shapeuint64Tuple of (n_rows, n_columns)


I've used cellranger mat2csv to convert a filtered_gene_bc_matrices_h5.h5 file to csv format for inspection, and I'm unsure whether the data is raw or normalized UMI counts. The table above seems to suggest that they are raw counts, but comparing small and large datasets containing the same cells, I've noticed that the values are much lower for larger files, suggesting that they are somehow normalized.


Could somebody shed light on how the counts are conducted, whether they are normalized, and if so, how they are normalized (total reads, size of dataset, etc)?


Thank you, and any response is greatly appreciated!

1 Reply

Re: UMI Counts: raw or normalized?

Posted By: Leah, on Apr 25, 2018 at 10:31 AM



The UMI counts reported in the gene-barcode matrix files (MEX and HDF5 format) are not normalized.


Each element of the matrix is the number of UMIs associated with a gene and barcode.


For more information on the matrix files, please see:


If you have additional questions, please feel free to also contact