UMI Counts: raw or normalized?

Posted By: nsag, on Apr 14, 2018 at 4:47 PM

Hi all,


From the "HDF5 Gene-Barcode Matrix Format" page:


barcodesstringBarcode sequences and their corresponding gem groups (e.g. AAACGGGCAGCTCGAC-1)
datauint32Nonzero UMI counts in column-major order
gene_namesstringGene symbols (e.g. Xkr4)
genesstringEnsembl gene IDs (e.g. ENSMUSG00000051951)
indicesuint32Row index of corresponding element in data
indptruint32Index into data / indices of the start of each column
shapeuint64Tuple of (n_rows, n_columns)


I've used cellranger mat2csv to convert a filtered_gene_bc_matrices_h5.h5 file to csv format for inspection, and I'm unsure whether the data is raw or normalized UMI counts. The table above seems to suggest that they are raw counts, but comparing small and large datasets containing the same cells, I've noticed that the values are much lower for larger files, suggesting that they are somehow normalized.


Could somebody shed light on how the counts are conducted, whether they are normalized, and if so, how they are normalized (total reads, size of dataset, etc)?


Thank you, and any response is greatly appreciated!