UMI Counts: raw or normalized?
From the "HDF5 Gene-Barcode Matrix Format" page:
|barcodes||string||Barcode sequences and their corresponding gem groups (e.g. AAACGGGCAGCTCGAC-1)|
|data||uint32||Nonzero UMI counts in column-major order|
|gene_names||string||Gene symbols (e.g. Xkr4)|
|genes||string||Ensembl gene IDs (e.g. ENSMUSG00000051951)|
|indices||uint32||Row index of corresponding element in data|
|indptr||uint32||Index into data / indices of the start of each column|
|shape||uint64||Tuple of (n_rows, n_columns)|
I've used cellranger mat2csv to convert a filtered_gene_bc_matrices_h5.h5 file to csv format for inspection, and I'm unsure whether the data is raw or normalized UMI counts. The table above seems to suggest that they are raw counts, but comparing small and large datasets containing the same cells, I've noticed that the values are much lower for larger files, suggesting that they are somehow normalized.
Could somebody shed light on how the counts are conducted, whether they are normalized, and if so, how they are normalized (total reads, size of dataset, etc)?
Thank you, and any response is greatly appreciated!
Re: UMI Counts: raw or normalized?
The UMI counts reported in the gene-barcode matrix files (MEX and HDF5 format) are not normalized.
Each element of the matrix is the number of UMIs associated with a gene and barcode.
For more information on the matrix files, please see:
If you have additional questions, please feel free to also contact email@example.com