SIMLR - A novel scRNA-seq analysis method
Bo Wang, Stanford University
Bo is a PhD student in computer science. His research interests include machine learning, computational biology, computer vision, and data analysis. He is currently part of the Serafim Batzoglou lab at Stanford and is working on single-cell RNA-seq analysis, cancer subtyping, 3D long range interaction modeling, and gene function prediction. These projects involve a wide range of machine learning techniques: deep learning, network analysis, factor analysis, and convex optimization. Prior to Stanford, he received his Master’s Degree at University of Toronto, majoring in numerical analysis.
A new Nature Methods publication, “Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning”, introduces a novel approach to single cell data analysis called SIMLR (single-cell interpretation via multi-kernel learning) that learns appropriate cell-to-cell similarity metrics for dimension reduction and clustering of single cell RNA-seq (scRNA-seq) data. We recently had a chance to speak with Bo Wang, lead author on the paper, about the current challenges and future potential of single cell analysis.
10x: Your background is in computer science and numerical analysis, what got you interested in applying these to biological data analysis?
BW: I did a rotation with Serafim Batzoglou in the first year of my PhD. During that rotation, I realized that my skills in computer science and numerical analysis could actually help solve many important biological questions, which could potentially make a big difference in the world.
10x: In the new publication “Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning”, you introduce a new approach to single cell data analysis called SIMLR (single-cell interpretation via multi-kernel learning). What inspired you to focus on scRNA-seq data analysis? What opportunities did you see to improve scRNA-seq data analysis?
BW: I worked in a children’s hospital in Canada for two years. During that time, I was exposed to many cancer-related problems in which heterogeneity plays an essential role. Single-cell analysis is an invaluable tool for dissecting cell heterogeneity and promises to yield new insights into a variety of important applications, including cancer analysis. Further, current studies in single-cell analysis have inherited methods designed for bulk RNA-seq data, without consideration for the unique problems that come with single-cell analysis. Therefore, I feel the need to design specific tools for single-cell RNA-seq data analysis.
10x: How is SIMLR different compared to current single-cell analysis methods? What are the advantages of SIMLR over traditional RNA-seq data analysis methods?
BW: The key difference between SIMLR and current single-cell analysis methods is that, instead of using traditional distance metrics such as Euclidean or correlations, SIMLR learns a discriminative cell-to-cell similarity. This similarity enables efficient downstream applications in single-cell analysis, such as clustering, visualization and differential gene analysis. Plus, the scalability of SIMLR makes it easy to handle hundreds of thousands of cells in minutes.
10x: How can SIMLR be applied to different types of single-cell studies (e.g. studies of highly heterogeneous cell populations or studies focusing on cell differentiation)? What types of single cell studies can benefit most from SIMLR?
BW: SIMLR can be applied to highly heterogeneous cell populations. I think SIMLR will be particularly useful for single-cell studies involving multiple levels of heterogeneity and lots of cell types with large variances in the number of cells for each cell type.
10x: Did you uncover any interesting or surprising discoveries in your research as you were applying SIMLR to single-cell datasets?
BW: When applying SIMLR on PBMC single-cell RNA-seq data from 10x Genomics, we found that SIMLR was able to easily identify Natural-Killer (NK) cells, while traditional methods such as tSNE and PCA usually mixed up NK cells with CD8+ T cells. Further, SIMLR was able to identify a rare megakaryocyte population of 12 cells in PBMC single-cell RNA-seq data from 10x Genomics with simple k-means. It also pinned down most of the marker genes for both major and rare cell types in the PBMC data.
10x: There’s a lot of excitement and promise around single-cell analysis. In your opinion, what does the future of single cell data analysis look like? What improvements are still needed?
BW: In my opinion, the future of single-cell data analysis will focus on providing comprehensive single-cell multi-omics data analysis for more than just RNA-seq data. Computational tools for single-cell analysis should be able to integrate multiple types of measurements of a single cell (e.g., RNA-seq, DNA Methylation or ATAC-seq) and construct a global view of the heterogeneity and dynamics of genome-wide transcriptome and epigenetic states.
10x: In your experience, how important is open source software and datasets for bioinformatics tool development?
BW: I think open-sourced software and datasets will greatly improve the quality and efficiency of many popular bioinformatics tools. Further, it will promote a highly collaborative environment in the single-cell community.
For more information about the publication “Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning”, please see the blog post “A novel scRNA-seq analysis method uses machine learning”.