I had tweeted about this analysis a few weeks back and it generated some interest, so I figured I would put together a slim version of it here with a little more commentary than the original notebook.
To start, my motivation for exploring this data wasn't the data itself, but rather, I had just come across the Python package Scanpy and needed to try it out. The package contains straight-forward functions for end-to-end analysis of scRNA-seq data, surpassing the capabilities of its R counterparts. I absolutely love the ecosystem of R tools, but inevitably, I'll always end up having to bounce expression matrices between different data types for Scater, Seurat, Monocle, etc. This isn't a big deal, but there is something appealing about having everything wrapped in a single package.
Scanpy is also just really good at what it does, with computational speed-ups of 4-16x comparable tools, and a much higher memory efficiency, enabling analysis of >1 million cells in a reasonable amount of time. As datasets get bigger, this is something we're going to have to value.
My one problem though, is that before touching Scanpy, I've somehow avoided having to write a line of Python code in my life. Luckily, the API design is remarkably user friendly. Functions are organized into groups (preprocessing, machine learnings/statistics, and plotting) that make it easy to string them together into a standard analysis pipeline. Also, the data type (AnnData) is very similar to those in R (SingleCellExperiment, ExpressionSet, etc). So the package has been great for taking my first baby steps into the world of Python. Hopefully I can pick up on things relatively quickly!
Anyway, let's get into the analysis. The data is part of the Tabula Muris dataset, comprising ~50,000 cells from 12 mouse tissues. The data comes as separate files for each tissue and replicate. I merged these together and dumped it into a .h5ad file that I'll start with here. If you want to see my stupidly-inefficient code for merging it, see the original analysis notebook here. Data can be downloaded here.
import numpy as np import scanpy.api as sc sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) sc.settings.set_figure_params(dpi=80) # low dpi (dots per inch) yields small inline figures sc.logging.print_version_and_date() # we will soon provide an update with more recent dependencies sc.logging.print_versions_dependencies_numerics()
Running Scanpy 0.4.2+2.g884c5f8 on 2018-02-03 22:13. Dependencies: numpy==1.14.0 scipy==1.0.0 pandas==0.22.0 scikit-learn==0.19.1 statsmodels==0.8.0 python-igraph==0.7.1 louvain==0.6.1
Loading the data
adata = sc.read("/Users/davidcook/Projects/fun_analysis/tabula_muris/write/mouse.atlas.h5ad")
The preprocessing steps will be quite standard for scRNA-seq. First, we'll look at the distribution of UMI/cell, Genes/cell, and Mitochondrial proportions/cell.
#This function actually writes metadata for genes/cell and number of cells expressing each gene, which is needed for the plotting function #You can be conservative with removing cells here, and then tighten the thresholds after viewing the data sc.pp.filter_genes(adata, min_cells=5) sc.pp.filter_cells(adata, min_genes=250)
Now calculate the percentage of mitochrondrial reads per cell
#Comments below are from Alex Wolf's usage notebook--keeping them for reference mito_genes = [name for name in adata.var_names if name.startswith('MT-')] # for each cell compute fraction of counts in mito genes vs. all genes # the ".A1" is only necessary, as X is sparse - it transform to a dense array after summing adata.obs['percent_mito'] = np.sum( adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 # add the total counts per cell as observations-annotation to adata adata.obs['n_counts'] = np.sum(adata.X, axis=1).A1
And visualizing the distributions
axs = sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.4, multi_panel=True)
Single-cell data is typically skewed like this. Droplets containing more than one cell certainly contribute to the skew (more RNA per droplet > more cDNA made > more reads/UMIs per droplet). It's not uncommon to trim off some off the cells at both ends of this distributions to remove possible doublets and poor-quality cells (bottom end of distribution).
It looks like either mitochondrial genes were removed prior to hosting the data online, or a reference that lacked them was used. Let's just assume they won't affect things too much.
ax = sc.pl.scatter(adata, x='n_counts', y='n_genes')
Let's get rid of cells with >60k UMIs (possible doubles) and those with less than 1k (poor lysis or bad quality cell)
adata = adata[adata.obs['n_counts'] < 60000, :] adata = adata[adata.obs['n_counts'] > 1000, :]
I'm quite sure this normalization function will just scale counts so that all cells have the same total number of counts. Other normalization methods exist that are a bit more elegant, but this works fine in most cases.
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4) filter_result = sc.pp.filter_genes_dispersion( adata.X, min_mean=0.0125, max_mean=5, min_disp=0.5) sc.pl.filter_genes_dispersion(filter_result)
normalizing by total count per cell finished (0:00:02.771): normalized adata.X and added 'n_counts', counts per cell before normalization (adata.obs) filter highly variable genes by dispersion and mean (0:00:04.965) --> set `n_top_genes` to simply select top-scoring genes instead
For most downstream analyses it makes sense to only use a subset of highly-variable genes. We'll filter the data to only include those.
adata = adata[:, filter_result.gene_subset]
It's great that Scanpy has a function to regress out the effect of confounding variables from the data. It's very common for the total number of UMIs per cell (and relatedly, genes/cell) to contribute to structure in the data. You'll often see it correlate with the first or second principal component. To get around this, we can simply regress their effect out.
This can also be valuable for removing the effect of cell cycle if it's messing with the analysis. Unfortunately, Scanpy currently doesn't have a function for cell cycle classification. I find that Seurat does a great job at this, and for other projects, I've moved data into R, performed classification, and then brought the classifications back here to be regressed out.
--> after `sc.pp.regress_out`, consider rescaling the adata using `sc.pp.scale`
Exploring the data
Now that all the preprocessing is done, we can do some exploratory analysis. I don't expect PCA to show anything super useful here because of the complexity of the data, but it's the first visualization I always throw up just to see if anything stands out.
ax = sc.pl.pca_scatter(adata, color=['Tissue'], right_margin=0.5)
tSNE seems to be everyone's favourite visualization method for scRNA-seq data right now, and while I have gripes with its over-use and frequent over-interpretation, it can be nice for visualization of different populations of cells present when different populations exist. On that note, I should probably just include this link here: How to use tSNE effectively.
Pretty cool! You can start to see the various cell types that may exist forming distinct groups. Notice that some colours make up multiple groups of cells in the plot (eg. Trachea). This could represent the different cell types that make up the tissues.
We'll do some unsupervised clustering (Louvain modularity optimization) to identify different populations.
So the 12 tissues may be able to be broken down into 48 distinct cell types. A much deeper analysis would be needed to actually understand what cell types these clusters represent and their relations to eachother. Check out the manuscript though--they did a good job going through all that.
To finish off, I wanted to explore a graph-based visualization of this data. Like I mentioned above, interpreting tSNE plots can be...well, impossible. But it's hard to argue with it's complex compositions within data, which I guess is why the field has grabbed onto them so tightly. Graph-based methods, however, may be a useful alternative. I'm still learning about them and any strings that may be attached to their use, but it seems like they should retain cell-cell relationships much better than tSNE.
Scanpy has implemented multiple force-directed graph methods. Here, I'll use the Fruchterman Reingold method