This is a user-friendly manual for
biologists who are interested in performing quantitative trajectory
inference with the Treefit package. If you are not familiar with the
basic workflow of Treefit, then it would be a good idea to look at the
introductory tutorial (vignette("treefit")
) before using
it.
Single-cell technologies are expected to help us discover a novel type of cells and revolutionize our understanding of the process of cell differentiation, and trajectory inference is one of the key computational challenges in single-cell transcriptomics. In recent years, many software packages have been developed and widely used to extract an underlying “tree” trajectory from single-cell RNA-seq data.
However, trajectory inference often suffers from the uncertainty due to the heterogeneity of individual cells and the high levels of technical noise in single-cell experiments. Hence, it was a fatal problem that there was no method to quantitatively assess the reliability of the estimated trajectory and to distinguish distinct cell types in an objective manner so far. Of course, many handy software packages are available for visualizing data and helping perform exploratory data analysis; however, we must stress that different visualization techniques often result in completely different pictures and the interpretation can be quite subjective and thus it is hard to reach a scientific truth solely by this approach. In order to facilitate scientific discoveries from single-cell gene expression data, we need a quantitative methodology.
This is why we developed Treefit - the first software for performing quantitative trajectory inference. Treefit can be used in conjunction with existing popular software packages such as Seurat and dynverse. In this tutorial, we demonstrate how to use Treefit together with Seurat by analyzing the single-cell RNA-seq dataset in Trapnell el al., 2014, Nature Biotechnology.
As the Treefit package is meant to be used with gene expression data (either raw counts or normalized values) that have been already quality controlled, it is a good idea to use Seurat or other popular toolkits designed for the quality control of single-cell gene expression data (e.g., filtering cells and normalizing the data) before the Treefit analysis.
We analyze the dataset provided in Trapnell el al., 2014, Nature Biotechnology. This dataset was originally acquired for the purpose of studying the process of differentiation of human myoblasts that can be reasonably explained by the linear model (i.e., non-branching, chain-like structure). However, as the authors discuss in detail in the original article, their trajectory inference software based on a minimum spanning tree (MST) helped detect contamination with another cell lineage which was not related to myogenesis and so revealed that the bifurcation tree model (i.e., Y-shaped tree structure) fits the data better than the linear model.
To use this dataset, we download and read the dataset provided in dynverse as follows (see https://zenodo.org/record/1443566 for details).
From now, we demonstrate how to preprocess the above data with
Seurat. In order to handle data in Seurat, we first need to create a
Seurat
object. When creating a Seurat
object
from dynverse data, we must be aware that the roles of the rows and
columns are reversed and therefore we need to transpose the matrix. In
dynverse data, the rows and columns correspond to cells and genes,
respectively, and vice versa in Seurat
objects.
Here we assume that we wish to perform the following preprocessing.
This can be done by setting the designated parameters as follows.
trapnell <- Seurat::CreateSeuratObject(counts=t(trapnell.dynverse$count),
min.cells=3,
min.features=200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## Warning: Data is of class matrix. Coercing to dgCMatrix.
The “Feature names cannot have underscores …” warning message can be ignored. It’s caused by difference between dynverse and Seurat. Dynverse uses underscores for gene names but Seurat doesn’t allow it. Seurat replaces underscores with dashes automatically with the warning message.
Although pictures may not always describe reality, we usually visualize data before starting detailed analysis. Here we create a scatter plot by principal component analysis (PCA). Below are a few lines of code to do this and the result is shown in Figure 1.
trapnell <- Seurat::FindVariableFeatures(trapnell, verbose=FALSE)
trapnell <- Seurat::ScaleData(trapnell, verbose=FALSE)
## Warning: No layers found matching search pattern provided
## Error in `ScaleData()`:
## ! No layer matching pattern 'data' found. Please run NormalizeData and retry
## Warning: No layers found matching search pattern provided
## Error in `PrepDR5()`:
## ! No layer matching pattern 'scale.data' not found. Please run ScaleData and retry
## Error in `object[[reduction]]`:
## ! 'pca' not found in this Seurat object
##
It is hard to reach the truth from the PCA result shown in Figure 1.
Unlike the tree-like toy data analyzed in the previous tutorial
(vignette("treefit")
), we cannot easily recognize the
underlying tree structure for various reasons (e.g., high level
of noise, contamination with different types of cells, and the
inaccuracy of two-dimensional visualization of high dimensional
data).
Figure 2b in Trapnell el al., 2014, Nature Biotechnology shows that this dataset contains three distinct types of cells (i.e., proliferating cells, differentiating myoblasts, and contaminating interstitial mesenchymal cells), which led the authors to adopt a Y-shaped tree model (i.e., a star tree with three arms) to explain the data. However, it is difficult to quantify the reliability of this tree model and to confidently predict that the best-fit tree consists of three principal paths.
Let us use Treefit in order to estimate the tree-likeness of the
preprocessed myoblast data (which contain 290 cells) and to predict the
number of principal paths in the best-fit tree. The workflow is the same
as the toy data analysis described in the previous tutorial
(vignette("treefit")
).
## Error in `Seurat::GetAssayData()`:
## ! `assay` must be one of "RNA", not "counts".
## Error: object 'trapnell.fit' not found
## Error: object 'trapnell.fit' not found
As explained in the previous tutorial
(vignette("treefit")
), the two Grassmann distances
$max_cca_distance
and $rms_cca_distance
play
an essential role in the Treefit analysis. The goal of the first
analysis using $max_cca_distance
is to measure the
goodness-of-fit between data and data-derived tree trajectories, and
this helps biologists make a mathematical and statistical evidence-based
quantitative discussion. The other analysis with
$rms_cca_distance
aims to predict the number of principal
paths in the best-fit tree trajectory, and the result of this analysis
can be helpful to discover novel cell types or make sure whether or not
there are contaminating cell types.
If we compare the mean values of $max_cca_distance
shown
in the left panel of Figure 2 with the previous results
(vignette("treefit")
), we can see that the tree-likeness of
the myoblast dataset is more like the noisy tree-like data with the
fatness
0.8
rather than the tree-like data
with the fatness
0.1
. This estimate is
reasonable in light of the fact that the myoblast dataset in Trapnell el al.,
2014, Nature Biotechnology contains some contaminating cell types
that are not related to myogenesis.
Turning to the right panel of Figure 2, we see that
$rms_cca_distance
attains the local minimum at p=2.
Therefore, we can deduce that the best-fit tree trajectory for the
myoblast data consists of p+1=3 principal paths.
$n_principal_paths_candidates[1]
also tells us the number
principal paths is 3. Recalling the discussion in Trapnell el al.,
2014, Nature Biotechnology (i.e., the cells in the dataset
can be classified into the following three groups: proliferating cells;
differentiating myoblasts; and contaminating interstitial mesenchymal
cells), we thus see that Treefit has correctly predicted the number of
distinct principal paths forming the best-fit tree trajectory for this
dataset.