suppressPackageStartupMessages({
library( Seurat )
library( tidyverse ) })
ReadMtx(
"~/tmp/ifnagrko/ifnagrko_raw_counts.mtx.gz",
"~/tmp/ifnagrko/ifnagrko_obs.csv",
"~/tmp/ifnagrko/ifnagrko_var.csv",
feature.column=2, skip.cell=1, skip.feature=1,
cell.sep=",", feature.sep=",",
mtx.transpose=TRUE ) -> count_matrixSingle-cell RNA-Seq: Pseudotime
Lecture Notes, Bioinformatics Basic Course
MSc studies in Molecular Biotechnology, Summer term 2024, University of Heidelberg
2024-05-03
Overview on the data
For today, we use example data from the lab of A. Martin-Villalba (DKFZ), who study adult neurogenesis. The data set comprises cells taken from the subventricular zone (SVZ) of the murine brain, a site where neuronal stem cells (NSCs) are known to reside. These mainly give rise to neuroblasts which then migrate to the olfactory bulb where they replace olfactory interneurons. However, the NSCs are know to also produce other cell types, and the neuroblasts may find use not only in the olfactory bulb.
For this data set, cells have been taken from the SVZ from several mice. FACS was used to select cells expressing GLAST on their surface, a marker for astrocytes and NSCs, which, however, persists a while and hence can still be found also on the neuroblasts.
Loading the data
The count matrix is given in three files (available on Moodle), one file with the counts (in MatrixMarket format), one with the cell barcodes and one with the gene names. This is like the output of the 10X CellRanger software. However, while we could use Seurat’s Read10X function on Thursday to read the PBMC example data set (that was made directly with CellRange), today’s files are formatted slightly differently. Hence, we use the mroe flexible ReadMtx function where we can specify the deviations in data format:
Here is the top left corner of count matrix:
count_matrix[1:5,1:5]5 x 5 sparse Matrix of class "dgCMatrix"
AAACCCAGTCTCCTGT-WT_20mo_2021_001 AAACGAAAGGTACTGG-WT_20mo_2021_001
Xkr4 . .
Rp1 . .
Sox17 . .
Gm37323 . .
Mrpl15 . .
AAACGAACATGGCTGC-WT_20mo_2021_001 AAACGAAGTAGCTGAG-WT_20mo_2021_001
Xkr4 . .
Rp1 . .
Sox17 . .
Gm37323 . .
Mrpl15 . .
AAACGAATCACCCTCA-WT_20mo_2021_001
Xkr4 .
Rp1 .
Sox17 .
Gm37323 .
Mrpl15 1
Overall, we have around 18k cells:
dim( count_matrix )[1] 20830 18302
Standard Seurat analysis
We now run a standard Seurat analysis, using the same workflow as we did yesterday, when we worked through Seurat’s “PBMC Guided Tutorial”. However, this time, we just concatenate the functions and run through them in one go:
CreateSeuratObject( count_matrix ) %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA() %>%
RunUMAP(dims=1:15) %>%
FindNeighbors() %>%
FindClusters() -> seuWarning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
Normalizing layer: counts
Finding variable features for layer counts
Centering and scaling data matrix
PC_ 1
Positive: Apoe, Aldoc, Sparcl1, Sdc4, Ptn, Cmtm5, Glul, Gpr37l1, Fxyd1, Atp1b2
S100a1, Slc4a4, Slc1a3, Prxl2a, F3, Itm2b, Mt1, Rgcc, Prdx6, Sfxn5
Sat1, Scrg1, Dbi, Hes5, Luzp2, Plaat3, Pla2g7, Sash1, Plpp3, Sparc
Negative: Tubb5, Sox11, Tubb3, Stmn1, Jpt1, Hmgb3, Ptma, Sox4, Dlx2, Cd24a
Igfbpl1, Dlx6os1, Map1b, Stmn2, Abracl, Tmsb4x, Lmnb1, Cdca7, Ccnd2, Elavl4
Cdk4, Dcx, Arx, Uchl1, EYFP, Celf4, Dlx5, Nrxn3, H1fx, Hmgn2
PC_ 2
Positive: Rorb, Cldn10, Clu, Mt3, Ntsr2, Mfge8, S1pr1, Id4, Slc1a2, Acsl6
Plpp3, Sox9, Ddah1, Bcan, Cxcl14, Btbd17, Mlc1, Cspg5, Fjx1, Aqp4
Ntm, Acsl3, Gabrb1, Tspan7, Lsamp, Chst2, Mt2, Lhx2, Slc39a12, Glud1
Negative: Ctss, C1qc, Laptm5, Csf1r, Trem2, C1qa, Cx3cr1, C1qb, Tyrobp, Ly86
Fcer1g, Siglech, Selplg, Fcrls, Tmem119, Fcgr3, Apbb1ip, Unc93b1, Cd53, Lpcat2
Spi1, Pld4, Olfml3, Irf8, Ctsh, Aif1, Cd300c2, Fyb, Otulinl, Mylip
PC_ 3
Positive: Atp1a3, Camk2b, Snhg11, Syt1, Nrip3, Kcnj4, Scg2, Snap25, Dnm1, Pcp4
Icam5, Ndrg4, Eef1a2, Eno2, Ano3, Ryr2, Arpp21, Ptk2b, Gng4, Kcna4
Penk, Slc4a10, Snca, Gad1, Rprml, Grin2a, C1qtnf4, Shisa8, Camk2a, Kcnb2
Negative: Hmgb2, Top2a, Pbk, Birc5, Mki67, Cdk1, Cdca8, Spc24, Cenpf, Spc25
Prc1, Rrm2, Mdk, Nusap1, Tpx2, Cdca3, Knl1, Ckap2l, Esco2, Aurkb
Cenpm, Ccna2, Bub1, Cks2, Kif11, Hist1h3c, Hist1h1b, Hmmr, Pclaf, Fbxo5
PC_ 4
Positive: C1qc, C1qa, Ctss, Trem2, Csf1r, C1qb, Cx3cr1, Laptm5, Fcer1g, Tyrobp
Ly86, Siglech, Selplg, Fcrls, Fcgr3, Hexb, Spi1, Cd53, Itgb5, Pld4
Ptgs1, Cd300c2, Aif1, Irf8, Fyb, Itgam, Cyth4, Ltc4s, Otulinl, Cd37
Negative: Frzb, Apod, Npy, Plp1, Vtn, Foxd3, Wnt6, Nr2f2, Edil3, Sox10
Gsn, Matn4, Fbln2, Aspa, Aqp1, Igf1, Plat, Lpar1, Igfbp4, Erbb3
Fabp7, Plppr4, Ptgds, Col23a1, Alx3, Hey2, Cd59a, Fam3c, Scd1, Mybpc1
PC_ 5
Positive: Top2a, Pbk, Birc5, Mki67, Spc25, Cdk1, Prc1, Nusap1, Spc24, Esco2
Tpx2, Knl1, Aurkb, Cenpf, Cdca8, Ckap2l, Kif11, Cdca3, Hist1h3c, Hmmr
Ccna2, Bub1, Incenp, Hist1h2af, Ndc80, Cit, Fbxo5, Kif4, Sgo1, Kif22
Negative: Stmn2, Igfbpl1, Cd24a, Nrep, Sox4, Map1b, Stmn4, Tubb3, Shtn1, Dlx6os1
Dcx, Ly6h, Sox11, Jpt1, Mpped2, Stmn1, Plxna4, Pbx3, Elavl4, Uchl1
Runx1t1, Cald1, Foxp2, Dlx2, Gad2, Celf4, Pfn2, Dlx5, Sp8, Tubb5
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
22:15:46 UMAP embedding parameters a = 0.9922 b = 1.112
22:15:46 Read 18302 rows and found 15 numeric columns
22:15:46 Using Annoy for neighbor search, n_neighbors = 30
22:15:46 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
22:15:49 Writing NN index file to temp file /tmp/Rtmp9EdHMj/file37283b5d146d
22:15:49 Searching Annoy index using 1 thread, search_k = 3000
22:15:59 Annoy recall = 100%
22:15:59 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
22:16:00 Initializing from normalized Laplacian + noise (using RSpectra)
22:16:05 Commencing optimization for 200 epochs, with 766518 positive edges
22:16:15 Optimization finished
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 18302
Number of edges: 562149
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8880
Number of communities: 23
Elapsed time: 7 seconds
Here is the UMAP, with the clusters and their index numbers:
UMAPPlot( seu, label=TRUE ) + coord_equal()Here is a colouring by expression of Slc1a3, the gene coding for GLAST (the FACS marker mentioned above):
FeaturePlot( seu, c( "Slc1a3", "Mki67" ) ) + coord_equal()We see in the UMAP an elongated structure: this is the SVZ’s neurogenic lineage. At its start (at the bottom) we see the NSCs (that express Slc1a3). There (in clusters 0, 11 and 16) the NSCs are quiescent. When they get activated, their state changes to what is seen in clusters 4 and 8. Then, they enter the cell cycle (actually visible as cycle or ring in the UMAP; clusters 7, 17 and 12; marked with the cell-cycle marker Mki67) and are referred to as transit amplifying progenitors (TAPs). Afterwards they become neuroblasts (cluster 3, 1, 2, 15).
If we look at papers from the neurogenesis field to learn about markers for these stages (quiest NSCs, activated NSCs, TAPs, neuroblasts), we can check teh above assignment. To dave time, we skip over this and I hope you take my word for the assigment just mentioned.
Subsetting to the lineage
Let us remove all the outlying cluster that are not part of the trajectory from NSCs to neuroblasts.
The following clusters are part of the lineage (i.e., the main trajectory formed by the connected clusters):
lineage_clusters <- c( 16, 0, 11, 4, 8, 10, 12, 7, 17, 13, 3, 1, 2, 15 )We find the cluster assignment of each cell in the Seurat object’s slot seurat_clusters:
head( seu$seurat_clusters )AAACCCAGTCTCCTGT-WT_20mo_2021_001 AAACGAAAGGTACTGG-WT_20mo_2021_001
6 3
AAACGAACATGGCTGC-WT_20mo_2021_001 AAACGAAGTAGCTGAG-WT_20mo_2021_001
0 4
AAACGAATCACCCTCA-WT_20mo_2021_001 AAACGCTAGGGCAGGA-WT_20mo_2021_001
2 0
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
With the %in% operator, we can see which cells are in the lineage clusters:
head( seu$seurat_clusters %in% lineage_clusters )[1] FALSE TRUE TRUE TRUE TRUE TRUE
Using this, we subset the object
seu[ , seu$seurat_clusters %in% lineage_clusters ] -> seu_lng
seu_lngAn object of class Seurat
20830 features across 14779 samples within 1 assay
Active assay: RNA (20830 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, umap
Now, we have only 14k cells. The UMAP now has only the core lineage:
UMAPPlot( seu_lng ) + coord_equal()Slingshot
We now use the Slingshot package to assign pseudotime values to the cells in the lineage.
suppressPackageStartupMessages(
library( slingshot) )As a first step, we ask Slingshot to find chains of connected clusters, which it calls “lineages”:
getLineages(
Embeddings( seu_lng, "pca" )[,1:15],
seu_lng$seurat_clusters ) -> ptoThe Slingshot function getLineage gets two parameters from us: First, for each cell its first 15 principal-component scores. In Seurat, the Embedding function extracts PCA, UMAP or similar coordinates from a Seurat object, as shown in the preceding code chunk. The second parameter is the vector of cluster assignments. The result is stored in Slingshot’s data object class, which it calls “PseudotimeOrdering”.
ptoclass: PseudotimeOrdering
dim: 14779 4
metadata(3): lineages mst slingParams
pathStats(2): pseudotime weights
cellnames(14779): AAACGAAAGGTACTGG-WT_20mo_2021_001
AAACGAACATGGCTGC-WT_20mo_2021_001 ...
TTTGTTGAGCCGTTAT-WT_4mo_2021_001 TTTGTTGTCTGAGCAT-WT_4mo_2021_001
cellData names(2): reducedDim clusterLabels
pathnames(4): Lineage1 Lineage2 Lineage3 Lineage4
pathData names(0):
There are a number of function in Slingshot to extract the information stored in such an object. For example, we can look at the maximum spanning tree (MST):
plot( slingMST(pto) )Here, each vertex is a cluster. The paths connecting them are the trajectories that Slingshot will examine. There are four “lineages” that tarverse the cluster along the graph just shown:
slingLineages(pto)$Lineage1
[1] "16" "11" "0" "4" "8" "12" "13" "3" "1" "2"
$Lineage2
[1] "16" "11" "0" "4" "8" "12" "13" "3" "1" "15"
$Lineage3
[1] "16" "11" "0" "4" "8" "12" "17" "7"
$Lineage4
[1] "16" "11" "0" "4" "8" "12" "13" "10"
We will later concentrate on lineage 1, which is the longest one, going all through the trajectory. Lineage 2 is a variant, with a slight change at the end, lineage 3 and 4 enter the cycle. Unfortunately, Slingshot cannot deal with cycles, which is why those lineages don’t close.
Now, we as ask Slingshot to assign to the cells in each lineage coordinate values, called “pseudotimes”, which increase smoothly all along the lienage, from the beginning to the end. This step takes a few minutes.
getCurves(pto) -> ptoAs that step takes a while, we save the result, so that we can restart from here, if needed, without having to repeat it.
save( seu, pto, file="ifnagrko_seu_pto.rda" )To restart from here, we load the data with
load( "ifnagrko_seu_pto.rda" )The result of the getCurves call is a matrix, with one column for each lineage, that assigns each cell a pseudotime value. For cells that don’t appear in a lineage, the pseudotime value is NA.
slingPseudotime(pto) %>% head() Lineage1 Lineage2 Lineage3 Lineage4
AAACGAAAGGTACTGG-WT_20mo_2021_001 52.133082 52.233944 NA NA
AAACGAACATGGCTGC-WT_20mo_2021_001 11.671661 11.670258 11.681115 11.670240
AAACGAAGTAGCTGAG-WT_20mo_2021_001 19.878485 19.867406 19.427261 19.865931
AAACGAATCACCCTCA-WT_20mo_2021_001 56.833040 NA NA NA
AAACGCTAGGGCAGGA-WT_20mo_2021_001 9.562365 9.561429 9.562658 9.561693
AAACGCTTCAGTCTTT-WT_20mo_2021_001 12.358920 12.355300 12.330698 12.334660
Plotting the pseudotime
TO visualize the pseudotime, we assemble a table from the UMAP coordinates and the pseudotime values.
First, we get the matrix of UMAP coordinates from the Seurat object and convert it to a tibble.
Embeddings( seu_lng, "umap") %>% as_tibble( rownames="barcode" ) %>% head()# A tibble: 6 × 3
barcode umap_1 umap_2
<chr> <dbl> <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_2021_001 -0.468 7.39
2 AAACGAACATGGCTGC-WT_20mo_2021_001 -3.07 -6.99
3 AAACGAAGTAGCTGAG-WT_20mo_2021_001 -0.523 -4.94
4 AAACGAATCACCCTCA-WT_20mo_2021_001 -2.39 12.1
5 AAACGCTAGGGCAGGA-WT_20mo_2021_001 -2.65 -11.2
6 AAACGCTTCAGTCTTT-WT_20mo_2021_001 -1.09 -9.02
The, we do the same for the pseudotime matrix
slingPseudotime(pto) %>% as_tibble( rownames="barcode" ) %>% head()# A tibble: 6 × 5
barcode Lineage1 Lineage2 Lineage3 Lineage4
<chr> <dbl> <dbl> <dbl> <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_2021_001 52.1 52.2 NA NA
2 AAACGAACATGGCTGC-WT_20mo_2021_001 11.7 11.7 11.7 11.7
3 AAACGAAGTAGCTGAG-WT_20mo_2021_001 19.9 19.9 19.4 19.9
4 AAACGAATCACCCTCA-WT_20mo_2021_001 56.8 NA NA NA
5 AAACGCTAGGGCAGGA-WT_20mo_2021_001 9.56 9.56 9.56 9.56
6 AAACGCTTCAGTCTTT-WT_20mo_2021_001 12.4 12.4 12.3 12.3
and merge the two tables with a table join:
left_join(
Embeddings( seu_lng, "umap") %>% as_tibble( rownames="barcode" ),
slingPseudotime(pto) %>% as_tibble( rownames="barcode" ) ) -> umap_ptJoining with `by = join_by(barcode)`
head( umap_pt )# A tibble: 6 × 7
barcode umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_20… -0.468 7.39 52.1 52.2 NA NA
2 AAACGAACATGGCTGC-WT_20mo_20… -3.07 -6.99 11.7 11.7 11.7 11.7
3 AAACGAAGTAGCTGAG-WT_20mo_20… -0.523 -4.94 19.9 19.9 19.4 19.9
4 AAACGAATCACCCTCA-WT_20mo_20… -2.39 12.1 56.8 NA NA NA
5 AAACGCTAGGGCAGGA-WT_20mo_20… -2.65 -11.2 9.56 9.56 9.56 9.56
6 AAACGCTTCAGTCTTT-WT_20mo_20… -1.09 -9.02 12.4 12.4 12.3 12.3
This table is availble on Moodle as file umap_pt.csv
Now, we can plot lineage 1:
umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage1 ) ) +
geom_point( size=.3 ) + coord_equal() +
scale_color_viridis_c() + ggtitle( "Lineage 1" )Here are the other three lineages:
umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage2 ) ) +
geom_point( size=.3 ) + coord_equal() +
scale_color_viridis_c() + ggtitle( "Lineage 2" )umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage3 ) ) +
geom_point( size=.3 ) + coord_equal() +
scale_color_viridis_c() + ggtitle( "Lineage 3" )umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage4 ) ) +
geom_point( size=.3 ) + coord_equal() +
scale_color_viridis_c() + ggtitle( "Lineage 4" )Plotting a gene along pseudotime
Here are the counts for one gene, namely Slc1a3:
seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) %>% head()# A tibble: 6 × 2
barcode count
<chr> <dbl>
1 AAACCCAGTCTCCTGT-WT_20mo_2021_001 11
2 AAACGAAAGGTACTGG-WT_20mo_2021_001 0
3 AAACGAACATGGCTGC-WT_20mo_2021_001 11
4 AAACGAAGTAGCTGAG-WT_20mo_2021_001 24
5 AAACGAATCACCCTCA-WT_20mo_2021_001 0
6 AAACGCTAGGGCAGGA-WT_20mo_2021_001 24
Here, seu[["RNA"]]$counts extracts the count matrix (the one we started with at the very beginning) from the Seurat object. We take the row labelled "Slc1a3" from that matrix. Using square bracket indexing, we get the row as named vector. Then, enframe turns this vector into a data frame with two columns, with the names (cell barcodes) and the values (read counts for gene Slc1a3).
In the same manner, we can get the column sums, i.e., the read count totals per cell:
colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) %>% head()# A tibble: 6 × 2
barcode colSum
<chr> <dbl>
1 AAACCCAGTCTCCTGT-WT_20mo_2021_001 8934
2 AAACGAAAGGTACTGG-WT_20mo_2021_001 7672
3 AAACGAACATGGCTGC-WT_20mo_2021_001 6889
4 AAACGAAGTAGCTGAG-WT_20mo_2021_001 5131
5 AAACGAATCACCCTCA-WT_20mo_2021_001 5988
6 AAACGCTAGGGCAGGA-WT_20mo_2021_001 9866
We can now append these two small tables to our umap_pt table using table joins:
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
head() Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 6 × 9
barcode umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4 colSum count
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAACGAAAGGTACT… -0.468 7.39 52.1 52.2 NA NA 7672 0
2 AAACGAACATGGCT… -3.07 -6.99 11.7 11.7 11.7 11.7 6889 11
3 AAACGAAGTAGCTG… -0.523 -4.94 19.9 19.9 19.4 19.9 5131 24
4 AAACGAATCACCCT… -2.39 12.1 56.8 NA NA NA 5988 0
5 AAACGCTAGGGCAG… -2.65 -11.2 9.56 9.56 9.56 9.56 9866 24
6 AAACGCTTCAGTCT… -1.09 -9.02 12.4 12.4 12.3 12.3 6100 31
This allows us to make the following plot
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
ggplot +
geom_point(
aes(
x = Lineage1,
y = log10( count/colSum * 1e4 + 1 )
), size=.3 ) Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1875 rows containing missing values or values outside the scale range
(`geom_point()`).
Here, we see how the expression strength of the gene changes along the pseudotime trajectory. Note that we used our usual log normalization for the y axis
Binnig pseudotime
This looks quite noisy. Let’s average over several cells to smoothen this out.
To this end, we put the cells into “bins” of similar pseudotime. As the pseudotime for lineage 1 runs from 0 to 65, we divide it by 6.5 and then round down to the next lower integer (function floor) in order to get 10 bins numbered from 0 to 9:
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
head()Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 6 × 10
barcode umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4 colSum count pt_bin
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAACGAA… -0.468 7.39 52.1 52.2 NA NA 7672 0 8
2 AAACGAA… -3.07 -6.99 11.7 11.7 11.7 11.7 6889 11 1
3 AAACGAA… -0.523 -4.94 19.9 19.9 19.4 19.9 5131 24 3
4 AAACGAA… -2.39 12.1 56.8 NA NA NA 5988 0 8
5 AAACGCT… -2.65 -11.2 9.56 9.56 9.56 9.56 9866 24 1
6 AAACGCT… -1.09 -9.02 12.4 12.4 12.3 12.3 6100 31 1
With group_by( pt_bin ), we now group the table rows into 11 groups, one for each value of pt_bin (values 0 to 9, and NA for cells outside the lineage). Then, using summarize, we add up the counts and the colsums for all the cells in a group (bin):
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
group_by( pt_bin ) %>%
summarize( sum(count), sum(colSum) )Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 3
pt_bin `sum(count)` `sum(colSum)`
<dbl> <dbl> <dbl>
1 0 16074 3658514
2 1 94525 25285896
3 2 20585 6442558
4 3 14436 5898141
5 4 5448 2908013
6 5 2532 2593208
7 6 593 2587333
8 7 301 9609219
9 8 751 23770581
10 9 23 252731
11 NA 2900 21051163
We can also divide the count sum by the colSum sum, to get the fraction of all the reads from cells of a given pseudotime bin that fall onto gene Slc1a3:
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
group_by( pt_bin ) %>%
summarize( fraction = sum(count) / sum(colSum) )Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 2
pt_bin fraction
<dbl> <dbl>
1 0 0.00439
2 1 0.00374
3 2 0.00320
4 3 0.00245
5 4 0.00187
6 5 0.000976
7 6 0.000229
8 7 0.0000313
9 8 0.0000316
10 9 0.0000910
11 NA 0.000138
Defining a function
We will soon want to perform this operation for several genes. Therefore, we define a fucntion:
get_binned_fraction <- function( gene ) {
umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts[gene,] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
group_by( pt_bin ) %>%
summarize( fraction = sum(count) / sum(colSum) )
}Note how the body of this function (the part in curly braces) is exactly the code we just used, with one change: the explicitly stated gene name "Slc1a3" has been replaced with the placeholder gene. This is called a function argument, and it is listed at the function head, behind the word function. We now have a new R function that we can call with an arbitrary gene name:
get_binned_fraction( "Slc1a3" )Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 2
pt_bin fraction
<dbl> <dbl>
1 0 0.00439
2 1 0.00374
3 2 0.00320
4 3 0.00245
5 4 0.00187
6 5 0.000976
7 6 0.000229
8 7 0.0000313
9 8 0.0000316
10 9 0.0000910
11 NA 0.000138
Now we can easily make a plot for this gene:
get_binned_fraction( "Slc1a3" ) %>%
ggplot() +
geom_line( aes( x=pt_bin, y=fraction ) ) +
scale_y_log10()Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
or for any other gene:
get_binned_fraction( "Aqp4" ) %>%
ggplot() +
geom_line( aes( x=pt_bin, y=fraction ) ) +
scale_y_log10()Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
A plotting loop
Here is a list of potentially ionteresting genes:
interesting_genes <- c( "Aqp4", "Slc1a3", "Egfr", "Mki67",
"Dcx", "Prokr2", "Rbfox3", "Gria1", "Calb2" )With a for loop, we can ask R to execute a given code for each of these genes:
for( gene in interesting_genes ) {
print(
get_binned_fraction( gene ) %>%
ggplot() +
geom_line( aes( x=pt_bin, y=fraction ) ) +
scale_y_log10() + ggtitle( gene )
)
}Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
(The reason why we needed to enclose the plotting command in a “print” statement is quite technical; we skip over this.)
A heatmap
We can also use the tidyverse function map_dfr to call our get_binned_fraction function for each of the genes. We will get a dataframe with the 10 values for the 10 pseudotime bins for each gene, and map_dfr will concatenate all these tables:
map_dfr( set_names(interesting_genes), get_binned_fraction, .id="gene" ) -> long_bin_tableJoining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
head( long_bin_table, 15 )# A tibble: 15 × 3
gene pt_bin fraction
<chr> <dbl> <dbl>
1 Aqp4 0 0.00115
2 Aqp4 1 0.000830
3 Aqp4 2 0.000517
4 Aqp4 3 0.000112
5 Aqp4 4 0.0000677
6 Aqp4 5 0.0000378
7 Aqp4 6 0.0000143
8 Aqp4 7 0.00000281
9 Aqp4 8 0.00000341
10 Aqp4 9 0.00000791
11 Aqp4 NA 0.00000461
12 Slc1a3 0 0.00439
13 Slc1a3 1 0.00374
14 Slc1a3 2 0.00320
15 Slc1a3 3 0.00245
We do not have time to explain the intricacies of map_dfr, and just use this table to make a heatmap. For this, we use geom_tile:
long_bin_table %>%
ggplot +
geom_tile( aes( x=pt_bin, y=gene, fill=fraction ) )Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).
It might be helpful if each row where normalized such that it uses the full range of the colour scale. To this end, we divide the faction values for each gene by the respective maximum value:
long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
head(15)# A tibble: 15 × 4
# Groups: gene [2]
gene pt_bin fraction nfrac
<chr> <dbl> <dbl> <dbl>
1 Aqp4 0 0.00115 1
2 Aqp4 1 0.000830 0.724
3 Aqp4 2 0.000517 0.451
4 Aqp4 3 0.000112 0.0972
5 Aqp4 4 0.0000677 0.0590
6 Aqp4 5 0.0000378 0.0329
7 Aqp4 6 0.0000143 0.0125
8 Aqp4 7 0.00000281 0.00245
9 Aqp4 8 0.00000341 0.00297
10 Aqp4 9 0.00000791 0.00690
11 Aqp4 NA 0.00000461 0.00402
12 Slc1a3 0 0.00439 1
13 Slc1a3 1 0.00374 0.851
14 Slc1a3 2 0.00320 0.727
15 Slc1a3 3 0.00245 0.557
Here is our improved heatmap:
long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
ggplot +
geom_tile( aes( x=pt_bin, y=gene, fill=nfrac ) )Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).
We can now see for each in which pseudotime bin it takes its maximum value.
Finally, we might try another colour scale
long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
ggplot +
geom_tile( aes( x=pt_bin, y=gene, fill=nfrac ) ) +
scale_fill_viridis_c()Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).
Whethere this one is nicer might be a matter of taste.
If we now did this heatmap for many mroe genes, and maybe a bit finer bins, we can get a goos overview of the order in which genes get activated when cells progress along the differentiation trajectory from stem cells to neurons. This can help us gain insights into how this process is regulated.