Single-cell RNA-Seq: Pseudotime

Author

Simon Anders

Lecture Notes, Bioinformatics Basic Course

MSc studies in Molecular Biotechnology, Summer term 2024, University of Heidelberg

2024-05-03

Overview on the data

For today, we use example data from the lab of A. Martin-Villalba (DKFZ), who study adult neurogenesis. The data set comprises cells taken from the subventricular zone (SVZ) of the murine brain, a site where neuronal stem cells (NSCs) are known to reside. These mainly give rise to neuroblasts which then migrate to the olfactory bulb where they replace olfactory interneurons. However, the NSCs are know to also produce other cell types, and the neuroblasts may find use not only in the olfactory bulb.

For this data set, cells have been taken from the SVZ from several mice. FACS was used to select cells expressing GLAST on their surface, a marker for astrocytes and NSCs, which, however, persists a while and hence can still be found also on the neuroblasts.

Loading the data

The count matrix is given in three files (available on Moodle), one file with the counts (in MatrixMarket format), one with the cell barcodes and one with the gene names. This is like the output of the 10X CellRanger software. However, while we could use Seurat’s Read10X function on Thursday to read the PBMC example data set (that was made directly with CellRange), today’s files are formatted slightly differently. Hence, we use the mroe flexible ReadMtx function where we can specify the deviations in data format:

suppressPackageStartupMessages({
  library( Seurat )
  library( tidyverse ) })

ReadMtx( 
  "~/tmp/ifnagrko/ifnagrko_raw_counts.mtx.gz",
  "~/tmp/ifnagrko/ifnagrko_obs.csv",
  "~/tmp/ifnagrko/ifnagrko_var.csv", 
  feature.column=2, skip.cell=1, skip.feature=1, 
  cell.sep=",", feature.sep=",",
  mtx.transpose=TRUE ) -> count_matrix

Here is the top left corner of count matrix:

count_matrix[1:5,1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
        AAACCCAGTCTCCTGT-WT_20mo_2021_001 AAACGAAAGGTACTGG-WT_20mo_2021_001
Xkr4                                    .                                 .
Rp1                                     .                                 .
Sox17                                   .                                 .
Gm37323                                 .                                 .
Mrpl15                                  .                                 .
        AAACGAACATGGCTGC-WT_20mo_2021_001 AAACGAAGTAGCTGAG-WT_20mo_2021_001
Xkr4                                    .                                 .
Rp1                                     .                                 .
Sox17                                   .                                 .
Gm37323                                 .                                 .
Mrpl15                                  .                                 .
        AAACGAATCACCCTCA-WT_20mo_2021_001
Xkr4                                    .
Rp1                                     .
Sox17                                   .
Gm37323                                 .
Mrpl15                                  1

Overall, we have around 18k cells:

dim( count_matrix )
[1] 20830 18302

Standard Seurat analysis

We now run a standard Seurat analysis, using the same workflow as we did yesterday, when we worked through Seurat’s “PBMC Guided Tutorial”. However, this time, we just concatenate the functions and run through them in one go:

CreateSeuratObject( count_matrix ) %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA() %>%
RunUMAP(dims=1:15) %>%
FindNeighbors() %>% 
FindClusters() -> seu
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
Normalizing layer: counts
Finding variable features for layer counts
Centering and scaling data matrix
PC_ 1 
Positive:  Apoe, Aldoc, Sparcl1, Sdc4, Ptn, Cmtm5, Glul, Gpr37l1, Fxyd1, Atp1b2 
       S100a1, Slc4a4, Slc1a3, Prxl2a, F3, Itm2b, Mt1, Rgcc, Prdx6, Sfxn5 
       Sat1, Scrg1, Dbi, Hes5, Luzp2, Plaat3, Pla2g7, Sash1, Plpp3, Sparc 
Negative:  Tubb5, Sox11, Tubb3, Stmn1, Jpt1, Hmgb3, Ptma, Sox4, Dlx2, Cd24a 
       Igfbpl1, Dlx6os1, Map1b, Stmn2, Abracl, Tmsb4x, Lmnb1, Cdca7, Ccnd2, Elavl4 
       Cdk4, Dcx, Arx, Uchl1, EYFP, Celf4, Dlx5, Nrxn3, H1fx, Hmgn2 
PC_ 2 
Positive:  Rorb, Cldn10, Clu, Mt3, Ntsr2, Mfge8, S1pr1, Id4, Slc1a2, Acsl6 
       Plpp3, Sox9, Ddah1, Bcan, Cxcl14, Btbd17, Mlc1, Cspg5, Fjx1, Aqp4 
       Ntm, Acsl3, Gabrb1, Tspan7, Lsamp, Chst2, Mt2, Lhx2, Slc39a12, Glud1 
Negative:  Ctss, C1qc, Laptm5, Csf1r, Trem2, C1qa, Cx3cr1, C1qb, Tyrobp, Ly86 
       Fcer1g, Siglech, Selplg, Fcrls, Tmem119, Fcgr3, Apbb1ip, Unc93b1, Cd53, Lpcat2 
       Spi1, Pld4, Olfml3, Irf8, Ctsh, Aif1, Cd300c2, Fyb, Otulinl, Mylip 
PC_ 3 
Positive:  Atp1a3, Camk2b, Snhg11, Syt1, Nrip3, Kcnj4, Scg2, Snap25, Dnm1, Pcp4 
       Icam5, Ndrg4, Eef1a2, Eno2, Ano3, Ryr2, Arpp21, Ptk2b, Gng4, Kcna4 
       Penk, Slc4a10, Snca, Gad1, Rprml, Grin2a, C1qtnf4, Shisa8, Camk2a, Kcnb2 
Negative:  Hmgb2, Top2a, Pbk, Birc5, Mki67, Cdk1, Cdca8, Spc24, Cenpf, Spc25 
       Prc1, Rrm2, Mdk, Nusap1, Tpx2, Cdca3, Knl1, Ckap2l, Esco2, Aurkb 
       Cenpm, Ccna2, Bub1, Cks2, Kif11, Hist1h3c, Hist1h1b, Hmmr, Pclaf, Fbxo5 
PC_ 4 
Positive:  C1qc, C1qa, Ctss, Trem2, Csf1r, C1qb, Cx3cr1, Laptm5, Fcer1g, Tyrobp 
       Ly86, Siglech, Selplg, Fcrls, Fcgr3, Hexb, Spi1, Cd53, Itgb5, Pld4 
       Ptgs1, Cd300c2, Aif1, Irf8, Fyb, Itgam, Cyth4, Ltc4s, Otulinl, Cd37 
Negative:  Frzb, Apod, Npy, Plp1, Vtn, Foxd3, Wnt6, Nr2f2, Edil3, Sox10 
       Gsn, Matn4, Fbln2, Aspa, Aqp1, Igf1, Plat, Lpar1, Igfbp4, Erbb3 
       Fabp7, Plppr4, Ptgds, Col23a1, Alx3, Hey2, Cd59a, Fam3c, Scd1, Mybpc1 
PC_ 5 
Positive:  Top2a, Pbk, Birc5, Mki67, Spc25, Cdk1, Prc1, Nusap1, Spc24, Esco2 
       Tpx2, Knl1, Aurkb, Cenpf, Cdca8, Ckap2l, Kif11, Cdca3, Hist1h3c, Hmmr 
       Ccna2, Bub1, Incenp, Hist1h2af, Ndc80, Cit, Fbxo5, Kif4, Sgo1, Kif22 
Negative:  Stmn2, Igfbpl1, Cd24a, Nrep, Sox4, Map1b, Stmn4, Tubb3, Shtn1, Dlx6os1 
       Dcx, Ly6h, Sox11, Jpt1, Mpped2, Stmn1, Plxna4, Pbx3, Elavl4, Uchl1 
       Runx1t1, Cald1, Foxp2, Dlx2, Gad2, Celf4, Pfn2, Dlx5, Sp8, Tubb5 
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
22:15:46 UMAP embedding parameters a = 0.9922 b = 1.112
22:15:46 Read 18302 rows and found 15 numeric columns
22:15:46 Using Annoy for neighbor search, n_neighbors = 30
22:15:46 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
22:15:49 Writing NN index file to temp file /tmp/Rtmp9EdHMj/file37283b5d146d
22:15:49 Searching Annoy index using 1 thread, search_k = 3000
22:15:59 Annoy recall = 100%
22:15:59 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
22:16:00 Initializing from normalized Laplacian + noise (using RSpectra)
22:16:05 Commencing optimization for 200 epochs, with 766518 positive edges
22:16:15 Optimization finished
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 18302
Number of edges: 562149

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8880
Number of communities: 23
Elapsed time: 7 seconds

Here is the UMAP, with the clusters and their index numbers:

UMAPPlot( seu, label=TRUE ) + coord_equal()

Here is a colouring by expression of Slc1a3, the gene coding for GLAST (the FACS marker mentioned above):

FeaturePlot( seu, c( "Slc1a3", "Mki67" ) ) + coord_equal()

We see in the UMAP an elongated structure: this is the SVZ’s neurogenic lineage. At its start (at the bottom) we see the NSCs (that express Slc1a3). There (in clusters 0, 11 and 16) the NSCs are quiescent. When they get activated, their state changes to what is seen in clusters 4 and 8. Then, they enter the cell cycle (actually visible as cycle or ring in the UMAP; clusters 7, 17 and 12; marked with the cell-cycle marker Mki67) and are referred to as transit amplifying progenitors (TAPs). Afterwards they become neuroblasts (cluster 3, 1, 2, 15).

If we look at papers from the neurogenesis field to learn about markers for these stages (quiest NSCs, activated NSCs, TAPs, neuroblasts), we can check teh above assignment. To dave time, we skip over this and I hope you take my word for the assigment just mentioned.

Subsetting to the lineage

Let us remove all the outlying cluster that are not part of the trajectory from NSCs to neuroblasts.

The following clusters are part of the lineage (i.e., the main trajectory formed by the connected clusters):

lineage_clusters <- c( 16, 0, 11, 4, 8, 10, 12, 7, 17, 13, 3, 1, 2, 15 )

We find the cluster assignment of each cell in the Seurat object’s slot seurat_clusters:

head( seu$seurat_clusters )
AAACCCAGTCTCCTGT-WT_20mo_2021_001 AAACGAAAGGTACTGG-WT_20mo_2021_001 
                                6                                 3 
AAACGAACATGGCTGC-WT_20mo_2021_001 AAACGAAGTAGCTGAG-WT_20mo_2021_001 
                                0                                 4 
AAACGAATCACCCTCA-WT_20mo_2021_001 AAACGCTAGGGCAGGA-WT_20mo_2021_001 
                                2                                 0 
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

With the %in% operator, we can see which cells are in the lineage clusters:

head( seu$seurat_clusters %in% lineage_clusters )
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Using this, we subset the object

seu[ , seu$seurat_clusters %in% lineage_clusters ] -> seu_lng

seu_lng
An object of class Seurat 
20830 features across 14779 samples within 1 assay 
Active assay: RNA (20830 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap

Now, we have only 14k cells. The UMAP now has only the core lineage:

UMAPPlot( seu_lng ) + coord_equal()

Slingshot

We now use the Slingshot package to assign pseudotime values to the cells in the lineage.

suppressPackageStartupMessages(
  library( slingshot) )

As a first step, we ask Slingshot to find chains of connected clusters, which it calls “lineages”:

getLineages( 
  Embeddings( seu_lng, "pca" )[,1:15],
  seu_lng$seurat_clusters ) -> pto

The Slingshot function getLineage gets two parameters from us: First, for each cell its first 15 principal-component scores. In Seurat, the Embedding function extracts PCA, UMAP or similar coordinates from a Seurat object, as shown in the preceding code chunk. The second parameter is the vector of cluster assignments. The result is stored in Slingshot’s data object class, which it calls “PseudotimeOrdering”.

pto
class: PseudotimeOrdering 
dim: 14779 4 
metadata(3): lineages mst slingParams
pathStats(2): pseudotime weights
cellnames(14779): AAACGAAAGGTACTGG-WT_20mo_2021_001
  AAACGAACATGGCTGC-WT_20mo_2021_001 ...
  TTTGTTGAGCCGTTAT-WT_4mo_2021_001 TTTGTTGTCTGAGCAT-WT_4mo_2021_001
cellData names(2): reducedDim clusterLabels
pathnames(4): Lineage1 Lineage2 Lineage3 Lineage4
pathData names(0):

There are a number of function in Slingshot to extract the information stored in such an object. For example, we can look at the maximum spanning tree (MST):

plot( slingMST(pto) )

Here, each vertex is a cluster. The paths connecting them are the trajectories that Slingshot will examine. There are four “lineages” that tarverse the cluster along the graph just shown:

slingLineages(pto)
$Lineage1
 [1] "16" "11" "0"  "4"  "8"  "12" "13" "3"  "1"  "2" 

$Lineage2
 [1] "16" "11" "0"  "4"  "8"  "12" "13" "3"  "1"  "15"

$Lineage3
[1] "16" "11" "0"  "4"  "8"  "12" "17" "7" 

$Lineage4
[1] "16" "11" "0"  "4"  "8"  "12" "13" "10"

We will later concentrate on lineage 1, which is the longest one, going all through the trajectory. Lineage 2 is a variant, with a slight change at the end, lineage 3 and 4 enter the cycle. Unfortunately, Slingshot cannot deal with cycles, which is why those lineages don’t close.

Now, we as ask Slingshot to assign to the cells in each lineage coordinate values, called “pseudotimes”, which increase smoothly all along the lienage, from the beginning to the end. This step takes a few minutes.

getCurves(pto) -> pto

As that step takes a while, we save the result, so that we can restart from here, if needed, without having to repeat it.

save( seu, pto, file="ifnagrko_seu_pto.rda" )

To restart from here, we load the data with

load( "ifnagrko_seu_pto.rda" )

The result of the getCurves call is a matrix, with one column for each lineage, that assigns each cell a pseudotime value. For cells that don’t appear in a lineage, the pseudotime value is NA.

slingPseudotime(pto) %>% head()
                                   Lineage1  Lineage2  Lineage3  Lineage4
AAACGAAAGGTACTGG-WT_20mo_2021_001 52.133082 52.233944        NA        NA
AAACGAACATGGCTGC-WT_20mo_2021_001 11.671661 11.670258 11.681115 11.670240
AAACGAAGTAGCTGAG-WT_20mo_2021_001 19.878485 19.867406 19.427261 19.865931
AAACGAATCACCCTCA-WT_20mo_2021_001 56.833040        NA        NA        NA
AAACGCTAGGGCAGGA-WT_20mo_2021_001  9.562365  9.561429  9.562658  9.561693
AAACGCTTCAGTCTTT-WT_20mo_2021_001 12.358920 12.355300 12.330698 12.334660

Plotting the pseudotime

TO visualize the pseudotime, we assemble a table from the UMAP coordinates and the pseudotime values.

First, we get the matrix of UMAP coordinates from the Seurat object and convert it to a tibble.

Embeddings( seu_lng, "umap") %>% as_tibble( rownames="barcode" ) %>% head()
# A tibble: 6 × 3
  barcode                           umap_1 umap_2
  <chr>                              <dbl>  <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_2021_001 -0.468   7.39
2 AAACGAACATGGCTGC-WT_20mo_2021_001 -3.07   -6.99
3 AAACGAAGTAGCTGAG-WT_20mo_2021_001 -0.523  -4.94
4 AAACGAATCACCCTCA-WT_20mo_2021_001 -2.39   12.1 
5 AAACGCTAGGGCAGGA-WT_20mo_2021_001 -2.65  -11.2 
6 AAACGCTTCAGTCTTT-WT_20mo_2021_001 -1.09   -9.02

The, we do the same for the pseudotime matrix

slingPseudotime(pto) %>% as_tibble( rownames="barcode" ) %>% head()
# A tibble: 6 × 5
  barcode                           Lineage1 Lineage2 Lineage3 Lineage4
  <chr>                                <dbl>    <dbl>    <dbl>    <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_2021_001    52.1     52.2     NA       NA   
2 AAACGAACATGGCTGC-WT_20mo_2021_001    11.7     11.7     11.7     11.7 
3 AAACGAAGTAGCTGAG-WT_20mo_2021_001    19.9     19.9     19.4     19.9 
4 AAACGAATCACCCTCA-WT_20mo_2021_001    56.8     NA       NA       NA   
5 AAACGCTAGGGCAGGA-WT_20mo_2021_001     9.56     9.56     9.56     9.56
6 AAACGCTTCAGTCTTT-WT_20mo_2021_001    12.4     12.4     12.3     12.3 

and merge the two tables with a table join:

left_join(
  Embeddings( seu_lng, "umap") %>% as_tibble( rownames="barcode" ),
  slingPseudotime(pto) %>% as_tibble( rownames="barcode" ) ) -> umap_pt
Joining with `by = join_by(barcode)`
head( umap_pt )
# A tibble: 6 × 7
  barcode                      umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4
  <chr>                         <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 AAACGAAAGGTACTGG-WT_20mo_20… -0.468   7.39    52.1     52.2     NA       NA   
2 AAACGAACATGGCTGC-WT_20mo_20… -3.07   -6.99    11.7     11.7     11.7     11.7 
3 AAACGAAGTAGCTGAG-WT_20mo_20… -0.523  -4.94    19.9     19.9     19.4     19.9 
4 AAACGAATCACCCTCA-WT_20mo_20… -2.39   12.1     56.8     NA       NA       NA   
5 AAACGCTAGGGCAGGA-WT_20mo_20… -2.65  -11.2      9.56     9.56     9.56     9.56
6 AAACGCTTCAGTCTTT-WT_20mo_20… -1.09   -9.02    12.4     12.4     12.3     12.3 

This table is availble on Moodle as file umap_pt.csv

Now, we can plot lineage 1:

umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage1 ) ) +
  geom_point( size=.3 ) + coord_equal() +
  scale_color_viridis_c() + ggtitle( "Lineage 1" )

Here are the other three lineages:

umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage2 ) ) +
  geom_point( size=.3 ) + coord_equal() +
  scale_color_viridis_c() + ggtitle( "Lineage 2" )

umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage3 ) ) +
  geom_point( size=.3 ) + coord_equal() +
  scale_color_viridis_c() + ggtitle( "Lineage 3" )

umap_pt %>%
ggplot( aes( x=umap_1, y=umap_2, col=Lineage4 ) ) +
  geom_point( size=.3 ) + coord_equal() +
  scale_color_viridis_c() + ggtitle( "Lineage 4" )

Plotting a gene along pseudotime

Here are the counts for one gene, namely Slc1a3:

seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) %>% head()
# A tibble: 6 × 2
  barcode                           count
  <chr>                             <dbl>
1 AAACCCAGTCTCCTGT-WT_20mo_2021_001    11
2 AAACGAAAGGTACTGG-WT_20mo_2021_001     0
3 AAACGAACATGGCTGC-WT_20mo_2021_001    11
4 AAACGAAGTAGCTGAG-WT_20mo_2021_001    24
5 AAACGAATCACCCTCA-WT_20mo_2021_001     0
6 AAACGCTAGGGCAGGA-WT_20mo_2021_001    24

Here, seu[["RNA"]]$counts extracts the count matrix (the one we started with at the very beginning) from the Seurat object. We take the row labelled "Slc1a3" from that matrix. Using square bracket indexing, we get the row as named vector. Then, enframe turns this vector into a data frame with two columns, with the names (cell barcodes) and the values (read counts for gene Slc1a3).

In the same manner, we can get the column sums, i.e., the read count totals per cell:

colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) %>% head()
# A tibble: 6 × 2
  barcode                           colSum
  <chr>                              <dbl>
1 AAACCCAGTCTCCTGT-WT_20mo_2021_001   8934
2 AAACGAAAGGTACTGG-WT_20mo_2021_001   7672
3 AAACGAACATGGCTGC-WT_20mo_2021_001   6889
4 AAACGAAGTAGCTGAG-WT_20mo_2021_001   5131
5 AAACGAATCACCCTCA-WT_20mo_2021_001   5988
6 AAACGCTAGGGCAGGA-WT_20mo_2021_001   9866

We can now append these two small tables to our umap_pt table using table joins:

umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
head()  
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 6 × 9
  barcode         umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4 colSum count
  <chr>            <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl> <dbl>
1 AAACGAAAGGTACT… -0.468   7.39    52.1     52.2     NA       NA      7672     0
2 AAACGAACATGGCT… -3.07   -6.99    11.7     11.7     11.7     11.7    6889    11
3 AAACGAAGTAGCTG… -0.523  -4.94    19.9     19.9     19.4     19.9    5131    24
4 AAACGAATCACCCT… -2.39   12.1     56.8     NA       NA       NA      5988     0
5 AAACGCTAGGGCAG… -2.65  -11.2      9.56     9.56     9.56     9.56   9866    24
6 AAACGCTTCAGTCT… -1.09   -9.02    12.4     12.4     12.3     12.3    6100    31

This allows us to make the following plot

umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
ggplot +
  geom_point( 
    aes(
       x = Lineage1,
       y = log10( count/colSum * 1e4 + 1 ) 
    ), size=.3 ) 
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1875 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, we see how the expression strength of the gene changes along the pseudotime trajectory. Note that we used our usual log normalization for the y axis

Binnig pseudotime

This looks quite noisy. Let’s average over several cells to smoothen this out.

To this end, we put the cells into “bins” of similar pseudotime. As the pseudotime for lineage 1 runs from 0 to 65, we divide it by 6.5 and then round down to the next lower integer (function floor) in order to get 10 bins numbered from 0 to 9:

umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
head()
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 6 × 10
  barcode  umap_1 umap_2 Lineage1 Lineage2 Lineage3 Lineage4 colSum count pt_bin
  <chr>     <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl> <dbl>  <dbl>
1 AAACGAA… -0.468   7.39    52.1     52.2     NA       NA      7672     0      8
2 AAACGAA… -3.07   -6.99    11.7     11.7     11.7     11.7    6889    11      1
3 AAACGAA… -0.523  -4.94    19.9     19.9     19.4     19.9    5131    24      3
4 AAACGAA… -2.39   12.1     56.8     NA       NA       NA      5988     0      8
5 AAACGCT… -2.65  -11.2      9.56     9.56     9.56     9.56   9866    24      1
6 AAACGCT… -1.09   -9.02    12.4     12.4     12.3     12.3    6100    31      1

With group_by( pt_bin ), we now group the table rows into 11 groups, one for each value of pt_bin (values 0 to 9, and NA for cells outside the lineage). Then, using summarize, we add up the counts and the colsums for all the cells in a group (bin):

umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
group_by( pt_bin ) %>%
summarize( sum(count), sum(colSum) )
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 3
   pt_bin `sum(count)` `sum(colSum)`
    <dbl>        <dbl>         <dbl>
 1      0        16074       3658514
 2      1        94525      25285896
 3      2        20585       6442558
 4      3        14436       5898141
 5      4         5448       2908013
 6      5         2532       2593208
 7      6          593       2587333
 8      7          301       9609219
 9      8          751      23770581
10      9           23        252731
11     NA         2900      21051163

We can also divide the count sum by the colSum sum, to get the fraction of all the reads from cells of a given pseudotime bin that fall onto gene Slc1a3:

umap_pt %>%
left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
left_join( seu[["RNA"]]$counts["Slc1a3",] %>% enframe( "barcode", "count" ) ) %>%
mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
group_by( pt_bin ) %>%
summarize( fraction = sum(count) / sum(colSum) )
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 2
   pt_bin  fraction
    <dbl>     <dbl>
 1      0 0.00439  
 2      1 0.00374  
 3      2 0.00320  
 4      3 0.00245  
 5      4 0.00187  
 6      5 0.000976 
 7      6 0.000229 
 8      7 0.0000313
 9      8 0.0000316
10      9 0.0000910
11     NA 0.000138 

Defining a function

We will soon want to perform this operation for several genes. Therefore, we define a fucntion:

get_binned_fraction <- function( gene ) {
  umap_pt %>%
  left_join( colSums( seu[["RNA"]]$counts ) %>% enframe( "barcode", "colSum" ) ) %>%
  left_join( seu[["RNA"]]$counts[gene,] %>% enframe( "barcode", "count" ) ) %>%
  mutate( pt_bin = floor( Lineage1 / 6.5 ) ) %>%
  group_by( pt_bin ) %>%
  summarize( fraction = sum(count) / sum(colSum) )
}

Note how the body of this function (the part in curly braces) is exactly the code we just used, with one change: the explicitly stated gene name "Slc1a3" has been replaced with the placeholder gene. This is called a function argument, and it is listed at the function head, behind the word function. We now have a new R function that we can call with an arbitrary gene name:

get_binned_fraction( "Slc1a3" )
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
# A tibble: 11 × 2
   pt_bin  fraction
    <dbl>     <dbl>
 1      0 0.00439  
 2      1 0.00374  
 3      2 0.00320  
 4      3 0.00245  
 5      4 0.00187  
 6      5 0.000976 
 7      6 0.000229 
 8      7 0.0000313
 9      8 0.0000316
10      9 0.0000910
11     NA 0.000138 

Now we can easily make a plot for this gene:

get_binned_fraction( "Slc1a3" ) %>%
ggplot() +
  geom_line( aes( x=pt_bin, y=fraction ) ) +
  scale_y_log10()
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

or for any other gene:

get_binned_fraction( "Aqp4" ) %>%
ggplot() +
  geom_line( aes( x=pt_bin, y=fraction ) ) +
  scale_y_log10()
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

A plotting loop

Here is a list of potentially ionteresting genes:

interesting_genes <- c( "Aqp4", "Slc1a3", "Egfr", "Mki67", 
    "Dcx", "Prokr2", "Rbfox3", "Gria1", "Calb2" )

With a for loop, we can ask R to execute a given code for each of these genes:

for( gene in interesting_genes ) {
  print( 
    get_binned_fraction( gene ) %>%
    ggplot() +
      geom_line( aes( x=pt_bin, y=fraction ) ) +
      scale_y_log10() + ggtitle( gene ) 
  )
}
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`

Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

(The reason why we needed to enclose the plotting command in a “print” statement is quite technical; we skip over this.)

A heatmap

We can also use the tidyverse function map_dfr to call our get_binned_fraction function for each of the genes. We will get a dataframe with the 10 values for the 10 pseudotime bins for each gene, and map_dfr will concatenate all these tables:

map_dfr( set_names(interesting_genes), get_binned_fraction, .id="gene" ) -> long_bin_table
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
Joining with `by = join_by(barcode)`
head( long_bin_table, 15 )
# A tibble: 15 × 3
   gene   pt_bin   fraction
   <chr>   <dbl>      <dbl>
 1 Aqp4        0 0.00115   
 2 Aqp4        1 0.000830  
 3 Aqp4        2 0.000517  
 4 Aqp4        3 0.000112  
 5 Aqp4        4 0.0000677 
 6 Aqp4        5 0.0000378 
 7 Aqp4        6 0.0000143 
 8 Aqp4        7 0.00000281
 9 Aqp4        8 0.00000341
10 Aqp4        9 0.00000791
11 Aqp4       NA 0.00000461
12 Slc1a3      0 0.00439   
13 Slc1a3      1 0.00374   
14 Slc1a3      2 0.00320   
15 Slc1a3      3 0.00245   

We do not have time to explain the intricacies of map_dfr, and just use this table to make a heatmap. For this, we use geom_tile:

long_bin_table %>%
ggplot + 
  geom_tile( aes( x=pt_bin, y=gene, fill=fraction ) )
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).

It might be helpful if each row where normalized such that it uses the full range of the colour scale. To this end, we divide the faction values for each gene by the respective maximum value:

long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
head(15)
# A tibble: 15 × 4
# Groups:   gene [2]
   gene   pt_bin   fraction   nfrac
   <chr>   <dbl>      <dbl>   <dbl>
 1 Aqp4        0 0.00115    1      
 2 Aqp4        1 0.000830   0.724  
 3 Aqp4        2 0.000517   0.451  
 4 Aqp4        3 0.000112   0.0972 
 5 Aqp4        4 0.0000677  0.0590 
 6 Aqp4        5 0.0000378  0.0329 
 7 Aqp4        6 0.0000143  0.0125 
 8 Aqp4        7 0.00000281 0.00245
 9 Aqp4        8 0.00000341 0.00297
10 Aqp4        9 0.00000791 0.00690
11 Aqp4       NA 0.00000461 0.00402
12 Slc1a3      0 0.00439    1      
13 Slc1a3      1 0.00374    0.851  
14 Slc1a3      2 0.00320    0.727  
15 Slc1a3      3 0.00245    0.557  

Here is our improved heatmap:

long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
ggplot + 
  geom_tile( aes( x=pt_bin, y=gene, fill=nfrac ) )
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).

We can now see for each in which pseudotime bin it takes its maximum value.

Finally, we might try another colour scale

long_bin_table %>%
group_by( gene ) %>%
mutate( nfrac = fraction / max(fraction) ) %>%
ggplot + 
  geom_tile( aes( x=pt_bin, y=gene, fill=nfrac ) ) +
  scale_fill_viridis_c()
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_tile()`).

Whethere this one is nicer might be a matter of taste.

If we now did this heatmap for many mroe genes, and maybe a bit finer bins, we can get a goos overview of the order in which genes get activated when cells progress along the differentiation trajectory from stem cells to neurons. This can help us gain insights into how this process is regulated.