Welcome to the blog

Posts

My thoughts and ideas

TCR/BCR Repertoire Analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

TCR/BCR Repertoire Analysis

In this module, we will use our scRNA seurat object to explore the immune receptor T and B cells diversity. To recognize both antigens and tumor neoantigens, T and B cells can generate diverse receptor sequences through somatic V(D)J recombination. We will be using scRepertoire to explore the receptor data. This R package provides several convenient processing and visualization functions that are easy to understand and use. We will then add the clonal information for both B and T cells back onto our Seurat object to be used in further analysis.

#BiocManager::install("scRepertoire")

library("scRepertoire")
library("dplyr")
library("Seurat")

Exploring TCRs

We first read in the filtered_contig_annotations.csv output from the 10x Genomics Cell Ranger for all samples. The cellranger vdj pipeline provides amino acid and nucleotide sequences for framework and complementarity determining regions (CDRs). The filtered_contig_annotations.csv file contains high-level annotations of each high-confidence contig from cell-associated barcodes.

Loading in data

Rep1_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep1_ICB-t-filtered_contig_annotations.csv")
Rep1_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep1_ICBdT-t-filtered_contig_annotations.csv")
Rep3_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep3_ICB-t-filtered_contig_annotations.csv")
Rep3_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep3_ICBdT-t-filtered_contig_annotations.csv")
Rep5_ICB_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep5_ICB-t-filtered_contig_annotations.csv")
Rep5_ICBdT_t <- read.csv("data/single_cell_rna/clonotypes_t_posit/Rep5_ICBdT-t-filtered_contig_annotations.csv")

# create a list of all samples' TCR data 
TCR.contigs <- list(Rep1_ICB_t, Rep1_ICBdT_t, Rep3_ICB_t, Rep3_ICBdT_t, Rep5_ICB_t, Rep5_ICBdT_t)

Combining contigs into clones

Use scRepertoire’s combineTCR function to create an object with all samples’ TCR data. Additional filtering parameters are set to FALSE. removeNA will remove any chain with NA values, removeMulti will remove barcodes with greater than 2 chains, and filterMulti will allow for the selection of the 2 corresponding chains with the highest expression for a single barcode.


combined.TCR <- combineTCR(TCR.contigs, 
                           samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", 
                                       "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"),
                           removeNA = FALSE, 
                           removeMulti = FALSE, 
                           filterMulti = FALSE)

# make sure the object looks correct
head(combined.TCR[[1]])

                       barcode   sample                                          TCR1                      cdr3_aa1
1  Rep1_ICB_AAACCTGAGCGGATCA-1 Rep1_ICB                           TRAV3-3.TRAJ37.TRAC                 CAVMTGNTGKLIF
3  Rep1_ICB_AAACCTGCATACTCTT-1 Rep1_ICB                      TRAV4-4-DV10.TRAJ52.TRAC                CAASTGANTGKLTF
5  Rep1_ICB_AAACCTGCATCATCCC-1 Rep1_ICB                       TRAV6-7-DV9.TRAJ31.TRAC                 CALSAYSNNRIFF
7  Rep1_ICB_AAACCTGGTTCGGGCT-1 Rep1_ICB                                          <NA>                          <NA>
8  Rep1_ICB_AAACCTGTCAGTCCCT-1 Rep1_ICB TRAV14D-3-DV8.TRAJ22.TRAC;TRAV4-3.TRAJ34.TRAC CAASASSGSWQLIF;CAAESSSNTNKVVF
11 Rep1_ICB_AAAGATGAGCTTCGCG-1 Rep1_ICB                           TRAV9N-3.TRAJ4.TRAC                CAVSRGGSFNKLTF
                                                                                cdr3_nt1                         TCR2        cdr3_aa2
1                                                TGCGCAGTCATGACAGGCAATACCGGAAAACTCATCTTT   TRBV14.TRBD1.TRBJ1-1.TRBC1  CASSLGQGGTEVFF
3                                             TGTGCTGCTTCCACTGGAGCTAACACTGGAAAGCTCACGTTT    TRBV3.TRBD1.TRBJ1-1.TRBC1  CASSSGTGDTEVFF
5                                                TGTGCTCTGAGTGCCTATAGCAATAACAGAATCTTCTTT      TRBV31.NA.TRBJ2-1.TRBC2    CAWSALHAEQFF
7                                                                                   <NA>   TRBV20.TRBD1.TRBJ1-2.TRBC1   CGAIQGANSDYTF
8  TGTGCAGCAAGTGCATCTTCTGGCAGCTGGCAACTCATCTTT;TGTGCTGCTGAGTCATCTTCCAATACCAACAAAGTCGTCTTT   TRBV16.TRBD1.TRBJ1-4.TRBC1 CASSHSTGGNERLFF
11                                            TGTGCTGTGAGCCGGGGGGGTAGCTTCAATAAGTTGACCTTT TRBV13-1.TRBD1.TRBJ1-1.TRBC1  CASSRQGEGTEVFF
                                        cdr3_nt2                                                                   CTgene
1     TGTGCCAGCAGTCTGGGCCAGGGAGGCACAGAAGTCTTCTTT                           TRAV3-3.TRAJ37.TRAC_TRBV14.TRBD1.TRBJ1-1.TRBC1
3     TGTGCCAGCAGCTCCGGGACAGGGGACACAGAAGTCTTCTTT                       TRAV4-4-DV10.TRAJ52.TRAC_TRBV3.TRBD1.TRBJ1-1.TRBC1
5           TGTGCCTGGAGTGCCCTCCATGCTGAGCAGTTCTTC                          TRAV6-7-DV9.TRAJ31.TRAC_TRBV31.NA.TRBJ2-1.TRBC2
7        TGTGGTGCTATACAGGGGGCAAACTCCGACTACACCTTC                                            NA_TRBV20.TRBD1.TRBJ1-2.TRBC1
8  TGTGCAAGCAGCCACTCGACAGGGGGCAACGAAAGATTATTTTTC TRAV14D-3-DV8.TRAJ22.TRAC;TRAV4-3.TRAJ34.TRAC_TRBV16.TRBD1.TRBJ1-4.TRBC1
11    TGTGCCAGCAGTCGACAGGGGGAGGGCACAGAAGTCTTCTTT                         TRAV9N-3.TRAJ4.TRAC_TRBV13-1.TRBD1.TRBJ1-1.TRBC1
                                                                                                                                  CTnt
1                                                   TGCGCAGTCATGACAGGCAATACCGGAAAACTCATCTTT_TGTGCCAGCAGTCTGGGCCAGGGAGGCACAGAAGTCTTCTTT
3                                                TGTGCTGCTTCCACTGGAGCTAACACTGGAAAGCTCACGTTT_TGTGCCAGCAGCTCCGGGACAGGGGACACAGAAGTCTTCTTT
5                                                         TGTGCTCTGAGTGCCTATAGCAATAACAGAATCTTCTTT_TGTGCCTGGAGTGCCCTCCATGCTGAGCAGTTCTTC
7                                                                                           NA_TGTGGTGCTATACAGGGGGCAAACTCCGACTACACCTTC
8  TGTGCAGCAAGTGCATCTTCTGGCAGCTGGCAACTCATCTTT;TGTGCTGCTGAGTCATCTTCCAATACCAACAAAGTCGTCTTT_TGTGCAAGCAGCCACTCGACAGGGGGCAACGAAAGATTATTTTTC
11                                               TGTGCTGTGAGCCGGGGGGGTAGCTTCAATAAGTTGACCTTT_TGTGCCAGCAGTCGACAGGGGGAGGGCACAGAAGTCTTCTTT
                                            CTaa
1                   CAVMTGNTGKLIF_CASSLGQGGTEVFF
3                  CAASTGANTGKLTF_CASSSGTGDTEVFF
5                     CALSAYSNNRIFF_CAWSALHAEQFF
7                               NA_CGAIQGANSDYTF
8  CAASASSGSWQLIF;CAAESSSNTNKVVF_CASSHSTGGNERLFF
11                 CAVSRGGSFNKLTF_CASSRQGEGTEVFF

Visualize the Number of Clones

scRepertoire defines clones as TCRs/BCRs with shared/trackable complementarity-determining region 3 (CDR3) sequences. You can define clones using the amino acid sequence (aa), nucleotide (nt), or the V(D)JC genes (genes). The latter genes would be a more permissive definition of “clones”, as multiple amino acid or nucleotide sequences can result from the same gene combination. You can also use a combination of the V(D)JC and nucleotide sequence (strict). scRepertoire also allows for the users to select both or individual chains to examine.

First, we will visualize the number of clones per sample. There are many ways to count clones and, as discussed above, scReportoire gives us several options. Here are some sample commands to view the counts, try playing around with changing the cloneCall parameter to gene, nt, aa, or strict.

Let’s convince ourselves that the scReportoire counts the correct number of clones.

# view the total number of unique clones
# Change whether the clone is called by nucleotides, amino acids, or 

clonalQuant(combined.TCR, 
            cloneCall="nt", 
            chain = "both", 
            scale = FALSE) +
clonalQuant(combined.TCR, 
              cloneCall="aa", 
              chain = "both", 
              scale = FALSE) +
clonalQuant(combined.TCR, 
              cloneCall="gene", 
              chain = "both", 
              scale = FALSE) +
clonalQuant(combined.TCR, 
            cloneCall="strict", # gene and nucleotide 
            chain = "both", 
            scale = FALSE)
# view the relative percent of unique clones scaled by the total size of the clonal repertoire
clonalQuant(combined.TCR, 
            cloneCall="strict", 
            chain = "both", 
            scale = TRUE)

# When we change clone call to "gene", we lower the strictness of what we call a clone
# requiring only VDJC gene
# strict requires a VDJC gene + CDR3 nucleotide
clonalQuant(combined.TCR, 
            cloneCall="gene", 
            chain = "both", 
            scale = TRUE)

We can also examine the relative distribution of clones by abundance. Using clonalAbundance() will produce a line graph with the total number of clones by the number of instances within the sample or run.

# line graph with a total number of clones by the number of instances within the sample or run
# The relative distribution of clones by abundance
clonalAbundance(combined.TCR, 
                cloneCall = "aa", 
                scale = FALSE)


# the length distribution of the CDR3 sequences by calling 
# clone call must be CDR3 nucleotide (nt) OR CDR3 amino acid (aa)
clonalLength(combined.TCR, 
             cloneCall="aa", 
             chain = "both")

Then we can produce alluvial plots that compare the clone between samples. We don’t expect to see much similarity because each sample is from a different mouse.

# We can also look at clones between samples 
clonalCompare(combined.TCR, 
              top.clones = 25, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="aa", 
              graph = "alluvial")

clonalCompare(combined.TCR, 
              top.clones = 25, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="gene", 
              graph = "alluvial")

clonalCompare(combined.TCR, 
              top.clones = 25, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="nt", 
              graph = "alluvial") + NoLegend()

clonalCompare(combined.TCR, 
              top.clones = 10, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="strict", 
              graph = "alluvial")

clonalCompare(combined.TCR, 
              top.clones = 25, 
              samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"), 
              cloneCall="aa", 
              graph = "alluvial") + NoLegend()

If we wanted to export the results of our comparison, we could do something like this:

clonotype_table <- clonalCompare(combined.TCR, 
                                 top.clones = 50, 
                                 samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"), 
                                 cloneCall="aa", 
                                 graph = "alluvial", exportTable = TRUE)

Exploring BCR

We can run almost the exact same commands with our BCR data. Let’s read in our files and create a BCR object.


Rep1_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep1_ICB-b-filtered_contig_annotations.csv")
Rep1_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep1_ICBdT-b-filtered_contig_annotations.csv")
Rep3_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep3_ICB-b-filtered_contig_annotations.csv")
Rep3_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep3_ICBdT-b-filtered_contig_annotations.csv")
Rep5_ICB_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep5_ICB-b-filtered_contig_annotations.csv")
Rep5_ICBdT_b <- read.csv("data/single_cell_rna/clonotypes_b_posit/Rep5_ICBdT-b-filtered_contig_annotations.csv")

Unlike combineTCR, combineBCR produces a column called CTstrict of an index of nucleotide sequence and the corresponding V gene. This index automatically calculates the Levenshtein distance between sequences with the same V gene and will index sequences using a normalized Levenshtein distance with the same ID. After which, clone clusters are called using the components function. Clones that are clustered across multiple sequences will then be labeled with “Cluster” in the CTstrict header. We use the threshold parameter to set the normalized edit distance to consider, where the higher the number the more similarity of sequence will be used for clustering.

BCR.contigs <- list(Rep1_ICB_b, Rep1_ICBdT_b, Rep3_ICB_b, Rep3_ICBdT_b, Rep5_ICB_b, Rep5_ICBdT_b)

combined.BCR <- combineBCR(BCR.contigs, 
                           samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", 
                                       "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"), 
                           threshold = 0.85)

head(combined.BCR[[1]])

Visualize the Number of Clones

Let’s experiment again with visualizing the number of clones.


# view the total number of unique clones
clonalQuant(combined.BCR, 
            cloneCall="nt", 
            chain = "both", 
            scale = FALSE) +
  clonalQuant(combined.BCR, 
              cloneCall="aa", 
              chain = "both", 
              scale = FALSE) +
  clonalQuant(combined.BCR, 
              cloneCall="gene", 
              chain = "both", 
              scale = FALSE) +
  clonalQuant(combined.BCR, 
              cloneCall="strict", 
              chain = "both", 
              scale = FALSE)


# view the relative percent of unique clones scaled by the total size of the clonal repertoire
clonalQuant(combined.BCR, 
            cloneCall="strict", 
            chain = "both", 
            scale = TRUE)

# line graph with a total number of clones by the number of instances within the sample or run
# The relative distribution of clones by abundance
clonalAbundance(combined.BCR, 
                cloneCall = "gene", 
                scale = FALSE)

Next, looking at the alluvial plots for BCRs, we see a surprising amount of similar clones between Rep3, why do we think that is?

clonalCompare(combined.BCR, 
              top.clones = 25, 
              samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"), 
              cloneCall="aa", 
              graph = "alluvial")

clonalCompare(combined.BCR, 
              top.clones = 10, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="aa", 
              graph = "alluvial")

We believe that there is something wrong with how the clones are being counted. If you play around with the threshold you see that there is no change in the number of clones called.

When we look at the counts of shared BCRs between both chains vs just the alpha chain, we see that it seems that when we consider both chains it counts the NAs as matching. IS this what we really want?

clonalCompare(combined.BCR, 
              top.clones = 10, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="aa", 
              chain = "IGH",
              graph = "alluvial") +
clonalCompare(combined.BCR, 
              top.clones = 10, 
              samples = c("Rep3_ICB", "Rep3_ICBdT"), 
              cloneCall="aa", 
              chain = "both",
              graph = "alluvial")

clonalCompare(combined.BCR, 
              top.clones = 25, 
              samples = c("Rep1_ICB", "Rep1_ICBdT", "Rep3_ICB", "Rep3_ICBdT", "Rep5_ICB", "Rep5_ICBdT"), 
              cloneCall="aa",
              chain = "IGH",
              graph = "alluvial") + NoLegend()

Adding the BCR and TCR Data to your Seurat object

We will now add the BCR and TCR information which has been handled by scRepertoire so far to our Seurat object.

Read in your Seurat object


rep135 <- readRDS("outdir_single_cell_rna/preprocessed_object.rds")

Use the combineExpression function to add the TCR data to your Seurat object. The columns CTgene, CTnt, CTaa, CTstrict, clonalProportion, clonalFrequency, and cloneSize data will be added to your Seurat object’s metadata. Notice you can also decide the bins for grouping based on proportion or frequency.

Here we group by frequency:

rep135 <- combineExpression(combined.TCR, rep135, 
                         cloneCall="aa", proportion = FALSE,
                         group.by = "sample",
                         cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))

colnames(rep135@meta.data)

Since we want to add both BCR and TCR information we have to rename the columns to make it clear since scRepertoire uses the same column names for BCR and TCR data. If there are duplicate barcodes (if a cell has both Ig and TCR), the immune receptor information will not be added.

columns_to_modify <- c("CTgene", "CTnt", "CTaa", "CTstrict", "clonalProportion", "clonalFrequency", "cloneSize")
names(rep135@meta.data)[names(rep135@meta.data) %in% columns_to_modify] <- paste0(columns_to_modify, "_TCR")

colnames(rep135@meta.data) # make sure the column names are changed 

We can now plot the cells with TCRs and compare that to our cell type annotations. Indeed, we see a majority of TCRs in the same clusters which are called T cells!

DimPlot(rep135, group.by = c("cloneSize_TCR", "immgen_singler_main"))

Let’s repeat the same steps with the BCR data.

rep135 <- combineExpression(combined.BCR, rep135, 
                              cloneCall="aa", proportion = FALSE,
                              group.by = "sample",
                              cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))

names(rep135@meta.data)[names(rep135@meta.data) %in% columns_to_modify] <- paste0(columns_to_modify, "_BCR")
colnames(rep135@meta.data) # make sure the column names are changed 

Now let’s visualize the TCR, BCR, and cell type annotations with a UMAP.

DimPlot(rep135, group.by = c("cloneSize_TCR", "cloneSize_BCR", "immgen_singler_main"))

BCR vs TCR vs celltype

Exercise: Create a custom bar graph with BCR and TCR information

So now we have this data, but how do we use it? How could we use the information our Seurat object already holds to analysis our BCR/TCR information further?

Say we want to create a bar plot that shows the number of clones by each cell type. So we want the x-axis to be the number of clones and the y-axis to be counts of BCR or TCRs.

Let’s start by deciding what information we need to make this graph.

colnames(rep135@meta.data)

Create a dataframe that contains just that information, mainly for ease. Here we are going to grab the cell type labels and the amino acid sequences for the CDR3s.

rep135_TCR_clones <- rep135@meta.data[, c("CTaa_TCR", "immgen_singler_main")]

head(rep135_TCR_clones)

The dataframe should look something like this, where we have some cells with a TCR and their cell type label.

                                                   CTaa_TCR immgen_singler_main
Rep1_ICBdT_AAACCTGAGCCAACAG-1  CALGAVSAGNKLTF_CASRGGAYAEQFF                 NKT
Rep1_ICBdT_AAACCTGAGCCTTGAT-1                          <NA>             B cells
Rep1_ICBdT_AAACCTGAGTACCGGA-1                          <NA>         Fibroblasts
Rep1_ICBdT_AAACCTGCACGGCCAT-1                          <NA>            NK cells
Rep1_ICBdT_AAACCTGCACGGTAAG-1 CATDGGTGSNRLTF_CASSYGQGDSDYTF             T cells
Rep1_ICBdT_AAACCTGCATGCCACG-1                          <NA>         Fibroblasts

First, we should use groupby to group our data into the categories that we want to plot.

rep135_TCR_clones %>%
  group_by(immgen_singler_main)
# A tibble: 23,185 × 2
# Groups:   immgen_singler_main [18]
   CTaa_TCR                       immgen_singler_main
   <chr>                          <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
 2 NA                             B cells            
 3 NA                             Fibroblasts        
 4 NA                             NK cells           
 5 CATDGGTGSNRLTF_CASSYGQGDSDYTF  T cells            
 6 NA                             Fibroblasts        
 7 CAARLGMSNYNVLYF_CASSQTGGDERLFF T cells            
 8 NA                             Neutrophils        
 9 NA                             Fibroblasts        
10 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
# ℹ 23,175 more rows
# ℹ Use `print(n = ...)` to see more rows

Then we want to use the summarise function to create a count of how many TCRs we see:

rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>%
    summarise(Count = n())
# A tibble: 18 × 2
   immgen_singler_main Count
   <chr>               <int>
 1 B cells              3253
 2 B cells, pro            3
 3 Basophils              37
 4 DC                    295
 5 Endothelial cells      71
 6 Epithelial cells     1238
 7 Fibroblasts           589
 8 ILC                   763
 9 Macrophages           459
10 Mast cells             11
11 Monocytes             633
12 NK cells              565
13 NKT                  2249
14 Neutrophils            92
15 Stem cells              2
16 Stromal cells          18
17 T cells             12714
18 Tgd                   193

This looks good! Let’s plot it with a barplot:

TCR_celltypes_summary <- rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>%
    summarise(Count = n())

ggplot(TCR_celltypes_summary, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of TCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # adjust the x-axis labels so that they are not overlapping each other

Hmmmmmmmmm, this looks a little suspicious. There are a lot of B cells with TCRs… maybe our summarise function was incorrect?

Let’s check the counts when we sum just our cell types column.

table(rep135@meta.data$immgen_singler_main)

Unfortunately, those are the same numbers as our summary dataframe above. We want to be counting how many TCRs are seen per cell type not how many of each cell type we have. What do you think could be the problem?

          B cells      B cells, pro         Basophils                DC Endothelial cells  Epithelial cells       Fibroblasts               ILC       Macrophages 
             3253                 3                37               295                71              1238               589               763               459 
       Mast cells         Monocytes       Neutrophils          NK cells               NKT        Stem cells     Stromal cells           T cells               Tgd 
               11               633                92               565              2249                 2                18             12714               193 
# A tibble: 23,185 × 2
# Groups:   immgen_singler_main [18]
   CTaa_TCR                       immgen_singler_main
   <chr>                          <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
 2 NA                             B cells            
 3 NA                             Fibroblasts        
 4 NA                             NK cells           
 5 CATDGGTGSNRLTF_CASSYGQGDSDYTF  T cells            
 6 NA                             Fibroblasts        
 7 CAARLGMSNYNVLYF_CASSQTGGDERLFF T cells            
 8 NA                             Neutrophils        
 9 NA                             Fibroblasts        
10 CALGAVSAGNKLTF_CASRGGAYAEQFF   NKT                
# ℹ 23,175 more rows
# ℹ Use `print(n = ...)` to see more rows

It seems that those pesky NAs are being counted during our summarise command which we don’t want. If a cell has no TCR it should not be counted. Throwing in a na.omit() should do the trick!

rep135_TCR_clones %>%
    group_by(immgen_singler_main) %>% na.omit() 
# A tibble: 12,597 × 2
# Groups:   immgen_singler_main [15]
   CTaa_TCR                                  immgen_singler_main
   <chr>                                     <chr>              
 1 CALGAVSAGNKLTF_CASRGGAYAEQFF              NKT                
 2 CATDGGTGSNRLTF_CASSYGQGDSDYTF             T cells            
 3 CAARLGMSNYNVLYF_CASSQTGGDERLFF            T cells            
 4 CALGAVSAGNKLTF_CASRGGAYAEQFF              NKT                
 5 CAARLGMSNYNVLYF_CASSQTGGDERLFF            DC                 
 6 CAMREGSNNRIFF;CALSGANNNNAPRF_CASSYRGFDYTF T cells            
 7 CAAHSNYQLIW_CASSPGTGGYEQYF                NKT                
 8 CAVKNNRIFF_CASGDARGVEQYF                  ILC                
 9 CAASEGGNYKPTF_CASSRDRYAEQFF               NKT                
10 CARTNTGYQNFYF_CASSPHNSPLYF                Monocytes          
# ℹ 12,587 more rows
# ℹ Use `print(n = ...)` to see more rows

That looks much better. So all together we have a command that looks like this to get the data into the correct format for the barplot.

# Count occurrences of each cell type
cell_type_counts_TCR <- rep135_TCR_clones %>%
  group_by(immgen_singler_main) %>% na.omit() %>%
  summarise(Count = n())

Finally, we get to plot something!

# Plot
ggplot(cell_type_counts_TCR, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of TCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

And it looks great!

Now we can do the same thing with the BCR data.

rep135_BCR_clones <- rep135@meta.data[, c("CTaa_BCR", "immgen_singler_main")]

# Count occurrences of each cell type
cell_type_counts_BCR <- rep135_BCR_clones %>%
  group_by(immgen_singler_main) %>% na.omit() %>%
  summarise(Count = n())

# Plot
ggplot(cell_type_counts_BCR, aes(x = immgen_singler_main, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Cell Type", y = "Number of BCR Clones", title = "Number of Clones by Cell Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bonus Exercise: Stacked bar plot of cell types by counts grouped by TCR clones

What if we wanted to see not just how many clones there are, but also the number of unique clones, and how many cells we see each unique clone in? Let’s create a stacked barplot that shows the the number of clones for each cell type, but grouped by unique clones.

We can start by using the table function again.

table(rep135_TCR_clones$CTaa_TCR)
                                  CAAAGNTEGADRLTF_CASSWTANTEVFF                                       CAAAGSNNRIFF_CASRISAETLYF 
                                                              1                                                               1 
                                    CAAAGTGGYKVVF_CASSEDRGDTQYF                                                 CAAAGYAQGLTF_NA 
                                                              1                                                               1 
                                   CAAAGYSNNRLTL_CASSQDRNQDTQYF                                   CAAAISGSFNKLTF_CTCSPDNSQNTLYF 
                                                              1                                                               1 
                               CAAAKLPGTGSNRLTF_CASSEGTGGYAEQFF                                   CAAALMNYNQGKLIF_CASSTPTGQAPLF 
                                                              8                                                               1 
                                   CAAALSNYNVLYF_CASSRTDANTEVFF                                   CAAAMDYANKMIF_CASSSTHNANTEVFF 
                                                              1                                                               1

We see that most clones are only seen once, but if we look carefully there are some clones which are seen in mutiple cells. So we are on the right track. But we want these counts grouped by cell type. Can we run table with two parameters? (You might think this is silly but I was genuinly couldn’t remember when I tried this the first time)

# Count occurrences of each CTaa_TCR
ctaa_immgen_table <- table(rep135_TCR_clones$CTaa_TCR, rep135_TCR_clones$immgen_singler_main)

head(ctaa_immgen_table)
                                                                     Mast cells Monocytes Neutrophils NK cells NKT Stem cells Stromal cells T cells Tgd
  CAAADTEGADRLTF_CASSRTGGVEQYF                                                0         0           0        0   0          0             0       1   0
  CAAAESNYQLIW_CASSLASQYEQYF                                                  0         0           0        0   0          0             0       1   0
  CAAAFNSGGSNAKLTF_CASSQDGGAYEQYF                                             0         0           0        0   0          0             0       1   0
  CAAAGGYGNEKITF_CASSPRDWGVNQDTQYF                                            0         0           0        0   0          0             0       0   0
  CAAAGNTEGADRLTF_CASSWTANTEVFF                                               0         0           0        0   0          0             0       1   0
  CAAAGSNNRIFF_CASRISAETLYF                                                   0         0           0        0   0          0             0       1   0
  CAAAGTGGYKVVF_CASSEDRGDTQYF                                                 0         0           0        0   0          0             0       1   0
  CAAAGYAQGLTF_NA                                                             0         0           0        0   0          0             0       1   0
  CAAAGYSNNRLTL_CASSQDRNQDTQYF                                                0         0           0        0   0          0             0       1   0
  CAAAISGSFNKLTF_CTCSPDNSQNTLYF                                               0         0           0        0   1          0             0       0   0
  CAAAKLPGTGSNRLTF_CASSEGTGGYAEQFF                                            0         0           0        0   0          0             0       8   0

Well no way, that is exactly the summary we want. We still have a problem where we have a little too much information to be graphed in a way that is useful. How about we get rid of any row where there is only 1 clone seen.

# Filter out rows with all counts less than or equal to 1
ctaa_immgen_table_filtered <- ctaa_immgen_table[rowSums(ctaa_immgen_table > 1) > 0, ]
                                                                  Mast cells Monocytes Neutrophils NK cells NKT Stem cells Stromal cells T cells Tgd
  CAAAKLPGTGSNRLTF_CASSEGTGGYAEQFF                                         0         0           0        0   0          0             0       8   0
  CAAAYNQGKLIF_CASSLTGWGEQFF                                               0         0           0        0   0          0             0       3   0
  CAAEAADSGTYQRF_CASSLGQGSYEQYF                                            0         0           0        0   0          0             0       2   0
  CAAEAKGSALGRLHF_CASSDASGGAHEQYF                                          0         0           0        0   0          0             0       3   0
  CAAEENSNNRIFF_CASSLNWGYAEQFF                                             0         0           0        0   1          0             0       2   0
  CAAGANTNKVVF_CASKQGWQNTLYF;CASSDRGAHEQYF                                 0         0           0        0   4          0             0       4   0
  CAAGGSNAKLTF_CASSPRLGGGAETLYF                                            0         0           0        0   0          0             0       3   0
  CAAGWTGSKLSF_CASRYRENTLYF;CASRRTGNSPLYF                                  0         0           0        0   2          0             0       0   0
  CAAHSNYQLIW_CASSPGTGGYEQYF                                               0         0           0        0   6          0             0       0   0

This is a little bit more manageable. Let’s make this a dataframe and clean it up to get ready to plot.

# Convert the filtered contingency table to a dataframe
ctaa_immgen_df <- as.data.frame(ctaa_immgen_table_filtered)

# Rename columns
colnames(ctaa_immgen_df) <- c("CTaa_TCR", "immgen_singler_main", "Count")
# Plot stacked bar plot
ggplot(ctaa_immgen_df, aes(x = immgen_singler_main, y = Count, fill = CTaa_TCR)) +
  geom_bar(stat = "identity") +
  labs(x = "immgen_singler_main", y = "Count", title = "Stacked Bar Plot of CTaa_TCR by immgen_singler_main") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hmmm that legend makes the graph a little hard to see.

ggplot(ctaa_immgen_df, aes(x = immgen_singler_main, y = Count, fill = CTaa_TCR)) +
  geom_bar(stat = "identity") +
  labs(x = "immgen_singler_main", y = "Count", title = "Stacked Bar Plot of CTaa_TCR by immgen_singler_main") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Now that looks better. Except there are so many clones that the bar looks more like a gradient then a stacked bar plot… Maybe we beed to be a little more strict about what we show.

# Plot stacked bar plot -- what if we limit to only cells with a clone seen at least 5 times
ggplot(ctaa_immgen_df[ctaa_immgen_df$Count > 5, ], aes(x = immgen_singler_main, y = Count, fill = CTaa_TCR)) +
  geom_bar(stat = "identity") +
  labs(x = "immgen_singler_main", y = "Count", title = "Stacked Bar Plot of CTaa_TCR by immgen_singler_main") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")  # Remove the legend for CTaa_TCR

This gets rid a majority of cell types and only slightly clears up the different clones seen. Maybe we can try putting a border around the boxes:

ggplot(ctaa_immgen_df, aes(x = immgen_singler_main, y = Count, fill = CTaa_TCR)) +
  geom_bar(stat = "identity", color = "black", linewidth = 0.05) +  # Add border with black color and linewidth 0.5
  labs(x = "immgen_singler_main", y = "Count", title = "Stacked Bar Plot of CTaa_TCR by immgen_singler_main") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")  # Hide the legend

Data wrangling and visualization take a lot of finagling to get just right, especially as your dataset grows and your information becomes more complex. It usually takes some time to figure out how to visualize your data and then playing with that visualization to get it just right.

Resources

scRepertoire

CellRanger VDJ Annotations

Differentiation/Trajectory analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differentiation/Trajectory analysis

Trajectory Analysis Using CytoTRACE

Single-cell sequencing gives us a snapshot of what a population of cells is doing. This means we should see many different cells in many different phases of a cell’s lifecycle. We use trajectory analysis to place our cells on a continuous ‘timeline’ based on expression data. The timeline does not have to mean that the cells are ordered from oldest to youngest (although many analysis uses trajectory to quantify developmental time). Generally, tools will create this timeline by finding paths through cellular space that minimize the transcriptional changes between neighboring cells. So for every cell, an algorithm asks the question: what cell or cells is/are most similar to the current cell we are looking at? Unlike clustering, which aims to group cells by what type they are, trajectory analysis aims to order the continuous changes associated with the cell process.

The metric we use for assigning positions is called pseudotime. Pseudotime is an abstract unit of progress through a dynamic process. When we base our trajectory analysis on the transcriptomic profile of a cell, less mature cells are assigned smaller pseudotimes, and more mature cells are assigned larger pseudotimes.

Performing trajectory analysis on epithelial cells

Earlier we confirmed that our epithelial cell population corresponded to our tumor population. Then, through differential expression analysis, we saw that the epithelial cells form two distinct clusters that we identified as luminal and basal cells. We can further confirm this conclusion by using trajectory analysis to assign pseudotime values to the epithelial cells. We expect see that basal cells are less differentiated than luminal cells.

epcam Clusters

Subsetting epithelial cells

First, let’s load our libraries and our preprocessed object.

library(Seurat)
library(CytoTRACE)
library(monocle3)
library(ggplot2)
library(readr)


rep135 <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

Let’s see what clusters our epithelial cells are located in.

DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE) + 
  DimPlot(rep135, group.by = 'seurat_clusters_res0.8', label = TRUE) 

We can specifically highlight these cells for further clarity.

Epithelial_cells = rep135$immgen_singler_main =="Epithelial cells"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Epithelial cells")
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)

We already know that these two clusters can be separated into basal and luminal cells. We can see the distinction between these two types using markers. For example lets plot basal cell markers:

FeaturePlot(object = rep135, features = c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))

To more comprehensively see where the basal and luminal cells are, we can create a cell type score by averaging the expression of a group of genes and ploting the score.

cell_type_Basal_marker_gene_list <- list(c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))
rep135 <- AddModuleScore(object = rep135, features = cell_type_Basal_marker_gene_list, name = "cell_type_Basal_score") 
FeaturePlot(object = rep135, features = "cell_type_Basal_score1")

cell_type_Luminal_marker_gene_list <- list(c("Cd24a", "Erbb2", "Erbb3", "Foxa1", "Gata3", "Gpx2", "Krt18", "Krt19", "Krt7", "Krt8", "Upk1a"))
rep135 <- AddModuleScore(object = rep135, features = cell_type_Luminal_marker_gene_list, name = "cell_type_Luminal_score")
FeaturePlot(object = rep135, features = "cell_type_Luminal_score1")

For ease and clarity, let’s subset our Seurat object to just the epithelial cell clusters.

### Subsetting dataset epithelial
rep135 <- SetIdent(rep135, value = 'seurat_clusters_res0.8')
rep135_epithelial <- subset(rep135, idents = c('9', '12')) # 1750

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(rep135, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(rep135_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(rep135$seurat_clusters_res0.8)
table(rep135_epithelial$seurat_clusters_res0.8)

Run CytoTRACE

Now let’s run CytoTRACE. CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. CytoTRACE uses gene count signatures (GCS), or the correlation between gene count and gene expression levels to capture differentiation states.

First, we have to export the counts.

rep135_mtx <- GetAssayData(rep135_epithelial, layer = "counts")
write.csv(rep135_mtx, "outdir/rep135_epithelial_counts.csv")

Then export the file from posit. In the file window select rep135_epithelial_counts.csv. Then go to More -> Export… and click Download.

Now we go to https://cytotrace.stanford.edu/. We will navigate to the Run CytoTRACE tab on the left menu bar and upload our downloaded csv in the Upload gene expression table. We will not worry about uploading any other files as of now but if we had a larger dataset we could provide cell type and batch information for our cells.

When uploaded click Run CytoTRACE. This may take a few minutes.

Uploading to Cytotrace

We can then spend some time exploring CytoTRACE scores. For CytoTRACE, warmer colors mean less differentiation, and cooler colors mean more differentiated. We can use the Gene radio button to plot the expression of different marker genes for Basal and Luminal cells.

Plotting Krt5 compared to CytoTRACE scores

Plotting Cd24a compared to CytoTRACE scores

Once we have verified that the CytoTRACE scores are assigned in a way that corresponds with our biological knowledge we can click the Download CytoTRACE results button at the top left of the page.

Import CytoTRACE scores

We now want to add our CytoTRACE scores to our Seurat object. So we navigate to the Upload button and select our exported CytoTRACE scores from where the file was downloaded. We then read it into R and add the scores to our Seurat object as another column in the meta.data.


cytotrace_scores <- read.table("outdir/CytoTRACE_results.txt", sep="\t") # read in the cytotrace scores

rownames(cytotrace_scores) <- sub("\\.", "-", rownames(cytotrace_scores)) # the barcodes export with a `.` instead of a '-' at the end of the barcode so we have to remedy that before joining the cytotrace scores unto our seurat object

# Add CytoTRACE scores matching on the cell barcodes
rep135_epithelial <- AddMetaData(rep135_epithelial, cytotrace_scores %>% select("CytoTRACE"))

Now we can plot our basal cell markers, our luminal cell markers, and the CytoTRACE scores together to compare. Since it is a little unintuitive that less differentiated scores are closer to 1 we will also create a differentiation_score which will be an inverse of our CytoTRACE scores so that smaller scores mean less differentiated and larger scores mean more differentiated.

rep135_epithelial[['differentiation_scores']] <- 1 - rep135_epithelial[['CytoTRACE']] # Let's also reverse out CytoTRACE scores so that high means more differentiated and low means less differentiated

FeaturePlot(object = rep135_epithelial, features = c("cell_type_Basal_score1", "cell_type_Luminal_score1", "cytotrace_scores", "differentiation_scores"))

Subsetting to just luminal cells

The most important part of trajectory analysis is to make sure you have some biological reasoning to back up the pseudotime values. The best practice for pseudotime means using it to support a biological pattern that has already been observed by some other method.

Let’s separate our luminal cells from the basal cells and perform trajectory analysis.

## Subset to just luminal cells
DimPlot(rep135_epithelial) # cluster 10 is our luminal cells
rep135_luminal <- subset(rep135_epithelial, idents = c('9')) # 863 cells

We can also use the CytoTRACE R package to calculate our CytoTRACE scores.

# Creating a dataframe to pass to CytoTRACE
rep135_luminal_expression <- data.frame(GetAssayData(object = rep135_luminal, layer = "data"))

rep135_luminal_cytotrace_scores <- CytoTRACE(rep135_luminal_expression, ncores = 1)

rep135_luminal_cytotrace_transposed <- as.data.frame(rep135_luminal_cytotrace_scores$CytoTRACE) %>% rename("cytotrace_scores" = "rep135_luminal_cytotrace_scores$CytoTRACE")
head(rep135_luminal_cytotrace_transposed)

# fix the barcode formatting 
rownames(rep135_luminal_cytotrace_transposed) <- sub("\\.", "-", rownames(rep135_luminal_cytotrace_transposed))
rownames(rep135_luminal_cytotrace_transposed)

We then can add those CytoTRACE scores to our luminal cell object and visualize them.

rep135_luminal <- AddMetaData(rep135_luminal, rep135_luminal_cytotrace_transposed)

rep135_luminal[['differentiation_scores_luminal']] <- 1 - rep135_luminal[['CytoTRACE']]
rep135_luminal[['differentiation_scores_epithelial']] <- 1 - rep135_luminal[['differentiation_scores']]

# compare all epithelial cells CytoTRACE scores to the luminal-only CytoTRACE
(FeaturePlot(object = rep135_luminal, features = c("differentiation_scores_epithelial")) +
    ggtitle("Epithelial Cells CytoTRACE Scores")) +
  (FeaturePlot(object = rep135_luminal, features = c("differentiation_scores_luminal")) +
      ggtitle("Luminal Cells CytoTRACE Scores"))

CytoTRACE will force all given cells onto the same scale meaning that there has to be cells at both the low and high ends of differentiation. The CytoTRACE scores could be a reflection of cell cycling genes. We can check by using a feature plot to compare the S-phase genes, G2/M-phase genes, and differentiation scores.

# View the S-phase genes, G2/M-phase genes, and the Phase to see if that explains the differentiation score
FeaturePlot(object = rep135_luminal, features = c("S.Score", "G2M.Score")) + 
   DimPlot(rep135_luminal, group.by = "Phase")
 
FeaturePlot(object = rep135_luminal, features = c("S.Score", "G2M.Score", "differentiation_scores"))

We still don’t see a clear pattern, which illustrates the challenge and the danger of pseudotime.

Exercise: Create a subset of the Seurat object. You could explore the differences between the T-Cells populations, stem cells vs epithelial cells, or choose your own subset. Then run CytoTRACE on the subsetted dataset either using the webtool or the R package. Do the pseudotime scores make sense? Are there biological factors that support the CytoTRACE calculated trajectory?

Note: CytoTRACE will crash in the posit environment if you give it too many cells, so if there are several cell populations that you want to compare you can use the subset function to downsample your cell types. Make sure your Idents are set to the category you would like to subset too!

merged_subset <- subset(x = merged, downsample = 100)

Comparing CytoTRACE to Monocle3

Now that we have convinced ourselves that we somewhat trust the results of CytoTRACE, we can try the algorithms on the entire dataset and compare it to another trajectory analysis method. Another popular method used is Monocle3. Monocle3 is an analysis toolkit for scRNA and has many of the functions that Seurat has. We can use Monocle3’s trajectory algorithm but since it uses its own unique data structure, we will have to convert our subsetted object to a cell data set object. Luckily, there are tools that make that conversion relatively easy.

You can also refer to the full Monocle3 trajectory tutorial.

Before we start just running our data through the algorithm and seeing what we get, we should consider what we expect to get. Let’s remember what cell types we have in our dataset and where they are on our UMAP. What pseudotime scores do you expect to be assigned to the clusters?

DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE)

Overview of haematopoiesis

Haematopoiesis and red blood cells

Running Monocle3

Let’s now run Monocle3, again we have to convert our Seurat object to a Monocle ‘Cell Data Set’. We will use a package made for this specific purpose.

rep135_cds <- SeuratWrappers::as.cell_data_set(rep135)

Then we run the Monocle function cluster_cells. This function will redo unsupervised clusters and calculate partitions which are groups of cells that Monocle3 puts on separate trajectories. We don’t need the cells to be clustered since we already did that in Seurat but Monocle requires that partitions be calculated for its trajectory functions.

rep135_cds <- cluster_cells(rep135_cds)

Use the Monolce3 plotting functions to visualize partitions. Then we will execute the function learn_graph which will build the trajectory. We will set the use_partition parameter to FALSE so that we learn a trajectory across all clusters. Later you can come back and try setting it to TRUE and see what happens!

plot_cells(rep135_cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

# Monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

rep135_cds <- learn_graph(rep135_cds, use_partition = FALSE) # graph learned across all partitions

Monocle3 requires you to choose a starting point or root for the calculated trajectories. Running the function order_cells will open a pop-up window where you can interactively choose where you want your roots to be.

rep135_cds <- order_cells(rep135_cds)
# Pick a root or multiple roots

Plot the pseudotime:

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

When we choose a root around where the stem cells are located we see that the fibroblast and epithelial cells end with the higher pseudotime scores.

Monocle Pseudotime with Stem Cells as root

But if we choose are roots to be stem cells, fibroblasts, and epithelial cells, we see that monocle changes the pseudotime orderings accordingly.

Monocle Pseudotime with multiple roots

Loading in the CytoTRACE scores

We need a significant amount of computational power to run CytoTRACE on all cells so we have run CytoTRACE on a computing cluster and have saved the results to be loaded in and added to our seurat object. Of course, we have to make sure the data is formatted correctly.

# Cytotrace for all cells

# read in the cytotrace scores
rep135_cytotrace <- read.table("outdir_single_cell_rna/rep135_cytotrace_scores.tsv", sep="\t") 
head(rep135_cytotrace)

# rename the column that stores the scores to be more clearly accessed
rep135_cytotrace_transposed <- t(rep135_cytotrace)
colnames(rep135_cytotrace_transposed) <- "CytoTRACE"
head(rep135_cytotrace_transposed)

# fix the barcode formatting, our seurat 
rownames(rep135_cytotrace_transposed) <- sub("\\.", "-", rownames(rep135_cytotrace_transposed))
rownames(rep135_cytotrace_transposed)

Now we can add the scores to our seurat object and also create an inverse CytoTRACE score for clarity. So our differentiation score will be 0 for least differentiated (smallest pseudotime) and 1 being most differentiated (biggest pseudotime).

# Add CytoTRACE scores matching on the cell barcodes
rep135 <- AddMetaData(rep135, rep135_cytotrace_transposed)
rep135[["differentiation_score"]] <- 1 - rep135[["cytotrace_scores"]]

Finally, let’s compare the pseudotime values to our cell types. Do we get the results we want to get? Does Monocle and CytoTRACE agree with each other? What happens if you choose different roots for the Monocle pseudotime?

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(rep135, features = 'differentiation_score') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
DimPlot(rep135, group.by = 'immgen_singler_main', label = TRUE)

Monocle pseudotime compared with CytoTRACE pseudtime

Exercise: Using Trajectory to Analyze Monocyte Differentiation

For a final exercise, we can apply the same steps as above to analyze another group of cells: macrophages and monocytes.

highlight = rep135$immgen_singler_main =="Macrophages"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Macrophages")
# Plot the UMAP
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


highlight = rep135$immgen_singler_main =="Monocytes"
highlighted_cells <- WhichCells(rep135, expression = immgen_singler_main =="Monocytes")
# Plot the UMAP
DimPlot(rep135, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


# grab all cells that are macrophages and monocytes
Idents(rep135) <- "immgen_singler_main" 
rep135_macro_mono_cells <- subset(rep135, idents = c("Macrophages", "Monocytes"), invert = FALSE) # 1092

DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

# grab all cells that are macrophages and monocytes, we can subset by clusters 6 and 14 which seem to contain 
Idents(rep135) <- "seurat_clusters_res0.8" 
rep135_macro_mono_cells <- subset(rep135, idents = c(6, 14), invert = FALSE) # 1350
DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_fine', label = TRUE)
# create a data frame with the counts from our subsetted obect
rep135_macro_mono_cells_expression <- data.frame(GetAssayData(rep135_macro_mono_cells, layer = "data")) 

# pass that dataframe to the CytoTRACE function
rep135_macro_mono_cells_cytotrace_scores <- CytoTRACE(rep135_macro_mono_cells_expression, ncores = 4)

# Create a dataframe out of the CytoTRACE scores
rep135_macro_mono_cells_cytotrace_scores_df <- as.data.frame(rep135_macro_mono_cells_cytotrace_scores$CytoTRACE)

# Make the rownames of the cytotraace scores function the cell barcodes and rename the CytoTRACE scores column approproately
rownames(rep135_macro_mono_cells_cytotrace_scores_df) <- sub("\\.", "-", rownames(rep135_macro_mono_cells_cytotrace_scores_df))

Now we will incorporate our CytoTRACE scores into our Seurat object.

# Add CytoTRACE scores matching on the cell barcodes
rep135_macro_mono_cells <- AddMetaData(rep135_macro_mono_cells, rep135_macro_mono_cells_cytotrace_scores_df)

CytoTRACE assignes the least differentiated cells a score of 1 and the most differentiated cells a score of 0, which is sometimes not inutive. So lets create a inverse CytoTRACE score which we will call our differentiation score.

rep135_macro_mono_cells[['differentiation_scores']] <- 1 - rep135_macro_mono_cells[['CytoTRACE']]

# Plot the results
DimPlot(rep135_macro_mono_cells, group.by = 'seurat_clusters', label = TRUE) + 
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) +
  FeaturePlot(rep135_macro_mono_cells, features = 'differentiation_scores')

Running Monocole

Let’s compare our CytoTRACE scores to Monocle3’s trajectory calculations.

cds <- SeuratWrappers::as.cell_data_set(rep135_macro_mono_cells)

cds <- cluster_cells(cds)

# View our clusters
plot_cells(cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

cds <- learn_graph(cds, use_partition = FALSE) # graph learned across all partitions

cds <- order_cells(cds) # choose your root(s)
plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)
plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(rep135_macro_mono_cells, features = 'CytoTRACE') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  (FeaturePlot(rep135_macro_mono_cells, features = 'differentiation_score') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  DimPlot(rep135_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE)

Macrophages differentiation

Selecting a root in Monocle3

Monocle3 proposes a method to choose a root more precisely in its trajecotry tutorial. It requires you to do the preprocessing steps specific to Monocle3.

rep135_cds <- SeuratWrappers::as.cell_data_set(rep135)

# Monocle preprocessing steps
rep135_cds <- preprocess_cds(rep135_cds, num_dim = 50, preprocess_method = 'PCA')
# rep135_cds <- align_cds(rep135_cds, alignment_group = "orig.ident") # removes batch effect by fitting its a linear model to the cells
rep135_cds <- reduce_dimension(rep135_cds, preprocess_method = 'PCA', reduction_method="UMAP") # calculate UMAPs

rep135_cds <- cluster_cells(rep135_cds)

plot_cells(rep135_cds, show_trajectory_graph = FALSE, color_cells_by = "immgen_singler_main")

# monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

rep135_cds <- learn_graph(rep135_cds, use_partition = FALSE) # graph learned across all partitions

# rep135_cds <- order_cells(rep135_cds)
# Pick a root or multiple roots -- stem cell? or stem cell, fibroblasts, epithelial cells

# a helper function to identify the root principal points:
cell_ids <- which(colData(rep135_cds)[, "seurat_clusters_res0.8"] == 5)

closest_vertex <-
  rep135_cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex

closest_vertex <- as.matrix(closest_vertex[colnames(rep135_cds), ])

root_pr_nodes <-
    igraph::V(principal_graph(rep135_cds)[["UMAP"]])$name[as.numeric(names
                                                              (which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes

rep135_cds <- order_cells(rep135_cds, root_pr_nodes=root_pr_nodes)

plot_cells(rep135_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

Further Resources

R Tutorial

Python Tutorial

Sanbomics

Does Monocole Use Clusters to Calculate Pseudotime

Using_Monocle_For_Pseudotime_Trajectory