Welcome to the blog

Posts

My thoughts and ideas

Differential expression analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Differential expression analysis


In this section we will use the previously generated Seurat object that has gone through the various preprocessing steps, clustering, and celltyping, and use it for gene expression and differential expression analyses. We will carry out two differential expression analyses in this module. Firstly, since we know that the tumor cells should be epithelial cells, we will begin by trying to identify epithelial cells in our data using expression of Epcam as a marker. Subsequently, we will carry out a DE analysis within the Epcam-positive population(s). Secondly, we will compare the T cell populations of mice treated with ICB therapy against mice with their T cells depleted that underwent ICB therapy to determine differences in T cell phenotypes.


Read-in the saved seurat object from the previous step if it is not already loaded in your current R session.

library(Seurat)
library(dplyr)
library(EnhancedVolcano)
library(presto)
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#alternatively we have a seurat object from the last step saved here if you need it.
merged <- readRDS('/cloud/project/data/single_cell_rna/backup_files/preprocessed_object.rds')

Gene expression analysis for epithelial cells

Use Seurat’s FeaturePlot function to color each cell by its Epcam expression on a UMAP. FeaturePlot requires at least 2 arguments- the seurat object, and the ‘feature’ you want to plot (where a ‘feature’ can be a gene, PC scores, any of the metadata columns, etc.). To customize the FeaturePlot, please refer to Seurat’s documentation here

FeaturePlot(merged, features = 'Epcam')

While there are some Epcam positive cells scattered on the UMAP, there appear to be 2 distinct groups of cells at the top of the UMAP that we may be able to pull apart as potentially being malignant cells.

There are a few different ways to go about identifying what those clusters are. We can start by trying to use the DimPlot plotting function from before along with the FeaturePlot function. Separating plots by the + symbol allows us to plot multiple plots side-by-side.

DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
FeaturePlot(merged, features = 'Epcam') + 
DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE)

While the plots generated by the above commands make it pretty clear that the clusters of interest are clusters 9 and 12, sometimes it is trickier to determine which cluster we are interested in solely from the UMAP as the clusters may be overlapping. In this case, a violin plot VlnPlot may be more helpful. Similar to FeaturePlot, VlnPlot also takes the Seurat object and features as input. It also requires a group.by argument that determines the x-axis groupings of the cells.
To learn more about customizing a Violin plot, please refer to the Seurat documentation

VlnPlot(merged, group.by = 'seurat_clusters_res0.8', features = 'Epcam')

Great! Looks like we can confirm that clusters 9 and 12 have the highest expression of Epcam. However, it is interesting that they are split into 2 clusters. This is a good place to use differential expression analysis to determine how these clusters differ from each other.

Differential expression for epithelial cells

We can begin by restricting the Seurat object to the cells we are interested in. We will do so using Seurat’s subset function, that allows us to create an object that is filtered to any values of interest in the metadata column. However, we first need to set the default identity of the Seurat object to the metadata column we want to use for the ‘subset’, and we can do so using the SetIdent function. After subsetting the object, we can plot the original object and subsetted object side-by-side to ensure the subsetting happened as expected. We can also count the number of cells of each type to confirm the same.

#set ident to seurat clusters metadata column and subset object to Epcam positive clusters
merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
merged_epithelial <- subset(merged, idents = c('9', '12'))

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
DimPlot(merged_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$seurat_clusters_res0.8)
table(merged_epithelial$seurat_clusters_res0.8)

Now we will use Seurat’s FindMarkers function to carry out a differential expression analysis between both groups. FindMarkers also requires that we use SetIdent to change the default ‘Ident’ to the metadata column we want to use for our comparison. More information about FindMarkers is available here.

Note that here we use FindMarkers to compare clusters 9 and 12. The default syntax of FindMarkers requires that we provide each group of cells as ident.1 and ident.2. The output of FindMarkers is a table with each gene that is differentially expressed and its corresponding log2FC. The direction of the log2FC is of ident.1 with respect to ident.2. Therefore, genes upregulated in ident.1 have positive log2FC, while those downregulated in ident.1 have negative log2FC. Here, we also provide a min.pct=0.25 argument so that we only test genes that are expressed in 25% of cells in either of the ident.1 or ident.2 groups. This can help reduce false positives as the genes must be expressed in a greater proportion of the cells compared to the default value of 1%. We also specify the logfc.threshold=0.1 parameter, which ensures our results only include genes that have a fold change of less than -0.1 or more than 0.1. Increasing the min.pct and logfc.threshold parameters can also result in the function running faster as they reduce the number of genes being tested.

#carry out DE analysis between both groups
merged_epithelial <- SetIdent(merged_epithelial, value = "seurat_clusters_res0.8")
epithelial_de <- FindMarkers(merged_epithelial, ident.1 = "9", ident.2 = "12", min.pct=0.25, logfc.threshold=0.1) #how cluster 9 changes wrt cluster 12

On opening epithelial_de in your RStudio session, you’ll see that it is a dataframe with the genes as rownames, and the following columns- p_val, avg_log2FC, pct.1, pct.2, p_val_adj. The p-values are dependent on the test used while running FindMarkers, and the adjusted p-value is based on the bonferroni correction test. pct.1 and pct.2 are the percentages of cells where the gene is detected in the ident.1 and ident.2 groups respectively.

Next we can subset this dataframe to only include DE genes that have a significant p-value, and then further subset the ‘significant DE genes only’ dataframe to the top 20 genes with the highest absolute log2FC. Looking at the absolute log2FC allows us to capture both, upregulated and downregulated genes.

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
epithelial_de_sig <- epithelial_de[epithelial_de$p_val_adj < 0.001,] 

#get the top 20 genes by fold change
epithelial_de_sig %>%
  top_n(n = 20, wt = abs(avg_log2FC)) -> epithelial_de_sig_top20

epithelial_de_sig_top20 is a dataframe that is restricted to the top20 most differentially expressed genes by log2FC.

There are a few different ways we can visualize the differentially expressed genes. We’ll start with the Violin and Feature plots from before. We can also visualize DEs using a DotPlot that allows us to capture both the average expression of a gene and the % of cells expressing it. In addition to these in-built Seurat functions, we can also generate a volcano plot using the EnhancedVolcano package. For the volcano plot, we can use the unfiltered DE results as the function colors and labels genes based on parameters (pCutoff, FCcutoff) we specify.

#get list of top 20 DE genes for ease
epithelial_de_sig_top20_genes <- rownames(epithelial_de_sig_top20)

#plot all 20 genes in violin plots
VlnPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8', ncol = 5, pt.size = 0)

#plot all 20 genes in UMAP plots
FeaturePlot(merged_epithelial, features = epithelial_de_sig_top20_genes, ncol = 5)

#plot all 20 genes in a DotPlot
DotPlot(merged_epithelial, features = epithelial_de_sig_top20_genes, 
  group.by = 'seurat_clusters_res0.8') + RotatedAxis()

#plot all differentially expressed genes in a volcano plot
EnhancedVolcano(epithelial_de,
  lab = rownames(epithelial_de),
  x = 'avg_log2FC',
  y = 'p_val_adj',
  title = 'Cluster9 wrt Cluster 12',
  pCutoff = 0.05,
  FCcutoff = 0.5,
  pointSize = 3.0,
  labSize = 5.0,
  colAlpha = 0.3)

To find out how we can figure out what these genes mean, stay tuned! The next module on pathway analysis will help shed some light on that. For now, let’s create a TSV file containing our DE results for use later on. We will need to rerun FindMarkers with slightly different parameters for this- we will change the logfc.threshold parameter to 0, as one of the pathway analysis tools requires all genes to be included in the analysis (more on that later).

#rerun FindMarkers
epithelial_de_gsea <- FindMarkers(merged_epithelial, ident.1 = "9", 
  ident.2 = "12", min.pct=0.25, logfc.threshold=0)
#save this table as a TSV file
write.table(x = epithelial_de_gsea, 
  file = 'outdir_single_cell_rna/epithelial_de_gsea.tsv', sep='\t')

Differential expression for T cells

For the T cell focused analysis, we will ask how T cells from mice treated with ICB compare against T cells from mice with (some of) their T cells depleted treated with ICB (ICBdT). We will start by subsetting our merged object to only have T cells, by combining the various T cell annotations from celltyping section. We’ll start by seeing all the possible celltypes we have, and picking the ones that are related to T cells. Next, we will SetIdent to the celltype metadata column, and subset to the celltypes that correspond to T cells. Finally, we’ll doublecheck that the subsetting happened as we expected it to.

#check all the annotated celltypes
unique(merged$immgen_singler_main)

#pick the ones that are related to T cells
t_celltypes_names <- c('T cells', 'NKT', 'Tgd')
merged <- SetIdent(merged, value = 'immgen_singler_main')
merged_tcells <- subset(merged, idents = t_celltypes_names)

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE) + 
DimPlot(merged_tcells, group.by = 'immgen_singler_main', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$immgen_singler_main)
table(merged_tcells$immgen_singler_main)

Now we want to compare T cells from mice treated with ICB vs ICBdT. First, we need to distinguish the ICB and ICBdT cells from each other. Start by clicking on the object in RStudio and expand meta.data to get a snapshot of the columns and the what kind of data they hold.

Looks like orig.ident has information about the condition and replicates, but for the purposes of this DE analysis, we want a meta.data column that combines the replicates of each condition. So, we want to combine the replicates of each condition together into a single category.

#we'll start by checking the possible names each replicate has.
unique(merged_tcells$orig.ident)

#there are 6 possible values, 3 replicates for the ICB treatment condition, and 3 for the ICBdT condition
#so we can combine "Rep1_ICB", "Rep3_ICB", "Rep5_ICB" to ICB, and "Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT" to ICBdT. 
#first initialize a metadata column for experimental_condition
merged_tcells@meta.data$experimental_condition <- NA

#Now we can take all cells that are in each replicate-condition, 
#and assign them to the appropriate condition
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICB", "Rep3_ICB", "Rep5_ICB")] <- "ICB"
merged_tcells@meta.data$experimental_condition[merged_tcells@meta.data$orig.ident %in% c("Rep1_ICBdT", "Rep3_ICBdT", "Rep5_ICBdT")] <- "ICBdT"

#double check that the new column we generated makes sense 
#(each replicate should correspond to its experimental condition)
table(merged_tcells@meta.data$orig.ident, merged_tcells@meta.data$experimental_condition)

With the experimental conditions now defined, we can compare the T cells from both groups. We’ll start by using FindMarkers using similar parameters to last time, and see how ICBdT changes with respect to ICB. Next, restrict the dataframe to significant genes only, and then look at the top 5 most upregulated and downregulated DE genes by log2FC.

#carry out DE analysis between both groups
merged_tcells <- SetIdent(merged_tcells, value = "experimental_condition")
tcells_de <- FindMarkers(merged_tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25)

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
tcells_de_sig <- tcells_de[tcells_de$p_val_adj < 0.001,]

#find the top 5 most downregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = -avg_log2FC)
#find the top 5 most upregulated genes
tcells_de_sig %>%
  top_n(n = 5, wt = avg_log2FC)

The most downregulated gene in the ICBdT condition based on foldchange is Cd4. That makes sense, because the monoclonal antibody (GK1.5) used for T cell depletion specifically targets CD4 T cells! You can read more about the antibody here.

Interestingly, for the list of genes that are upregulated, we see Cd8b1 show up. It could be interesting to see if the CD8 T cells’ phenotype changes based on the treatment condition. So now, let’s subset the object to CD8 T cells only, find DE genes to see how ICBdT CD8 T cells change compared to ICB CD8 T cells, and visualize these similar to before.

#subset object to CD8 T cells. Since we already showed how to subset cells using the clusters earlier, 
#this time we'll subset to CD8 T cells by selecting for cells with high 
#expression of Cd8 genes and low expression of Cd4 genes
merged_cd8tcells <- subset(merged_tcells, subset= Cd8b1 > 1 & Cd8a > 1 & Cd4 < 0.1)

#carry out DE analysis between both groups
merged_cd8tcells <- SetIdent(merged_cd8tcells, value = "experimental_condition")
cd8tcells_de <- FindMarkers(merged_cd8tcells, ident.1 = "ICBdT", ident.2 = "ICB", min.pct=0.25) #how ICBdT changes wrt ICB

#restrict differentially expressed genes to those with an adjusted p-value less than 0.001 
cd8tcells_de_sig <- cd8tcells_de[cd8tcells_de$p_val_adj < 0.001,]

#get the top 20 genes by fold change
cd8tcells_de_sig %>%
  top_n(n = 20, wt = abs(avg_log2FC)) -> cd8tcells_de_sig_top20

#get list of top 20 DE genes for ease
cd8tcells_de_sig_top20_genes <- rownames(cd8tcells_de_sig_top20)


#plot all 20 genes in violin plots
VlnPlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, group.by = 'experimental_condition', ncol = 5, pt.size = 0)

#plot all 20 genes in UMAP plots
FeaturePlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, ncol = 5)

#plot all 20 genes in a DotPlot
DotPlot(merged_cd8tcells, features = cd8tcells_de_sig_top20_genes, group.by = 'experimental_condition') + RotatedAxis()

#plot all differentially expressed genes in a volcano plot
EnhancedVolcano(cd8tcells_de,
  lab = rownames(cd8tcells_de),
  x = 'avg_log2FC',
  y = 'p_val_adj',
  title = 'ICBdT wrt ICB',
  pCutoff = 0.05,
  FCcutoff = 0.5,
  pointSize = 3.0,
  labSize = 5.0,
  colAlpha = 0.3)

At this point, you can either start doing literature searches for some of these genes and maybe find that antigen-specific memory CD8 T cells have higher expression of Bcl2 and Cdk8.

But we can also use gene set and pathway analyses to try and determine what processes the cells may be involved in.

Cell Type Annotation | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Cell Type Annotation

Cell Type Annotation

Why do we care about assigning cell type annotations to single cell RNAseq data?

Annotating cells in single-cell gene expression data is an active area of research. Defining a cell type can be challenging because of two fundamental reasons. One, gene expression levels are not discrete and mostly on a continuum; and two, differences in gene expression do not always translate to differences in cellular function (Pasquini et al.). The field is growing and rapidly changing due to the advent of different kinds of annotation tools, and the creation of multiple scRNA-seq databases (Wang et al.).

One way to assign cell type annotation is through web resources. They are relatively easy to use and do not require advanced level scripting or programming skills. Examples include Azimuth, Tabula Sapiens, and SciBet. Researchers sometimes try both methods with multiple tools when performing the cell type annotation. In addition, it is also recommended to validate your annotation by experiments, statistical analysis, or consulting subject matter experts.

Methods for assigning cell types usually fall into one of two categories:

Assigning cell annotations to our data

Loading libraries

First, we need to load the relevant libraries.

library(SingleR)
library(celldex)
library(Seurat)
library(cowplot)

merged <- readRDS("outdir_single_cell_rna/rep135_clustered.rds")

Alternatively, we can download a copy of this rds object and then load it in.

url <- "http://genomedata.org/cri-workshop/rep135_clustered.rds"
file_name <- "rep135_clustered.rds"
file_path <- "outdir_single_cell_rna/"
download.file(url, paste(file_path, file_name, sep = ""), mode = "wb")
merged <- readRDS('outdir_single_cell_rna/rep135_clustered.rds')

Specifying which reference dataset to use

The following function provides normalized expression values of 830 microarray samples generated by ImmGen from pure populations of murine immune cells. The samples were processed and normalized as described in Aran, Looney and Liu et al. (2019), i.e., CEL files from the Gene Expression Omnibus (GEO; GSE15907 and GSE37448), were downloaded, processed, and normalized using the robust multi-array average (RMA) procedure on probe-level data. This dataset consists of 20 broad cell types (“label.main”) and 253 finely resolved cell subtypes (“label.fine”). The subtypes have also been mapped to the Cell Ontology (“label.ont”, if cell.ont is not “none”), which can be used for further programmatic queries.

#cell typing with single R
#load singleR immgen reference
ref_immgen <- celldex::ImmGenData()

Calling the ImmGenData() function returns a SummarizedExperiment object containing a matrix of log-expression values with sample-level labels.

ref_immgen

#result of calling the above ref_immgen function
class: SummarizedExperiment 
dim: 22134 830 
metadata(0):
assays(1): logcounts
rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a
rowData names(0):
colnames(830):
  GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL
  GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL
  ...
  GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL
  GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL
colData names(3): label.main label.fine label.ont

Now let’s see what each of the labels look like:

head(ref_immgen$label.main, n=10)

From the main labels, we can see that we get general cell types such as Macrophages and Monocytes.

[1] "Macrophages" "Macrophages" "Macrophages"
[4] "Macrophages" "Macrophages" "Macrophages"
[7] "Monocytes"   "Monocytes"   "Monocytes"  
[10] "Monocytes" 
head(ref_immgen$label.fine, n=10)

From the fine labels, we can see that we start to subtype the more general cell types we saw above. So rather than seeing 6 labels for Macrophages we now see specific Macrophage types such as Macrophages (MF.11C-11B+).

[1] "Macrophages (MF.11C-11B+)" "Macrophages (MF.11C-11B+)"
[3] "Macrophages (MF.11C-11B+)" "Macrophages (MF.ALV)"     
[5] "Macrophages (MF.ALV)"      "Macrophages (MF.ALV)"     
[7] "Monocytes (MO.6+I-)"       "Monocytes (MO.6+2+)"      
[9] "Monocytes (MO.6+2+)"       "Monocytes (MO.6+2+)" 
head(ref_immgen$label.ont, n=10)

From the ont labels, we can see that we start to subtypes are now mapped to Cell Ontology IDs.

[1] "CL:0000235" "CL:0000235" "CL:0000235" "CL:0000583" "CL:0000583"
[6] "CL:0000583" "CL:0000576" "CL:0000576" "CL:0000576" "CL:0000576"

Applying the immgen cell reference to our data

#generate predictions for our seurat object
predictions_main = SingleR(test = GetAssayData(merged), 
                      ref = ref_immgen,
                      labels = ref_immgen$label.main)

predictions_fine = SingleR(test = GetAssayData(merged), 
                           ref = ref_immgen,
                           labels = ref_immgen$label.fine)

What do these objects look like?

head(predictions_main)

You should see something like the below, where each row is a barcode with columns for scores, labels (assigned cell type), delta, and pruned.labels.

DataFrame with 6 rows and 4 columns
                                                         scores
                                                       <matrix>
Rep1_ICBdT_AAACCTGAGCCAACAG-1 0.4156037:0.4067582:0.2845856:...
Rep1_ICBdT_AAACCTGAGCCTTGAT-1 0.4551058:0.3195934:0.2282272:...
Rep1_ICBdT_AAACCTGAGTACCGGA-1 0.0717647:0.0621878:0.0710026:...
Rep1_ICBdT_AAACCTGCACGGCCAT-1 0.2774994:0.2569566:0.2483387:...
Rep1_ICBdT_AAACCTGCACGGTAAG-1 0.3486259:0.3135662:0.3145100:...
Rep1_ICBdT_AAACCTGCATGCCACG-1 0.0399733:0.0229926:0.0669236:...
                                   labels delta.next
                              <character>  <numeric>
Rep1_ICBdT_AAACCTGAGCCAACAG-1         NKT  0.0124615
Rep1_ICBdT_AAACCTGAGCCTTGAT-1     B cells  0.1355124
Rep1_ICBdT_AAACCTGAGTACCGGA-1 Fibroblasts  0.1981683
Rep1_ICBdT_AAACCTGCACGGCCAT-1    NK cells  0.0577608
Rep1_ICBdT_AAACCTGCACGGTAAG-1     T cells  0.1038542
Rep1_ICBdT_AAACCTGCATGCCACG-1 Fibroblasts  0.2443470
                              pruned.labels
                                <character>
Rep1_ICBdT_AAACCTGAGCCAACAG-1           NKT
Rep1_ICBdT_AAACCTGAGCCTTGAT-1       B cells
Rep1_ICBdT_AAACCTGAGTACCGGA-1   Fibroblasts
Rep1_ICBdT_AAACCTGCACGGCCAT-1      NK cells
Rep1_ICBdT_AAACCTGCACGGTAAG-1       T cells
Rep1_ICBdT_AAACCTGCATGCCACG-1   Fibroblasts

Okay…but what do these columns actually tell us?

The scores column contains a matrix for each barcode that corresponds to to how confident SingleR is in assigning each cell type to the barcode for that row. The labels column is the most confident assignment singleR has for that particular barcode. The delta column contains the “delta” value for each cell, which is the gap, or the difference between the score for the assigned label and the median score across all labels. If the delta is small, this indicates that the cell matches all labels with the same confidence, so the assigned label is not very meaningful. SingleR can discard cells with low delta values caused by (i) ambiguous assignments with closely related reference labels and (ii) incorrect assignments that match poorly to all reference labels – so in the pruned.labels column you will find “cleaner” or more reliable labels.

How many cells have low confidence labels?

unique(predictions_main$pruned.labels)
 [1] "NKT"               "B cells"          
 [3] "Fibroblasts"       "NK cells"         
 [5] "T cells"           "Neutrophils"      
 [7] "DC"                "Monocytes"        
 [9] "ILC"               "Epithelial cells" 
[11] "Macrophages"       "Basophils"        
[13] "Tgd"               "Mast cells"       
[15] "Endothelial cells" NA                 
[17] "Stem cells"        "Stromal cells"    
[19] "B cells, pro"   
table(predictions_main$pruned.labels)
          B cells      B cells, pro         Basophils 
             3219                 3                33 
               DC Endothelial cells  Epithelial cells 
              295                67              1238 
      Fibroblasts               ILC       Macrophages 
              577               752               454 
       Mast cells         Monocytes       Neutrophils 
               11               617                92 
         NK cells               NKT        Stem cells 
              562              2241                 2 
    Stromal cells           T cells               Tgd 
               18             12631               189 

table(predictions_main$labels)
          B cells      B cells, pro         Basophils 
             3253                 3                37 
               DC Endothelial cells  Epithelial cells 
              295                71              1238 
      Fibroblasts               ILC       Macrophages 
              589               763               459 
       Mast cells         Monocytes       Neutrophils 
               11               633                92 
         NK cells               NKT        Stem cells 
              565              2249                 2 
    Stromal cells           T cells               Tgd 
               18             12714               193 
summary(is.na(predictions_main$pruned.labels))
   Mode   FALSE    TRUE 
logical   23001     184 

Are there more or less pruned labels for the fine labels?

summary(is.na(predictions_fine$pruned.labels))
   Mode   FALSE    TRUE 
logical   23006     179 

Now that we understand what the singleR dataframe looks like and what the data contains, let’s begin to visualize the data.

plotDeltaDistribution(predictions_main, ncol = 4, dots.on.top = FALSE)

plotScoreHeatmap(predictions_main)

Immgen Delta Dist

Immgen Main heatmap

Rather than only working with the singleR dataframe, we can add the labels to our Seurat data object as a metadata field, so let’s add the cell type labels to our seurat object.

#add main labels to object
merged[['immgen_singler_main']] = rep('NA', ncol(merged))
merged$immgen_singler_main[rownames(predictions_main)] = predictions_main$labels

#add fine labels to object
merged[['immgen_singler_fine']] = rep('NA', ncol(merged))
merged$immgen_singler_fine[rownames(predictions_fine)] = predictions_fine$labels

Let’s visualize what the cell typing looks like within our data now

How do our samples differ in their relative cell composition?

#visualizing the relative proportion of cell types across our samples
library(viridis)
library(ggplot2)
ggplot(merged[[]], aes(x = orig.ident, fill = immgen_singler_main)) + geom_bar(position = "fill") + scale_fill_viridis(discrete = TRUE)

Immgen Stacked Bar

We can also flip the samples and cell labels and view this data as below.

ggplot(merged[[]], aes(x = immgen_singler_main, fill = orig.ident)) + geom_bar(position = "fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_fill_viridis(discrete = TRUE)

ggplot(merged[[]], aes(x = immgen_singler_fine, fill = orig.ident)) + geom_bar(position = "fill") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + scale_fill_viridis(discrete = TRUE) 

Immgen Main Stacked Bar

Immgen Fine Stacked Bar

How do our cell type annotations map to our clusters we defined previously?

#plotting cell types on our umaps
DimPlot(merged, group.by = c("immgen_singler_main"))

DimPlot(merged, group.by = c("immgen_singler_fine")) + NoLegend()

Immgen Main UMAP

Immgen Fine UMAP

How do our cell annotations differ if we use a different reference set?

We’ve selected the

This dataset was contributed by the Benayoun Lab that identified, downloaded and processed data sets on GEO that corresponded to sorted cell types Benayoun et al., 2019.

The dataset contains 358 mouse RNA-seq samples annotated to 18 main cell types (“label.main”). These are split further into 28 subtypes (“label.fine”). The subtypes have also been mapped to the Cell Ontology as with the ImmGen reference.

ref_mouserna <- celldex::MouseRNAseqData()

If we look at this reference, we can see that it also has the main, fine, and ont labels that we saw with the ImmGen reference.

ref_mouserna

class: SummarizedExperiment 
dim: 21214 358 
metadata(0):
assays(1): logcounts
rownames(21214): Xkr4 Rp1 ... LOC100039574 LOC100039753
rowData names(0):
colnames(358): ERR525589Aligned ERR525592Aligned ... SRR1044043Aligned
  SRR1044044Aligned
colData names(3): label.main label.fine label.ont
predictions_mouse_main = SingleR(test = GetAssayData(merged), 
                      ref = ref_mouserna,
                      labels = ref_mouserna$label.main)

predictions_mouse_fine = SingleR(test = GetAssayData(merged), 
                           ref = ref_mouserna,
                           labels = ref_mouserna$label.fine)
merged[['mouserna_singler_main']] = rep('NA', ncol(merged))
merged$mouserna_singler_main[rownames(predictions_mouse_main)] = predictions_mouse_main$labels

#add fine labels to object
merged[['mouserna_singler_fine']] = rep('NA', ncol(merged))
merged$mouserna_singler_fine[rownames(predictions_mouse_fine)] = predictions_mouse_fine$labels

Let’s visualize how the labeling of our cells differs between the main labels from ImmGen and MouseRNA.

table(predictions_main$labels)
table(predictions_mouse_main$labels)
          B cells      B cells, pro         Basophils                DC Endothelial cells 
             3253                 3                37               295                71 
 Epithelial cells       Fibroblasts               ILC       Macrophages        Mast cells 
             1238               589               763               459                11 
        Monocytes       Neutrophils          NK cells               NKT        Stem cells 
              633                92               565              2249                 2 
    Stromal cells           T cells               Tgd 
               18             12714               193 


       Adipocytes           B cells   Dendritic cells Endothelial cells       Fibroblasts 
                4              3325                94               245               979 
     Granulocytes       Hepatocytes       Macrophages         Monocytes          NK cells 
              116               652               185              1013              1341 
          T cells 
            15231 
p1 <- DimPlot(merged, group.by = c("immgen_singler_main")) + scale_colour_viridis(option = 'turbo', discrete = TRUE)
p2 <- DimPlot(merged, group.by = c("mouserna_singler_main")) + scale_colour_viridis(option = 'turbo', discrete = TRUE)
p <- plot_grid(p1, p2, ncol = 2)
p

Comparison UMAP

Note on reference annotation datasets

As one might expect, the choice of reference can have a major impact on the annotation results. It’s essential to choose a reference dataset encompassing a broader spectrum of labels than those expected in our test dataset. Trust in the appropriateness of labels assigned by the original authors to reference samples is often a leap of faith, and it’s unsurprising that certain references outperform others due to differences in sample preparation quality. Ideally, we favor a reference generated using a technology or protocol similar to our test dataset, although this consideration is typically not an issue when using SingleR() for annotating well-defined cell types.

Users are advised to read the relevant vignette for more details about the available references as well as some recommendations on which to use. (As an aside, the ImmGen dataset and other references were originally supplied along with SingleR itself but have since been migrated to the separate celldex package for more general use throughout Bioconductor.) Of course, as we shall see in the next Chapter, it is entirely possible to supply your own reference datasets instead; all we need are log-expression values and a set of labels for the cells or samples

Understanding Clustering using Celltyping

We know that the UMAP shape changes when we use different numbers of PCs but why does the UMAP shape change? Lets try creating a UMAP with only 5 PCs.

merged_5PC <- RunUMAP(merged, dims = 1:5)

DimPlot(merged_5PC, label = TRUE, group.by = 'immgen_singler_main')

DimHeatmap(merged, dims = 1:5, balanced = TRUE, cells = 500)

We see that the cells form much more general clusters. For example, the immune cells are just in one big blob. When we increase our PCs we increase the amount of information that can be used to tease apart more specfic cell types. The distinct B cell cluster we saw earlier was only possible because we provided enough genetic expression information in the PCs we chose.

Seurat FindVariableFeatures

Saving our data

Let’s save our seurat objects to use in future sections.

saveRDS(merged, file = "outdir_single_cell_rna/preprocessed_object.rds")