Welcome to the blog

Posts

My thoughts and ideas

Cancer cell identification | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Cancer cell identification


Often when we’re analyzing cancer samples using scRNAseq, we need to identify tumor cells from healthy cells. Usually, we know what kind of celltype we expect the tumor cells to be (epithelial cells for bladder cancer, B cells for a B cell lymphoma, etc.), but we may want to distinguish tumor cells from normal cells of the same celltype for various reasons such as DE analyses. Also, since we are drawing conclusions related to gene expression from these comparisons, we may want to identify tumor cells using methods that are orthogonal to the genes expressed by the tumor cells. To that end, two common methods used are identifying either mutations or copy number alterations in scRNAseq data.

In the case of mutations, one will typically carry out mutation calling from DNA sequencing data (ideally from tumor-normal paired samples) and then ‘look’ for those mutations in the scRNAseq data’s BAM files using a tool like VarTrix. Because of the sparsity of single cell data, and its end bias (5’ or 3’ depending on the kit), it is unlikely that we can get every tumor cell from this approach, however it is likely that we can be pretty confident in the tumor cells we do identify.

For copy number alterations (CNAs), there are various tools that try to detect CNAs in tumor cells like CONICSmat and InferCNV. But they rely on a similar principle of using the counts matrix to identify regions of the genome that collectively have higher or lower expression and that if (say) 100+ genes in the same region have higher (or lower) expression it is likely that that is due to a copy number gain (or loss) as opposed to their being upregulated (or downregulated).


Finding tumor cells based on Copy number data

If you are working in a tumor sample where you expect to find CNAs, looking for copy number alterations in the scRNAseq data can be one way to identify tumor cells. In our case, whole genome sequencing was done on the cell line used for the mouse models, so we have some confidently determined CNAs we expect to find in the scRNAseq data-

CNV_WGS_scatterplot

As we can see above, we expect to find gains in chromosome 2 and 11 and a loss in chromosome 12.

We will use CONICSmat to identify cells with CNAs in our scRNAseq data. While there is more information available on their GitHub tutorial and paper, briefly CONICSmat fits a two-component Gaussian mixture model for each chromosomal region, and uses a Bayesian Information Criterion (BIC) statistical test to ask if a 1-component model (all cells are the same and there’s no CNA) fits better than a 2-component model (some cells have altered copy number compared to others). Here, a better fit is defined by a lower BIC score. Note that since we are only using the gene by cell counts matrix, average expression of genes in that chromosomal region is used as a proxy for a CNA. The key to most single-cell CNA based tools is that we need both tumor cells and non-tumor cells in our analysis as a copy number gain or loss in the tumor cells can only be measured relative to healthy cells. While CONICSmat can be run on all cells in the sample together, for our purposes, we will subset the object to Epithelial cells and B cells so that it can run more efficiently. CONICSmat is a somewhat computationally intensive tool, so we will start by clearing our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select Free unused memory.

#make sure you have cleared your workspace and 'freed unused space' otherwise you may run into issues later on!

#start by loading in the libraries and the preprocessed Seurat object
library("Seurat") 
library("ggplot2")
library("cowplot")
library("dplyr") 
library("Matrix")
library("hdf5r")
library("CONICSmat")
library("showtext")

merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#subset Seurat object to only include epithelial cells and B cells
merged <- SetIdent(merged, value = 'immgen_singler_main')
merged_subset <- subset(merged, idents=c('B cells', 'Epithelial cells'))

#as always double check that the number of cells match our expectations
table(merged$immgen_singler_main)
table(merged_subset$immgen_singler_main)

Now we can look into running CONICSmat! It is run in a few steps and requires us to look at some outputs along the way and define some thresholds. Tha main inputs CONICSmat needs are a Seurat counts matrix, a regions file specifying genomic regions, and a gene_pos file that has the genomic coordinates of all the genes. After we run the plotAll function, we will get a PDF file with results from the BIC test for every chromosomal region. We will look through this file to determine an appropriate BIC difference threshold that will help capture true CNA events. The BIC difference is (BIC 1 component score) - (BIC 2 component score), and a greater difference suggests a higher chance of a ‘real’ CNA event. Once we have a confident set of CNAs, we will use that information to cluster the cells in a heatmap wherein the heatmap is colored by z-scores calculated from the normalized average expression values.

#Normalize Seurat counts matrix using CONICsmat normMat function
conicsmat_expr <- CONICSmat::normMat(as.matrix(Seurat::GetAssayData(merged_subset, assay = 'RNA', layer = 'counts')))

#Get chromosomal positions of genes in the expression matrix
gene_pos=getGenePositions(rownames(conicsmat_expr), ensembl_version = "https://oct2022.archive.ensembl.org/", species = "mouse")

#Get coordinates for each region of interest (in our case we'll just use chromosome coordinates)
regions=read.table("/cloud/project/data/single_cell_rna/reference_files/chromosome_full_positions_mm10.txt",sep="\t",header = T, row.names = 1)


#Filter out uninformative genes aka genes that are expressed in less than 5 cells (that was the default given by CONICSmat)
conicsmat_expr=filterMatrix(conicsmat_expr,gene_pos[,"mgi_symbol"],minCells=5)

#Calculate normalization factor for each cell- this centers the gene expression in each cell around the mean. 
#(Need this because the more genes that are expressed in a cell, the less reads are 'available' per gene)
normFactor=calcNormFactors(conicsmat_expr)

#Fit the 2 component Gaussian mixture model for each region to determine if the region has a CNA. 
## This step outputs a PDF with a page for each region in the regions file that we can look through to determine which regions are likely to have a CNA.
## It also outputs a BIC_LR.txt file that summarizes the BIC scores and adj p-values for each region
l=plotAll(conicsmat_expr,normFactor,regions,gene_pos,"outdir_single_cell_rna/conic_plotall")

Let’s look through the conic_plotall_CNVs.pdf file. Looking through the results, we can see clear bimodal distributions for chromosomes 2, 11, and 12, and we see a corresponding lower BIC score for the 2 component model in these cases. However, we also see a low BIC score for cases like chromosome 5 even though we don’t see a bimodal distribution, and this may contribute to noisy results when we apply a BIC threshold. Looking at these barplots, a BIC threshold of around 800 might help filter out the noise, so let’s see the results with the default threshold (200) and 800.

#the BIC difference threshold will be applied using the text file that generated by the previous step. 
#Let's read that in and look at it in the RStudio window
lrbic <- read.table("outdir_single_cell_rna/conic_plotall_BIC_LR.txt",sep="\t",header=T,row.names=1,check.names=F)
lrbic

#filter candidate regions using default threshold 200
candRegions_200 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>200 & lrbic[,"LRT adj. p-val"]<0.01)]

#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
plotHistogram(l[,candRegions_200],conicsmat_expr,clusters=2,zscoreThreshold=4)
#note that this command outputs both a plot, and some barcodes with numbers. 
#The barcode with numbers are basically the heatmap cluster ID that each barcode is assigned to. But we can ignore that for now.
#did that split up the cells with the gain/losses well? Remember that based on the WGS data, we're expecting gains in chr2 and chr11, and a loss in chr12
#If not, try it with a different numbers of clusters (4, 8, 12 etc)

#Now filter candidates using threshold 800
candRegions_800 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>800 & lrbic[,"LRT adj. p-val"]<0.01)]

#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=2,zscoreThreshold=4)
#how does that look? Maybe try increasing the clusters a little to get it to split up nicely.

#Once you have a cluster split you like, save the barcode-cluster ID list to a variable for use later. 
#Also save the plot as a PDF. (Note if the save doesn't work, 
#you may need to run dev.off() a few times until you get a message saying null device 1)
pdf("outdir_single_cell_rna/conic_plot_histogram_cand_regions_3clusters.pdf", width=5, height=5)
hi <- plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4)
dev.off()
hi
#Also note the cluster with putative malignant cells from the heatmap (note that the cluster IDs are not always in order, refer to the text in grey on the right to identify the clusters)

#now convert the hi named vector to a dataframe
hi_df <- data.frame(hi) 
#look at hi_df in RStudio

#add a new column for tumor cell status to the barcodes based on the cluster you've determined to be the tumor cells
hi_df$tumor_cell_classification <- ifelse(hi_df$hi=='2', 'cnv-tumor cell', 'cnv-not tumor cell')
#look at hi_df dataframe in RStudio again

#double-check our classifications worked as expected
table(hi_df$hi, hi_df$tumor_cell_classification)

#let's also see where all the cells cluster on the UMAP. 
#for this we can use the DimPlot function and provide it with a list of cell barcodes that we want to color separately
cnv_tumor_bc <- rownames(hi_df[hi_df$tumor_cell_classification == 'cnv-tumor cell',])
DimPlot(merged, cells.highlight = rownames(hi_df[hi_df$tumor_cell_classification == 'cnv-tumor cell',])) + #breakdown the argument given to cells.highlight
  DimPlot(merged, group.by = 'immgen_singler_main') 

#save this list of barcodes to their own TSV file for later
write.table(x = cnv_tumor_bc, file = 'outdir_single_cell_rna/cnv_tumor_bc_list.tsv', row.names = FALSE, col.names = 'cnv_tumor_barcodes', quote=FALSE)

Looks like most of our tumor cells based on the CNV classification are in the epithelial cells cluster as we’d expect! A key point about CONICSmat here is that, the tool itself doesn’t determine a tumor cell from a non-tumor cell. In this case we knew from the whole genome data that the tumor cells should have gains in chromosomes 2 and 11 and a loss in chromosome 12. But if we did not have that information, we can overlay our celltypes on the heatmap and determine CNAs as the CNAs should primarily arise in the tumor’s celltype.

# using the celltypes argument we can add celltypes to the Seurat object. 
plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4, celltypes = merged_subset$immgen_singler_main)
#note that we got the cell barcodes from merged_subset as that 
#was the object used to generate the initial conicsmat_expr matrix.

We will look into adding this information to the Seurat object after we have the SNV tumor calls as well.

Finding tumor cells based on mutation data

We will use VarTrix to identify cells with mutations in our scRNAseq data. However, VarTrix is a tool that is run at the commandline level, so we will not run it in this workshop, instead we will go over the command to run it and how the outputs were processed here, and then plot the resulting mutation calls in RStudio. As input, VarTrix takes the scRNAseq BAM and barcodes.tsv files output from CellRanger, along with a VCF file containing the variants, and a reference genome fasta file. It can be run in 3 --scoring-method modes, here we chose coverage. This mode generates 2 output matrices- the ref-matrix and out-matrix that are in a MatrixMarket format. The former summarizes the number of REF reads (reads with wild-type allele) observed for every cell barcode and variant, while the latter summarizes the same for ALT reads (reads with mutant allele). Even though we will not be running it in this workshop, the vartrix command below was run for each replicate separately.

vartrix --vcf [input VCF file (based on cell line, is common for all replicates)] \
 --bam [cellranger bam file (unique to each replicate)] \
 --fasta [mm10 mouse reference genome] \
 --cell-barcodes [barcodes.tsv (unique to each replicate)] \
 --out-matrix [name of output ALT reads matrix] \
 --ref-matrix [name of output REF reads matrix] \
 --out-variants [name of output summarizing variants] -s coverage

The first few lines of one of the output matrices are shown below (both matrices are formatted the same way), where the first row summarizes the data so that the first value is the number of variants, second value is the number of cells, and the 3rd number is the number of rows in this matrix. The data itself starts from the 4th row, where the first column identifies the variant number (or ID), the second column identifies the cell barcode, and the third column indicates the number of reads (REF or ALT depending on the matrix) for that variant-

%%MatrixMarket matrix coordinate real general
% written by sprs
10427 4016 287647
3 359 0
3 890 0
3 1018 0
3 1068 0
3 1964 0
3 2048 0
... 

The processing of these matrices to identify variant containing cells depends on the data. In the manuscript, the mutation calls were noisy, so the authors required a cell to have at least 2 variants with greater than 20X total coverage, with at least 5 ALT reads, and 10% VAF (variant allele fraction). However, if one has a confident set of tumor specific variants, the criteria to classify tumor specific cells can be less stringent. Here we will try two different approaches to filtering variants- first we will use the same list of variants the authors did and gradually increase the number of ALT reads we require to classify a cell as a tumor cell. Second, we will filter the original set of variants to a ‘higher confidence’ set of variants and classify tumor cells based on them. This section is also computationally intensive, so we will clear our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select Free unused memory.

#Load libraries and packages if they are not loaded already
library(Seurat)
library(dplyr)
library(Matrix)
library(ggplot2)

#read in seurat object
merged <- readRDS(file='data/single_cell_rna/backup_files/preprocessed_object.rds')

#read in the variants file
variants_file <- read.csv('/cloud/project/data/single_cell_rna/cancer_cell_id/mcb6c-exome-somatic.variants.annotated.clean.filtered_10K.tsv', sep='\t')

#read in a sample's output VarTrix matrix and barcodes 
##to read a sparse VarTrix matrix into R for processing, we'll use the readMM function, convert that to a matrix, and then convert the matrix to a dataframe.
rep1_icb_df <- as.data.frame(as.matrix(readMM('/cloud/project/data/single_cell_rna/cancer_cell_id/Rep1_ICB_alt_counts.mtx')))
##read in sample barcodes file
rep1_icb_bc <- read.table('/cloud/project/data/single_cell_rna/cancer_cell_id/Rep1_ICB_barcodes.tsv', sep = '\t', header = FALSE, col.names = 'barcodes')

Let’s spend some time looking at the variants_file, rep1_icb_bc and rep1_icb_df files. The variants_file was an output of our variant calling pipeline that uses 4 variant callers and includes some general variants filtering along with information like allele depth, allele fraction, etc. about each variant. The rep1_icb_bc file is a simple file containing all the barcodes in that sample. Lastly, the rep1_icb_df file is matrix from earlier processed into a dataframe. You might be wondering how you are supposed to know that this is how the matrix should be processed- usually most tools have (or at least should have) vignettes or brief tutorials available on how to process their data. Luckily for us, VarTrix does have all this documented in their GitHub documentation. So, let’s go through their website and see how they want us to process their data. Looks like after we read in the matrix, they want us to assign the barcodes as column names, and variants as row names. And after that we can look into adding that to our Seurat object. We have the barcodes in the rep1_icb_bc file, but what about the variants? We have the variant_file but it has chromosome and position separately, so we will make a column where they are combined together, and then add them to the matrix file.

#make a concatenated column of chromosome and variants
variants_file$CHROM_POS <- paste0(variants_file$CHROM, '_', variants_file$POS)

#assign barcodes to columns and variants to rows
colnames(rep1_icb_df) <- rep1_icb_bc$barcodes
row.names(rep1_icb_df) <- variants_file$CHROM_POS

#Hmm we got an error complaining about duplicate row names... that makes sense, since dataframes use the rownames to index a row, 
#2 rows having the same index is a problem because it'll be an ambiguous index value. 
#But why do we have 2 occurrences of the same variant and does it happen multiple times? Luckily it told us one of the values where it ran into this error `chr9_53503887`. 
#So let's look for it in the variant file. Interesting, we have 2 variants coming from the same position, it's difficult to say what's happening from the variant information alone, so we'll check what might be happening on IGV. 
#For now, at least we know where the error is coming from. Let's check if there are any other similar occurrences.

#use the data.frame() and table() functions to get a dataframe summarizing the number of times each CHROM_POS value occurs
troubleshoot_df <- data.frame(table(variants_file$CHROM_POS))
#sort or order the dataframe in descending order of occurrences for ease of viewing
troubleshoot_df <- troubleshoot_df[order(troubleshoot_df$Freq, decreasing = TRUE),]
#look at the top 5 rows
head(troubleshoot_df)
#looks like it only happens once!

#We can get around this error by using the make.unique() function in R. 
#This function goes through a column and for any values that occur more than once, it appends `_` and a number to the value.
variants_file$CHROM_POS <- make.unique(as.character(variants_file$CHROM_POS), sep = "_")
#navigate through the dataframe and see what changed.

#Okay now let's try  assign the variants to a row again
row.names(rep1_icb_df) <- variants_file$CHROM_POS
#Success! 

While it’s difficult to say what exactly happened in the case of the variant at chr9_53503887 without loading this sample’s BAM file into IGV, based on the reference genome, one possibility is that due to the variant being in a soft-masked/ repetitive region, the alignment algorithms were not able to align the reads well and so we have some instances where it was called as 4 base deletion, and others where it was called as a 2 base deletion.

chr9_53503887_anomaly

Let’s look at this dataframe more carefully now. So we have a dataframe with cell barcodes as columns and variants as rows. And each cell or value in this dataframe is the number of ALT reads (aka reads containing a mutation). Our overall objective is to find tumor cells, and we want to do that by establishing some sort of a threshold of number of ALT reads needed for a cell to be a tumor cell. Now we can do that for every sample separately or make a superset dataframe and do all the filtering together. The second approach might be better because eventually our goal is to combine cells from all the samples with the Seurat object. In order to make a superset dataframe, first we need to do the above formatting for each vartrix output. Then, we can combine the results horizontally because we need to add the cell barcodes. The key to combining dataframes horizontally is that the rownames (or index) need to be the same, and that is the case here because we ran Vartrix for all samples using the same set of variants.

#first let's make a vector of samples
sample_names <- c('Rep1_ICB', 'Rep1_ICBdT', 'Rep3_ICB', 'Rep3_ICBdT', 'Rep5_ICB', 'Rep5_ICBdT')

# Initialize an empty dataframe with the rows predefined as variant_names
all_samples_vartrix_df <- data.frame(matrix(nrow = length(variants_file$CHROM_POS), ncol = 0))
rownames(all_samples_vartrix_df) <- variants_file$CHROM_POS

#Now loop through each sample in sample_names and format them using the code from before
for (sample_rep in sample_names) {
  #read in and processing a sparse matrix
  rep_df <- data.frame(as.matrix(readMM(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_alt_counts.mtx'))))
  
  #get each sample's barcodes
  rep_bc <- read.table(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_barcodes.tsv'), 
    sep = '\t', header = FALSE, col.names = 'barcodes')
  
  #set column names and row names
  colnames(rep_df) <- rep_bc$barcodes
  row.names(rep_df) <- variants_file$CHROM_POS
  
  #combine dataframes
  all_samples_vartrix_df <- cbind(all_samples_vartrix_df, rep_df)
}

#let's double check that the number of columns make sense. 
#the number of cell barcodes in each of our replicates should be the same as the number of columns in all_samples_vartrix_df 
#number of columns in dataframe
ncol(all_samples_vartrix_df) #28585
#number of rows in each sample's dataframe # 4016+3927+6390+5576+2828+5848
#have a variable that we'll keep adding the number of rows to
tracking_sum = 0
#loop through each sample and count the number of rows in that dataframe. 
#Add that value to the tracking_sum value, and also print the row every time. 
for (sample_rep in sample_names) {
  rep_bc <- read.table(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_barcodes.tsv'), sep = '\t', header = FALSE, col.names = 'barcodes')
  print(nrow(rep_bc))
  tracking_sum <- tracking_sum + nrow(rep_bc)
}

Now we have all the cells in one dataframe, with information about the number of ALT reads for each variant. Next, we can get to classifying some tumor cells. For this, we’ll first want to flip (or transpose) the matrix so that the cell barcodes are on the rows and variants are on the columns. The reason for this will become clearer soon, but basically we need to know the number of total ALT reads across all variants in each cell, and the way dataframe operations work, it is much easier to do that by adding a column and take the sum off all the values in the row. To transpose the dataframe, we can use a handy R function called t(). And then we can take the sum of values in the dataframe using rowSums(). Once we have that, we can start subsetting the dataframe and making UMAP plots that highlight cells depending on how many ALT reads we want to require for a cell to be considered a tumor cell.

#transpose the dataframe
all_samples_vartrix_df <- as.data.frame(t(all_samples_vartrix_df))
#The t() function technically outputs a matrix and that needs to be converted to a dataframe

#this is a large matrix and ends up being computationally heavy so run the following command 
gc() #this is the same as 'free unused R memory'

#now let's add a column that is the sum of the entire row so that we can see the number of ALT reads per cell
all_samples_vartrix_df$num_alt_reads <- rowSums(all_samples_vartrix_df)

#Now let's highlight cells in Seurat with at least 1 ALT read
## first get a dataframe where everything has at least 1 ALT read using that column we added
all_samples_vartrix_df_1_ALT_read <- all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 1,]
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_1_ALT_read)) + 
  ggtitle('UMAP highlighting cells with at least 1 ALT read')
#that's way too many 'tumor' cells especially since we know a lot of the cells are not even epithelial cells.

#Let's try this again but increase our requirement to 10 ALT reads. 
#Also, we don't need to make a new dataframe every time, we can load the cells.highlight directly from the all_sample_vartrix_df table
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 10,])) + 
  ggtitle('UMAP highlighting cells with at least 10 ALT reads')

#That seems to have helped, but we're still getting a lot more cells than we'd expect 
#considering that the epithelial cells are largely in clusters 9 and 12
#let's increase the number of ALT reads up to 20
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 20,])) + 
  ggtitle('UMAP highlighting cells with at least 20 ALT reads')

We tried to get fairly strict with how we are defining a tumor cell, by requiring at least 20 reads carrying a mutation for a cell to be a tumor cell, but it seems like we’re getting some pretty calls as we are identifying many cells that are unlikely to be epithelial cells as tumor cells. This could suggest that our original variant list was not filtered enough and still has germline variants. We can filter this variant file further by removing variants that are likely to be germline variants using the following filtering strategy.

#before we move forward let's delete some files and free up some space
rm(rep1_icb_bc, rep1_icb_df, troubleshoot_df, all_samples_vartrix_df_1_ALT_read)
gc()

#filter variants file to high confidence variants
variants_file_high_conf <- variants_file[variants_file$set == 'mutect-varscan-strelka' & 
                           variants_file$NORMAL.DP > 50 & 
                           variants_file$TUMOR.DP > 50 &
                           variants_file$NORMAL.AF < 0.01 & 
                           variants_file$TUMOR.AF > 0.25 &
                           variants_file$Consequence == 'missense_variant', ]
 

What are these filters doing?

  • The set column only keeps variants that were called by all 3 mutation callers (mutect, varscan, and strelka).
  • (NORMAL.DP) and (TUMOR.DP) require that there at least 50 reads of support of the variant position in both the tumor and normal samples.
  • NORMAL.AF filter requires that if a variant does occur in the normal sample, it is at a VAF of less than 1%.
  • TUMOR.AF filter requires that a variant occurs in the tumor sample at at least 25% VAF.
  • CONSEQUENCE dictates that we only look at missense mutations.

Let’s use this high confidence variants file, restrict our superset variants matrix to this list of variants. And see if this works better to classify tumor cells.

#get a list of high confidence variants
high_conf_variants <- variants_file_high_conf$CHROM_POS

#the list of variants corresponds to the columns of our dataframe, 
#so we can use the list of high_conf_variants as a list of columns to keep in our dataframe
all_samples_vartrix_df_high_conf <- all_samples_vartrix_df[high_conf_variants]

#now let's recalculate num_alt_reads based on these ~2K high confidence variants
all_samples_vartrix_df_high_conf$num_alt_reads <- rowSums(all_samples_vartrix_df_high_conf)

#let's plot potential tumor cells based on the high confidence variants using the same thresholds as before. 
#At least 1 ALT read for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 1,])) + 
  ggtitle('UMAP highlighting cells with at least 1 ALT read (filtered VCF)')
#At least 10 ALT reads for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 10,])) + 
  ggtitle('UMAP highlighting cells with at least 10 ALT reads (filtered VCF)')
#At least 20 ALT reads for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,])) + 
  ggtitle('UMAP highlighting cells with at least 20 ALT reads (filtered VCF)')

#How does this compare against our previous plot?
tumor_cells_10Kvars_dimplot <- DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 20,])) + 
ggtitle('Tumor cells with at least 20 ALT reads - 10K variants')

tumor_cells_2Kvars_dimplot <- DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,])) + 
ggtitle('Tumor cells with at least 20 ALT reads - 2K variants')

#That looks way better! Based on this, I think we can be pretty happy about finding tumor cells from the high confidence variants 
all_samples_vartrix_df_high_conf_20_ALT_read <- all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,]
snv_tumor_bc <- rownames(all_samples_vartrix_df_high_conf_20_ALT_read)
write.table(x = snv_tumor_bc, file = 'outdir_single_cell_rna/snv_tumor_bc_list.tsv', row.names = FALSE, col.names = 'snv_tumor_barcodes', quote=FALSE)

Lastly, let’s look into add the SNV and CNV based tumor cell classifications, to our Seurat object’s metadata. To add metadata to a Seurat object, we need a dataframe where we have all the barcodes in the Seurat object as rownames, and column(s) that we want to add to the metadata with value(s) corresponding to each barcode. Recall that our tumor cell classification lists for both SNVs and CNVs include only the tumor cell barcodes. So we will need a list of all barcodes in the Seurat object.

#read in our CNV and SNV tumor cell classification files
snv_bc_df <- read.csv('data/single_cell_rna/backup_files/snv_tumor_bc_list.tsv', sep='\t')
cnv_bc_df <- read.csv('data/single_cell_rna/backup_files/cnv_tumor_bc_list.tsv', sep='\t')
#get lists of these barcodes
snv_bc_list <- snv_bc_df$snv_tumor_barcodes
cnv_bc_list <- cnv_bc_df$cnv_tumor_barcodes

#read in our Seurat object
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#get all barcodes from our Seurat object
all_bc <- rownames(merged@meta.data)
#convert them into a dataframe
all_bc_df <- data.frame(all_bc)
#add 2 columns that will hold the SNV tumor cell classifications (snv_class) and #CNV tumor cell classifications (cnv_class) prefilled with the value 'unclassified' 
all_bc_df$snv_class <- 'unclassified'
all_bc_df$cnv_class <- 'unclassified'

#for barcodes in each of our SNV and CNV lists, 
#we'll change the value in the corresponding columns to 'snv_tumor' and 'cnv_tumor'
all_bc_df$snv_class[all_bc_df$all_bc %in% snv_bc_list] <- 'snv_tumor'
all_bc_df$cnv_class[all_bc_df$all_bc %in% cnv_bc_list] <- 'cnv_tumor'

#now let's add these classifications to Seurat metadata
#first we need to make the barcodes column as the rownames or index and delete the extra column. 
rownames(all_bc_df) <- all_bc_df$all_bc
#open the all_bc_df and see what happened. We don't need the extra barcodes column, so let's get rid of that. 
all_bc_df$all_bc <- NULL
#finally, let's add this to the Seurat metadata using Seurat's AddMetaData function
merged <- AddMetaData(merged, all_bc_df)
#see what columns were added to the metadata column
colnames(merged@meta.data)

#Now let's plot the CNV and SNV classifications on our UMAP
DimPlot(merged, group.by = c('snv_class', 'cnv_class', 'immgen_singler_main'))

#let's also take a quick look at how much the snv and cnv classifications agree with each other
table(merged$cnv_class, merged$snv_class)

Voila! Looks like both of our independent classifications of tumor cells agree to quite a large extent. They also both independently added to our tumor cell identification, and were restricted to the epithelial cells.

Gene set enrichment and pathway analysis | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Gene set enrichment and pathway analysis


After carrying out differential expression analysis, and getting a list of interesting genes, a common next step is enrichment or pathway analyses. Broadly, enrichment analyses can be divided into two types- overrepresentation analysis and gene set enrichment analysis (GSEA).

Overrepresentation analysis takes a list of significantly DE genes and determines if these genes are all known to be differentially regulated in a certain pathway or geneset. It is primarily useful if we have a set of genes that are highly differentially expressed and we want to determine what process(es) they may be involved in. Mathematically, it calculates a p-value using a hypergeometric distribution to determine if a gene set (from a database) is significantly over-represented in our DE genes. A couple key points about overrepresentation analysis are that firstly, we get to determine the list of genes that are used as inputs. So, we can set a p-value and log2FC threshold that would in turn determine the gene list. Secondly, since the overrepresentation analysis does not use information about the foldchange values (only a list of genes) it is not directional. So if an overrepresentation analysis gives us a pathway or geneset as being significantly enriched, we are not getting any information about whether the genes in our list are responsible for activating or suppressing the pathway- we can only conclude that our genes are involved in that pathway in some way.

GSEA addresses the second point above because it uses a list of genes and their corresponding fold change values as inputs to the analysis. Another difference between GSEA and overrepresentation analysis is that in GSEA, we will use all the genes as inputs without applying any filters based on log2FC or p-values. GSEA is useful in determining incremental changes at the gene expression level that may come together to have an impact on a specific pathway. GSEA ranks genes based on their ‘enrichment scores’ (ES), which measures the degree to which a set of genes is over-represented at the top or bottom of a list of genes that are ordered based on their log2FC values.

Another crucial part of any enrichment analysis is the databases. The main pitfall to avoid is choosing multiple or broad databases as this can result in many spurious results. Therefore, when possible, it is better to choose the reference databases based on their biological relevance.


There are various tools available for enrichment analysis, here we chose to use a tool called clusterProfiler. It allows us to perform both overrepresentation and GSEA analyses, is widely used by the field, and has quite a few helpful tutorials/resources available.

We will also use a web tool Enrichr for some of our analysis.

We will start by investigating the Epcam positive clusters we identified in the Differential Expression section. Let’s load in the R libraries we will need and read in the DE file we generated previously. Recall that we generated this file using the FindMarkers function in Seurat, and had ident.1 as cluster 9 and ident.2 as cluster 12. Therefore, we are looking at cluster 9 with respect to cluster 12, that is, positive log2FC values correspond to genes upregulated in cluster 9 or downregulated in cluster 12 and vice versa for negative log2FC values.

#load R libraries
library("Seurat")
library("ggplot2")
library("cowplot")
library("dplyr")
library("clusterProfiler")
library("org.Mm.eg.db")
library("msigdbr")
library("DOSE")
library("stringr")
library("enrichplot")

#read in the epithelial DE file
de_gsea_df <- read.csv('outdir_single_cell_rna/epithelial_de_gsea.tsv', sep = '\t')

head(de_gsea_df)
#open this file in Rstudio and get a sense for the distribution of foldchange values and see if their p values are significant
#alternatively try making a histogram of log2FC values using ggplot and the geom_histogram() function. 
ggplot(de_gsea_df, aes(avg_log2FC)) + geom_histogram()
#You can also go one step further and impose a p-value cutoff (say 0.01) and plot the distribution. 

Overrepresentation analysis

You may notice that we have quite a few genes with fairly large fold change values- while fold change values do not impact the overrepresentation analysis, they can inform the thresholds we use for picking the genes. Since we know that we have quite a few genes with foldchanges greater/lower than +/- 2, we can use that as our cutoff. We will also impose an adj p-value cutoff of 0.01. Thus, for the overrepresentation analysis, we will begin by filtering de_gsea_df based on the log2FC and p-value, and then get the list of genes for our analysis.

#filter de_gsea_df by subsetting it to only include genes that are significantly DE (pval<0.01) and their absolute log2FC is > 2.
#The abs(de_gsea_df$avg_log2FC) ensures that we keep both the up and downregulated genes
overrep_df <- de_gsea_df[de_gsea_df$p_val_adj < 0.01 & abs(de_gsea_df$avg_log2FC) > 2,] 
overrep_gene_list <- rownames(overrep_df)

Next, we will set up our reference. By default clusterProfiler allows us to use the msigdb reference. While that works, here we will show how you can download a mouse specific celltype signature reference geneset from msigdb and use that for your analysis. We will use the M8 geneset from the msigdb mouse collections. We clicked on the Gene Symbols link on the right to download the dataset and uploaded that to your workspace. These files are in a gmt (gene matrix transposed) format, and can be read-in using an in-built R function, read.gmt. And once we have the reference data loaded, we will use the enricher function in the clusterProfiler library for the overrepresentation analysis. The inputs to the function include the DE gene list, the reference database, the statistical method for p-value adjustment, and finally a pvalue cutoff threshold. This generates an overrepresentation R object that can be input into visualization functions like barplot() and dotplot() to make some typical pathway analysis figures. We can also use the webtool, Enrichr, for a quick analysis against multiple databases. For this part, we will save the genelist we’re using for the overrepresentation analysis to a TSV file.

#read in the tabula muris gmt file
msigdb_m8 <- read.gmt('/cloud/project/data/single_cell_rna/reference_files/m8.all.v2023.2.Mm.symbols.gmt')
#click on the dataframe in RStudio to see how it's formatted- we have 2 columns, #the first with the genesets, and the other with genes that are in that geneset.
#try to determine how many different pathways are in this database
overrep_msigdb_m8 <- enricher(gene = overrep_gene_list, TERM2GENE = msigdb_m8, pAdjustMethod = "BH", pvalueCutoff = 0.05)

#visualize data using the barplot and dotplot functions
barplot(overrep_msigdb_m8, showCategory = 10)
dotplot(overrep_msigdb_m8, showCategory = 10)

#save overrep_gene_list to a tsv file (overrep_gene_list is our list of genes and 
#file is the name we want the file to have when it's saved. 
#The remaining arguments are optional- row.names=FALSE stops R from adding numbers (effectively an S.No column), 
#col.names gives our single column TSV a column name, 
#and quote=FALSE ensures the genes don't have quotes around them which is the default way R saves string values to a TSV)
write.table(x = overrep_gene_list, file = 'outdir_single_cell_rna/epithelial_overrep_gene_list.tsv', row.names = FALSE, col.names = 'overrep_genes', quote=FALSE)

For the Enrichr webtool based analysis, we’ll open that TSV file in our Rstudio session, copy the genes, and paste them directly into the textbox on the right. The webtool should load multiple barplots with different enriched pathways. Feel free to click around and explore here. To compare the results against the results we generated in R, navigate to the Cell Types tab on the top and look for Tabula Muris.

An important component to a ‘good’ overrepresentation analysis is using one’s expertise about the biology in conjunction with the pathways identified to generate hypotheses. It is unlikely that every pathway in the plots above is meaningful, however knowledge of bladder cancer (for this dataset) tells us that basal and luminal bladder cancers share similar expression profiles to basal and luminal breast cancers reference. So, the overrepresentation analysis showing genesets like ‘Tabula Muris senis mammary gland basal cell ageing’ and ‘Tabula muris senis mammary gland luminal epithelial cell of mammary gland ageing’ could suggest that the difference in unsupervised clusters 9 and 12 could be coming from the basal and luminal cells. To investigate this further, we can compile a list of basal and luminal markers from the literature, generate a combined score for those genes using Seurat’s AddModuleScore function and determine if the clusters are split up as basal and luminal. For now we’ll use the same markers defined in this dataset’s original manuscript.

#define lists of marker genes
basal_markers <- c('Cd44', 'Krt14', 'Krt5', 'Krt16', 'Krt6a')
luminal_markers <- c('Cd24a', 'Erbb2', 'Erbb3', 'Foxa1', 'Gata3', 'Gpx2', 'Krt18', 'Krt19', 'Krt7', 'Krt8', 'Upk1a')

#read in the seurat object if it isn't loaded in your R session
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#use AddModuleScore to calculate a single score that summarizes the gene expression for each list of markers
merged <- AddModuleScore(merged, features=list(basal_markers), name='basal_markers_score')
merged <- AddModuleScore(merged, features=list(luminal_markers), name='luminal_markers_score')

#visualize these scores using FeaturePlot and VlnPlots
FeaturePlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'))
VlnPlot(merged, features=c('basal_markers_score1', 'luminal_markers_score1'), group.by = 'seurat_clusters_res0.8', pt.size=0)

Interesting! This analysis could lead us to conclude that cluster 12 is composed of basal epithelial cells, while cluster 9 is composed of luminal epithelial cells. Next, let’s see if we can use GSEA to determine if there are certain biological processes that are distinct between these clusters?

GSEA Analysis

For GSEA, we need to start by creating a named vector where the values are the log fold change values and the names are the gene’s names. Recall that GSEA analysis relies on identifying any incremental gene expression changes (not just those that are statistically significant), so we will use our original unfiltered dataframe to get these values. This will be used as input to the gseGO function in the clusterProfiler library, which uses gene ontology for GSEA analysis. The other parameters for the function include OrgDb = org.Mm.eg.db, the organism database from where all the pathways’ genesets will be determined; ont = "ALL", specifies the subontologies, with possible options being BP (Biological Process), MF (Molecular Function), CC (Cellular Compartment), or ALL; keyType = "SYMBOL" tells gseGO that the genes in our named vector are gene symbols as opposed to Entrez IDs, or Ensembl IDs; and pAdjustMethod="BH" and pvalueCutoff=0.05 specify the p-value adjustment statistical method to use and the corresponding cutoff.

#read in the epithelial de df we generated previously
de_gsea_df <- read.csv('outdir_single_cell_rna/epithelial_de_gsea.tsv', sep = '\t')
#if you can't find the file in your session, we have uploaded a version for you
#de_gsea_df <- read.csv('/cloud/project/data/single_cell_rna/backup_files/epithelial_de_gsea.tsv', sep = '\t')

#get all the foldchange values from the original dataframe
gene_list <- de_gsea_df$avg_log2FC
#set names for this vector to gene names
names(gene_list) <- rownames(de_gsea_df)
#sort list in descending order of log2FC values as that is required by the gseGO function
gene_list = sort(gene_list, decreasing = TRUE)

#now we can run the gseGO function
gse <- gseGO(geneList=gene_list, 
             OrgDb = org.Mm.eg.db,
             ont ="ALL",              
             keyType = "SYMBOL", 
             pAdjustMethod = "BH",             
             pvalueCutoff = 0.05)
#explore the gse object by opening it in RStudio. 
#It basically has a record of all the parameters and inputs used for the function, 
#along with a results dataframe.
#we can pull this result dataframe out to view it in more detail
gse_result <- gse@result

Looking at the dataframe, you might notice that there are around 900 rows, and similar to the overrepresentation analysis, it is unlikely that all of them are truly meaningful. You can skim through all these results to determine which ones might be biologically meaningful, but in the interest of time, here we will subset the gse object to any pathways that have the word epithelial in them, and use those for plotting. In order to subset the object, we will determine the indices of the rows which have these values using R’s which and grepl functions, and then subset the results dataframe in the gse object to those rows. All of these genesets end up having negative enrichment scores (or are downregulated in our putative luminal cell cluster), so we will add one index that has a positive enrichment score to aid in plotting. For plotting, we will use the dotplot, cnetplot, and heatplot functions.

#start by grabbing the indices we'll need to subset the `gse` object
#epithelial indices from gse object using which and grepl
epithelial_indices <- which(grepl("epithelial", gse@result$Description))
#index for the most positive enrichment score using which.max()
positive_index <- which.max(gse@result$enrichmentScore)
#concatenate these indices to get the list of indices that will be used to subset the gsea object
subset_indices <- c(positive_index,epithelial_indices)

#now create a new gse_epithelial object and subset the gse object to these indices
gse_epithelial <- gse
gse_epithelial@result <- gse_epithelial@result[subset_indices,]

#plot!
#dotplot - splitting by 'sign' and facet_grid together allow us to separate activated and suppressed pathways
dotplot(gse_epithelial, showCategory=20, split=".sign") + facet_grid(.~.sign) 

#heatplot - allows us to see the genes that are being considered 
#for each of the pathways/genesets and their corresponding fold change
heatplot(gse_epithelial, foldChange=gene_list)

#cnetplot - allows us to see the genes along with the various 
#pathways/genesets and how they related to each other
cnetplot(gse_epithelial, foldChange=gene_list)

Based on these results, we could conclude that cluster 9 (putative luminal cells) have lower expression of quite a few pathways related to epithelial cell proliferation compared to cluster 12 (putative basal cells).

As an additional exercise, let’s try to do an overrepresentation and/or GSEA analysis for the DE analysis we did on CD8 T cells. We did not have you save the DE results file yesterday, so you can download the DE file from the link below first, and then use that for your analysis.

download.file(url = 'http://genomedata.org/cri-workshop/reference_files/cd8tcells_de_gsea.tsv',
              destfile = 'outdir_single_cell_rna/cd8tcells_de_gsea.tsv')

Hints: