Differentiation/Trajectory analysis

« Cancer cell identification Course TCR/BCR Repertoire Analysis »

Trajectory Analysis Using CytoTRACE

Single-cell sequencing gives us a snapshot of what a population of cells is doing. This means we should see many different cells in many different phases of a cell’s lifecycle. We use trajectory analysis to place our cells on a continuous ‘timeline’ based on expression data. The timeline does not have to mean that the cells are ordered from oldest to youngest (although many analysis uses trajectory to quantify developmental time). Generally, tools will create this timeline by finding paths through cellular space that minimize the transcriptional changes between neighboring cells. So for every cell, an algorithm asks the question: what cell or cells is/are most similar to the current cell we are looking at? Unlike clustering, which aims to group cells by what type they are, trajectory analysis aims to order the continuous changes associated with the cell processes.

The metric we use for assigning positions is called pseudotime. Pseudotime is an abstract unit of progress through a dynamic process. When we base our trajectory analysis on the transcriptomic profile of a cell, less mature cells are assigned smaller pseudotimes, and more mature cells are assigned larger pseudotimes.

Performing trajectory analysis on epithelial cells

Earlier we confirmed that our epithelial cell population corresponded to our tumor population. Then, through differential expression analysis, we saw that the epithelial cells form two distinct clusters that we identified as luminal and basal cells. We can further confirm this conclusion by using trajectory analysis to assign pseudotime values to the epithelial cells. We expect see that basal cells are less differentiated than luminal cells.

epcam Clusters

Subsetting epithelial cells

First, let’s load our libraries and our preprocessed object.

library(Seurat)
library(CytoTRACE)
library(monocle3)
library(ggplot2)


merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

Let’s see what clusters our epithelial cells are located in.

DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE) + 
  DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) 

We can specifically highlight these cells for further clarity.

Epithelial_cells = merged$immgen_singler_main =="Epithelial cells"
highlighted_cells <- WhichCells(merged, expression = immgen_singler_main =="Epithelial cells")
DimPlot(merged, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)

We already know that these two clusters can be separated into basal and luminal cells. We can see the distinction between these two types using markers. For example lets plot basal cell markers:

FeaturePlot(object = merged, features = c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))

To more comprehensively see where the basal and luminal cells are, we can create a cell type score by averaging the expression of a group of genes and ploting the score.

cell_type_Basal_marker_gene_list <- list(c("Cd44", "Krt14", "Krt5", "Krt16", "Krt6a"))
merged <- AddModuleScore(object = merged, features = cell_type_Basal_marker_gene_list, name = "cell_type_Basal_score") 
FeaturePlot(object = merged, features = "cell_type_Basal_score1")

cell_type_Luminal_marker_gene_list <- list(c("Cd24a", "Erbb2", "Erbb3", "Foxa1", "Gata3", "Gpx2", "Krt18", "Krt19", "Krt7", "Krt8", "Upk1a"))
merged <- AddModuleScore(object = merged, features = cell_type_Luminal_marker_gene_list, name = "cell_type_Luminal_score")
FeaturePlot(object = merged, features = "cell_type_Luminal_score1")

For ease and clarity, let’s subset our Seurat object to just the epithelial cell clusters.

### Subsetting dataset epithelial
merged <- SetIdent(merged, value = 'seurat_clusters_res0.8')
merged_epithelial <- subset(merged, idents = c('9', '12')) # 1750

#confirm that we have subset the object as expected visually using a UMAP
DimPlot(merged, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(merged_epithelial, group.by = 'seurat_clusters_res0.8', label = TRUE)

#confirm that we have subset the object as expected by looking at the individual cell counts
table(merged$seurat_clusters_res0.8)
table(merged_epithelial$seurat_clusters_res0.8)

Run CytoTRACE

Now let’s run CytoTRACE. CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. CytoTRACE uses gene count signatures (GCS), or the correlation between gene count and gene expression levels to capture differentiation states.

# create a dataframe of expression values to pass to Cytotrce
merged_epithelial_expression <- data.frame(GetAssayData(object = merged_epithelial, layer = "data"))

merged_epithelial_cytotrace_scores <- CytoTRACE(merged_epithelial_expression, ncores = 1)

We then have to get the Cytotrace data into the correct format.

merged_epithelial_cytotrace_transposed <- as.data.frame(merged_epithelial_cytotrace_scores$CytoTRACE) 
names(merged_epithelial_cytotrace_transposed) <- "cytotrace_scores"

head(merged_epithelial_cytotrace_transposed)

# fix the barcode formatting 
rownames(merged_epithelial_cytotrace_transposed) <- sub("\\.", "-", rownames(merged_epithelial_cytotrace_transposed))
rownames(merged_epithelial_cytotrace_transposed)

Add the data to the object

merged_epithelial <- AddMetaData(merged_epithelial,  merged_epithelial_cytotrace_transposed, 'cytotrace_scores')

merged_epithelial[['differentiation_scores']] <- 1 - merged_epithelial[['cytotrace_scores']] # Let's also reverse out CytoTRACE scores so that high means more differentiated and low means less differentiated

FeaturePlot(object = merged_epithelial, features = c("cell_type_Basal_score1", "cell_type_Luminal_score1", "cytotrace_scores", "differentiation_scores"))

Cytotrace scores of epitheial cells

Subsetting to just luminal cells

The most important part of trajectory analysis is to make sure you have some biological reasoning to back up the pseudotime values. The best practice for pseudotime means using it to support a biological pattern that has already been observed by some other method.

Let’s separate our luminal cells from the basal cells and perform trajectory analysis.

## Subset to just luminal cells
DimPlot(merged_epithelial) # cluster 10 is our luminal cells
merged_luminal <- subset(merged_epithelial, idents = c('9')) # 863 cells

We can also use the CytoTRACE R package to calculate our CytoTRACE scores.

# Creating a dataframe to pass to CytoTRACE
merged_luminal_expression <- data.frame(GetAssayData(object = merged_luminal, layer = "data"))

merged_luminal_cytotrace_scores <- CytoTRACE(merged_luminal_expression, ncores = 1)

merged_luminal_cytotrace_transposed <- as.data.frame(merged_luminal_cytotrace_scores$CytoTRACE) 
names(merged_luminal_cytotrace_transposed) <- "cytotrace_scores"

head(merged_luminal_cytotrace_transposed)

# fix the barcode formatting 
rownames(merged_luminal_cytotrace_transposed) <- sub("\\.", "-", rownames(merged_luminal_cytotrace_transposed))
rownames(merged_luminal_cytotrace_transposed)

We then can add those CytoTRACE scores to our luminal cell object and visualize them.

merged_luminal <- AddMetaData(merged_luminal, merged_luminal_cytotrace_transposed)

merged_luminal[['cytotrace_scores_luminal']] <- 1 - merged_luminal[['cytotrace_scores']]
merged_luminal[['differentiation_scores_luminal']] <- 1 - merged_luminal[['differentiation_scores']]

# compare all epithelial cells CytoTRACE scores to the luminal-only CytoTRACE
(FeaturePlot(object = merged_luminal, features = c("cytotrace_scores_luminal")) +
    ggtitle("Luminal Cells CytoTRACE Scores")) +
  (FeaturePlot(object = merged_luminal, features = c("differentiation_scores_luminal")) +
      ggtitle("Luminal Cells Differentiation Scores"))

Cytotrace scores of luminal cells

CytoTRACE will force all given cells onto the same scale meaning that there has to be cells at both the low and high ends of differentiation. The CytoTRACE scores could be a reflection of cell cycling genes. We can check by using a feature plot to compare the S-phase genes, G2/M-phase genes, and differentiation scores.

# View the S-phase genes, G2/M-phase genes, and the Phase to see if that explains the differentiation score
FeaturePlot(object = merged_luminal, features = c("S.Score", "G2M.Score")) + 
   DimPlot(merged_luminal, group.by = "Phase")
 
FeaturePlot(object = merged_luminal, features = c("S.Score", "G2M.Score", "differentiation_scores"))

We still don’t see a clear pattern, which illustrates the challenge and the danger of pseudotime.

Exercise: Create a subset of the Seurat object. You could explore the differences between the T-Cells populations, stem cells vs epithelial cells, or choose your own subset. Then run CytoTRACE on the subsetted dataset either using the webtool or the R package. Do the pseudotime scores make sense? Are there biological factors that support the CytoTRACE calculated trajectory?

Note: CytoTRACE will crash in the posit environment if you give it too many cells, so if there are several cell populations that you want to compare you can use the subset function to downsample your cell types. Make sure your Idents are set to the category you would like to subset too!

merged_subset <- subset(x = merged, downsample = 100)

Comparing CytoTRACE to Monocle3

Now that we have convinced ourselves that we somewhat trust the results of CytoTRACE, we can try the algorithms on the entire dataset and compare it to another trajectory analysis method. Another popular method used is Monocle3. Monocle3 is an analysis toolkit for scRNA and has many of the functions that Seurat has. We can use Monocle3’s trajectory algorithm but since it uses its own unique data structure, we will have to convert our subsetted object to a cell data set object. Luckily, there are tools that make that conversion relatively easy.

You can also refer to the full Monocle3 trajectory tutorial.

Before we start just running our data through the algorithm and seeing what we get, we should consider what we expect to get. Let’s remember what cell types we have in our dataset and where they are on our UMAP. What pseudotime scores do you expect to be assigned to the clusters?

DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE)

Overview of haematopoiesis

Haematopoiesis and red blood cells

Running Monocle3

Let’s now run Monocle3, again we have to convert our Seurat object to a Monocle ‘Cell Data Set’. We will use a package made for this specific purpose.

merged_cds <- SeuratWrappers::as.cell_data_set(merged)

Then we run the Monocle function cluster_cells. This function will redo unsupervised clusters and calculate partitions which are groups of cells that Monocle3 puts on separate trajectories. We don’t need the cells to be clustered since we already did that in Seurat but Monocle requires that partitions be calculated for its trajectory functions.

merged_cds <- cluster_cells(merged_cds)

Use the Monolce3 plotting functions to visualize partitions. Then we will execute the function learn_graph which will build the trajectory. We will set the use_partition parameter to FALSE so that we learn a trajectory across all clusters. Later you can come back and try setting it to TRUE and see what happens!

plot_cells(merged_cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

# Monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

merged_cds <- learn_graph(merged_cds, use_partition = FALSE) # graph learned across all partitions

Monocle3 requires you to choose a starting point or root for the calculated trajectories. Running the function order_cells will open a pop-up window where you can interactively choose where you want your roots to be. Note that you have to allow pop-ups for this function because it opens a window to choose the roots in, your browser might have this disabled.

merged_cds <- order_cells(merged_cds)
# Pick a root or multiple roots

Plot the pseudotime:

plot_cells(merged_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

When we choose a root around where the stem cells are located we see that the fibroblast and epithelial cells end with the higher pseudotime scores.

Monocle Pseudotime with Stem Cells as root

But if we choose are roots to be stem cells, fibroblasts, and epithelial cells, we see that monocle changes the pseudotime orderings accordingly.

Monocle Pseudotime with multiple roots

Loading in the CytoTRACE scores

We need a significant amount of computational power to run CytoTRACE on all cells so we have run CytoTRACE on a computing cluster and have saved the results to be loaded in and added to our seurat object. Of course, we have to make sure the data is formatted correctly.

# Cytotrace for all cells

# read in the cytotrace scores
merged_cytotrace <- read.table("data/single_cell_rna/reference_files/merged_cytotrace_scores.tsv", sep="\t") 
head(merged_cytotrace)


merged_cytotrace_transposed <- t(merged_cytotrace)
colnames(merged_cytotrace_transposed) <- "CytoTRACE"
head(merged_cytotrace_transposed)

rownames(merged_cytotrace_transposed) <- sub("\\.", "-", rownames(merged_cytotrace_transposed))
rownames(merged_cytotrace_transposed)

Now we can add the scores to our seurat object and also create an inverse CytoTRACE score for clarity. So our differentiation score will be 0 for least differentiated (smallest pseudotime) and 1 being most differentiated (biggest pseudotime).

# Add CytoTRACE scores matching on the cell barcodes
merged <- AddMetaData(merged, merged_cytotrace_transposed)
merged[["differentiation_score"]] <- 1 - merged[["CytoTRACE"]]

Finally, let’s compare the pseudotime values to our cell types. Do we get the results we want to get? Does Monocle and CytoTRACE agree with each other? What happens if you choose different roots for the Monocle pseudotime?

plot_cells(merged_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(merged, features = 'differentiation_score')) +
DimPlot(merged, group.by = 'immgen_singler_main', label = TRUE)

Monocle pseudotime compared with CytoTRACE pseudtime

Exercise: Using Trajectory to Analyze Monocyte Differentiation

For a final exercise, we can apply the same steps as above to analyze another group of cells: macrophages and monocytes.

highlight = merged$immgen_singler_main =="Macrophages"
highlighted_cells <- WhichCells(merged, expression = immgen_singler_main =="Macrophages")
# Plot the UMAP
DimPlot(merged, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


highlight = merged$immgen_singler_main =="Monocytes"
highlighted_cells <- WhichCells(merged, expression = immgen_singler_main =="Monocytes")
# Plot the UMAP
DimPlot(merged, reduction = 'umap', group.by = 'orig.ident', cells.highlight = highlighted_cells)


# grab all cells that are macrophages and monocytes
Idents(merged) <- "immgen_singler_main" 
merged_macro_mono_cells <- subset(merged, idents = c("Macrophages", "Monocytes"), invert = FALSE) # 1092

DimPlot(merged_macro_mono_cells, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(merged_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

# grab all cells that are macrophages and monocytes, we can subset by clusters 6 and 14 which seem to contain 
Idents(merged) <- "seurat_clusters_res0.8" 
merged_macro_mono_cells <- subset(merged, idents = c(6, 14), invert = FALSE) # 1350

DimPlot(merged_macro_mono_cells, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(merged_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) 

DimPlot(merged_macro_mono_cells, group.by = 'immgen_singler_fine', label = TRUE)

# create a data frame with the counts from our subsetted obect
merged_macro_mono_cells_expression <- data.frame(GetAssayData(merged_macro_mono_cells, layer = "data")) 

# pass that dataframe to the CytoTRACE function
merged_macro_mono_cells_cytotrace_scores <- CytoTRACE(merged_macro_mono_cells_expression, ncores = 4)

# Create a dataframe out of the CytoTRACE scores
merged_macro_mono_cells_cytotrace_scores_df <- as.data.frame(merged_macro_mono_cells_cytotrace_scores$CytoTRACE)

# Make the rownames of the cytotraace scores function the cell barcodes and rename the CytoTRACE scores column approproately
rownames(merged_macro_mono_cells_cytotrace_scores_df) <- sub("\\.", "-", rownames(merged_macro_mono_cells_cytotrace_scores_df))

Now we will incorporate our CytoTRACE scores into our Seurat object.

# Add CytoTRACE scores matching on the cell barcodes
merged_macro_mono_cells <- AddMetaData(merged_macro_mono_cells, merged_macro_mono_cells_cytotrace_scores_df)

CytoTRACE assignes the least differentiated cells a score of 1 and the most differentiated cells a score of 0, which is sometimes not inutive. So lets create a inverse CytoTRACE score which we will call our differentiation score.

merged_macro_mono_cells[['differentiation_scores']] <- 1 - merged_macro_mono_cells[['CytoTRACE']]

# Plot the results
DimPlot(merged_macro_mono_cells, group.by = 'seurat_clusters_res0.8', label = TRUE) + 
  DimPlot(merged_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE) +
  FeaturePlot(merged_macro_mono_cells, features = 'differentiation_scores')

Running Monocole

Let’s compare our CytoTRACE scores to Monocle3’s trajectory calculations.

cds <- SeuratWrappers::as.cell_data_set(merged_macro_mono_cells)

cds <- cluster_cells(cds)

# View our clusters
plot_cells(cds, show_trajectory_graph = FALSE, color_cells_by = "partition")

cds <- learn_graph(cds, use_partition = FALSE) # graph learned across all partitions

cds <- order_cells(cds) # choose your root(s)

plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

plot_cells(cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1) + 
  (FeaturePlot(merged_macro_mono_cells, features = 'CytoTRACE') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  (FeaturePlot(merged_macro_mono_cells, features = 'differentiation_score') + scale_color_viridis(option = 'magma', discrete = FALSE)) +
  DimPlot(merged_macro_mono_cells, group.by = 'immgen_singler_main', label = TRUE)

Macrophages differentiation

Selecting a root in Monocle3

Monocle3 proposes a method to choose a root more precisely in its trajecotry tutorial. It requires you to do the preprocessing steps specific to Monocle3.

merged_cds <- SeuratWrappers::as.cell_data_set(merged)

# Monocle preprocessing steps
merged_cds <- preprocess_cds(merged_cds, num_dim = 50, method = 'PCA')
# merged_cds <- align_cds(merged_cds, alignment_group = "orig.ident") # removes batch effect by fitting its a linear model to the cells
merged_cds <- reduce_dimension(merged_cds, preprocess_method = 'PCA', reduction_method="UMAP") # calculate UMAPs

merged_cds <- cluster_cells(merged_cds)

plot_cells(merged_cds, show_trajectory_graph = FALSE, color_cells_by = "immgen_singler_main")

# monocle will create a trajectory for each partition, but we want all our clusters
# to be on the same trajectory so we will set `use_partition` to FALSE when 
# we learn_graph

merged_cds <- learn_graph(merged_cds, use_partition = FALSE) # graph learned across all partitions

# merged_cds <- order_cells(merged_cds)
# Pick a root or multiple roots -- stem cell? or stem cell, fibroblasts, epithelial cells

# a helper function to identify the root principal points:
cell_ids <- which(colData(merged_cds)[, "seurat_clusters_res0.8"] == 5)

closest_vertex <-
  merged_cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex

closest_vertex <- as.matrix(closest_vertex[colnames(merged_cds), ])

root_pr_nodes <-
    igraph::V(principal_graph(merged_cds)[["UMAP"]])$name[as.numeric(names
                                                              (which.max(table(closest_vertex[cell_ids,]))))]
root_pr_nodes

merged_cds <- order_cells(merged_cds, root_pr_nodes=root_pr_nodes)

plot_cells(merged_cds, color_cells_by = "pseudotime", label_branch_points = FALSE, label_leaves =  FALSE, cell_size = 1)

Running Cytotrace Online

First, we have to export the counts.

merged_mtx <- GetAssayData(merged_epithelial, layer = "counts")
write.csv(merged_mtx, "outdir/merged_epithelial_counts.csv")

Then export the file from posit. In the file window select merged_epithelial_counts.csv. Then go to More -> Export… and click Download.

Now we go to https://cytotrace.stanford.edu/. We will navigate to the Run CytoTRACE tab on the left menu bar and upload our downloaded csv in the Upload gene expression table. We will not worry about uploading any other files as of now but if we had a larger dataset we could provide cell type and batch information for our cells.

When uploaded click Run CytoTRACE. This may take a few minutes.

Uploading to Cytotrace

We can then spend some time exploring CytoTRACE scores. For CytoTRACE, warmer colors mean less differentiation, and cooler colors mean more differentiated. We can use the Gene radio button to plot the expression of different marker genes for Basal and Luminal cells.

Plotting Krt5 compared to CytoTRACE scores

Plotting Cd24a compared to CytoTRACE scores

Once we have verified that the CytoTRACE scores are assigned in a way that corresponds with our biological knowledge we can click the Download CytoTRACE results button at the top left of the page.

Import CytoTRACE scores

We now want to add our CytoTRACE scores to our Seurat object. So we navigate to the Upload button and select our exported CytoTRACE scores from where the file was downloaded. We then read it into R and add the scores to our Seurat object as another column in the meta.data.

cytotrace_scores <- read.table("outdir/CytoTRACE_results.txt", sep="\t") # read in the cytotrace scores

rownames(cytotrace_scores) <- sub("\\.", "-", rownames(cytotrace_scores)) # the barcodes export with a `.` instead of a '-' at the end of the barcode so we have to remedy that before joining the cytotrace scores unto our seurat object

# Add CytoTRACE scores matching on the cell barcodes
merged_epithelial <- AddMetaData(merged_epithelial, cytotrace_scores %>% select("CytoTRACE"))

Now we can plot our basal cell markers, our luminal cell markers, and the CytoTRACE scores together to compare. Since it is a little unintuitive that less differentiated scores are closer to 1 we will also create a differentiation_score which will be an inverse of our CytoTRACE scores so that smaller scores mean less differentiated and larger scores mean more differentiated.

merged_epithelial[['differentiation_scores']] <- 1 - merged_epithelial[['CytoTRACE']] # Let's also reverse out CytoTRACE scores so that high means more differentiated and low means less differentiated

FeaturePlot(object = merged_epithelial, features = c("cell_type_Basal_score1", "cell_type_Luminal_score1", "cytotrace_scores", "differentiation_scores"))

Further Resources

R Tutorial

Python Tutorial

Sanbomics

Does Monocole Use Clusters to Calculate Pseudotime

Using_Monocle_For_Pseudotime_Trajectory

« Cancer cell identification Course TCR/BCR Repertoire Analysis »

Cancer cell identification

« Gene set enrichment and pathway analysis Course Differentiation/Trajectory analysis »

Often when we’re analyzing cancer samples using scRNAseq, we need to identify tumor cells from healthy cells. Usually, we know what kind of celltype we expect the tumor cells to be (epithelial cells for bladder cancer, B cells for a B cell lymphoma, etc.), but we may want to distinguish tumor cells from normal cells of the same celltype for various reasons such as DE analyses. Also, since we are drawing conclusions related to gene expression from these comparisons, we may want to identify tumor cells using methods that are orthogonal to the genes expressed by the tumor cells. To that end, two common methods used are identifying either mutations or copy number alterations in scRNAseq data.

In the case of mutations, one will typically carry out mutation calling from DNA sequencing data (ideally from tumor-normal paired samples) and then ‘look’ for those mutations in the scRNAseq data’s BAM files using a tool like VarTrix. Because of the sparsity of single cell data, and its end bias (5’ or 3’ depending on the kit), it is unlikely that we can get every tumor cell from this approach, however it is likely that we can be pretty confident in the tumor cells we do identify.

For copy number alterations (CNAs), there are various tools that try to detect CNAs in tumor cells like CONICSmat and InferCNV. They rely on a similar principle of using the counts matrix to identify regions of the genome that collectively have higher or lower expression and that if (say) 100+ genes in the same region have higher (or lower) expression it is likely that that is due to a copy number gain (or loss) as opposed to their being upregulated (or downregulated).

Finding tumor cells based on Copy number data

If you are working in a tumor sample where you expect to find CNAs, looking for copy number alterations in the scRNAseq data can be one way to identify tumor cells. In our case, whole genome sequencing was done on the cell line used for the mouse models, so we have some confidently determined CNAs we expect to find in the scRNAseq data-

CNV_WGS_scatterplot

As we can see above, we expect to find gains in chromosome 2 and 11 and a loss in chromosome 12.

We will use CONICSmat to identify cells with CNAs in our scRNAseq data. While there is more information available on their GitHub tutorial and paper, briefly, CONICSmat fits a two-component Gaussian mixture model for each chromosomal region, and uses a Bayesian Information Criterion (BIC) statistical test to ask if a 1-component model (all cells are the same and there’s no CNA) fits better than a 2-component model (some cells have altered copy number compared to others). Here, a better fit is defined by a lower BIC score. Note that since we are only using the gene by cell counts matrix, average expression of genes in that chromosomal region is used as a proxy for a CNA. The key to most single-cell CNA based tools is that we need both tumor cells and non-tumor cells in our analysis as a copy number gain or loss in the tumor cells can only be measured relative to healthy cells. While CONICSmat can be run on all cells in the sample together, for our purposes, we will subset the object to Epithelial cells and B cells so that it can run more efficiently. CONICSmat is a somewhat computationally intensive tool, so we will start by clearing our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select Free unused memory.

#make sure you have cleared your workspace and 'freed unused space' otherwise you may run into issues later on!

#start by loading in the libraries and the preprocessed Seurat object
library("Seurat") 
library("ggplot2")
library("cowplot")
library("dplyr") 
library("Matrix")
library("hdf5r")
library("CONICSmat")
library("showtext")

merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#subset Seurat object to only include epithelial cells and B cells
merged <- SetIdent(merged, value = 'immgen_singler_main')
merged_subset <- subset(merged, idents=c('B cells', 'Epithelial cells'))

#as always double check that the number of cells match our expectations
table(merged$immgen_singler_main)
table(merged_subset$immgen_singler_main)

Now we can look into running CONICSmat! It is run in a few steps and requires us to look at some outputs along the way and define some thresholds. Tha main inputs CONICSmat needs are a Seurat counts matrix, a regions file specifying genomic regions, and a gene_pos file that has the genomic coordinates of all the genes. After we run the plotAll function, we will get a PDF file with results from the BIC test for every chromosomal region. We will look through this file to determine an appropriate BIC difference threshold that will help capture true CNA events. The BIC difference is (BIC 1 component score) - (BIC 2 component score), and a greater difference suggests a higher chance of a ‘real’ CNA event. Once we have a confident set of CNAs, we will use that information to cluster the cells in a heatmap wherein the heatmap is colored by z-scores calculated from the normalized average expression values.

#Normalize Seurat counts matrix using CONICsmat normMat function
conicsmat_expr <- CONICSmat::normMat(as.matrix(Seurat::GetAssayData(merged_subset, assay = 'RNA', layer = 'counts')))

#Get chromosomal positions of genes in the expression matrix
gene_pos=getGenePositions(rownames(conicsmat_expr), ensembl_version = "https://oct2022.archive.ensembl.org/", species = "mouse")

#Get coordinates for each region of interest (in our case we'll just use chromosome coordinates)
regions=read.table("/cloud/project/data/single_cell_rna/reference_files/chromosome_full_positions_mm10.txt",sep="\t",header = T, row.names = 1)


#Filter out uninformative genes aka genes that are expressed in less than 5 cells (that was the default given by CONICSmat)
conicsmat_expr=filterMatrix(conicsmat_expr,gene_pos[,"mgi_symbol"],minCells=5)

#Calculate normalization factor for each cell- this centers the gene expression in each cell around the mean. 
#(Need this because the more genes that are expressed in a cell, the less reads are 'available' per gene)
normFactor=calcNormFactors(conicsmat_expr)

#Fit the 2 component Gaussian mixture model for each region to determine if the region has a CNA. 
## This step outputs a PDF with a page for each region in the regions file that we can look through to determine which regions are likely to have a CNA.
## It also outputs a BIC_LR.txt file that summarizes the BIC scores and adj p-values for each region
l=plotAll(conicsmat_expr,normFactor,regions,gene_pos,"outdir_single_cell_rna/conic_plotall")

Let’s look through the conic_plotall_CNVs.pdf file. Looking through the results, we can see clear bimodal distributions for chromosomes 2, 11, and 12, and we see a corresponding lower BIC score for the 2 component model in these cases. However, we also see a low BIC score for cases like chromosome 5 even though we don’t see a bimodal distribution, and this may contribute to noisy results when we apply a BIC threshold. Looking at these barplots, a BIC threshold of around 800 might help filter out the noise, so let’s see the results with the default threshold (200) and 800.

#the BIC difference threshold will be applied using the text file that generated by the previous step. 
#Let's read that in and look at it in the RStudio window
lrbic <- read.table("outdir_single_cell_rna/conic_plotall_BIC_LR.txt",sep="\t",header=T,row.names=1,check.names=F)
lrbic

#filter candidate regions using default threshold 200
candRegions_200 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>200 & lrbic[,"LRT adj. p-val"]<0.01)]

#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
plotHistogram(l[,candRegions_200],conicsmat_expr,clusters=2,zscoreThreshold=4)
#note that this command outputs both a plot, and some barcodes with numbers. 
#The barcode with numbers are basically the heatmap cluster ID that each barcode is assigned to. But we can ignore that for now.
#did that split up the cells with the gain/losses well? Remember that based on the WGS data, we're expecting gains in chr2 and chr11, and a loss in chr12
#If not, try it with a different numbers of clusters (4, 8, 12 etc)

#Now filter candidates using threshold 800
candRegions_800 <- rownames(lrbic)[which(lrbic[,"BIC difference"]>800 & lrbic[,"LRT adj. p-val"]<0.01)]

#plot a histogram and heatmap where we try to split it up into 2 cluster (ideally tumor and non tumor cells)
plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=2,zscoreThreshold=4)
dev.off() #(after viewing)
#how does that look? Maybe try increasing the clusters a little to get it to split up nicely.

#Once you have a cluster split you like, save the barcode-cluster ID list to a variable for use later. 
#Also save the plot as a PDF. (Note if the save doesn't work, 
#you may need to run dev.off() a few times until you get a message saying null device 1)
pdf("outdir_single_cell_rna/conic_plot_histogram_cand_regions_3clusters.pdf", width=5, height=5)
hi <- plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4)
dev.off()
hi
#Also note the cluster with putative malignant cells from the heatmap (note that the cluster IDs are not always in order, refer to the text in grey on the right to identify the clusters)

#now convert the hi named vector to a dataframe
hi_df <- data.frame(hi) 
#look at hi_df in RStudio

#add a new column for tumor cell status to the barcodes based on the cluster you've determined to be the tumor cells
hi_df$tumor_cell_classification <- ifelse(hi_df$hi=='2', 'cnv-tumor cell', 'cnv-not tumor cell')
#look at hi_df dataframe in RStudio again

#double-check our classifications worked as expected
table(hi_df$hi, hi_df$tumor_cell_classification)

#let's also see where all the cells cluster on the UMAP. 
#for this we can use the DimPlot function and provide it with a list of cell barcodes that we want to color separately
cnv_tumor_bc <- rownames(hi_df[hi_df$tumor_cell_classification == 'cnv-tumor cell',])
DimPlot(merged, cells.highlight = rownames(hi_df[hi_df$tumor_cell_classification == 'cnv-tumor cell',])) + #breakdown the argument given to cells.highlight
  DimPlot(merged, group.by = 'immgen_singler_main') 

#save this list of barcodes to their own TSV file for later
write.table(x = cnv_tumor_bc, file = 'outdir_single_cell_rna/cnv_tumor_bc_list.tsv', row.names = FALSE, col.names = 'cnv_tumor_barcodes', quote=FALSE)

Conicsmat tumor cells dimplot

Looks like most of our tumor cells based on the CNV classification are in the epithelial cells cluster as we’d expect! A key point about CONICSmat here is that, the tool itself doesn’t determine a tumor cell from a non-tumor cell. In this case we knew from the whole genome data that the tumor cells should have gains in chromosomes 2 and 11 and a loss in chromosome 12. But if we did not have that information, we can overlay our celltypes on the heatmap and determine CNAs as the CNAs should primarily arise in the tumor’s celltype.

# using the celltypes argument we can add celltypes to the Seurat object. 
plotHistogram(l[,candRegions_800],conicsmat_expr,clusters=3,zscoreThreshold=4, celltypes = merged_subset$immgen_singler_main)
#note that we got the cell barcodes from merged_subset as that 
#was the object used to generate the initial conicsmat_expr matrix.
dev.off() #(after viewing)

Conicsmat histogram with celltypes

We will look into adding this information to the Seurat object after we have the SNV tumor calls as well.

Finding tumor cells based on mutation data

We will use VarTrix to identify cells with mutations in our scRNAseq data. However, VarTrix is a tool that is run at the commandline level, so we will not run it in this workshop, instead we will go over the command to run it and how the outputs were processed here, and then plot the resulting mutation calls in RStudio. As input, VarTrix takes the scRNAseq BAM and barcodes.tsv files output from CellRanger, along with a VCF file containing the variants, and a reference genome fasta file. It can be run in 3 --scoring-method modes, here we chose coverage. This mode generates 2 output matrices- the ref-matrix and out-matrix that are in a MatrixMarket format. The former summarizes the number of REF reads (reads with wild-type allele) observed for every cell barcode and variant, while the latter summarizes the same for ALT reads (reads with mutant allele). Even though we will not be running it in this workshop, the vartrix command below was run for each replicate separately.

vartrix --vcf [input VCF file (based on cell line, is common for all replicates)] \
 --bam [cellranger bam file (unique to each replicate)] \
 --fasta [mm10 mouse reference genome] \
 --cell-barcodes [barcodes.tsv (unique to each replicate)] \
 --out-matrix [name of output ALT reads matrix] \
 --ref-matrix [name of output REF reads matrix] \
 --out-variants [name of output summarizing variants] -s coverage

The first few lines of one of the output matrices are shown below (both matrices are formatted the same way), where the first row summarizes the data so that the first value is the number of variants, second value is the number of cells, and the 3rd number is the number of rows in this matrix. The data itself starts from the 4th row, where the first column identifies the variant number (or ID), the second column identifies the cell barcode, and the third column indicates the number of reads (REF or ALT depending on the matrix) for that variant-

%%MatrixMarket matrix coordinate real general
% written by sprs
4016 287647
359 0
890 0
1018 0
1068 0
1964 0
2048 0
... 

The processing of these matrices to identify variant containing cells depends on the data. In the manuscript, the mutation calls were noisy, so the authors required a cell to have at least 2 variants with greater than 20X total coverage, with at least 5 ALT reads, and 10% VAF (variant allele fraction). However, if one has a confident set of tumor specific variants, the criteria to classify tumor specific cells can be less stringent. Here we will try two different approaches to filtering variants- first we will use the same list of variants the authors did and gradually increase the number of ALT reads we require to classify a cell as a tumor cell. Second, we will filter the original set of variants to a ‘higher confidence’ set of variants and classify tumor cells based on them. This section is also computationally intensive, so we will clear our workspace using the broom icon on the top right pane, and also click on the drop-down menu with a piechart beside it and select Free unused memory.

#Load libraries and packages if they are not loaded already
library(Seurat)
library(dplyr)
library(Matrix)
library(ggplot2)

#read in seurat object
merged <- readRDS(file='data/single_cell_rna/backup_files/preprocessed_object.rds')

#read in the variants file
variants_file <- read.csv('/cloud/project/data/single_cell_rna/cancer_cell_id/mcb6c-exome-somatic.variants.annotated.clean.filtered_10K.tsv', sep='\t')

#read in a sample's output VarTrix matrix and barcodes 
##to read a sparse VarTrix matrix into R for processing, we'll use the readMM function, convert that to a matrix, and then convert the matrix to a dataframe.
rep1_icb_df <- as.data.frame(as.matrix(readMM('/cloud/project/data/single_cell_rna/cancer_cell_id/Rep1_ICB_alt_counts.mtx')))
##read in sample barcodes file
rep1_icb_bc <- read.table('/cloud/project/data/single_cell_rna/cancer_cell_id/Rep1_ICB_barcodes.tsv', sep = '\t', header = FALSE, col.names = 'barcodes')

Let’s spend some time looking at the variants_file, rep1_icb_bc and rep1_icb_df files. The variants_file was an output of our variant calling pipeline that uses 4 variant callers and includes some general variants filtering along with information like allele depth, allele fraction, etc. about each variant. The rep1_icb_bc file is a simple file containing all the barcodes in that sample. Lastly, the rep1_icb_df file is the matrix from earlier processed into a dataframe. You might be wondering how you are supposed to know how the matrix should be processed- usually most tools have (or at least should have) vignettes or brief tutorials available on how to process their data. Luckily for us, VarTrix does have all this documented in their GitHub documentation. So, let’s go through their website and see how they want us to process their data. Looks like after we read in the matrix, they want us to assign the barcodes as column names, and variants as row names. And after that we can look into adding that to our Seurat object. We have the barcodes in the rep1_icb_bc file, but what about the variants? We have the variant_file but it has chromosome and position separately, so we will make a column where they are combined together, and then add them to the matrix file.

#make a concatenated column of chromosome and variants
variants_file$CHROM_POS <- paste0(variants_file$CHROM, '_', variants_file$POS)

#assign barcodes to columns and variants to rows
colnames(rep1_icb_df) <- rep1_icb_bc$barcodes
row.names(rep1_icb_df) <- variants_file$CHROM_POS

#Hmm we got an error complaining about duplicate row names... that makes sense, since dataframes use the rownames to index a row, 
#2 rows having the same index is a problem because it'll be an ambiguous index value. 
#But why do we have 2 occurrences of the same variant and does it happen multiple times? Luckily it told us one of the values where it ran into this error `chr9_53503887`. 
#So let's look for it in the variant file. Interesting, we have 2 variants coming from the same position, it's difficult to say what's happening from the variant information alone, so we'll check what might be happening on IGV. 
#For now, at least we know where the error is coming from. Let's check if there are any other similar occurrences.

#use the data.frame() and table() functions to get a dataframe summarizing the number of times each CHROM_POS value occurs
troubleshoot_df <- data.frame(table(variants_file$CHROM_POS))
#sort or order the dataframe in descending order of occurrences for ease of viewing
troubleshoot_df <- troubleshoot_df[order(troubleshoot_df$Freq, decreasing = TRUE),]
#look at the top 5 rows
head(troubleshoot_df)
#looks like it only happens once!

#We can get around this error by using the make.unique() function in R. 
#This function goes through a column and for any values that occur more than once, it appends `_` and a number to the value.
variants_file$CHROM_POS <- make.unique(as.character(variants_file$CHROM_POS), sep = "_")
#navigate through the dataframe and see what changed.

#Okay now let's try  assign the variants to a row again
row.names(rep1_icb_df) <- variants_file$CHROM_POS
#Success! 

While it’s difficult to say what exactly happened in the case of the variant at chr9_53503887 without loading this sample’s BAM file into IGV, based on the reference genome, one possibility is that due to the variant being in a soft-masked/ repetitive region, the alignment algorithms were not able to align the reads well and so we have some instances where it was called as 4 base deletion, and others where it was called as a 2 base deletion.

chr9_53503887_anomaly

Let’s look at this dataframe more carefully now. So we have a dataframe with cell barcodes as columns and variants as rows. And each cell or value in this dataframe is the number of ALT reads (aka reads containing a mutation). Our overall objective is to find tumor cells, and we want to do that by establishing some sort of a threshold of number of ALT reads needed for a cell to be a tumor cell. Now we can do that for every sample separately or make a superset dataframe and do all the filtering together. The second approach might be better because eventually our goal is to combine cells from all the samples with the Seurat object. In order to make a superset dataframe, first we need to do the above formatting for each vartrix output. Then, we can combine the results horizontally because we need to add the cell barcodes. The key to combining dataframes horizontally is that the rownames (or index) need to be the same, and that is the case here because we ran Vartrix for all samples using the same set of variants.

#first let's make a vector of samples
sample_names <- c('Rep1_ICB', 'Rep1_ICBdT', 'Rep3_ICB', 'Rep3_ICBdT', 'Rep5_ICB', 'Rep5_ICBdT')

# Initialize an empty dataframe with the rows predefined as variant_names
all_samples_vartrix_df <- data.frame(matrix(nrow = length(variants_file$CHROM_POS), ncol = 0))
rownames(all_samples_vartrix_df) <- variants_file$CHROM_POS

#Now loop through each sample in sample_names and format them using the code from before
for (sample_rep in sample_names) {
  #read in and processing a sparse matrix
  rep_df <- data.frame(as.matrix(readMM(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_alt_counts.mtx'))))
  
  #get each sample's barcodes
  rep_bc <- read.table(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_barcodes.tsv'), 
    sep = '\t', header = FALSE, col.names = 'barcodes')
  
  #set column names and row names
  colnames(rep_df) <- rep_bc$barcodes
  row.names(rep_df) <- variants_file$CHROM_POS
  
  #combine dataframes
  all_samples_vartrix_df <- cbind(all_samples_vartrix_df, rep_df)
}

#let's double check that the number of columns make sense. 
#the number of cell barcodes in each of our replicates should be the same as the number of columns in all_samples_vartrix_df 
#number of columns in dataframe
ncol(all_samples_vartrix_df) #28585
#number of rows in each sample's dataframe # 4016+3927+6390+5576+2828+5848
#have a variable that we'll keep adding the number of rows to
tracking_sum = 0
#loop through each sample and count the number of rows in that dataframe. 
#Add that value to the tracking_sum value, and also print the row every time. 
for (sample_rep in sample_names) {
  rep_bc <- read.table(paste0('/cloud/project/data/single_cell_rna/cancer_cell_id/', sample_rep, '_barcodes.tsv'), sep = '\t', header = FALSE, col.names = 'barcodes')
  print(nrow(rep_bc))
  tracking_sum <- tracking_sum + nrow(rep_bc)
}

#print the tracking sum and compare to the value we got from ncol
print(tracking_sum)

Now we have all the cells in one dataframe, with information about the number of ALT reads for each variant. Next, we can get to classifying some tumor cells. For this, we’ll first want to flip (or transpose) the matrix so that the cell barcodes are on the rows and variants are on the columns. The reason for this will become clearer soon, but basically we need to know the number of total ALT reads across all variants in each cell, and the way dataframe operations work, it is much easier to do that by adding a column and take the sum off all the values in the row. To transpose the dataframe, we can use a handy R function called t(). And then we can take the sum of values in the dataframe using rowSums(). Once we have that, we can start subsetting the dataframe and making UMAP plots that highlight cells depending on how many ALT reads we want to require for a cell to be considered a tumor cell.

#transpose the dataframe
all_samples_vartrix_df <- as.data.frame(t(all_samples_vartrix_df))
#The t() function technically outputs a matrix and that needs to be converted to a dataframe

#this is a large matrix and ends up being computationally heavy so run the following command 
gc() #this is the same as 'free unused R memory'

#now let's add a column that is the sum of the entire row so that we can see the number of ALT reads per cell
all_samples_vartrix_df$num_alt_reads <- rowSums(all_samples_vartrix_df)

#Now let's highlight cells in Seurat with at least 1 ALT read
## first get a dataframe where everything has at least 1 ALT read using that column we added
all_samples_vartrix_df_1_ALT_read <- all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 1,]
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_1_ALT_read)) + 
  ggtitle('UHighlighting cells with \nat least 1 ALT read')
#that's way too many 'tumor' cells especially since we know a lot of the cells are not even epithelial cells.

#Let's try this again but increase our requirement to 10 ALT reads. 
#Also, we don't need to make a new dataframe every time, we can load the cells.highlight directly from the all_sample_vartrix_df table
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 10,])) + 
  ggtitle('Highlighting cells with \nat least 10 ALT reads')

#That seems to have helped, but we're still getting a lot more cells than we'd expect 
#considering that the epithelial cells are largely in clusters 9 and 12
#let's increase the number of ALT reads up to 20
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 20,])) + 
  ggtitle('Highlighting cells with \nat least 20 ALT reads')

Vartrix unfiltered vcf tumor cell

In spite of being fairly strict with how we are defining a tumor cell by requiring at least 20 reads carrying a mutation for a cell to be a tumor cell, we are still getting some pretty noisy calls as we are identifying many cells that are unlikely to be epithelial cells as tumor cells. This could suggest that our original variant list was not filtered enough and still has germline variants. We can filter this variant file further by removing variants that are likely to be germline variants using the following filtering strategy.

# Before we move forward let's delete some files and free up some space
rm(rep1_icb_bc, rep1_icb_df, troubleshoot_df, all_samples_vartrix_df_1_ALT_read)
gc()

# Filter variants file to high confidence variants
variants_file_high_conf <- variants_file[variants_file$set == 'mutect-varscan-strelka' & 
                           variants_file$NORMAL.DP > 50 & 
                           variants_file$TUMOR.DP > 50 &
                           variants_file$NORMAL.AF < 0.01 & 
                           variants_file$TUMOR.AF > 0.25 &
                           variants_file$Consequence == 'missense_variant', ]
 

What are these filters doing?

The set column only keeps variants that were called by all 3 mutation callers (mutect, varscan, and strelka).
(NORMAL.DP) and (TUMOR.DP) require that there at least 50 reads of support of the variant position in both the tumor and normal samples.
NORMAL.AF filter requires that if a variant does occur in the normal sample, it is at a VAF of less than 1%.
TUMOR.AF filter requires that a variant occurs in the tumor sample at at least 25% VAF.
CONSEQUENCE dictates that we only look at missense mutations.

Let’s use this high confidence variants file, restrict our superset variants matrix to this list of variants. And see if this works better to classify tumor cells.

#get a list of high confidence variants
high_conf_variants <- variants_file_high_conf$CHROM_POS

#the list of variants corresponds to the columns of our dataframe, 
#so we can use the list of high_conf_variants as a list of columns to keep in our dataframe
all_samples_vartrix_df_high_conf <- all_samples_vartrix_df[high_conf_variants]

#now let's recalculate num_alt_reads based on these ~2K high confidence variants
all_samples_vartrix_df_high_conf$num_alt_reads <- rowSums(all_samples_vartrix_df_high_conf)

#let's plot potential tumor cells based on the high confidence variants using the same thresholds as before. 
#At least 1 ALT read for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 1,])) + 
  ggtitle('Highlighting cells with \nat least 1 ALT read (filtered VCF)')
#At least 10 ALT reads for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 10,])) + 
  ggtitle('Highlighting cells with \nat least 10 ALT reads (filtered VCF)')
#At least 20 ALT reads for a tumor cell
DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,])) + 
  ggtitle('Highlighting cells with \nat least 20 ALT reads (filtered VCF)')

#How does this compare against our previous plot?
tumor_cells_10Kvars_dimplot <- DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df[all_samples_vartrix_df$num_alt_reads >= 20,])) + 
  ggtitle('Tumor cells with at least 20 ALT reads \n- 10K variants')

tumor_cells_2Kvars_dimplot <- DimPlot(merged, cells.highlight = rownames(all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,])) + 
  ggtitle('Tumor cells with at least 20 ALT reads \n- 2K variants')

## Plot them side by side
tumor_cells_10Kvars_dimplot + tumor_cells_2Kvars_dimplot


#That looks way better! Based on this, I think we can be pretty happy about finding tumor cells from the high confidence variants 
all_samples_vartrix_df_high_conf_20_ALT_read <- all_samples_vartrix_df_high_conf[all_samples_vartrix_df_high_conf$num_alt_reads >= 20,]
snv_tumor_bc <- rownames(all_samples_vartrix_df_high_conf_20_ALT_read)
write.table(x = snv_tumor_bc, file = 'outdir_single_cell_rna/snv_tumor_bc_list.tsv', row.names = FALSE, col.names = 'snv_tumor_barcodes', quote=FALSE)

Vartrix filtered vcf tumor cell

Lastly, let’s look into adding the SNV and CNV based tumor cell classifications to our Seurat object’s metadata. To add metadata to a Seurat object, we need a dataframe where we have all the barcodes in the Seurat object as rownames, and column(s) that we want to add to the metadata with value(s) corresponding to each barcode. Recall that our tumor cell classification lists for both SNVs and CNVs include only the tumor cell barcodes. So we will need a list of all barcodes in the Seurat object.

#read in our CNV and SNV tumor cell classification files
snv_bc_df <- read.csv('outdir_single_cell_rna/snv_tumor_bc_list.tsv', sep='\t')
cnv_bc_df <- read.csv('outdir_single_cell_rna/cnv_tumor_bc_list.tsv', sep='\t')

# If R crashed before you got to these steps, we have backip
# snv_bc_df <- read.csv('data/single_cell_rna/backup_files/snv_tumor_bc_list.tsv', sep='\t')
# cnv_bc_df <- read.csv('data/single_cell_rna/backup_files/cnv_tumor_bc_list.tsv', sep='\t')

#get lists of these barcodes
snv_bc_list <- snv_bc_df$snv_tumor_barcodes
cnv_bc_list <- cnv_bc_df$cnv_tumor_barcodes

#read in our Seurat object (if it's not already in your environment)
merged <- readRDS('outdir_single_cell_rna/preprocessed_object.rds')

#get all barcodes from our Seurat object
all_bc <- rownames(merged@meta.data)
#convert them into a dataframe
all_bc_df <- data.frame(all_bc)
#add 2 columns that will hold the SNV tumor cell classifications (snv_class) and #CNV tumor cell classifications (cnv_class) prefilled with the value 'unclassified' 
all_bc_df$snv_class <- 'unclassified'
all_bc_df$cnv_class <- 'unclassified'

#for barcodes in each of our SNV and CNV lists, 
#we'll change the value in the corresponding columns to 'snv_tumor' and 'cnv_tumor'
all_bc_df$snv_class[all_bc_df$all_bc %in% snv_bc_list] <- 'snv_tumor'
all_bc_df$cnv_class[all_bc_df$all_bc %in% cnv_bc_list] <- 'cnv_tumor'

#now let's add these classifications to Seurat metadata
#first we need to make the barcodes column as the rownames or index and delete the extra column. 
rownames(all_bc_df) <- all_bc_df$all_bc
#open the all_bc_df and see what happened. We don't need the extra barcodes column, so let's get rid of that. 
all_bc_df$all_bc <- NULL
#finally, let's add this to the Seurat metadata using Seurat's AddMetaData function
merged <- AddMetaData(merged, all_bc_df)
#see what columns were added to the metadata column
colnames(merged@meta.data)

#Now let's plot the CNV and SNV classifications on our UMAP
DimPlot(merged, group.by = c('snv_class', 'cnv_class', 'immgen_singler_main'))

#let's also take a quick look at how much the snv and cnv classifications agree with each other
table(merged$cnv_class, merged$snv_class)

CNV SNV tumor cells

> table(merged$cnv_class, merged$snv_class)
               snv_tumor unclassified
  cnv_tumor          822           12
  unclassified       495        21856

Voila! Looks like both of our independent classifications of tumor cells agree to quite a large extent. They also both independently added to our tumor cell identification, and were restricted to the epithelial cells.