Welcome to the blog

Posts

My thoughts and ideas

Team Assignment - Alignment | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Team Assignment - Alignment

The goal of the following team assignment is for students to gain hands-on experience by working on recently published RNA-seq data and apply the concepts they have learned up to RNA alignment. To complete this assignment, students will need to review commands we performed in earlier sections.

Background on Dataset used

In this assignment, we will be using subsets of the GSE136366 dataset (Roczniak-Ferguson A, Ferguson SM. Pleiotropic requirements for human TDP-43 in the regulation of cell and organelle homeostasis. Life Sci Alliance 2019 Oct;2(5). PMID: 31527135). This dataset consists of 6 RNA sequencing files of human cells that either express or lack the TDP-43 protein.

Experimental Details

  • The libraries are prepared as paired end.
  • The samples are sequenced on an Illumina HiSeq 2500.
  • Each read is 63 bp long
  • The data are RF/fr-firststrand stranded (dUTP)
  • The source dataset is located here: GSE136366
  • 3 samples are from TDP-43 Knockout HeLa cells and 3 samples wherein a wildtype TDP-43 transgene was re-expressed.
  • For this exercise we will be using different subsets of the reads:
    • Team A: chr11
    • Team B: chr12
    • Team C: chr17
    • Team D: chr19
    • Team E: chr6
  • The files are named based on their SRR id’s, and obey the following key:
    • SRR10045016 = KO sample 1
    • SRR10045017 = KO sample 2
    • SRR10045018 = KO sample 3
    • SRR10045019 = Rescue sample 1
    • SRR10045020 = Rescue sample 2
    • SRR10045021 = Rescue sample 3

Obtaining the dataset & reference files

Goals:

  • Obtain the files necessary for data processing
  • Review reference and annotation file formats
  • Review sequence FASTQ format

As mentioned previously, we have subsetted the 6 RNA-seq samples into 5 different chromosome regions. Each team can download their corresponding dataset using the following commands.

cd $RNA_HOME/
mkdir -p team_exercise/untrimmed
cd team_exercise/untrimmed

### TEAM A
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_A/dataset.tar.gz
tar -xzvf dataset.tar.gz

### TEAM B
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_B/dataset.tar.gz
tar -xzvf dataset.tar.gz

### TEAM C
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_C/dataset.tar.gz
tar -xzvf dataset.tar.gz

### TEAM D
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_D/dataset.tar.gz
tar -xzvf dataset.tar.gz

### TEAM E
wget -c http://genomedata.org/seq-tec-workshop/read_data/rna_alignment-de_exercise/dataset_E/dataset.tar.gz
tar -xzvf dataset.tar.gz

Additionally, teams will need to create a separate directory and download the corresponding reference files needed for RNA alignment & further expression analysis. Don’t forget to modify the below commands to use your team’s chromosome.

mkdir -p $RNA_HOME/team_exercise/references
cd $RNA_HOME/team_exercise/references

## Adapter trimming
wget -c http://genomedata.org/seq-tec-workshop/references/RNA/illumina_multiplex.fa

## Reference fasta corresponding to your team's assigned chromosome (e.g. chr6)
wget -c http://genomedata.org/seq-tec-workshop/references/RNA/chrXX.fa

## Obtain annotated reference gtf file corresponding to your team's assigned chromosome (e.g. chr6)
wget -c http://genomedata.org/seq-tec-workshop/references/RNA/chrXX_Homo_sapiens.GRCh38.95.gtf

Data Preprocessing (QC & Trimming)

Goals:

  • Perform adapter trimming on your data and also pre-trim 5 bases from end (right) of reads
  • Perform QC on your data with fastqc and multiqc before and after trimming your data

Q1. What is the average percentage of reads that are trimmed?

Q2. How do you expect the sequence length distribution to look prior to and after trimming? Is your answer confirmed by the multiqc report results?

Q3. Are there any metrics where the sample(s) failed?

Alignment Exercise

Goals:

  • Create HISAT2 index files for your chromosome
  • Review HISAT2 alignment options
  • Perform alignments
  • Obtain an alignment summary
  • Sort and convert your alignments into compressed BAM format

A useful option to add to the end of your commands is 2>, which redirects the stderr from any command into a specific file. This can be used to redirect your stderr into a summary file, and can be used as follows: my_alignment_command 2> alignment_metrics.txt. The advantage of this is being able to view the alignment metrics later on.

Q4. What were the percentages of reads that aligned to the reference for each sample?

Q5. By compressing your sam format to bam, approximately how much space is saved (fold change in size)?

Post-alignment QC & IGV Visualization

Goals:

  • Perform post-alignment QC analysis using fastqc and multiqc
  • Merge bam files (one for each condition) for easier visualization in IGV
  • Index the bam files
  • Explore the alignments using IGV

Q6. How does the information from your post-alignment QC report differ from pre-alignment QC?

Q7. IGV: Can you identify certain exons that have significantly more/less coverage in one of your KO/RESCUE samples compared to the other? What is happening here?

Q8. IGV: Can you identify regions where the RNAseq reads are mapping to unexpected regions? What do you think is the reason for this phenomenon?

Q9. IGV: Can you identify a gene region that has RNA sequencing support for multiple isoforms?

Presenting Your Results

At the end of this team exercise, groups will present findings from their QC reports and IGV analysis to the class for specific questions listed below.

Team A: Present IGV findings regarding question 9.

Team B: Present multiqc report on pre- and post-alignment qc files (question 6).

Team C: Present IGV findings regarding question 7.

Team D: Present IGV findings regarding question 8.

Team E: Present multiqc report on pre- and post-trimming qc files (Data Preprocessing section).

Alignment QC | Griffith Lab

RNA-seq Bioinformatics

Introduction to bioinformatics for RNA sequence analysis

Alignment QC


RNA-seq_Flowchart3


Alignment QC mini lecture

If you would like a refresher on alignment QC, we have made a mini lecture briefly covering the topic.

Use samtools and FastQC to evaluate the alignments

Use samtools view to see the format of a SAM/BAM alignment file

cd $RNA_ALIGN_DIR
samtools view -H UHR.bam
samtools view UHR.bam | head
samtools view UHR.bam | head | column -t | less -S

Try filtering the BAM file to require or exclude certain flags. This can be done with samtools view -f -F options

-f INT required flag -F INT filtering flag

“Samtools flags explained”

Try requiring that alignments are ‘paired’ and ‘mapped in a proper pair’ (=3). Also filter out alignments that are ‘unmapped’, the ‘mate is unmapped’, and ‘not primary alignment’ (=268)

samtools view -f 3 -F 268 UHR.bam | head | column -t | less -S

Now require that the alignments be only for ‘PCR or optical duplicate’. How many reads meet this criteria? Why?

samtools view -f 1024 UHR.bam | head

Use samtools flagstat to get a basic summary of an alignment. What percent of reads are mapped? Is this realistic? Why?

cd $RNA_ALIGN_DIR
mkdir flagstat

samtools flagstat HBR_Rep1.bam > flagstat/HBR_Rep1.bam.flagstat
samtools flagstat HBR_Rep2.bam > flagstat/HBR_Rep2.bam.flagstat
samtools flagstat HBR_Rep3.bam > flagstat/HBR_Rep3.bam.flagstat
samtools flagstat UHR_Rep1.bam > flagstat/UHR_Rep1.bam.flagstat
samtools flagstat UHR_Rep2.bam > flagstat/UHR_Rep2.bam.flagstat
samtools flagstat UHR_Rep3.bam > flagstat/UHR_Rep3.bam.flagstat

# Note that we could have created and run a samtools flagstat command for all files ending in *Rep*.bam using the following construct:
# find *Rep*.bam -exec echo samtools flagstat {} \> flagstat/{}.flagstat \; | sh

# View an example
cat flagstat/UHR_Rep1.bam.flagstat

Details of the SAM/BAM format can be found here: http://samtools.sourceforge.net/SAM1.pdf

Using FastQC

You can use FastQC to perform basic QC of your BAM file (See Pre-alignment QC). This will give you output very similar to when you ran FastQC on your fastq files.

cd $RNA_ALIGN_DIR
fastqc UHR_Rep1.bam UHR_Rep2.bam UHR_Rep3.bam HBR_Rep1.bam HBR_Rep2.bam HBR_Rep3.bam
mkdir fastqc
mv *fastqc.html fastqc/
mv *fastqc.zip fastqc/

Using Picard

You can use Picard to generate RNA-seq specific quality metrics and figures

# Generating the necessary input files for picard CollectRnaSeqMetrics
cd $RNA_HOME/refs

# Create a .dict file for our reference
java -jar $PICARD CreateSequenceDictionary -R chr22_with_ERCC92.fa -O chr22_with_ERCC92.dict

# Create a bed file of the location of ribosomal sequences in our reference (first extract from the gtf then convert to bed)
# Note that here we pull all the "rrna" transcripts from the GTF. This is a good strategy for the whole transcriptome ...
# ... but on chr22 there is very little "rrna" content, leading to 0 coverage for all samples, so we are also adding a single protein coding ribosomal gene "RRP7A" (normally we would not do this)
grep --color=none -i -P "rrna|rrp7a" chr22_with_ERCC92.gtf > ref_ribosome.gtf
gff2bed < ref_ribosome.gtf > ref_ribosome.bed

# Create interval list file for the location of ribosomal sequences in our reference
java -jar $PICARD BedToIntervalList -I ref_ribosome.bed -O ref_ribosome.interval_list -SD chr22_with_ERCC92.dict

# Create a genePred file for our reference transcriptome
gtfToGenePred -genePredExt chr22_with_ERCC92.gtf chr22_with_ERCC92.ref_flat.txt

# reformat this genePred file
cat chr22_with_ERCC92.ref_flat.txt | awk '{print $12"\t"$0}' | cut -d$'\t' -f1-11 > tmp.txt
mv tmp.txt chr22_with_ERCC92.ref_flat.txt

cd $RNA_HOME/alignments/hisat2/
mkdir picard
find *Rep*.bam -exec echo java -jar $PICARD CollectRnaSeqMetrics I={} O=picard/{}.RNA_Metrics REF_FLAT=$RNA_HOME/refs/chr22_with_ERCC92.ref_flat.txt STRAND=SECOND_READ_TRANSCRIPTION_STRAND RIBOSOMAL_INTERVALS=$RNA_HOME/refs/ref_ribosome.interval_list \; | sh

RSeQC [optional]

Background: RSeQC is a tool that can be used to generate QC reports for RNA-seq. For more information, please check: RSeQC Tool Homepage

Files needed:

cd $RNA_HOME/refs/

# Convert Gtf to genePred
gtfToGenePred chr22_with_ERCC92.gtf chr22_with_ERCC92.genePred

# Convert genPred to bed12
genePredToBed chr22_with_ERCC92.genePred chr22_with_ERCC92.bed12

cd $RNA_ALIGN_DIR
mkdir rseqc
geneBody_coverage.py -i UHR_Rep1.bam,UHR_Rep2.bam,UHR_Rep3.bam -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 -o rseqc/UHR
geneBody_coverage.py -i HBR_Rep1.bam,HBR_Rep2.bam,HBR_Rep3.bam -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 -o rseqc/HBR

find *Rep*.bam -exec echo inner_distance.py -i {} -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 -o rseqc/{} \; | sh

find *Rep*.bam -exec echo junction_annotation.py -i {} -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 -o rseqc/{} \; | sh

find *Rep*.bam -exec echo junction_saturation.py -i {} -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 -o rseqc/{} \; | sh

find *Rep*.bam -exec echo read_distribution.py  -i {} -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 \> rseqc/{}.read_dist.txt \; | sh

find *Rep*.bam -exec echo RNA_fragment_size.py -i {} -r $RNA_HOME/refs/chr22_with_ERCC92.bed12 \> rseqc/{}.frag_size.txt \; | sh

find *Rep*.bam -exec echo bam_stat.py -i {} \> {}.bam_stat.txt \; | sh

rm -f log.txt

MultiQC

We will now use multiQC to compile a QC report from all the QC tools above.

cd $RNA_ALIGN_DIR
multiqc ./

MultiQC screenshot

MultiQC

View a pre-generated MultiQC report for full bam files

View a multiQC on QC reports from non-downsampled bam files:

mkdir $RNA_ALIGN_DIR/example_QC
cd $RNA_ALIGN_DIR/example_QC
wget http://genomedata.org/rnaseq-tutorial/multiqc_report.html

Below is a brief description of each of the samples included in the multiQC report.

Name Sample type
Sample 1 Brain metastasis
Sample 2 Melanoma xenograft
Sample 3 Melanoma cell line
Sample 4 Melanoma
Sample 5 Small Cell Lung Cancer FFPE
Sample 6 Brain metastasis