User Tools

Site Tools


This is an old revision of the document!

RNA-seq Data Analysis: part 1

The purpose of this exercise is to introduce tools for analyzing differential gene expression in RNA-seq data. You will analyze RNA-seq data from human reference (a mix of tissues) and brain tissue to identify genes for which expression is enriched in the brain. Due to time constraints, the data we will analyze is a subset (chr22) of a larger RNA-seq dataset.

We will use a suite of tools called the Tuxedo pipeline. For an additional tutorial, see the following paper: Trapnell et al. 2014, Nature Protocols. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Note: this pipeline is no longer updated and has been replaced with a more efficient and accurate pipeline, however, we will use the original Tuxedo pipeline because there is far more support currently available for it and it has fewer bugs to contend with. The new and improved pipeline consists of a similar suite of tools: HISAT, StringTie, and Ballgown.


1. Assess data quality using FastQC.
2. Quality filter datasets using Trimmomatic.
3. Align the RNA-seq reads to the human genome using TopHat2.
4. Assemble transcripts based on RNA-seq data using cufflinks and cuffmerge.
5. Compare expression differences using cuffdiff.
6. Visualize data using the genome viewing software IGV.
7. Plot data with R and cummeRbund.

Quality control and filtering

1. Open cyberduck and sftp into the montgomery lab server:

2. Download brain and download the brain_data folder onto your desktop:

There are 12 RNA-seq datasets corresponding to paired-end data for 3 replicates from two sample sets (brain and ref). Examine a few lines of one of the files using zmore or zless. What information is contained in each line?

  • brain_rep1_1.fastq.gz
  • brain_rep1_2.fastq.gz
  • brain_rep2_1.fastq.gz
  • brain_rep2_2.fastq.gz
  • brain_rep3_1.fastq.gz
  • brain_rep3_2.fastq.gz
  • ref_rep1_1.fastq.gz
  • ref_rep1_2.fastq.gz
  • ref_rep2_1.fastq.gz
  • ref_rep2_2.fastq.gz
  • ref_rep3_1.fastq.gz
  • ref_rep3_2.fastq.gz

3. Assess the quality of the data using FastQC:

In FastQC:


See FastQC tutorial for additional details:

4. Trim adapter sequences and quality filter the RNA-seq data (fastq files) using Trimmomatic:

Trim adapter sequences and quality filter each dataset using Trimmomatic (you will run trimmomatic 6 times in total).

$ trimmomatic PE -phred33 'input_fastq_1' 'input_fastq_2' 'output_fastq_paired_1' 'output_fastq_unpaired_1' 'output_fastq_paired_2' 'output_fastq_unpaired_2' ILLUMINACLIP:/usr/share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36											

See the Trimmomatic manual for a detailed description of options:

Submit an answer to the following question on Canvas:
What proportion of the reads were retained?

5. Assess the quality of one of the datasets after quality filtering using FastQC:

In FastQC:


6. Create a bowtie index for the human chromosome 22 sequence:

$ bowtie2-build 'sequence.fa' 'prefix'

The chr22 sequence is in the brain_data folder: hg38_chr22.fa.

For the bowtie prefix, use chr22.

7. Move the 6 bowtie index files and the genome sequence file (hg38_chr22.fa) to a new folder called bowtie_chr22.

See the bowtie manual for additional details:

8. Align sequences from each of the libraries to the human genome using TopHat2 (you will run tophat 6 times in total):

$ tophat -p 8 -G 'path_to_genome_annotations.gtf' -o 'output_folder' 'path_to_bowtie_index_for_reference_genome/prefix' 'fastq_file_paired_1_1','fastq_file_paired_1_2','fastq_file_unpaired_1_1','fastq_file_unpaired_1_2'
  • NOTE: There are no spaces between the fastq file names.
  • The directory containing the fastq files should be the current working directory.
  • Name the output folders as follows: ref1, ref2, ref3, brain1, brain2, brain3

See the TopHat manual for additional details:

9. Determine what proportion of the reads from each library were aligned:

Use the UNIX more command to open each TopHat summary file in the terminal. The TopHat summary files are named align_summary.txt and are located in the output folder specified in step 5.

$ more ./ref1/align_summary.txt

Submit an answer to the following question on Canvas:
What proportion of reads in each library aligned?

assignments/ex13.1510775651.txt.gz · Last modified: 2017/11/15 12:54 by dokuroot