User Tools

Site Tools


wiki:basicpipeline

Writing a basic RNA analysis pipeline

Next, let's initiate our first script. For today, we're going to walk through the basic steps of our analysis pipeline:

  • Trimmomatic
  • FASTQC
  • hisat2 alignment
  • featureCounts

And, actually, we're just going to skip Trimmomatic because it takes a really long time and only remove 1% of the reads in our dataset we're using today.


Initiate a script file

Navigate into the 02_scripts directory and initiate a script file:

$ pwd
$ ls  # should see 02_scripts or navigate to the proper directory
$ cd 02_scripts
$ nano simple_pipeline.sh

Copy the text below into your simple_pipeline.sh script and then save and close it:

#!/usr/bin/env bash

#SBATCH --job-name=pipeline_demo 
#SBATCH --nodes=1
#SBATCH --ntasks=6      # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00   # modify this to reflect how long to let the job go. 
#SBATCH --output=log_pipeline_demo_%j.txt

Take the opportunity now to see if you can edit your script remotely:

  • Try switching to cyberduck or sFTP Client or some other method of editing your files remotely.
  • For more information, look up Syncing at the bottom of the page for advice on Cyberduck.

FTP Client settings
Connect Using: SFTP
Connect to: login.rc.colorado.edu
Login: <eID>@colostate.edu # In which <eID> is replaced with your actual eID
Password: Try using password,push
Port: 22

Using your remote editing, add some pseudocode to your script:

#!/usr/bin/env bash

#SBATCH --job-name=pipeline_demo 
#SBATCH --nodes=1
#SBATCH --ntasks=6      # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00   # modify this to reflect how long to let the job go. 
#SBATCH --output=log_pipeline_demo_%j.txt

# Source the bashrc link to install software

# Quality control with FASTQC

# Alignment to reference genome with hisat2

# Tabulation of read counts per gene with featureCounts

:!: Notebook: Update your notebook to state that you are working on this script in this directory. Make note of your overall goal.

:!: Exercise: Try to fill in the proper code for the first sample.

The input fastq files for the first sample are:

../01_input/SRR3567551_1.fastq
../01_input/SRR3567551_2.fastq

The sample will be called sample01

Instructions and hints for each section are below:


:!: EXERCISE: Write a line of code to source your activate.bashrc file.

Add the line that we just discussed that will make all the software usable within the script.


Perform Quality Control with FASTQC

FASTQC Usage:

fastqc [options] <inputfile>
 
Options:
  -t <numberofthreads>
  -o <outputdir>

full FASTQC command line:

$ fastqc -t <numberofthreads> -o <outputdir> <inputfile>

For more help:

$ fastqc --help

:!: Quick tip: Write one FASTQC command for each of the paired-end files. So for a paired-end sample, like sample01, we will have two FASTQC commands, one for ../01_input/SRR3567551_1.fastq and one for ../01_input/SRR3567551_2.fastq

:!: EXERCISE: Write a pair of FASTQC commands for sample01 into your simple_pipeline.sh

:!: Quick Tip: Test your code every few changes. Test using $ sbatch simple_pipeline.sh

:?: Question: What is the FASTQC output? Where is it?

:!: ANSWER: Here are the lines of code: CLICK HERE FOR simple_pipeline.sh


Align reads to a reference genome using hisat2

hisat2 Usage:

hisat2 [options] -x <hisatpath> -1 <file_1.fastq> -2 <file_2.fastq> -S <outputfilename.sam> 
 
Options:
  --summary-file <summaryfilename.txt>   # Print out a summary
  --no-unal                              # Suppress SAM records for reads that failed to align
  -p <numbers of threads>

hisat2 Input:

  • .fastq files. For our purposes, these are the same .fastq files that are the input for FASTQC. If you use trimmomatic, they can be the output of the trimmomatic run.
  • hisat path: this is the location of the hisat2 indexes (remember we made these and they are in the format sc3_1.ht2, sc3_2.ht2, etc.

The hisat path: The hisat path is written: /path/to/hisat/files/hisat-prefix

So if your files are sc3_1.ht2, sc3_2.ht2, sc3_3.ht2, etc. And if they are located in the directory: /usr/john/hisat/, then your command will look like this…

hisat2 -x /usr/john/hisat/sc3

hisat2 Ouput:

:!: Exercise: Try it! Try writing a hisat2 command line in your script using the input files ../01_input/SRR3567551_1.fastq and ../01_input/SRR3567551_2.fastq and using the hisat2 indexes you made last class.

:!: Quick Tip: Comment out the FASTQC lines of code so they don't run again. You'll need to keep the source activate.bashrc line of code, however.

:!: More hisat2 help:

Manual pages:

$ hisat2

Manual website:https://ccb.jhu.edu/software/hisat2/manual.shtml

:?: Question: What is the hisat2 output? Where is it?

:!: ANSWER: Here are the lines of code: CLICK HERE FOR simple_pipeline.sh

Tabulation Using featureCounts

wiki/basicpipeline.txt · Last modified: 2018/11/27 05:29 by erin