User Tools

Site Tools


pre-processing_quality_control

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pre-processing_quality_control [2018/11/07 21:11]
david [File part of the command]
pre-processing_quality_control [2018/11/08 10:46] (current)
david [Running the programs through an sbatch script]
Line 3: Line 3:
  
 This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment. This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment.
 +
 +==== Papers ====
 +  * [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf |  Quality control and Phred scores ]] (Assigned reading on the schedule)
 +  * [[ https://​www.ncbi.nlm.nih.gov/​pubmed/​24695404 | Trimmomatic reference ]]
 +  * [[https://​www.bioinformatics.babraham.ac.uk/​publications.html | FastQC publications ]]
  
 ==== Helpful References ==== ==== Helpful References ====
-  ​* Quality control and Phred scores [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf | Assigned reading on the schedule]] +  * Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic ​| manual ​]] 
-  ​* Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic]] +  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/ ​| manual]]
-  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/​]]+
   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]
   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]
   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]
 +
 +
 +
 +==== Not covered in this lesson ====
 +  * Downloading,​ installing software
 +  * Downloading data from SRA [[https://​www.ncbi.nlm.nih.gov/​sra/​SRX4314530[accn] | example ]]
 +
 +==== Additional Points ====
 +  * Erin has posted a homework for this lecture on the schedule
 +  * Code repositories cannot contain large datasets - They are symbolic links
 +  * The mechanism for using installed software may differ between my lecture and Erin'​s,​ but you probably won't see the difference
  
 ===== Summit: the HPC environment ===== ===== Summit: the HPC environment =====
Line 62: Line 77:
 <​code>​ <​code>​
 $ git pull $ git pull
 +remote: Enumerating objects: 12, done.
 +remote: Counting objects: 100% (12/12), done.
 +remote: Compressing objects: 100% (4/4), done.
 +remote: Total 8 (delta 6), reused 6 (delta 4), pack-reused 0
 +Unpacking objects: 100% (8/8), done.
 +From github.com:​meekrob/​summit-rna-seq-setup
 +   ​51d037e..e05225d ​ master ​    -> origin/​master
 +Updating 51d037e..e05225d
 +Fast-forward
 + ​03_scripts/​number_of_reads.sbatch |  3 ++-
 + ​03_scripts/​trimmomatic.sbatch ​    | 19 +++++++++++--------
 + 2 files changed, 13 insertions(+),​ 9 deletions(-)
 </​code>​ </​code>​
  
-<fs large>​**3)** Rerun setup</​fs>​ +Check that everything is OK.
- +
-Just to make sure everything is still working...+
  
 <​code>​ <​code>​
Line 72: Line 97:
 </​code>​ </​code>​
  
-<fs large>**4)** Raise your hand if there are failures</fs>+<code> 
 +$ ls 
 +01_input/ ​   
 +02_output/  
 +03_scripts/​  
 +04_logs/​  
 +Makefile  
 +README.txt  
 +Trimmomatic-0.36/​  
 +activate.bashrc  
 +bin/ 
 +</​code>​ 
 + 
 +===== A great "​get"​ from git ===== 
 + 
 +One of the greatest benefits of tracking changes is that you can see what you've changed. 
 + 
 +Let's say that, with an editor, I change line four of 03_scripts/​number_of_reads.sbatch to use a 10 minute time limit instead of 1 minute. 
 +<​code>​ 
 +$ nano 03_scripts/​number_of_reads.sbatch 
 +</​code>​ 
 + 
 +Line four now reads: 
 +<​code>​ 
 +#SBATCH --time=0:​10:​00 
 +</​code>​ 
 + 
 +I can use git to tell me what's different:​ 
 + 
 +<​code>​ 
 +$ git diff 03_scripts/​number_of_reads.sbatch  
 +diff --git a/​03_scripts/​number_of_reads.sbatch b/​03_scripts/​number_of_reads.sbatch 
 +index 0634563..3106155 100644 
 +--- a/​03_scripts/​number_of_reads.sbatch 
 ++++ b/​03_scripts/​number_of_reads.sbatch 
 +@@ -1,7 +1,7 @@ 
 + #​!/​usr/​bin/​env bash 
 + #​SBATCH --nodes=1 ​ # access with $SLURM_NNODES in the script 
 + #​SBATCH --ntasks=1 ​ # access with $SLURM_NTASKS in the script 
 +-#SBATCH --time=0:​01:​00 
 ++#SBATCH --time=0:​10:​00 
 + #​SBATCH --qos=testing # change to "​normal"​ when done testing 
 + #​SBATCH --partition=shas-testing # remove "​-testing"​ when done testing 
 + #​SBATCH --output=numreads-%j.out 
 +</​code>​ 
 + 
 +What if I screwed up? 
 + 
 +<​code>​ 
 +$ git checkout 03_scripts/​number_of_reads.sbatch 
 +</​code>​ 
 + 
 +Line 4 goes back to the original! 
 +<​code>​ 
 +#SBATCH --time=0:​01:​00 
 +</code> 
  
-<fs medium>​If anyone'​s job submission failed, then we are still waiting for support staff to update those accounts. :-x m(</​fs>​ 
  
-Please work with a neighbor it doesn'​t work for you. I can take you through this tutorial during office hours once your accounts are working. 
  
 ===== Directory Structure ===== ===== Directory Structure =====
Line 114: Line 193:
     * Have to keep track of subdirectories     * Have to keep track of subdirectories
     * use them consistently     * use them consistently
 +    * some coding overhead
 ===== Input Data ===== ===== Input Data =====
  
Line 177: Line 257:
 So let's look at one of ours: So let's look at one of ours:
 <​code>​ <​code>​
-$ head -4 01_input/SRR3567552_1.fastq ​+$ head -4 01_input/SRR3567551_1.fastq ​
 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100
 GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC
Line 189: Line 269:
  
 <​code>​ <​code>​
-$ ls -lh 01_input/SRR3567552_1.fastq ​+$ ls -lh 01_input/SRR3567551_1.fastq ​
 -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq
 </​code>​ </​code>​
Line 207: Line 287:
  
 <​code>​ <​code>​
-$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567552_1.fastq+$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567551_1.fastq
 </​code>​ </​code>​
  
Line 248: Line 328:
 Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line: Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line:
 <​code>​ <​code>​
-01_input/SRR3567552_1.fastq has 20654219 reads.+01_input/SRR3567551_1.fastq has 20654219 reads.
 </​code>​ </​code>​
  
Line 301: Line 381:
 </​code>​ </​code>​
  
-<fs large>​Getting the command for Trimmomatic to work **is //not// easy.**</​fs>​+==== Hardest boss of RNA-seq Analysis? ==== 
 + 
 +Hardest boss of <​del>​The Internet</​del>​ RNA-seq Analysis AS THE FIRST STEP??? 
 + 
 +{{ :​final_boss.jpg?​200 |}} 
 + 
 +<fs large>​Getting the command for Trimmomatic to work **is //not// easy.** ​So don't be discouraged. Most programs are not this hard to set up.</fs> 
 + 
 +Here is the QUICK START example command [[http://​www.usadellab.org/​cms/?​page=trimmomatic | given in the docs]]!!!:​ 
 + 
 +<​code>​ 
 +java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz  
 +output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz  
 +output_reverse_unpaired.fq.gz ILLUMINACLIP:​TruSeq3-PE.fa:​2:​30:​10 LEADING:3 TRAILING:3 SLIDINGWINDOW:​4:​15 MINLEN:36 
 +</​code>​ 
 + 
 +<fs large>​Let'​s chip away at this monster. </​fs>​ 
 + 
 +=== File parameters === 
 + 
 +  * The Paired-end PE mode requires naming two input fastq files,  
 +  * and then naming a "​Trimmed"​ and "​Unpaired"​ output file for each of them.  
 +  * That's six file parameters!  
 +  * Two existing, and four to be created by Trimmomatic. 
  
-The Paired-end PE mode requires naming two input fastq files, and then naming a "​Trimmed"​ and "​Unpaired"​ output file for each of them. That's six file parameters! Two existing, and four to be created by Trimmomatic.+=== Trimmer ===
  
 Looking at the documentation,​ here is an example argument for the trimmer. Looking at the documentation,​ here is an example argument for the trimmer.
Line 334: Line 438:
 The specific values of these settings depend on the platform and the adapters that were used. However, depending on the success of the alignment (in a future lecture), these example setting may be sufficient. The specific values of these settings depend on the platform and the adapters that were used. However, depending on the success of the alignment (in a future lecture), these example setting may be sufficient.
  
-=== Creating a wrapper script ===+=== Approach: ​Creating a wrapper script ===
  
 We need some techniques for handling these complex arguments in a reliable way. We need some techniques for handling these complex arguments in a reliable way.
Line 370: Line 474:
 in2=${infile/​_1.fastq/​_2.fastq} in2=${infile/​_1.fastq/​_2.fastq}
  
-make all the file output names, ​strip the input directory with '​basename'​ +# strip the input directory with '​basename'​ 
-out1paired=$(basename ${in1/​.fastq/​.trimmed.fastq}   # makes: SRR1234567_1.trimmed.fastq +basename1=$(basename ​$in1) 
-out1unpaired=$(basename ​${in1/​.fastq/​.unpaired.fastq}# makes: SRR1234567_1.unpaired.fastq +basename2=$(basename $in2) 
-out2paired=$(basename ​${in2/​.fastq/​.trimmed.fastq}   # makes: SRR1234567_2.trimmed.fastq +# make all the file output names 
-out2unpaired=$(basename ​${in2/​.fastq/​.unpaired.fastq}# makes: SRR1234567_2.unpaired.fastq+out1paired=${basename1/​.fastq/​.trimmed.fastq} ​   # makes: SRR1234567_1.trimmed.fastq 
 +out1unpaired=${basename1/​.fastq/​.unpaired.fastq} # makes: SRR1234567_1.unpaired.fastq 
 +out2paired=${basename2/​.fastq/​.trimmed.fastq} ​   # makes: SRR1234567_2.trimmed.fastq 
 +out2unpaired=${basename2/​.fastq/​.unpaired.fastq} # makes: SRR1234567_2.unpaired.fastq
 </​code>​ </​code>​
  
Line 389: Line 496:
 In the command, I'll let the value of ''​$trim''​ be substituted in the command. In the command, I'll let the value of ''​$trim''​ be substituted in the command.
  
-Final form (Approach 1 - a lot of typing):+Final form (Approach 1 - a lot of typing, this only works for SRR3567551_1):
 <​code>​ <​code>​
 java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 01_input/​SRR3567551_1.fastq 01_input/​SRR3567551_2.fastq 02_output/​02_trimmomatic/​SRR3567551_1.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_1.unpaired.fastq 02_output/​02_trimmomatic/​SRR3567551_2.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_2.unpaired.fastq $trim java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 01_input/​SRR3567551_1.fastq 01_input/​SRR3567551_2.fastq 02_output/​02_trimmomatic/​SRR3567551_1.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_1.unpaired.fastq 02_output/​02_trimmomatic/​SRR3567551_2.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_2.unpaired.fastq $trim
 </​code>​ </​code>​
  
-Final form (Approach 2 - a lot of scripting):+Final form (Approach 2 - a lot of scripting, but it works for any pair of runs):
 <​code>​ <​code>​
 java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 $in1 $in2 $outdir/​$out1paired $outdir/​$out1unpaired ​ $outdir/​$out2paired $outdir/​$out2unpaired $trim java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 $in1 $in2 $outdir/​$out1paired $outdir/​$out1unpaired ​ $outdir/​$out2paired $outdir/​$out2unpaired $trim
pre-processing_quality_control.1541650318.txt.gz · Last modified: 2018/11/07 21:11 by david