User Tools

Site Tools


pre-processing_quality_control

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pre-processing_quality_control [2018/11/07 22:00]
david [File part of the command]
pre-processing_quality_control [2018/11/08 10:46] (current)
david [Running the programs through an sbatch script]
Line 3: Line 3:
  
 This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment. This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment.
 +
 +==== Papers ====
 +  * [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf |  Quality control and Phred scores ]] (Assigned reading on the schedule)
 +  * [[ https://​www.ncbi.nlm.nih.gov/​pubmed/​24695404 | Trimmomatic reference ]]
 +  * [[https://​www.bioinformatics.babraham.ac.uk/​publications.html | FastQC publications ]]
  
 ==== Helpful References ==== ==== Helpful References ====
-  ​* Quality control and Phred scores [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf | Assigned reading on the schedule]] +  * Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic ​| manual ​]] 
-  ​* Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic]] +  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/ ​| manual]]
-  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/​]]+
   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]
   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]
   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]
 +
 +
  
 ==== Not covered in this lesson ==== ==== Not covered in this lesson ====
   * Downloading,​ installing software   * Downloading,​ installing software
-  * Starting ​from a published source, downloading ​and formatting data (such as raw fastq).+  * Downloading data from SRA [[https://​www.ncbi.nlm.nih.gov/​sra/​SRX4314530[accn] | example ]] 
 + 
 +==== Additional Points ==== 
 +  * Erin has posted ​homework for this lecture on the schedule 
 +  * Code repositories cannot contain large datasets - They are symbolic links 
 +  * The mechanism for using installed software may differ between my lecture ​and Erin'​s,​ but you probably won't see the difference 
 ===== Summit: the HPC environment ===== ===== Summit: the HPC environment =====
  
Line 65: Line 77:
 <​code>​ <​code>​
 $ git pull $ git pull
 +remote: Enumerating objects: 12, done.
 +remote: Counting objects: 100% (12/12), done.
 +remote: Compressing objects: 100% (4/4), done.
 +remote: Total 8 (delta 6), reused 6 (delta 4), pack-reused 0
 +Unpacking objects: 100% (8/8), done.
 +From github.com:​meekrob/​summit-rna-seq-setup
 +   ​51d037e..e05225d ​ master ​    -> origin/​master
 +Updating 51d037e..e05225d
 +Fast-forward
 + ​03_scripts/​number_of_reads.sbatch |  3 ++-
 + ​03_scripts/​trimmomatic.sbatch ​    | 19 +++++++++++--------
 + 2 files changed, 13 insertions(+),​ 9 deletions(-)
 </​code>​ </​code>​
  
-<fs large>​**3)** Rerun setup</​fs>​ +Check that everything is OK.
- +
-Just to make sure everything is still working...+
  
 <​code>​ <​code>​
Line 75: Line 97:
 </​code>​ </​code>​
  
-<fs large>**4)** Raise your hand if there are failures</fs>+<code> 
 +$ ls 
 +01_input/ ​   
 +02_output/  
 +03_scripts/​  
 +04_logs/​  
 +Makefile  
 +README.txt  
 +Trimmomatic-0.36/​  
 +activate.bashrc  
 +bin/ 
 +</​code>​ 
 + 
 +===== A great "​get"​ from git ===== 
 + 
 +One of the greatest benefits of tracking changes is that you can see what you've changed. 
 + 
 +Let's say that, with an editor, I change line four of 03_scripts/​number_of_reads.sbatch to use a 10 minute time limit instead of 1 minute. 
 +<​code>​ 
 +$ nano 03_scripts/​number_of_reads.sbatch 
 +</​code>​ 
 + 
 +Line four now reads: 
 +<​code>​ 
 +#SBATCH --time=0:​10:​00 
 +</​code>​ 
 + 
 +I can use git to tell me what's different:​ 
 + 
 +<​code>​ 
 +$ git diff 03_scripts/​number_of_reads.sbatch  
 +diff --git a/​03_scripts/​number_of_reads.sbatch b/​03_scripts/​number_of_reads.sbatch 
 +index 0634563..3106155 100644 
 +--- a/​03_scripts/​number_of_reads.sbatch 
 ++++ b/​03_scripts/​number_of_reads.sbatch 
 +@@ -1,7 +1,7 @@ 
 + #​!/​usr/​bin/​env bash 
 + #​SBATCH --nodes=1 ​ # access with $SLURM_NNODES in the script 
 + #​SBATCH --ntasks=1 ​ # access with $SLURM_NTASKS in the script 
 +-#SBATCH --time=0:​01:​00 
 ++#SBATCH --time=0:​10:​00 
 + #​SBATCH --qos=testing # change to "​normal"​ when done testing 
 + #​SBATCH --partition=shas-testing # remove "​-testing"​ when done testing 
 + #​SBATCH --output=numreads-%j.out 
 +</​code>​ 
 + 
 +What if I screwed up? 
 + 
 +<​code>​ 
 +$ git checkout 03_scripts/​number_of_reads.sbatch 
 +</​code>​ 
 + 
 +Line 4 goes back to the original! 
 +<​code>​ 
 +#SBATCH --time=0:​01:​00 
 +</code> 
  
-<fs medium>​If anyone'​s job submission failed, then we are still waiting for support staff to update those accounts. :-x m(</​fs>​ 
  
-Please work with a neighbor it doesn'​t work for you. I can take you through this tutorial during office hours once your accounts are working. 
  
 ===== Directory Structure ===== ===== Directory Structure =====
Line 117: Line 193:
     * Have to keep track of subdirectories     * Have to keep track of subdirectories
     * use them consistently     * use them consistently
 +    * some coding overhead
 ===== Input Data ===== ===== Input Data =====
  
Line 180: Line 257:
 So let's look at one of ours: So let's look at one of ours:
 <​code>​ <​code>​
-$ head -4 01_input/SRR3567552_1.fastq ​+$ head -4 01_input/SRR3567551_1.fastq ​
 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100
 GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC
Line 192: Line 269:
  
 <​code>​ <​code>​
-$ ls -lh 01_input/SRR3567552_1.fastq ​+$ ls -lh 01_input/SRR3567551_1.fastq ​
 -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq
 </​code>​ </​code>​
Line 210: Line 287:
  
 <​code>​ <​code>​
-$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567552_1.fastq+$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567551_1.fastq
 </​code>​ </​code>​
  
Line 251: Line 328:
 Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line: Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line:
 <​code>​ <​code>​
-01_input/SRR3567552_1.fastq has 20654219 reads.+01_input/SRR3567551_1.fastq has 20654219 reads.
 </​code>​ </​code>​
  
Line 419: Line 496:
 In the command, I'll let the value of ''​$trim''​ be substituted in the command. In the command, I'll let the value of ''​$trim''​ be substituted in the command.
  
-Final form (Approach 1 - a lot of typing):+Final form (Approach 1 - a lot of typing, this only works for SRR3567551_1):
 <​code>​ <​code>​
 java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 01_input/​SRR3567551_1.fastq 01_input/​SRR3567551_2.fastq 02_output/​02_trimmomatic/​SRR3567551_1.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_1.unpaired.fastq 02_output/​02_trimmomatic/​SRR3567551_2.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_2.unpaired.fastq $trim java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 01_input/​SRR3567551_1.fastq 01_input/​SRR3567551_2.fastq 02_output/​02_trimmomatic/​SRR3567551_1.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_1.unpaired.fastq 02_output/​02_trimmomatic/​SRR3567551_2.trimmed.fastq 02_output/​02_trimmomatic/​SRR3567551_2.unpaired.fastq $trim
 </​code>​ </​code>​
  
-Final form (Approach 2 - a lot of scripting):+Final form (Approach 2 - a lot of scripting, but it works for any pair of runs):
 <​code>​ <​code>​
 java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 $in1 $in2 $outdir/​$out1paired $outdir/​$out1unpaired ​ $outdir/​$out2paired $outdir/​$out2unpaired $trim java -jar Trimmomatic-0.36/​trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 $in1 $in2 $outdir/​$out1paired $outdir/​$out1unpaired ​ $outdir/​$out2paired $outdir/​$out2unpaired $trim
pre-processing_quality_control.1541653225.txt.gz · Last modified: 2018/11/07 22:00 by david