User Tools

Site Tools


pre-processing_quality_control

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pre-processing_quality_control [2018/11/07 22:10]
david [Not covered in this lesson]
pre-processing_quality_control [2018/11/08 10:46] (current)
david [Running the programs through an sbatch script]
Line 3: Line 3:
  
 This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment. This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment.
 +
 +==== Papers ====
 +  * [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf |  Quality control and Phred scores ]] (Assigned reading on the schedule)
 +  * [[ https://​www.ncbi.nlm.nih.gov/​pubmed/​24695404 | Trimmomatic reference ]]
 +  * [[https://​www.bioinformatics.babraham.ac.uk/​publications.html | FastQC publications ]]
  
 ==== Helpful References ==== ==== Helpful References ====
-  ​* Quality control and Phred scores [[ https://​link.springer.com/​content/​pdf/​10.1007%2F978-1-4939-2291-8_8.pdf | Assigned reading on the schedule]] +  * Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic ​| manual ​]] 
-  ​* Trimmomatic [[http://​www.usadellab.org/​cms/?​page=trimmomatic]] +  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/ ​| manual]]
-  * FastQC [[https://​www.bioinformatics.babraham.ac.uk/​projects/​fastqc/​]]+
   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]   * Sbatch command reference [[https://​slurm.schedmd.com/​sbatch.html]]
   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]   * Summit specs [[https://​www.colorado.edu/​rc/​resources/​summit/​specifications]]
   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]   * More complex Slurm jobs using dependencies [[https://​hpc.nih.gov/​docs/​job_dependencies.html]]
 +
 +
  
 ==== Not covered in this lesson ==== ==== Not covered in this lesson ====
   * Downloading,​ installing software   * Downloading,​ installing software
   * Downloading data from SRA [[https://​www.ncbi.nlm.nih.gov/​sra/​SRX4314530[accn] | example ]]   * Downloading data from SRA [[https://​www.ncbi.nlm.nih.gov/​sra/​SRX4314530[accn] | example ]]
 +
 +==== Additional Points ====
 +  * Erin has posted a homework for this lecture on the schedule
 +  * Code repositories cannot contain large datasets - They are symbolic links
 +  * The mechanism for using installed software may differ between my lecture and Erin'​s,​ but you probably won't see the difference
 +
 ===== Summit: the HPC environment ===== ===== Summit: the HPC environment =====
  
Line 65: Line 77:
 <​code>​ <​code>​
 $ git pull $ git pull
 +remote: Enumerating objects: 12, done.
 +remote: Counting objects: 100% (12/12), done.
 +remote: Compressing objects: 100% (4/4), done.
 +remote: Total 8 (delta 6), reused 6 (delta 4), pack-reused 0
 +Unpacking objects: 100% (8/8), done.
 +From github.com:​meekrob/​summit-rna-seq-setup
 +   ​51d037e..e05225d ​ master ​    -> origin/​master
 +Updating 51d037e..e05225d
 +Fast-forward
 + ​03_scripts/​number_of_reads.sbatch |  3 ++-
 + ​03_scripts/​trimmomatic.sbatch ​    | 19 +++++++++++--------
 + 2 files changed, 13 insertions(+),​ 9 deletions(-)
 </​code>​ </​code>​
 +
 +Check that everything is OK.
 +
 +<​code>​
 +$ make setup
 +</​code>​
 +
 +<​code>​
 +$ ls
 +01_input/ ​  
 +02_output/ ​
 +03_scripts/​
 +04_logs/
 +Makefile
 +README.txt
 +Trimmomatic-0.36/​
 +activate.bashrc
 +bin/
 +</​code>​
 +
 +===== A great "​get"​ from git =====
 +
 +One of the greatest benefits of tracking changes is that you can see what you've changed.
 +
 +Let's say that, with an editor, I change line four of 03_scripts/​number_of_reads.sbatch to use a 10 minute time limit instead of 1 minute.
 +<​code>​
 +$ nano 03_scripts/​number_of_reads.sbatch
 +</​code>​
 +
 +Line four now reads:
 +<​code>​
 +#SBATCH --time=0:​10:​00
 +</​code>​
 +
 +I can use git to tell me what's different:
 +
 +<​code>​
 +$ git diff 03_scripts/​number_of_reads.sbatch ​
 +diff --git a/​03_scripts/​number_of_reads.sbatch b/​03_scripts/​number_of_reads.sbatch
 +index 0634563..3106155 100644
 +--- a/​03_scripts/​number_of_reads.sbatch
 ++++ b/​03_scripts/​number_of_reads.sbatch
 +@@ -1,7 +1,7 @@
 + #​!/​usr/​bin/​env bash
 + #​SBATCH --nodes=1 ​ # access with $SLURM_NNODES in the script
 + #​SBATCH --ntasks=1 ​ # access with $SLURM_NTASKS in the script
 +-#SBATCH --time=0:​01:​00
 ++#SBATCH --time=0:​10:​00
 + #​SBATCH --qos=testing # change to "​normal"​ when done testing
 + #​SBATCH --partition=shas-testing # remove "​-testing"​ when done testing
 + #​SBATCH --output=numreads-%j.out
 +</​code>​
 +
 +What if I screwed up?
 +
 +<​code>​
 +$ git checkout 03_scripts/​number_of_reads.sbatch
 +</​code>​
 +
 +Line 4 goes back to the original!
 +<​code>​
 +#SBATCH --time=0:​01:​00
 +</​code>​
 +
  
  
Line 169: Line 257:
 So let's look at one of ours: So let's look at one of ours:
 <​code>​ <​code>​
-$ head -4 01_input/SRR3567552_1.fastq ​+$ head -4 01_input/SRR3567551_1.fastq ​
 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100 @SRR3567552.1 HISEQ:​222:​C3RTWACXX:​1:​1101:​1411:​2064 length=100
 GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC
Line 181: Line 269:
  
 <​code>​ <​code>​
-$ ls -lh 01_input/SRR3567552_1.fastq ​+$ ls -lh 01_input/SRR3567551_1.fastq ​
 -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq -rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/​SRR3567552_1.fastq
 </​code>​ </​code>​
Line 199: Line 287:
  
 <​code>​ <​code>​
-$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567552_1.fastq+$ sbatch 03_scripts/​number_of_reads.sbatch 01_input/SRR3567551_1.fastq
 </​code>​ </​code>​
  
Line 240: Line 328:
 Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line: Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called ''​numreads-1388630.out''​ containing the line:
 <​code>​ <​code>​
-01_input/SRR3567552_1.fastq has 20654219 reads.+01_input/SRR3567551_1.fastq has 20654219 reads.
 </​code>​ </​code>​
  
pre-processing_quality_control.1541653859.txt.gz · Last modified: 2018/11/07 22:10 by david