User Tools

Site Tools


pre-processing_quality_control

This is an old revision of the document!


Pre-processing and Quality Control Tutorial on Summit

This lesson is a practical run-through of the first steps in RNA-seq processing within a high-performance computing environment.

Helpful References

Not covered in this lesson

  • Downloading, installing software
  • Starting from a published source, downloading and formatting data (such as raw fastq).

Summit: the HPC environment

Don't have an account on summit? Request one at https://rcamp.rc.colorado.edu/accounts/account-request/create/verify/csu

You can follow this tutorial once your account is set up.

For those with summit access

By now, you should all be able to log onto the Summit supercomputer from the command line.

$ ssh myname@colostate.edu@login.rc.colorado.edu

Notice it is colostate in myname@colostate.edu,

but it is colorado in login.rc.colorado.edu

Password:

When you type in your password, it is your CSU eid password followed by “,push”. For example, if your password is 'cranberries', you must type 'cranberries,push'.

You will then get a request on your phone's Duo Mobile app

We must directly log in to the compile node:

$ ssh scompile

Setting up

After you have:

  • logged onto summit
  • logged onto scompile using ssh scompile

you should be in your home directory on a compile node.

1) Change to the directory we made last time:

$ cd /scratch/summit/$USER
$ cd DSCI512_RNAseq
$ cd PROJ01_testsummit
$ cd summit-rna-seq-setup

2) Update the repository

When you're working on a code repository with someone, best practice is to update as soon as you start working on it to incorporate changes.

$ git pull

3) Rerun setup

Just to make sure everything is still working…

$ make setup

4) Raise your hand if there are failures

If anyone's job submission failed, then we are still waiting for support staff to update those accounts. :-x m(

Please work with a neighbor it doesn't work for you. I can take you through this tutorial during office hours once your accounts are working.

Directory Structure

In the name of best practices, we're going to use subdirectories to keep things organized. It might not seem necessary all the time, and it takes some discipline to get used to, but it pays off in the end.

summit-rna-seq-setup/ (where we type commands)
├── 01_input -> link to Erin's input directory
├── 02_output -> link to your scratch directory
├── 03_scripts
│   ├── fastqc.sbatch
│   ├── number_of_reads.sbatch
│   ├── template.sbatch
│   ├── trimmomatic_array.sbatch
│   └── trimmomatic.sbatch
├── 04_logs
├── activate.bashrc
├── bin
│   ├── bedtools -> link to bedtools installation
│   ├── FastQC -> link to FastQC installation
│   ├── hisat2 -> link to hisat2 installation
│   └── subread -> link to subread installation
├── Makefile
├── README.txt
└── Trimmomatic-0.36
    └── adapters
    ├── LICENSE
    └── trimmomatic-0.36.jar
  • Benefits:
    • data is organized,
    • workflow has logical progression
  • Drawbacks:
    • Have to keep track of subdirectories
    • use them consistently

Input Data

The setup script has linked data in 01_input. Let's look at it:

$ ls 01_input/
AceticAcid_experiment_metadata.txt 
EO070_S2_R1_001_test.fastq.gz 
EO070_S2_R2_001_test.fastq.gz
metadata_mouse.txt 
short_AceticAcidMetadata.txt
SRR3567551_1.fastq SRR3567551_2.fastq SRR3567552_1.fastq SRR3567552_2.fastq
SRR3567554_1.fastq SRR3567554_2.fastq SRR3567555_1.fastq SRR3567555_2.fastq
SRR3567637_1.fastq SRR3567637_2.fastq SRR3567638_1.fastq SRR3567638_2.fastq
SRR3567639_1.fastq SRR3567639_2.fastq SRR3567657_1.fastq SRR3567657_2.fastq
SRR3567674_1.fastq SRR3567674_2.fastq SRR3567676_1.fastq SRR3567676_2.fastq
SRR3567677_1.fastq SRR3567677_2.fastq SRR3567679_1.fastq SRR3567679_2.fastq
test.txt

This directory contains sequence reads from an RNA-seq experiment testing different growth conditions on yeast.

You can see in 01_input/short_AceticAcidMetadata.txt a more specific description:

$ more 01_input/AceticAcid_experiment_metadata.txt 
SRR3567551_1.fastq.gz	SRR3567551_2.fastq.gz	sample02	CK45-1	untreated	45min	1
SRR3567552_1.fastq.gz	SRR3567552_2.fastq.gz	sample03	CK45-2	untreated	45min	2
SRR3567554_1.fastq.gz	SRR3567554_2.fastq.gz	sample04	Ac45-1	aceticAcidTreated	45min	1
SRR3567555_1.fastq.gz	SRR3567555_2.fastq.gz	sample05	Ac45-2	aceticAcidTreated	45min	2
SRR3567637_1.fastq.gz	SRR3567637_2.fastq.gz	sample07	CK120-1	untreated	120min	1
SRR3567638_1.fastq.gz	SRR3567638_2.fastq.gz	sample08	CK120-2	untreated	120min	2
SRR3567639_1.fastq.gz	SRR3567639_2.fastq.gz	sample10	Ac120-1	aceticAcidTreated	120min	1
SRR3567657_1.fastq.gz	SRR3567657_2.fastq.gz	sample11	Ac120-2	aceticAcidTreated	120min	2
SRR3567674_1.fastq.gz	SRR3567674_2.fastq.gz	sample13	CK200-1	untreated	200min	1
SRR3567676_1.fastq.gz	SRR3567676_2.fastq.gz	sample14	CK200-2	untreated	200min	2
SRR3567677_1.fastq.gz	SRR3567677_2.fastq.gz	sample16	Ac200-1	aceticAcidTreated	200min	1
SRR3567679_1.fastq.gz	SRR3567679_2.fastq.gz	sample18	Ac200-2	aceticAcidTreated	200min	2

Notice that all files end in “.fastq”. Some have _1, and some have _2.

FASTQ Format

(verbatim from wikipedia)

A FASTQ file normally uses four lines per sequence.

  1. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  2. Line 2 is the raw sequence letters.
  3. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  4. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

So let's look at one of ours:

$ head -4 01_input/SRR3567552_1.fastq 
@SRR3567552.1 HISEQ:222:C3RTWACXX:1:1101:1411:2064 length=100
GTGCTTGTGGACTGCTTGGTGGGGCTTGCTCTGCTAGGCGGACTACTTGCGTGCCTTGTTGTAGACGGCCTTGGTAGGTCTCTTGTAGACCGTCGCTTGC
+SRR3567552.1 HISEQ:222:C3RTWACXX:1:1101:1411:2064 length=100
CBCFFFFFHHHHHJJJJJJCGIJIJJJJJJJJJJJIJJJJJGIIHHHHHHFFDDDEEDDDDDDDEDDDBDBDDDCCDD@CDDDDDDCDDDDDBBDDDDD

This shows a single read: its sequence, additional information in the header, and a symbol representing the quality assigned at each position.

It's pretty big:

$ ls -lh 01_input/SRR3567552_1.fastq 
-rw-r--r-- 1 erinnish@colostate.edu erinnishgrp@colostate.edu 6.6G Oct 16 15:14 01_input/SRR3567552_1.fastq

This tells us the number of bytes in the file, but not how many sequences there are.

What linux command can we use to tell use how many reads there are in the file?

This is a job for SLURM!!

Our first job submission

We are going to take advantage of wc -l, a utility that counts the lines in a file, to tell us how many sequences there are.

I have written a simple script for that purpose. Let's try it:

$ sbatch 03_scripts/number_of_reads.sbatch 01_input/SRR3567552_1.fastq

$ sacct -j <job number>
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- --------
1388630      number_of+ shas-test+ csu-gener+          1    PENDING      0:0

Running the programs through an sbatch script

Let's look at the script we just ran:

$ more 03_scripts/number_of_reads.sbatch
#!/usr/bin/env bash
#SBATCH --nodes=1  # access with $SLURM_NNODES in the script
#SBATCH --ntasks=1  # access with $SLURM_NTASKS in the script
#SBATCH --time=0:01:00
#SBATCH --qos=testing # change to "normal" when done testing
#SBATCH --partition=shas-testing # remove "-testing" when done testing
#SBATCH --output=numreads-%j.out

# 4 lines per read in FASTQ format
numlines=$(wc -l $1 | cut -f1 -d ' ')
numreads=$(( numlines / 4 ))
echo "$1 has $numreads reads."

Summit was still recovering from maintenance yesterday, so my job just sat there, but I expect a file to be created in the directory called numreads-1388630.out containing the line:

01_input/SRR3567552_1.fastq has 20654219 reads.

Common SBATCH header directives go at the top (under the #!/usr/bin/env bash shebang):

example syntax description
–nodes=1 Number of nodes used by your job.
–ntasks=1 Number of tasks or threads used by your job. Accessible in your script via $SLURM_NTASKS
–time=4:00:00 Job time limit. H:mm:ss.
–partition=shas shas=Haswell. shas-testing=Haswell testing. See the full list
–qos=normal Quality of service. For us: normal or testing. See the full list
–output=slurm-%j.out The name format of the output file. %j=job number assigned by SLURM.

Checking on your jobs

Here are some things I tried while waiting for my job to run:

$ sacct -j 1388630
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1388630      number_of+ shas-test+ csu-gener+          1    PENDING      0:0 

$ squeue -u $USER --start
  JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
1388630 shas-test number_o dcking@c PD 2018-11-08T00:49:00      1 shas0108            (Resources)

Running Trimmomatic

Java is a powerful language that is portable across systems.

But, java programs can be tricky to run from the command line. You have to pass the program as an argument to the java runtime.

1) Load the java runtime into your environment

$ module load jdk

2) Run trimmomatic by passing it the java archive.

$ java -jar Trimmomatic-0.36/trimmomatic-0.36.jar
Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version

Hardest boss of RNA-seq Analysis?

Hardest boss of The Internet RNA-seq Analysis AS THE FIRST STEP???

Getting the command for Trimmomatic to work is not easy. So don't be discouraged. Most programs are not this hard to set up.

Here is the QUICK START example command given in the docs!!!:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz 
output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz 
output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Let's chip away at this monster.

File parameters

  • The Paired-end PE mode requires naming two input fastq files,
  • and then naming a “Trimmed” and “Unpaired” output file for each of them.
  • That's six file parameters!
  • Two existing, and four to be created by Trimmomatic.

Trimmer

Looking at the documentation, here is an example argument for the trimmer.

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10

We have to modify it for our directory structure: I have highlighted the file part in bold:

ILLUMINACLIP:Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10

This refers to an adapter file included with the Trimmomatic software. To see all of them, do:

$ ls Trimmomatic-0.36/adapters/
NexteraPE-PE.fa  TruSeq2-PE.fa  TruSeq2-SE.fa  TruSeq3-PE-2.fa  TruSeq3-PE.fa  TruSeq3-SE.fa

The numbers following the filename specify further parameters. Here is the general form of the argument:

Clipping parameters

The example given above has its own syntax given in the documentation:

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

See the documentation about Trimmomatic for the meaning of the individual arguments.

The specific values of these settings depend on the platform and the adapters that were used. However, depending on the success of the alignment (in a future lecture), these example setting may be sufficient.

Approach: Creating a wrapper script

We need some techniques for handling these complex arguments in a reliable way.

  • Take notes
  • Save parameters in a script

Each run of trimmomatic needs to:

  1. Take a single file argument ending in _1.fastq, assume the mate pair exists, ending in _2.fastq
  2. For each input file, name all of the output files (trimmed and unmatched)
  3. Remember 01_input and 02_output/02_trimmomatic directory structure
  4. Pass the complex parameters argument as a single string

For points 1-3, here are all of the filenames and directories,

input file output file trimmed output file unmatched
_1 files SRR3567551_1.fastq SRR3567551_1.trimmed.fastq SRR3567551_1.unpaired.fastq
_2 files SRR3567551_2.fastq SRR3567551_2.trimmed.fastq SRR3567551_2.unpaired.fastq
directory 01_input/ 02_output/02_trimmomatic/ 02_output/02_trimmomatic/

File part of the command

Approach 1: write out the whole thing manually…

01_input/SRR3567551_1.fastq 01_input/SRR3567551_2.fastq 02_output/02_trimmomatic/SRR3567551_1.trimmed.fastq 02_output/02_trimmomatic/SRR3567551_1.unpaired.fastq 02_output/02_trimmomatic/SRR3567551_2.trimmed.fastq 02_output/02_trimmomatic/SRR3567551_2.unpaired.fastq

Approach 2: script the filename variations:

infile=$1 # like 01_input/SRR1234567_1.fastq
outdir=02_output/02_trimmomatic

in1=$infile
# make the mate pair filename by replacing _1 with _2
in2=${infile/_1.fastq/_2.fastq}

# strip the input directory with 'basename'
basename1=$(basename $in1)
basename2=$(basename $in2)
# make all the file output names
out1paired=${basename1/.fastq/.trimmed.fastq}    # makes: SRR1234567_1.trimmed.fastq
out1unpaired=${basename1/.fastq/.unpaired.fastq} # makes: SRR1234567_1.unpaired.fastq
out2paired=${basename2/.fastq/.trimmed.fastq}    # makes: SRR1234567_2.trimmed.fastq
out2unpaired=${basename2/.fastq/.unpaired.fastq} # makes: SRR1234567_2.unpaired.fastq

Approach 3: Somewhere in between. Find a way to simplify approach 1 using scripting, it doesn't need to be as complex as my version.

For point #4, I have put the suggested values into a single variable called trim :

# This large chunk of params comes from the website demo, but...
# notice the Trimmomatic-0.36/adapters/TruSeq3-PE.fa path given to ILLUMINACLIP
trim="ILLUMINACLIP:Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"

In the command, I'll let the value of $trim be substituted in the command.

Final form (Approach 1 - a lot of typing):

java -jar Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 01_input/SRR3567551_1.fastq 01_input/SRR3567551_2.fastq 02_output/02_trimmomatic/SRR3567551_1.trimmed.fastq 02_output/02_trimmomatic/SRR3567551_1.unpaired.fastq 02_output/02_trimmomatic/SRR3567551_2.trimmed.fastq 02_output/02_trimmomatic/SRR3567551_2.unpaired.fastq $trim

Final form (Approach 2 - a lot of scripting):

java -jar Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads $SLURM_NTASKS -phred33 $in1 $in2 $outdir/$out1paired $outdir/$out1unpaired  $outdir/$out2paired $outdir/$out2unpaired $trim

To run my sbatch version, do:

$ sbatch 03_scripts/trimmomatic.sbatch 01_input/SRR3567551_1.fastq 

Let's see what it did.

$ wc -l 02_output/02_trimmomatic/SRR3567551_?.trimmed.fastq
   69507988 02_output/02_trimmomatic/SRR3567551_1.trimmed.fastq
   69507988 02_output/02_trimmomatic/SRR3567551_2.trimmed.fastq

The starting number of sequences were larger:

$ wc -l 01_input/SRR3567551_?.fastq
   75949800 01_input/SRR3567551_1.fastq
   75949800 01_input/SRR3567551_2.fastq

A log file was generated (trimmomatic-1380117.out). It says:

Input Read Pairs: 18987450 Both Surviving: 17376997 (91.52%) Forward Only Surviving: 1529287 (8.05%) Reverse Only Surviving: 51126 (0.27%) Dropped: 30040 (0.16%)

This matches the number of lines in our files (differing by a factor of 4, because each read is 4 lines).

Performance

Trimmomatic is very parallel-izable, and benefits from increasing –ntasks in the #SBATCH header section.

I ran this same job with multiple values of '–ntasks'.

ntasks time in minutes:seconds
2 14:52
4 1:12
6 0:33
8 0:29
10 0:27

Notice however, the diminishing returns above 6.

Array jobs

Lastly, the HPC environment can be used to run multiple jobs at the same time. SLURM has a facility to run multiple jobs with a single script, using a special variable SLURM_ARRAY_TASK_ID to change the behavior of each script.

We will use SLURM_ARRAY_TASK_ID to specify which input file to use, in place of putting it on the command line.

In our 01_input directory, there are 24 files. 12 each of _1 and _2 pairs. Here are the 12 _1 files.

SRR3567551_1.fastq SRR3567552_1.fastq SRR3567554_1.fastq SRR3567555_1.fastq SRR3567637_1.fastq SRR3567638_1.fastq SRR3567639_1.fastq SRR3567657_1.fastq SRR3567674_1.fastq SRR3567676_1.fastq SRR3567677_1.fastq SRR3567679_1.fastq

Sometimes the numbering in file names can be used in the array context, but that SRR IDs above are not consecutive.

Instead, let's set them to an array, which we can iterate over in a numerically consecutive manner.

files=(SRR3567551_1.fastq SRR3567552_1.fastq SRR3567554_1.fastq SRR3567555_1.fastq SRR3567637_1.fastq SRR3567638_1.fastq SRR3567639_1.fastq SRR3567657_1.fastq SRR3567674_1.fastq SRR3567676_1.fastq SRR3567677_1.fastq SRR3567679_1.fastq)

We can now access any of the file names using

${files[$i]}

..for i in [0,11].

The SLURM Array syntax provides a convenient mechanism to do this. Here is the last SBATCH header line in 03_scripts/trimmomatic_array.sbatch.

#SBATCH --array=0-11
# this script will be invoked 12 times, 
# each with a different value assigned to 
# SLURM_ARRAY_TASK_ID 
# within the range 0-11 (inclusive).

One additional change makes our previous script work with our new array variable files:

infile=01_input/${files[$SLURM_ARRAY_TASK_ID]}

This substitutes for the individual filename being specified on the command line.

$ sbatch 03_scripts/trimmomatic_array.sbatch

All twelve jobs are queued:

$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    1380162_[0-11]      shas trimmoma dcking@c PD       0:00      1 (None)

Moments later (depending on the wait list of the queue), all jobs were finished:

$ sa -j 1380162
       JobID    JobName  AllocCPUS      State ExitCode    Elapsed  Timelimit              Submit               Start                 End 
------------ ---------- ---------- ---------- -------- ---------- ---------- ------------------- ------------------- ------------------- 
1380162_11   trimmomat+          6  COMPLETED      0:0   00:00:43   00:04:00 2018-11-05T17:57:50 2018-11-05T17:59:12 2018-11-05T17:59:55 
1380162_0    trimmomat+          6  COMPLETED      0:0   00:00:32   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:09 2018-11-05T17:58:41 
1380162_1    trimmomat+          6  COMPLETED      0:0   00:00:50   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:09 2018-11-05T17:58:59 
1380162_2    trimmomat+          6  COMPLETED      0:0   00:00:43   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:09 2018-11-05T17:58:52 
1380162_3    trimmomat+          6  COMPLETED      0:0   00:00:41   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:09 2018-11-05T17:58:50 
1380162_4    trimmomat+          6  COMPLETED      0:0   00:00:40   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:09 2018-11-05T17:58:49 
1380162_5    trimmomat+          6  COMPLETED      0:0   00:01:06   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:41 2018-11-05T17:59:47 
1380162_6    trimmomat+          6  COMPLETED      0:0   00:00:32   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:41 2018-11-05T17:59:13 
1380162_7    trimmomat+          6  COMPLETED      0:0   00:00:44   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:41 2018-11-05T17:59:25 
1380162_8    trimmomat+          6  COMPLETED      0:0   00:00:55   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:41 2018-11-05T17:59:36 
1380162_9    trimmomat+          6  COMPLETED      0:0   00:00:41   00:04:00 2018-11-05T17:57:50 2018-11-05T17:58:41 2018-11-05T17:59:22 
1380162_10   trimmomat+          6  COMPLETED      0:0   00:00:28   00:04:00 2018-11-05T17:57:50 2018-11-05T17:59:12 2018-11-05T17:59:40 

According to this output, the longest any job took was 1 minute, 6 seconds (with 6 cores).

Five jobs started 19 seconds after I submitted (better than expected wait time), with two more starting 3 seconds after that, and the last five 29 seconds after that.

All of the data was processed within 2 minutes and 5 seconds of submission! It took longer to write this section!

Let's run FastQC!

Our first job will be FastQC, since it is an easier command to run.

Look at 03_scripts/fastqc.sbatch:

Here is the “work” section of 03_scripts/fastqc.sbatch

# set up your environment, PATH variable, modules
source activate.bashrc
 
# run the program here
fastq=$1 # pass the fastq file as the 1st argument
outdir=$2 # pass the output directory as the 2nd argument
# fastqc will crash with usage message if the above two are not set
fastqc -t $SLURM_NTASKS -d $SLURM_SCRATCH -o $outdir $fastq

The key points before the command:

  1. It sources activate.bashrc
  2. It reads two arguments from the command line: the input fastq, and the output directory (we will provide our linked output directory to our scratch space).

Key points inside the command:

  1. It uses $SLURM_NTASKS in conjunction with -t.
  2. It uses $SLURM_SCRATCH to set a location for temporary files. This isn't always necessary, but take advantage when you can- some programs automatically use /tmp or your home directory, which may exceed your quota.

You can run fastqc -h or look at examples online to see what other options are available.

Here is how you run this script on SRR3567551_1.fastq

$ sbatch 03_scripts/fastqc.sbatch 01_input/SRR3567551_1.fastq 02_output/01_fastqc

Notice that we included the directories in our arguments:

  • 01_input/SRR3567551_1.fastq becomes $1
  • 02_output/01_fastqc becomes $2

Checking the Job Status

The command above submitted the job to the Slurm job manager. To see it in the queue, do:

$ squeue -u $USER
    JOBID PARTITION     NAME     USER ST  TIME  NODES NODELIST(REASON)
  1378164 shas fastqc.s dcking@c PD       0:00      1 (Resources)

This shows my job has been queued and is waiting for the requested resources to become available.

Sometime later (while I was writing this document), the job finished, creating slurm-1378164.out which is the log produced by the program. Its contents are:

[/var/spool/slurmd/job1378164/slurm_script] fastqc.sbatch 01_input/SRR3567551_1.fastq 02_output/01_fastqc
Mon Nov  5 11:45:04 MST 2018
Started analysis of SRR3567551_1.fastq
Approx 5% complete for SRR3567551_1.fastq
Approx 10% complete for SRR3567551_1.fastq
Approx 15% complete for SRR3567551_1.fastq
Approx 20% complete for SRR3567551_1.fastq
Approx 25% complete for SRR3567551_1.fastq
Approx 30% complete for SRR3567551_1.fastq
Approx 35% complete for SRR3567551_1.fastq
Approx 40% complete for SRR3567551_1.fastq
Approx 45% complete for SRR3567551_1.fastq
Approx 50% complete for SRR3567551_1.fastq
Approx 55% complete for SRR3567551_1.fastq
Approx 60% complete for SRR3567551_1.fastq
Approx 65% complete for SRR3567551_1.fastq
Approx 70% complete for SRR3567551_1.fastq
Approx 75% complete for SRR3567551_1.fastq
Approx 80% complete for SRR3567551_1.fastq
Approx 85% complete for SRR3567551_1.fastq
Approx 90% complete for SRR3567551_1.fastq
Approx 95% complete for SRR3567551_1.fastq
Analysis complete for SRR3567551_1.fastq

Usage tip

:!: Usage tip :!:

I have set some helpful aliases to use sacct and squeue more efficiently. They are:

$ alias sa='sacct -X --format JobID,JobName,AllocCPUS,State,ExitCode,Elapsed,TimeLimit,Submit,Start,End'
$ alias sq='squeue -u $USER'

8-o If you want to learn more about making aliases, and setting them in your ~/.bashrc, Read this section in the Advanced BASH scripting guide. 8-o 8-o 8-o

The –format argument to sacct is particularly useful, in that you can fine-tune the information you want about your job. I have included Elapsed in order to see how long the job took:

$ sa -j 1378164
       JobID    JobName  AllocCPUS      State ExitCode    Elapsed  Timelimit              Submit               Start                 End 
------------ ---------- ---------- ---------- -------- ---------- ---------- ------------------- ------------------- ------------------- 
1378164      fastqc.sb+          2  COMPLETED      0:0   00:01:51   00:10:00 2018-11-05T11:44:58 2018-11-05T11:45:04 2018-11-05T11:46:55 

You can see that the Elapsed time was 1:51. Pretty fast!!

See the full list of options with:

$ man sacct

Notice also that AllocCPUS matches our requested ntasks specified in the header of the sbatch script, and used during the command.

You can increase the –ntasks to make certain jobs go faster (not all programs have a -t option). But realize that requesting more resources will increase your wait time.

FastQC Output

Inside 02_output/01_fastqc/ is a zip file named SRR3567551_1_fastqc.zip. It contains html and graphical content that we can't view on summit, but there is also a text file called summary.txt. Mine says:

PASS    Basic Statistics        SRR3567551_1.fastq
PASS    Per base sequence quality       SRR3567551_1.fastq
WARN    Per tile sequence quality       SRR3567551_1.fastq
PASS    Per sequence quality scores     SRR3567551_1.fastq
WARN    Per base sequence content       SRR3567551_1.fastq
PASS    Per sequence GC content SRR3567551_1.fastq
PASS    Per base N content      SRR3567551_1.fastq
PASS    Sequence Length Distribution    SRR3567551_1.fastq
FAIL    Sequence Duplication Levels     SRR3567551_1.fastq
WARN    Overrepresented sequences       SRR3567551_1.fastq
PASS    Adapter Content SRR3567551_1.fastq
FAIL    Kmer Content    SRR3567551_1.fastq

The visual output is more informative. I downloaded the zip file, unzipped it, and double clicked fastqc_report.html. Here are some graphs that it created:

These results are very good. There are some “FAIL” ones though.

Although FastQC has “FAILED” these reports, we don't yet know how negatively it will effect the alignment (which is our ultimate measure of success). Mainly, we see that, the sequence read quality is consistently very high. And we will move on to Trimmomatic to filter adapters and unmatched read pairs.

Wrapping up Quality Control and Preprocessing

This tutorial has taken you through the quality control and trimming steps of RNA-seq. As detailed in the diagram above, the output from Trimmomatic serves as the input to the aligners, which will be talked about in a later class.

pre-processing_quality_control.1541653225.txt.gz · Last modified: 2018/11/07 22:00 by david