Last time, we wrote out each line of an RNA-seq analysis pipeline by hand. This works fine if you have a few samples and if you make no errors. As projects get bigger, the task becomes more cumbersome.
To guard against errors and to streamline large projects, we need an automation strategy. This is the heart of pipeline building.
The key is to use the information in the
metadata file to instruct a series of commands to loop over each field.
Let's take a look at the
metadata file for our yeast demo project:
SRR3567551_1.fastq SRR3567551_2.fastq sample01 CK45-1 untreated 45min 1 SRR3567552_1.fastq SRR3567552_2.fastq sample02 CK45-2 untreated 45min 2 SRR3567554_1.fastq SRR3567554_2.fastq sample03 Ac45-1 aceticAcidTreated 45min 1 SRR3567555_1.fastq SRR3567555_2.fastq sample04 Ac45-2 aceticAcidTreated 45min 2 SRR3567674_1.fastq SRR3567674_2.fastq sample09 CK200-1 untreated 200min 1 SRR3567676_1.fastq SRR3567676_2.fastq sample10 CK200-2 untreated 200min 2 SRR3567677_1.fastq SRR3567677_2.fastq sample11 Ac200-1 aceticAcidTreated 200min 1 SRR3567679_1.fastq SRR3567679_2.fastq sample12 Ac200-2 aceticAcidTreated 200min 2
The key is going to use loop control to loop over each element of each column and parse each sample through a series of stereotyped commands.
Make a new project directory:
# Log into summit $ ssh -l <eID>@colostate.edu login.rc.colorado.edu # switch to scompile $ ssh scompile #If you want to, make your alias to scheck here $ alias scheck='squeue -u $USER'
Navigate to the space where you want to put your directory. I'm putting mine in
Make a new directory & navigate into it:
$ mkdir PROJ06_yeastDemo2 $ cd PROJ06_yeastDemo2
OK, we just started a new project. And we want to use the same input data as we used in the project
PROJ04_yeastDemo. One option would be to copy and paste the input files from that project over to this one. That would work, but it would be inefficient space-wise.
A better option would be to point the
PROJ06_yeastDemo2/01_input directory to the
PROJ04_yeastDemo/01_input directory. We can do this using soft links, also known as a symbolic link, also known as a short cut.
Soft link usage
$ ln -s /path/to/original /path/to/link # So if you are located within the directory where you want the link to exist, you can shorten it to... $ ln -s /path/to/original .
Let's try it. Navigate to your
PROJ06_yeastDemo2 directory and create a softlink as your input sub-directory.
# Navigate to your project directory: $ cd PROJ06_yeastDemo2 $ ls # create the softlink: $ ln -s ../PROJ04_yeastDemo/01_input . # check what you got using two different ls commands: $ ls $ ls -alh
Create the other sub-directories
$ mkdir 02_scripts $ mkdir 03_output