User Tools

Site Tools


Automating RNA-seq pipelines

Last time, we wrote out each line of an RNA-seq analysis pipeline by hand. This works fine if you have a few samples and if you make no errors. As projects get bigger, the task becomes more cumbersome.

To guard against errors and to streamline large projects, we need an automation strategy. This is the heart of pipeline building.

The key is to use the information in the metadata file to instruct a series of commands to loop over each field.

Let's take a look at the metadata file for our C. elegans demo project:

more ~/DSCI512_RNAseq/PROJ03GomezOrte1/metadata_gomezOrte.txt
SRR5832182_1.fastq      SRR5832182_2.fastq      EG01    01_Ecoli_15_1   Ecoli   15      1
SRR5832183_1.fastq      SRR5832183_2.fastq      EG02    02_Ecoli_15_2   Ecoli   15      2
SRR5832184_1.fastq      SRR5832184_2.fastq      EG03    03_Ecoli_15_3   Ecoli   15      3
SRR5832185_1.fastq      SRR5832185_2.fastq      EG04    04_Ecoli_20_1   Ecoli   20      1
SRR5832186_1.fastq      SRR5832186_2.fastq      EG05    05_Ecoli_20_2   Ecoli   20      2
SRR5832187_1.fastq      SRR5832187_2.fastq      EG06    06_Ecoli_20_3   Ecoli   20      3
SRR5832188_1.fastq      SRR5832188_2.fastq      EG07    07_Ecoli_25_2   Ecoli   25      1
SRR5832189_1.fastq      SRR5832189_2.fastq      EG08    08_Ecoli_25_1   Ecoli   25      2
SRR5832190_1.fastq      SRR5832190_2.fastq      EG09    09_Ecoli_25_3   Ecoli   25      3
SRR5832191_1.fastq      SRR5832191_2.fastq      EG10    10_Bsubtilis_15_1       Bsubtilis       15      1
SRR5832192_1.fastq      SRR5832192_2.fastq      EG11    11_Bsubtilis_15_2       Bsubtilis       15      2
SRR5832193_1.fastq      SRR5832193_2.fastq      EG12    12_Bsubtilis_15_3       Bsubtilis       15      3
SRR5832194_1.fastq      SRR5832194_2.fastq      EG13    13_Bsubtilis_20_1       Bsubtilis       20      1
SRR5832195_1.fastq      SRR5832195_2.fastq      EG14    14_Bsubtilis_20_2       Bsubtilis       20      2
SRR5832196_1.fastq      SRR5832196_2.fastq      EG15    15_Bsubtilis_20_3       Bsubtilis       20      3
SRR5832197_1.fastq      SRR5832197_2.fastq      EG16    16_Bsubtilis_25_1       Bsubtilis       25      1
SRR5832198_1.fastq      SRR5832198_2.fastq      EG17    17_Bsubtilis_25_2       Bsubtilis       25      2
SRR5832199_1.fastq      SRR5832199_2.fastq      EG18    18_Bsubtilis_25_3       Bsubtilis       25      3

The key is going to use loop control to loop over each element of each column and parse each sample through a series of stereotyped commands.

Let's explore pipeline automation

Make a new project directory:

Make a new directory & navigate into it:

$ pwd # You should be in DSCI512_RNAseq
$ mkdir PROJ04_GomezOrte2
$ cd PROJ04_GomezOrte2

OK, we just started a new project. And we want to use the same input data as we used in the project PROJ03_GomezOrte1. One option would be to copy and paste the input files from that project over to this one. That would work, but it would be inefficient space-wise.

A better option would be to point the PROJ04_GomezOrte2/01_input directory to the PROJ03_GomezOrte1/01_input directory. We can do this using soft links, also known as a symbolic link, also known as a short cut.

Soft link usage

$ ln -s /path/to/original /path/to/link
# So if you are located within the directory where you want the link to exist, you can shorten it to...
$ ln -s /path/to/original .

Let's try it. Navigate to your PROJ06_yeastDemo2 directory and create a softlink as your input sub-directory.

# Navigate to your project directory:
$ cd PROJ06_yeastDemo2
$ ls
# create the softlink:
$ ln -s ../PROJ03_GomezOrte1/01_input .
# check what you got using two different ls commands:
$ ls
$ ls -alh

Create the other sub-directories

$ mkdir 02_scripts
$ mkdir 03_output

Automating pipelines 2

wiki/automation.txt · Last modified: 2019/12/04 22:47 by erin