User Tools

Site Tools


2018pipeline

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
2018pipeline [2018/09/10 10:55]
david [Running the command]
2018pipeline [2018/09/11 15:44] (current)
david [Pipeline script]
Line 1: Line 1:
 +~~NOTOC~~
 +
 ====== Example Bioinformatics Pipeline ====== ====== Example Bioinformatics Pipeline ======
  
Line 203: Line 205:
 </​code>​ </​code>​
  
-:!: Now we have our starting data.+:!: Now we have our starting data. But let's give it a shorter name. 
 +<code bash> 
 +$ mv c_elegans.PRJNA13758.WS266.canonical_geneset.gtf wormbase.gtf 
 +</​code>​
  
 ====== Map out the Task ====== ====== Map out the Task ======
Line 360: Line 365:
 <code bash> <code bash>
 $ nano fix_chr_names.sbatch $ nano fix_chr_names.sbatch
-<​code>​+</code>
  
 <code bash> <code bash>
Line 428: Line 433:
  
 <code bash> <code bash>
-./faFrag ​../c_elegans11/​chrIII.fa 7555617 7555660 out.fa -mixed+faFrag c_elegans11/​chrIII.fa 7555617 7555660 out.fa -mixed
 grep -f gene_ids.lst chr_wormbase.gtf ​ | grep five_prime_utr grep -f gene_ids.lst chr_wormbase.gtf ​ | grep five_prime_utr
 grep -f gene_ids.lst chr_wormbase.gtf ​ | grep five_prime_utr | cut -f 9 | cut -f 4 -d ' ' | tr -d [\"\;] grep -f gene_ids.lst chr_wormbase.gtf ​ | grep five_prime_utr | cut -f 9 | cut -f 4 -d ' ' | tr -d [\"\;]
Line 442: Line 447:
  
 ===== Pipeline script ===== ===== Pipeline script =====
 +Let's anticipate sending our output to its own directory.
 +
 <code bash> <code bash>
-nano pipeline.sbatch+mkdir output
 </​code>​ </​code>​
 +
 +We're still in ''​data'',​ but we'll use ''​output''​ in the script.
 +
 <code bash> <code bash>
 +$ pwd
 +/​projects/​dcking@colostate.edu/​data
 +$ nano pipeline.sbatch
 +</​code>​
 +
 +<code bash pipeline_correct.sbatch>​
 #​!/​usr/​bin/​env bash #​!/​usr/​bin/​env bash
 #SBATCH --nodes=1 #SBATCH --nodes=1
Line 452: Line 468:
 #SBATCH --qos=normal #SBATCH --qos=normal
 #SBATCH --output=pipeline-latest.out #SBATCH --output=pipeline-latest.out
 + 
 # pipeline.sbatch - Perform grep on a GFF file, then run faFrag to extract DNA for  # pipeline.sbatch - Perform grep on a GFF file, then run faFrag to extract DNA for 
 # each line returned by grep. # each line returned by grep.
  
 +annotations=chr_wormbase.gff
  
 +if [ ! -e "​$annotations"​ ]
 +then
 +    echo "Need to supply an annotation file" >&2
 +    exit 1
 +fi 
 + 
 # Specify Gene list file  # Specify Gene list file 
 gene_lst=$1 gene_lst=$1
 +echo "using gene_lst=$gene_lst"​
 # #
 + 
 # Perform grep -f on Gene list file to get annotation lines from the GFF # Perform grep -f on Gene list file to get annotation lines from the GFF
-grep -f $gene_lst | grep five_prime_utr > tmpfile.gff+grep -f $gene_lst ​$annotations ​| grep five_prime_utr > tmpfile.gff 
 +echo "​temporary file contains $(wc -l tmpfile.gff) lines"
 # #
 + 
 ## Loop over grep'd file ## Loop over grep'd file
 ## for each GFF line ## for each GFF line
Line 470: Line 495:
 do do
    ## isolate column to specify chromosome ##    ## isolate column to specify chromosome ##
-   ​chrom=$(echo "​line"​ | cut -f 1)  +   ​chrom=$(echo "$line" | cut -f 1)  
-   ​+ 
    ## isolate column to specify chrom_start ##    ## isolate column to specify chrom_start ##
-   ​chrom_start=$(echo "​line"​ | cut -f 4)   +   ​chrom_start=$(echo "$line" | cut -f 4)   
-   ​+ 
    ## isolate column to specify chrom_end ##    ## isolate column to specify chrom_end ##
-   ​chrom_end=$(echo "​line"​ | cut -f 5)     +   ​chrom_end=$(echo "$line" | cut -f 5)     
-   ​+ 
    ## --- fill in this section --- ##    ## --- fill in this section --- ##
-   ​attributes=$(echo "​line"​ | cut -f 9)    +   ​attributes=$(echo "$line" | cut -f 9)    
-   ​+ 
    ## get the TRANSCRIPT ID out... There may be multiple transcripts for a given gene    ## get the TRANSCRIPT ID out... There may be multiple transcripts for a given gene
    ​gene_id=$(echo $attributes | cut -f 4 -d ' ')    ​gene_id=$(echo $attributes | cut -f 4 -d ' ')
-   + 
    ## trim the quotes and semicolon    ## trim the quotes and semicolon
-   ​gene_id_trimmed=$(echo $gene_id | tr -d [\"\;] # have to use backslashes for some characters +   ​gene_id_trimmed=$(echo $gene_id | tr -d [\"\;]# have to use backslashes for some characters 
-   ​+ 
    ​in_fasta="​c_elegans11/​$chrom.fa"​    ​in_fasta="​c_elegans11/​$chrom.fa"​
    ​out_fasta="​$gene_id_trimmed.fa"​    ​out_fasta="​$gene_id_trimmed.fa"​
    ###########​ run the command! #############​    ###########​ run the command! #############​
-   ​faFrag $in_fasta $chrom_start ​$$chrom_end $out_fasta +   echo "​Running faFrag on $gene_id_trimmed"​ 
-   ​+   faFrag $in_fasta $chrom_start $chrom_end $out_fasta ​||  echo ">>>>>>>>>>>​There was a problem with: '​$line'<<<<<<<<<<<<<<<"​ 
 + 
 done < tmpfile.gff done < tmpfile.gff
 + 
 # can now delete tmpfile.gff # can now delete tmpfile.gff
 rm tmpfile.gff rm tmpfile.gff
Line 532: Line 558:
 output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc
 </​code>​ </​code>​
 +
 +Let's explore this further with [[2018grep_motifs|Using regular expressions to find sequence motifs]]
  
2018pipeline.1536598538.txt.gz · Last modified: 2018/09/10 10:55 by david