This is an old revision of the document!
Most bioinformatics tasks share a common set of challenges:
Today we are going to implement a solution for a task that features these problems.
One might do genomic sequence analysis in order to study
A common way to store gene annotations is with GFF (General Feature Format) files. There different versions of this format, including GTF (General Transfer Format), and the current version is 3.0. Despite the different versions, you may expect the following tab-delimited format:
|1||seqname||For genomic sequences, this is the chromosome name or number.|
|2||source||data source or program that generated the feature|
|3||type||such as UTR or CDS (see http://www.sequenceontology.org/so_wiki/index.php/Category:SO:SOFA)|
|4||start||Start position of the feature, with sequence numbering starting at 1.|
|5||end||End position of the feature, with sequence numbering starting at 1.|
|6||score||A floating point value.|
|7||strand||defined as + (forward) or - (reverse).|
|8||phase||One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.|
|9||attributes||A semicolon-separated list of tag-value pairs, providing additional information about each feature.|
Let's review how we might isolate a given column.
How will we approach this problem? What information do we need in order to continue?
Say know that there are public gene annotations for the C. elegans genome. There is also available genomic sequence. We want to work on the sequences pertaining to certain genes.
Let's prepare our computing environment for the task at hand.
Let's prepare a space on Summit for this task. We'll need space to :
We'll use our Summit “project” space for this. You can access your project space thus:
$ cd /projects/$USER $ pwd
You are now in your project space, which has 250Gb of space that isn't deleted.
Let's prepare a directory for the input data. The name
data seems appropriate, but you may wish to be more specific.
$ mkdir data $ cd data
Now we must go and get the data.
For C. elegans, we can access the publically-available sequence at https://genome.ucsc.edu/. The genome browser at the University of California, Santa Cruz is one of the leaders in genome annotation storage and analysis. They have many genomes, but your organism of interest may be maintained by a different project.
Normally, you would go to Downloads ⇒ Genome Data ⇒ Nematodes ⇒ C. elegans, but it doesn't have the current data freeze (ce11)… but I know it is here: http://hgdownload.soe.ucsc.edu/goldenPath/ce11/bigZips/
After clicking on this link, you will see a long description of the files contained within. DNA sequence is usually in FASTA format. Look for the following file:
It's only 30 Mb, but you shouldn't click on this file since we want to download it from Summit and have it there. They list different methods for downloading the file from the command line.
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/ce11/bigZips/chromFa.tar.gz .
Make sure you copy the entire code block above, since it goes beyond the window.
wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/ce11/bigZips/chromFa.tar.gz' -O chromFa.tar.gz
Make sure you copy the entire code block above, since it goes beyond the window.
The program wget is equivalent to using your internet browser, The program rsync has more capabilities (the files don't have to be on the WWW, for example).
Either of the above commands will fetch the sequence in a Gzip-compressed, TAR file. This is a common format on the internet. TAR format is a handy way to package multiple files. Gzip is a standard compression method. To uncompress and extract the different chromosomes, you do:
tar -zxvf chromFa.tar.gz
This extracted seven files, one for each chromosome I-V (in roman numerals) and chromosome X (chrX) and the mitochondrial chromosome (chrM).
Take a look at one of the files:
$ more chrI.fa
How big is the C. elegans genome?
$ wc *.fa 301450 301450 15373889 chrI.fa 305590 305590 15585017 chrII.fa 275678 275678 14059486 chrIII.fa 349878 349878 17843713 chrIV.fa 277 277 14076 chrM.fa 418485 418485 21342670 chrV.fa 354380 354380 18073327 chrX.fa 2005738 2005738 102292178 total
The program wc (word count) gives the number of lines, words, and characters in a file. Words are separated by whitespace, so the sequence lines only have one “word” per line. Hence, the first two columns of output are the same.
The third column, however, is useful.
Each file contains the characters of the header, e.g.
>chrI, and the newline character (one per line). But the rest is the sequence count. This, the size of the C. elegans genome is roughly the value in the 3rd column total. It is on the order of 102 million basepairs.
We now have the C. elegans genome sequence in FASTA format. Let's be a little more organized and put it in a directory.
$ mkdir c_elegans11 $ mv -v *.fa c_elegans11 ‘chrI.fa’ -> ‘c_elegans11/chrI.fa’ ‘chrII.fa’ -> ‘c_elegans11/chrII.fa’ ‘chrIII.fa’ -> ‘c_elegans11/chrIII.fa’ ‘chrIV.fa’ -> ‘c_elegans11/chrIV.fa’ ‘chrM.fa’ -> ‘c_elegans11/chrM.fa’ ‘chrV.fa’ -> ‘c_elegans11/chrV.fa’ ‘chrX.fa’ -> ‘c_elegans11/chrX.fa’
11 to c_elegans so that I, or others using my work, know which version of the assembly I used.
I'm going to put the compressed file in there too, just to clean thing up a little. You could delete, but I like to keep things around in case I need to retrace my steps later.
$ mv chromFa.tar.gz c_elegans11 $ ls c_elegans11/ chrI.fa chrII.fa chrIII.fa chrIV.fa chrM.fa chromFa.tar.gz chrV.fa chrX.fa
We will be using annotations from Wormbase. There are many options for this, but for our purposes we will use a specific file.
Download it with the following command.
Although it's a good idea to keep data in compressed form to save space. We will need to access it without compression.
To decompress the file:
$ gunzip c_elegans.PRJNA13758.WS266.canonical_geneset.gtf.gz
If you do
ls -lh you will see that it is now 167Mb. It was 7mb before!
So what have we downloaded?
$ head c_elegans.PRJNA13758.WS266.canonical_geneset.gtf #!genebuild-version WS266 MtDNA WormBase gene 1 55 . + . gene_id "WBGene00014450"; gene_source "WormBase"; gene_biotype "tRNA"; MtDNA WormBase transcript 1 55 . + . gene_id "WBGene00014450"; transcript_id "MTCE.1"; gene_source "WormBase"; gene_biotype "tRNA"; transcript_source "WormBase"; transcript_biotype "tRNA"; MtDNA WormBase exon 1 55 . + . gene_id "WBGene00014450"; transcript_id "MTCE.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "tRNA"; transcript_source "WormBase"; transcript_biotype "tRNA"; exon_id "MTCE.1.e1"; MtDNA WormBase gene 58 111 . + . gene_id "WBGene00014451"; gene_source "WormBase"; gene_biotype "tRNA"; MtDNA WormBase transcript 58 111 . + . gene_id "WBGene00014451"; transcript_id "MTCE.2"; gene_source "WormBase"; gene_biotype "tRNA"; transcript_source "WormBase"; transcript_biotype "tRNA"; MtDNA WormBase exon 58 111 . + . gene_id "WBGene00014451"; transcript_id "MTCE.2"; exon_number "1"; gene_source "WormBase"; gene_biotype "tRNA"; transcript_source "WormBase"; transcript_biotype "tRNA"; exon_id "MTCE.2.e1"; MtDNA WormBase gene 113 549 . + . gene_id "WBGene00010957"; gene_source "WormBase"; gene_biotype "protein_coding"; MtDNA WormBase transcript 113 549 . + . gene_id "WBGene00010957"; transcript_id "MTCE.3.1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; MtDNA WormBase exon 113 549 . + . gene_id "WBGene00010957"; transcript_id "MTCE.3.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "MTCE.3.1.e1";
Now we have our starting data.
Now we need to strategize how to go from all annotations + whole genome sequence to only the desired annotations and their genomic sequences.
Using Google, I find some advice on Biostars. That advice includes, but is not limited to:
We'll use UCSC User Apps for this exercise:
Bedtools is easier to use, but must be compiled. We'll cover it later.
Google UCSC UserApps. I get “UCSC Genome Browser Downloads” as the second hit. It is leads you to http://hgdownload.soe.ucsc.edu/admin/exe/.
Now we have a choice of builds. We know summit is on linux, so let's try
This page has the full list of programs and how to download them. And below that, it has a description of each one.
We're going to use
faFrag. You can just download it by clicking the link, but we need it on summit. Like we did with the chromosome sequence:
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/faFrag ./
Let's check to see if it really works on Summit:
$ chmod 755 faFrag $ ./faFrag faFrag - Extract a piece of DNA from a .fa file. usage: faFrag in.fa start end out.fa options: -mixed - preserve mixed-case in FASTA file
This should have run, giving an usage message. If so, we're going to add it to our directory of installed programs
If the program ran, put it in your bin directory
$ mv faFrag ~/bin $ faFrag faFrag - Extract a piece of DNA from a .fa file. usage: faFrag in.fa start end out.fa options: -mixed - preserve mixed-case in FASTA file
Notice how we didn't need the
./ prefix to the program? By moving it to
~/bin, we can run it from whereever we are.
Thus, we have installed
faFrag. Now we have the software to extract the specific subsequence out of a FASTA file, given chromosome positions (that are specified in our annotation file).
Let's begin coding our first script.
The “canonical geneset” gtf file and the FASTA files we downloaded have a key discrepancy: the chromosome names. Anything from UCSC uses the convention of abbreviating the chromosome as “chr” with the number following it. Example:
chrI for chromosome I (roman numeral one). Our gtf file doesn't follow that convention.
We could change our sequences, or our GTF file. For this exercise, we'll change the GTF file.
A handy command for this is
sed, which stands for stream editor. We're going to use sed to make a new gene annotation file with the chromosome name changed to the “chr” convention.
Sed takes a short, one-line instruction as its argument, and applies that instruction to either to a file argument or standard input <stdin>.
$ sed 's/pattern/replacement/' filename
The parts of this instruction are separated by forward slashes (
The replacement portion always ends with a forward slash, but can then be followed by further options (examples below).
Sed can take its input from a pipe instead a file, as in:
$ cmd | 's/pattern/replacement/'
The pattern can be an exact string. Say we want to replace
chrM, the argument to sed would be:
$ sed 's/MtDNA/chrM/' inputfile.gtf > outputfile.gtf
The pattern can include special characters in order to refine the selection. Common ones are:
$ sed 's/^MtDNA/chrM/' inputfile.gtf > outputfile.gtf
^ means that the following text
MtDNA must be at the beginning of the line for it to match. That's perfect for us, since it that's the text we need to change.
$ sed 's/MtDNA$/chrM/' inputfile.gtf > outputfile.gtf
The dollar sign
$ in this context means that the preceding text
MtDNA must be at the end of the line for it to match. That's not the case for us, but may be useful in a different setting.
I execute the following command:
$ cut -f1 wormbase.gtf | sort -u #!genebuild-version WS266 I II III IV MtDNA V X
Note: The command
sort -u is equivalent to
sort | uniq.
The output tells us there is a comment, plus five roman numerals (I-V), plus X (not the roman numeral for 10), and MtDNA.
That seems like a long string of pipes! But we can cover all of the I,I,II,III,IV conversions with a single statement:
sed 's/^I/chrI/'. That leaves V, X, and MtDNA.
#!/usr/bin/env bash grep -v '#' wormbase.gtf | sed 's/^I/chrI/' | sed 's/^V/chrV/' | sed 's/^X/chrX/' | sed 's/^MtDNA/chrM/' > chr_wormbase.gtf
Note: none of these patterns accidentally match each other. If we needed to change III to chr3, however, we couldn't just use
^I, because it would match all of I, II, III, and IV. We would have to specify
^II individually before we could run
Let's put the above code in a new script.
$ nano fix_chr_names.sbatch <code> <code bash> #!/usr/bin/env bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=0:01:00 #SBATCH --qos=normal grep -v '#' wormbase.gtf | sed 's/^I/chrI/' | sed 's/^V/chrV/' | sed 's/^X/chrX/' | sed 's/^MtDNA/chrM/' > chr_wormbase.gtf
I've added some directives for SLURM. We only want
We submit the job this way:
$ sbatch fix_chr_names.sbatch Submitted batch job 1199849
This is not the same as running it with
bash. It has submitted the job to the job scheduler. And is now placed in the queue. You can check on its status with:
$ squeue -u $USER
Also, look for the new file:
chr_wormbase.gtf in your directory. I list my directories oldest-newest for this purpose. Newer files will be at the bottom:
$ ls -ltrh
Is your file there yet?
Once it is, we can move on to build the extraction code.
Let's choose some genes to look at. Oddly, there aren't common names or abbreviations in the file we downloaded, so we have to use the web interface to get the right gene identifiers.
At https://www.wormbase.org, we can use the search box for this function. If you have a gene that you study and think it might have an ortholog in C. elegans, try typing it in here. In general, developmental genes are highly conserved. Try searching for HOX.
My first hit was ceh-13. What we need is the WormBase ID.
To make sure you're getting results from C. elegans, click the link under species on the left.
I'm going to use these:
|Gene name/abbrev||WormBase ID|
Save the wormbase IDs to a file called gene_ids.lst. Here's mine:
$ cat gene_ids.lst WBGene00000437 WBGene00003011 WBGene00001250
./faFrag ../c_elegans11/chrIII.fa 7555617 7555660 out.fa -mixed grep -f gene_ids.lst chr_wormbase.gtf | grep five_prime_utr grep -f gene_ids.lst chr_wormbase.gtf | grep five_prime_utr | cut -f 9 | cut -f 4 -d ' ' | tr -d [\"\;]
while read line; do chrom=$(echo $line | cut -f1 -d ' '); st=$(echo $line | cut -f4 -d ' '); echo "$chrom $st"; done < <(grep -f gene_ids.lst chr_wormbase.gtf)
$ nano pipeline.sbatch
#!/usr/bin/env bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=1:00:00 #SBATCH --qos=normal #SBATCH --output=pipeline-latest.out # pipeline.sbatch - Perform grep on a GFF file, then run faFrag to extract DNA for # each line returned by grep. # Specify Gene list file gene_lst=$1 # # Perform grep -f on Gene list file to get annotation lines from the GFF grep -f $gene_lst | grep five_prime_utr > tmpfile.gff # ## Loop over grep'd file ## for each GFF line while read line do ## isolate column to specify chromosome ## chrom=$(echo "line" | cut -f 1) ## isolate column to specify chrom_start ## chrom_start=$(echo "line" | cut -f 4) ## isolate column to specify chrom_end ## chrom_end=$(echo "line" | cut -f 5) ## --- fill in this section --- ## attributes=$(echo "line" | cut -f 9) ## get the TRANSCRIPT ID out... There may be multiple transcripts for a given gene gene_id=$(echo $attributes | cut -f 4 -d ' ') ## trim the quotes and semicolon gene_id_trimmed=$(echo $gene_id | tr -d [\"\;] # have to use backslashes for some characters in_fasta="c_elegans11/$chrom.fa" out_fasta="$gene_id_trimmed.fa" ########### run the command! ############# faFrag $in_fasta $chrom_start $$chrom_end $out_fasta done < tmpfile.gff # can now delete tmpfile.gff rm tmpfile.gff
We can now run the command with
sbatch, passing in the argument of our gene list. It will be
$1 inside the script.
$ sbatch gene_ids.lst Submitted job XXXXXXX
You can watch the progress with:
$ squeue -u $USER $ ls -ltrh $ sacct -j XXXXXXX
The last one looks specifically at the specific job, but you have to paste-in the long ID returned by sbatch.
If everything is working, the script should be writing a fasta file for each transcript ID.
Let's move on to