- Summit test
Please download and install PuTTy:
Today, we will be learning about genomic file formats. But we should probably take a step back and learn about general file formats. Different file formats are typically identified by specific file extensions, suffixes to their names that inform programs of their type. Two general types of file formats are:
How do you we know something is a text file?
lessto view it
How do we know something is a binary file?
lessit looks like a bunch of alien writing
Several standardized types of text files have been developed to handle biological data and genome data. You may already be familiar with some Common Examples of Biological File Types.
In dealing with genomic information, almost all the files are text files.
We will focus on two types: FASTA and GTF/GFF.
FASTA files store DNA, RNA, or amino acid sequence information. The file extension is
$ more favorite3UTRs.fa >sequence1_description AGCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTACGATGCTACGGCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCGACGATCTATCATCTATCGAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGGGGTTGATTAGTCTAGCTGGTACGATCGAGCGATCTAGAGCGATCGAC>GAGCTGACGGACGTACTTACTATCGTAGCGACTACTTC >sequence2_description CTCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTACGATGCTACCCCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCGACGATCTATCATCTATCAAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGGGGTTGATTAGTCTAGCTGGTACGATCTTTCTAGCGAGCGATCTAGAG>CGATCGACGAGCTGACGGACGTACTTACTATCGTAGCGACTACTTC
$ more favorite3UTRs.fa >sequence1_description AGCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTAC GATGCTACGGCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCG ACGATCTATCATCTATCGAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGG GGTTGATTAGTCTAGCTGGTACGATCGAGCGATCTAGAGCGATCGAC>GAGCTGACGGACGT ACTTACTATCGTAGCGACTACTTC >sequence2_description CTCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTAC GATGCTACCCCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCG ACGATCTATCATCTATCAAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGG GGTTGATTAGTCTAGCTGGTACGATCTTTCTAGCGAGCGATCTAGAG>CGATCGACGAGCTG ACGGACGTACTTACTATCGTAGCGACTACTTC
ANOTHER EXAMPLE: a genome
$ more S288C_reference_sequence_R64-1-1_20110203.fsa >ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I] CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG
GTF (Gene Transfer Files) or GFF (General Feature Files) store annotation information about a particular sequence.
gtf. These share much of the same formatting with the exception of the the way the group information is formatted.
|1||seqname||The name of the sequence. Must be a chromosome or scaffold|
|2||source||The program that generated this feature|
|3||feature||The name of this type of feature. Some examples of standard feature types are “CDS”, “start codon”, “stop codon”, and “exon”.|
|4||start||The starting position of the feature in the sequence. The first base is numbered 1.|
|5||end||The ending position of the feature (inclusive).|
|6||score||A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter “:.”:.|
|7||strand||Valid entries include “+”, “-”, or “.” (for don't know/don't care).|
|8||frame||If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be “.”.|
|9||group||All lines with the same group are linked together into a single item.|
$ more saccharomyces_cerevisiae_R64-1-1_20110208.gff chrI SGD chromosome 1 230218 . . . ID=chrI;dbxref=NCBI:NC_001133;Name=chrI chrI SGD repeat_region 1 62 . - . ID=TEL01L-TR;Name=TEL01L-TR;Note=Terminal%20stretch%20of%20telomeric%20repeats%20on%20the%20left%20arm%20of%20Chromosome%20I;dbxref=SGD:S000028864 chrI SGD telomere 1 801 . - . ID=TEL01L;Name=TEL01L;Note=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20composed%20of%20an%20X%20element%20core%20sequence%2C%20X%20element%20combinatorial%20repeats%2C%20and%20a%20short%20terminal%20stretch%20of%20telomeric%20repeats;dbxref=SGD:S000028862 chrI SGD repeat_region 63 336 . - . ID=TEL01L-XR;Name=TEL01L-XR;Note=Telomeric%20X%20element%20combinatorial%20Repeat%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20contains%20repeats%20of%20the%20D%2C%20C%2C%20B%20and%20A%20types%2C%20as%20well%20as%20Tbf1p%20binding%20sites%3B%20formerly%20called%20SubTelomeric%20Repeats;dbxref=SGD:S000028866 chrI SGD gene 335 649 . + . ID=YAL069W;Name=YAL069W;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%20unlikely%20to%20encode%20a%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data;dbxref=SGD:S000002143;orf_classification=Dubious chrI SGD CDS 335 649 . + 0 Parent=YAL069W;Name=YAL069W;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%20unlikely%20to%20encode%20a%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data;dbxref=SGD:S000002143;orf_classification=Dubious
Figure: An example of a genome sequence (.fa) and annotation file (.gtf) that are rendered as browsable pictures on the UCSC Genome Browser
Quick Tip There are many other standardized file formats in genomics. Much of learning the ropes of genomics is getting to know these standardized formats. UCSC File Formats
Quick Tip Where do file extensions come from? For flat, text files, you put them there!
Quick Tip It is good practice to always save files with the proper file extensions! All files should have extensions!!
Quick Tip: Some standardized file formats will have lines at the beginning that start with
#. These are comments. They typically contain information about how the information was generated.
Quick Tip: Check what type of file you have using file.
$ file file.txt