User Tools

Site Tools


wiki:2018datasets

Before we get started

- Quiz
- Questions
- Putty
- Summit test

Requirements for all users

Update: PuTTy for Windows Users

Please download and install PuTTy:
PuTTy


FILE FORMATS

File formats in general

Today, we will be learning about genomic file formats. But we should probably take a step back and learn about general file formats. Different file formats are typically identified by specific file extensions, suffixes to their names that inform programs of their type. Two general types of file formats are:

  • text files files that have only text information.
  • binary files files that have more complex information that can be interpreted as formatting, images, application-specific objects, as well as text. Examples: .docx, .xlsx, .jpg, .pdf, and .m4p

How do you we know something is a text file?

  • Has the file extension like .txt, .csv, .fa, .gb
  • Use more or less to view it
  • Use a text editor to view and edit it

How do we know something is a binary file?

  • Has a specific file extension associated with a specific program
  • It was produced in a specific program
  • Use more or less it looks like a bunch of alien writing

:!: Quick Tip If you cannot see file extensions on your computer, take a moment to make these visible: MacOS Show file extensions
Windows Show file extensions


Genomics File Formats

Several standardized types of text files have been developed to handle biological data and genome data. You may already be familiar with some Common Examples of Biological File Types.

In dealing with genomic information, almost all the files are text files.

We will focus on two types: FASTA and GTF/GFF.

FASTA

FASTA files store DNA, RNA, or amino acid sequence information. The file extension is .fa or .fsa

$ more favorite3UTRs.fa
>sequence1_description
AGCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTACGATGCTACGGCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCGACGATCTATCATCTATCGAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGGGGTTGATTAGTCTAGCTGGTACGATCGAGCGATCTAGAGCGATCGAC>GAGCTGACGGACGTACTTACTATCGTAGCGACTACTTC

>sequence2_description
CTCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTACGATGCTACCCCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCGACGATCTATCATCTATCAAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGGGGTTGATTAGTCTAGCTGGTACGATCTTTCTAGCGAGCGATCTAGAG>CGATCGACGAGCTGACGGACGTACTTACTATCGTAGCGACTACTTC

OR

$ more favorite3UTRs.fa
>sequence1_description
AGCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTAC
GATGCTACGGCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCG
ACGATCTATCATCTATCGAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGG
GGTTGATTAGTCTAGCTGGTACGATCGAGCGATCTAGAGCGATCGAC>GAGCTGACGGACGT
ACTTACTATCGTAGCGACTACTTC

>sequence2_description
CTCTAGCATCGACTAGCTACGATCGATCGATCACGAGCTACGACGTAGGCATGGGGGCTTAC
GATGCTACCCCGGAGCTACGGCGACTGCGATCTACGGCGATCGACGGACGGACG>TCAGGCG
ACGATCTATCATCTATCAAGCGAGCTACTTACTCTTCTCTATCTACTTATCCCCTTCTTAGG
GGTTGATTAGTCTAGCTGGTACGATCTTTCTAGCGAGCGATCTAGAG>CGATCGACGAGCTG
ACGGACGTACTTACTATCGTAGCGACTACTTC

ANOTHER EXAMPLE: a genome

$ more S288C_reference_sequence_R64-1-1_20110203.fsa 
>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT
TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT
TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG

GTF/GFF

GTF (Gene Transfer Files) or GFF (General Feature Files) store annotation information about a particular sequence.

  • Features can be: genes, exons, introns, primer binding sites, origins of replications, anything you can think of.
  • GTF/GFF files are made of tab-separated columns. Each row is a “feature”.
  • GTF/GFF files can be paired with a FASTA file to give a user the complete information about a sequence and all its features.
  • GTF files can be uploaded to visual viewers of genomes and the features can be graphically displayed.
  • File extensions can be .gff, gff2, gff3, or gtf. These share much of the same formatting with the exception of the the way the group information is formatted.
  • The tab-separated columns of information are:
# Feature Description
1 seqname The name of the sequence. Must be a chromosome or scaffold
2 source The program that generated this feature
3 feature The name of this type of feature. Some examples of standard feature types are “CDS”, “start codon”, “stop codon”, and “exon”.
4 start The starting position of the feature in the sequence. The first base is numbered 1.
5 end The ending position of the feature (inclusive).
6 score A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter “:.”:.
7 strand Valid entries include “+”, “-”, or “.” (for don't know/don't care).
8 frame If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be “.”.
9 group All lines with the same group are linked together into a single item.

EXAMPLE:

$ more saccharomyces_cerevisiae_R64-1-1_20110208.gff 
chrI    SGD     chromosome      1       230218  .       .       .       ID=chrI;dbxref=NCBI:NC_001133;Name=chrI
chrI    SGD     repeat_region   1       62      .       -       .       ID=TEL01L-TR;Name=TEL01L-TR;Note=Terminal%20stretch%20of%20telomeric%20repeats%20on%20the%20left%20arm%20of%20Chromosome%20I;dbxref=SGD:S000028864
chrI    SGD     telomere        1       801     .       -       .       ID=TEL01L;Name=TEL01L;Note=Telomeric%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20composed%20of%20an%20X%20element%20core%20sequence%2C%20X%20element%20combinatorial%20repeats%2C%20and%20a%20short%20terminal%20stretch%20of%20telomeric%20repeats;dbxref=SGD:S000028862
chrI    SGD     repeat_region   63      336     .       -       .       ID=TEL01L-XR;Name=TEL01L-XR;Note=Telomeric%20X%20element%20combinatorial%20Repeat%20region%20on%20the%20left%20arm%20of%20Chromosome%20I%3B%20contains%20repeats%20of%20the%20D%2C%20C%2C%20B%20and%20A%20types%2C%20as%20well%20as%20Tbf1p%20binding%20sites%3B%20formerly%20called%20SubTelomeric%20Repeats;dbxref=SGD:S000028866
chrI    SGD     gene    335     649     .       +       .       ID=YAL069W;Name=YAL069W;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%20unlikely%20to%20encode%20a%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data;dbxref=SGD:S000002143;orf_classification=Dubious
chrI    SGD     CDS     335     649     .       +       0       Parent=YAL069W;Name=YAL069W;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%20unlikely%20to%20encode%20a%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data;dbxref=SGD:S000002143;orf_classification=Dubious

Figure: An example of a genome sequence (.fa) and annotation file (.gtf) that are rendered as browsable pictures on the UCSC Genome Browser http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Erin%20Osborne&hgS_otherUserSessionName=180616_ce11_class

Other File Formats

:!: Quick Tip There are many other standardized file formats in genomics. Much of learning the ropes of genomics is getting to know these standardized formats. UCSC File Formats


:!: Quick Tip Where do file extensions come from? For flat, text files, you put them there!

:!: Quick Tip It is good practice to always save files with the proper file extensions! All files should have extensions!!

:!: Quick Tip: Some standardized file formats will have lines at the beginning that start with #. These are comments. They typically contain information about how the information was generated.

:!: Quick Tip: Check what type of file you have using file.

Usage: file
file <file.txt>

$ file file.txt

Continue to Genomic Resources

wiki/2018datasets.txt · Last modified: 2018/08/28 09:03 by erin