User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
wiki:curtain_bin_vs_txt [2018/07/27 02:01]
david [What the text viewer displays (ASCII character set)]
wiki:curtain_bin_vs_txt [2018/07/27 16:36] (current)
david [Some other binary types]
Line 56: Line 56:
  in my imagination it is!  in my imagination it is!
 </​code>​ </​code>​
 +
 +Can you tell what the binary string for "​Yorick"​ is?
 +
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it. 
 +
 +==== Metadata and the file extension ====
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux, (non-text) binary files rely upon a form of metadata, a header, for the program to recognize it.
 +The //​**file**//​ command does this, and tries to recognize the header.
 +
 +<code bash>
 +$ file bedtools-2.25.0.tar.gz ​
 +bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
 +</​code>​
 +
 +One way to keep track of things is with the file extension. ​
 +
 +It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
 +
 +<​code>​
 +$ mv bedtools-2.25.0.tar.gz theExtDOESNT.matter
 +$ file theExtDOESNT.matter ​
 +theExtDOESNT.matter:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
 +$ gunzip theExtDOESNT.matter ​
 +gzip: theExtDOESNT.matter:​ unknown suffix -- ignored
 +$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz
 +$ gunzip -v bedtools-2.25.0.tar.gz ​
 +bedtools-2.25.0.tar.gz:​ 63.2% -- replaced with bedtools-2.25.0.tar
 +</​code>​
 +
 +So the file extension //does// matter-- for programs that make use of it. But the data is unchanged.
 +
 +==== Some other binary types ====
 +
 +===== Executable =====
 +<​code>​
 +$ file /bin/ls
 +/bin/ls: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
 +</​code>​
 +
 +You see that the command //ls// is actually a computer program that is a binary executable. ​
 +
 +===== Binary Alignment Map =====
 +<​code>​
 +$ file AR122.bam
 +AR122.bam: gzip compressed data, extra field
 +</​code>​
 +
 +
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type.
 +
 +==== Text-based Genomics filetypes ====
 +
 +8-)
 +
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer. ​
 +
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats.
 +
 +Examples of text formats in genomics.
 +
 +^ Filetype ^ description ^ extension ^ format definition ^
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] |
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] |
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] | 
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] |
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] |
  
  
  
wiki/curtain_bin_vs_txt.1532678512.txt.gz ยท Last modified: 2018/07/27 02:01 by david