User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
wiki:curtain_bin_vs_txt [2018/07/26 16:40]
david
wiki:curtain_bin_vs_txt [2018/07/27 16:35]
david [Text-based Genomics filetypes]
Line 1: Line 1:
 +
 +
 ====== What is text format versus binary? ====== ====== What is text format versus binary? ======
 **Isn'​t EVERYTHING binary?** **Isn'​t EVERYTHING binary?**
Line 5: Line 7:
  
  
-====== How is text binary? ====== +=== Even text is binary ===
- +
  
-Well, programs ​that read text really read binary, but know to display ​it as text. +All things in a modern computer are binary. Each program ​that reads binary ​data must know what to do with it. That's true for text viewers, like Microsoft Word, nano, and even the terminal itself
  
-Let's say that I wanted to know what binary numbers represent ​the text ''​Hey Mom!'' ​+Take the following excerpt from a famous play:
  
-That representation is called a //character encoding//, and we are using [[https://​en.wikipedia.org/​wiki/​ASCII|ASCII]]. ​+====What the text viewer sees (binary) ==== 
 +<​code>​ 
 +01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010 
 +00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001 
 +00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100 
 +00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000 
 +01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111 
 +01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101 
 +00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000 
 +01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100 
 +01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001 
 +00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000 
 +01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111 
 +01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011 
 +00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110 
 +01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001 
 +01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111 
 +01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100 
 +00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001 
 +01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001 
 +01110100 00100000 
 +</code>
  
-From the table on that page, I know that 72, 101 and 121 respectively encode the letters ''​ 
-H'',​ ''​e'',​ ''​y''​. ​ 
  
-We can use python to verify that: +==== What the text viewer sees (decimal) ==== 
-<​code ​bash+<​code>​ 
-$ python -c 'print chr(72), chr(101), chr(121)' + 65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110 
-H e y+101 119  32 104 105 109  44  32  ​72 111 114  97 116 105 111  59  32  97  32 102 101 108 108 
 +111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102 
 + 32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  ​59 ​ 32 
 +104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115 
 + ​32 ​ 98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59 
 + ​32 ​ 97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32 
 +105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32
 </​code>​ </​code>​
  
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation ​(the binary stringcan also be gotten with python, this time using string formatting. ​ +==== What the text viewer displays ​(ASCII character set==== 
-<​code ​bash+ 
-$ python -c 'print "​{0:​b}"​.format(72)'​ +<​code>​ 
-1001000+Alas! poor YorickI knew him, Horatio; a fellow of infinite jest, of most excel 
 +lent fancy; he hath borne me on his back a thousand times; and now, how abhorred 
 + in my imagination it is!
 </​code>​ </​code>​
  
 +Can you tell what the binary string for "​Yorick"​ is?
  
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it. 
  
-The string of 1'​s ​and 0's isn't very useful to humansBut humans have other utilities to work with binary. One of them is "​hexdump"​let's use hexdump ​to decode that character string ''​Hey Mom!''​+==== Metadata ​and the file extension ==== 
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a wholeOn linux, (non-text) ​binary ​files rely upon a form of metadataa header, for the program ​to recognize it. 
 +The //​**file**//​ command does this, and tries to recognize the header.
  
 <code bash> <code bash>
-echo -ne 'Hey Mom!' | hexdump ​-+file bedtools-2.25.0.tar.gz  
-00000000 ​ 48 65 79 20 4d 6f 6d 21                           |Hey Mom!| +bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
-00000008+
 </​code>​ </​code>​
-^ output ^ '​splain ^ 
-|''​00000000''​ | starting position | 
-|''​48 65 79 20 4d 6f 6d 21''​ | hexadecimal numbers corresponding to ''​H e y (sp) M o m !''​ | 
-|''​%%|%%Hey Mom!%%|%%''​| the ASCII encoding| 
-|''​00000008''​ | end position| 
  
-It's easier ​to read binary numbers as hexadecimal (base 16), so that's how they are displayed by hexdump by default.+One way to keep track of things is with the file extension
  
-Here'​s ​a way to show the decimals with the hexdump command:+It is a way for the user or a program ​to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
  
-<​code ​bash+<​code>​ 
-echo -ne 'Hey Mom!' | hexdump ​--e '4/1 "%d " " =|"' ​-e '4/1 "​%_p|"​ "​\n"'​ +mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
-72 101 121 32 =|H|e|y| | +$ file theExtDOESNT.matter  
-77 111 109 33 =|M|o|m|!|+theExtDOESNT.matter:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015 
 +$ gunzip theExtDOESNT.matter  
 +gzip: theExtDOESNT.matter:​ unknown suffix ​-- ignored 
 +$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip -v bedtools-2.25.0.tar.gz ​ 
 +bedtools-2.25.0.tar.gz:​ 63.2% -- replaced with bedtools-2.25.0.tar
 </​code>​ </​code>​
  
-Now we know how to write "​Mom!"​ with the numbers 77, 111, 109, and 33. Let's use the python built-in function "​chr"​ again, this time to write the full string.+So the file extension //does// matter-- for programs that make use of it. But the data is unchanged.
  
-<​code ​bash+==== Some other binary types ==== 
-python -c 'print chr(72), chr(101), chr(121), chr(32)'​ + 
-H e y +<​code>​ 
-$ python ​-c 'print chr(77)chr(111), chr(109), chr(33)'​ +file /bin/ls 
-M o m ! +/bin/ls: ELF 64-bit LSB executablex86-64version 1 (SYSV), dynamically linked ​(uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efebstripped
-$ python ​-c 'print chr(72)chr(101), chr(121), chr(32)chr(77)chr(111), chr(109), chr(33)'​ +
-H e y   M o m !+
 </​code>​ </​code>​
  
-TA-DA!+You see that the command //ls// is actually a computer program that is a binary executable.  
 + 
 + 
 +<​code>​ 
 +$ file AR122.bam 
 +AR122.bam: gzip compressed data, extra field 
 +</​code>​ 
 + 
 + 
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type. 
 + 
 +==== Text-based Genomics filetypes ==== 
 + 
 +8-) 
 + 
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer.  
 + 
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats. 
 + 
 +Examples of text formats in genomics. 
 + 
 +^ Filetype ^ description ^ extension ^ format definition ^ 
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] | 
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] | 
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] |  
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] | 
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] | 
 + 
 + 
wiki/curtain_bin_vs_txt.txt · Last modified: 2018/07/27 16:36 by david