User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
wiki:curtain_bin_vs_txt [2018/07/14 00:54]
david
wiki:curtain_bin_vs_txt [2018/07/27 16:35]
david [Text-based Genomics filetypes]
Line 1: Line 1:
-How is text binary? Well, programs that read text really read binary, but know to display it as text.  
  
-Let's say that I wanted to know what binary numbers represent the text "Hey Mom!" That representation is called a character encoding and we are using [[https://​en.wikipedia.org/​wiki/​ASCII|ASCII]]. From the table on that page, I know that 72, 101 and 121 respectively encode the letters "​H",​ "​e",​ "​y"​. ​ 
  
-We can use python to verify ​that: +====== What is text format versus binary? ====== 
-<code bash> +**Isn'​t EVERYTHING binary?** 
-$ python -c 'print chr(72)chr(101)chr(121)' + 
-H e y+The answer is in //character encodings//​. Everything on a computer is binary, but programs ​that read text know to map the numbers represented by a text file to certain characters. 
 + 
 + 
 +=== Even text is binary === 
 + 
 +All things in a modern computer are binary. Each program that reads binary data must know what to do with it. That's true for text viewerslike Microsoft Wordnano, and even the terminal itself.  
 + 
 +Take the following excerpt from a famous play: 
 + 
 +====What the text viewer sees (binary==== 
 +<​code>​ 
 +01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010 
 +00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001 
 +00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100 
 +00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000 
 +01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111 
 +01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101 
 +00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000 
 +01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100 
 +01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001 
 +00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000 
 +01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111 
 +01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011 
 +00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110 
 +01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001 
 +01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111 
 +01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100 
 +00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001 
 +01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001 
 +01110100 00100000
 </​code>​ </​code>​
  
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation (the binary stringcan also be gotten with python, this time using string formatting. ​ + 
-<​code ​bash+==== What the text viewer sees (decimal==== 
-$ python -c 'print "​{0:​b}"​.format(72)' +<​code>​ 
-1001000+ 65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110 
 +101 119  32 104 105 109  44  32  ​72 111 114  97 116 105 111  59  32  97  32 102 101 108 108 
 +111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102 
 + 32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  59  32 
 +104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115 
 + ​32 ​ 98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59 
 + ​32 ​ 97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32 
 +105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32
 </​code>​ </​code>​
  
 +==== What the text viewer displays (ASCII character set) ====
  
 +<​code>​
 +Alas! poor Yorick. I knew him, Horatio; a fellow of infinite jest, of most excel
 +lent fancy; he hath borne me on his back a thousand times; and now, how abhorred
 + in my imagination it is!
 +</​code>​
  
-The string of 1'​s ​and 0's isn't very useful to humansBut humans have other utilities to work with binaryOne of them is "​hexdump"​let's use hexdump ​to decode that character string "Hey Mom!"+Can you tell what the binary ​string ​for "​Yorick"​ is? 
 + 
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it 
 + 
 +==== Metadata and the file extension ==== 
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a wholeOn linux, (non-text) binary files rely upon a form of metadata, a headerfor the program ​to recognize it. 
 +The //​**file**//​ command does this, and tries to recognize the header.
  
 <code bash> <code bash>
-echo -ne 'Hey Mom!' | hexdump ​-+file bedtools-2.25.0.tar.gz  
-00000000 ​ 48 65 79 20 4d 6f 6d 21                           |Hey Mom!| +bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
-00000008+
 </​code>​ </​code>​
  
-00000000 - starting position+One way to keep track of things is with the file extension. ​
  
-48 65 79 20 4d 6f 6d 21 - hexadecimal numbers corresponding ​to H,​e,​y,​space,​M,​o,m,!+It is a way for the user or a program ​to check to see if the file is the right format. Ultimatelyhoweverthe data is still the same regardless of the file extension.
  
- |Hey Mom!| the ASCII encoding+<​code>​ 
 +$ mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
 +$ file theExtDOESNT.matter  
 +theExtDOESNT.matter:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015 
 +$ gunzip theExtDOESNT.matter  
 +gzip: theExtDOESNT.matter:​ unknown suffix -- ignored 
 +$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip -v bedtools-2.25.0.tar.gz  
 +bedtools-2.25.0.tar.gz:​ 63.2% -- replaced with bedtools-2.25.0.tar 
 +</​code>​
  
-00000008 ​end position+So the file extension //does// matter-- for programs that make use of it. But the data is unchanged.
  
-It's easier to read binary ​numbers as hexadecimal (base 16), so that's how they are displayed by hexdump by default.+==== Some other binary ​types ====
  
-Here's a way to show the decimals with the hexdump command: +<​code>​ 
- +file /bin/ls 
-<​code ​bash+/bin/ls: ELF 64-bit LSB executable, x86-64, version ​(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
-echo -ne 'Hey Mom!' | hexdump ​-v -e '4/"%d " " =|"'​ -e '4/1 "​%_p|"​ "​\n"'​ +
-72 101 121 32 =|H|e|y| | +
-77 111 109 33 =|M|o|m|!|+
 </​code>​ </​code>​
  
-Now we know how to write "​Mom!"​ with the numbers 77, 111, 109, and 33. Let's use the python built-in function "​chr"​ again, this time to write the full string.+You see that the command //ls// is actually a computer program that is a binary executable
  
-<​code ​bash+ 
-python -c 'print chr(72), chr(101), chr(121), chr(32)'​ +<​code>​ 
-H e y +file AR122.bam 
-$ python -c 'print chr(77)chr(111), chr(109), chr(33)'​ +AR122.bam: gzip compressed dataextra field
-M o m ! +
-$ python -c 'print chr(72), chr(101), chr(121), chr(32), chr(77), chr(111), chr(109), chr(33)'​ +
-H e y   M o m !+
 </​code>​ </​code>​
  
-TA-DA!+ 
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type. 
 + 
 +==== Text-based Genomics filetypes ==== 
 + 
 +8-) 
 + 
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer.  
 + 
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats. 
 + 
 +Examples of text formats in genomics. 
 + 
 +^ Filetype ^ description ^ extension ^ format definition ^ 
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] | 
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] | 
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] |  
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] | 
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] | 
 + 
 + 
wiki/curtain_bin_vs_txt.txt · Last modified: 2018/07/27 16:36 by david