User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
wiki:curtain_bin_vs_txt [2018/07/14 00:34]
david
wiki:curtain_bin_vs_txt [2018/07/27 16:36] (current)
david [Some other binary types]
Line 1: Line 1:
-How is text binary? Well, programs that read text really read binary, but know to display it as text.  
  
-Let's say that I wanted to know what binary numbers represent the text "Hey Mom!" That representation is called a character encoding and we are using [[https://​en.wikipedia.org/​wiki/​ASCII|ASCII]]. From the table on that page, I know that 72, 101 and 121 respectively encode the letters "​H",​ "​e",​ "​y"​. ​ 
  
-We can use python to verify ​that: +====== What is text format versus binary? ====== 
-<code bash> +**Isn'​t EVERYTHING binary?** 
-$ python -c 'print chr(72)chr(101)chr(121)' + 
-H e y+The answer is in //character encodings//​. Everything on a computer is binary, but programs ​that read text know to map the numbers represented by a text file to certain characters. 
 + 
 + 
 +=== Even text is binary === 
 + 
 +All things in a modern computer are binary. Each program that reads binary data must know what to do with it. That's true for text viewerslike Microsoft Wordnano, and even the terminal itself.  
 + 
 +Take the following excerpt from a famous play: 
 + 
 +====What the text viewer sees (binary==== 
 +<​code>​ 
 +01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010 
 +00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001 
 +00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100 
 +00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000 
 +01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111 
 +01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101 
 +00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000 
 +01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100 
 +01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001 
 +00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000 
 +01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111 
 +01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011 
 +00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110 
 +01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001 
 +01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111 
 +01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100 
 +00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001 
 +01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001 
 +01110100 00100000
 </​code>​ </​code>​
  
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation (the binary stringcan also be gotten with python, this time using string formatting. ​ + 
-<​code ​bash+==== What the text viewer sees (decimal==== 
-$ python -c 'print "​{0:​b}"​.format(72)' +<​code>​ 
-1001000+ 65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110 
 +101 119  32 104 105 109  44  32  ​72 111 114  97 116 105 111  59  32  97  32 102 101 108 108 
 +111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102 
 + 32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  59  32 
 +104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115 
 + ​32 ​ 98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59 
 + ​32 ​ 97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32 
 +105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32
 </​code>​ </​code>​
  
 +==== What the text viewer displays (ASCII character set) ====
  
-To show you that it's all numbersI used the hexdump command. It's hard to useso don't worry about the parameters. Just know that it shows four numbers, followed ​by "="followed by the characters they encode.+<​code>​ 
 +Alas! poor Yorick. I knew him, Horatio; a fellow of infinite jest, of most excel 
 +lent fancy; he hath borne me on his back a thousand times; and now, how abhorred 
 + in my imagination it is! 
 +</​code>​ 
 + 
 +Can you tell what the binary string for "​Yorick"​ is? 
 + 
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized ​by the program that uses it, and beyond that, how the human recognizes it.  
 + 
 +==== Metadata and the file extension ==== 
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux(non-text) binary files rely upon a form of metadata, a header, for the program to recognize it. 
 +The //​**file**//​ command does this, and tries to recognize ​the header.
  
 <code bash> <code bash>
-echo -ne 'Hey Mom!' | hexdump ​-v -e '4/1 "%d " " =|"'​ -e '4/1 "​%_p|"​ "​\n"'​ +file bedtools-2.25.0.tar.gz  
-72 101 121 32 =|H|e|y| | +bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
-77 111 109 33 =|M|o|m|!|+
 </​code>​ </​code>​
  
-I can't find a utility ​to that shows you the binary string, because it's actually kind of useless. But you can show any number as its binary representation ​with python string formatting.+One way to keep track of things is with the file extension
  
 +It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
  
-The capital letter '​H'​ holds the 72nd place in the ASCII character chart. Here is its binary string. +<​code>​ 
-<​code ​bash+mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
-python ​-c 'print "{0:b}".format(72)'​ +$ file theExtDOESNT.matter  
-1001000+theExtDOESNT.mattergzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015 
 +$ gunzip theExtDOESNT.matter ​ 
 +gzip: theExtDOESNT.matter:​ unknown suffix -- ignored 
 +$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip -v bedtools-2.25.0.tar.gz  
 +bedtools-2.25.0.tar.gz:​ 63.2% -- replaced with bedtools-2.25.0.tar
 </​code>​ </​code>​
  
-To go the other direction, i.e. show the ASCII character at a given index: +So the file extension //does// matter-- for programs that make use of itBut the data is unchanged. 
-<​code ​bash+ 
-python ​-c 'print chr(72)chr(101), chr(121)+==== Some other binary types ==== 
-H e y+ 
 +===== Executable ===== 
 +<​code>​ 
 +file /bin/ls 
 +/bin/ls: ELF 64-bit LSB executable, x86-64version 1 (SYSV), dynamically linked ​(uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
 </​code>​ </​code>​
 +
 +You see that the command //ls// is actually a computer program that is a binary executable. ​
 +
 +===== Binary Alignment Map =====
 +<​code>​
 +$ file AR122.bam
 +AR122.bam: gzip compressed data, extra field
 +</​code>​
 +
 +
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type.
 +
 +==== Text-based Genomics filetypes ====
 +
 +8-)
 +
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer. ​
 +
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats.
 +
 +Examples of text formats in genomics.
 +
 +^ Filetype ^ description ^ extension ^ format definition ^
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] |
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] |
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] | 
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] |
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] |
 +
 +
  
wiki/curtain_bin_vs_txt.1531550065.txt.gz · Last modified: 2018/07/14 00:34 by david