User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
wiki:curtain_bin_vs_txt [2018/07/14 00:34]
david
wiki:curtain_bin_vs_txt [2018/07/27 16:35]
david [Text-based Genomics filetypes]
Line 1: Line 1:
-How is text binary? Well, programs that read text really read binary, but know to display it as text.  
  
-Let's say that I wanted to know what binary numbers represent the text "Hey Mom!" That representation is called a character encoding and we are using [[https://​en.wikipedia.org/​wiki/​ASCII|ASCII]]. From the table on that page, I know that 72, 101 and 121 respectively encode the letters "​H",​ "​e",​ "​y"​. ​ 
  
-We can use python to verify ​that: +====== What is text format versus binary? ====== 
-<code bash> +**Isn'​t EVERYTHING binary?** 
-$ python -c 'print chr(72)chr(101)chr(121)' + 
-H e y+The answer is in //character encodings//​. Everything on a computer is binary, but programs ​that read text know to map the numbers represented by a text file to certain characters. 
 + 
 + 
 +=== Even text is binary === 
 + 
 +All things in a modern computer are binary. Each program that reads binary data must know what to do with it. That's true for text viewerslike Microsoft Wordnano, and even the terminal itself.  
 + 
 +Take the following excerpt from a famous play: 
 + 
 +====What the text viewer sees (binary==== 
 +<​code>​ 
 +01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010 
 +00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001 
 +00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100 
 +00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000 
 +01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111 
 +01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101 
 +00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000 
 +01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100 
 +01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001 
 +00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000 
 +01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111 
 +01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011 
 +00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110 
 +01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001 
 +01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111 
 +01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100 
 +00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001 
 +01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001 
 +01110100 00100000
 </​code>​ </​code>​
  
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation (the binary stringcan also be gotten with python, this time using string formatting. ​ + 
-<​code ​bash+==== What the text viewer sees (decimal==== 
-$ python -c 'print "​{0:​b}"​.format(72)' +<​code>​ 
-1001000+ 65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110 
 +101 119  32 104 105 109  44  32  ​72 111 114  97 116 105 111  59  32  97  32 102 101 108 108 
 +111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102 
 + 32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  59  32 
 +104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115 
 + ​32 ​ 98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59 
 + ​32 ​ 97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32 
 +105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32
 </​code>​ </​code>​
  
 +==== What the text viewer displays (ASCII character set) ====
  
-To show you that it's all numbersI used the hexdump command. It's hard to useso don't worry about the parameters. Just know that it shows four numbers, followed ​by "="followed by the characters they encode.+<​code>​ 
 +Alas! poor Yorick. I knew him, Horatio; a fellow of infinite jest, of most excel 
 +lent fancy; he hath borne me on his back a thousand times; and now, how abhorred 
 + in my imagination it is! 
 +</​code>​ 
 + 
 +Can you tell what the binary string for "​Yorick"​ is? 
 + 
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized ​by the program that uses it, and beyond that, how the human recognizes it.  
 + 
 +==== Metadata and the file extension ==== 
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux(non-text) binary files rely upon a form of metadata, a header, for the program to recognize it. 
 +The //​**file**//​ command does this, and tries to recognize ​the header.
  
 <code bash> <code bash>
-echo -ne 'Hey Mom!' | hexdump ​-v -e '4/1 "%d " " =|"'​ -e '4/1 "​%_p|"​ "​\n"'​ +file bedtools-2.25.0.tar.gz  
-72 101 121 32 =|H|e|y| | +bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
-77 111 109 33 =|M|o|m|!|+
 </​code>​ </​code>​
  
-I can't find a utility ​to that shows you the binary string, because it's actually kind of useless. But you can show any number as its binary representation ​with python string formatting.+One way to keep track of things is with the file extension
  
 +It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
  
-The capital letter '​H'​ holds the 72nd place in the ASCII character chart. Here is its binary string. +<​code>​ 
-<​code ​bash+mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
-python ​-c 'print "{0:b}".format(72)'​ +$ file theExtDOESNT.matter  
-1001000+theExtDOESNT.mattergzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015 
 +$ gunzip theExtDOESNT.matter ​ 
 +gzip: theExtDOESNT.matter:​ unknown suffix -- ignored 
 +$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip -v bedtools-2.25.0.tar.gz  
 +bedtools-2.25.0.tar.gz:​ 63.2% -- replaced with bedtools-2.25.0.tar
 </​code>​ </​code>​
  
-To go the other direction, i.e. show the ASCII character at a given index: +So the file extension //does// matter-- for programs that make use of itBut the data is unchanged. 
-<​code ​bash+ 
-python ​-c 'print chr(72)chr(101), chr(121)+==== Some other binary types ==== 
-H e y+ 
 +<​code>​ 
 +file /bin/ls 
 +/bin/ls: ELF 64-bit LSB executable, x86-64version 1 (SYSV), dynamically linked ​(uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
 </​code>​ </​code>​
 +
 +You see that the command //ls// is actually a computer program that is a binary executable. ​
 +
 +
 +<​code>​
 +$ file AR122.bam
 +AR122.bam: gzip compressed data, extra field
 +</​code>​
 +
 +
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type.
 +
 +==== Text-based Genomics filetypes ====
 +
 +8-)
 +
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer. ​
 +
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats.
 +
 +Examples of text formats in genomics.
 +
 +^ Filetype ^ description ^ extension ^ format definition ^
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] |
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] |
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] | 
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] |
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] |
 +
 +
  
wiki/curtain_bin_vs_txt.txt · Last modified: 2018/07/27 16:36 by david