User Tools

Site Tools


wiki:curtain_bin_vs_txt

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
wiki:curtain_bin_vs_txt [2018/07/27 02:00]
david [What the text viewer sees]
wiki:curtain_bin_vs_txt [2018/07/27 16:36] (current)
david [Some other binary types]
Line 1: Line 1:
 +
 +
 ====== What is text format versus binary? ====== ====== What is text format versus binary? ======
 **Isn'​t EVERYTHING binary?** **Isn'​t EVERYTHING binary?**
Line 11: Line 13:
 Take the following excerpt from a famous play: Take the following excerpt from a famous play:
  
-__NO_TOC__ 
 ====What the text viewer sees (binary) ==== ====What the text viewer sees (binary) ====
 <​code>​ <​code>​
Line 56: Line 57:
 </​code>​ </​code>​
  
 +Can you tell what the binary string for "​Yorick"​ is?
 +
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it. 
 +
 +==== Metadata and the file extension ====
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux, (non-text) binary files rely upon a form of metadata, a header, for the program to recognize it.
 +The //​**file**//​ command does this, and tries to recognize the header.
  
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation (the binary string) can also be gotten with python, this time using string formatting. ​ 
 <code bash> <code bash>
-python ​-c 'print "{0:b}".format(72)'​ +file bedtools-2.25.0.tar.gz ​ 
-1001000+bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
 </​code>​ </​code>​
  
 +One way to keep track of things is with the file extension. ​
  
 +It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
  
-The string of 1's and 0's isn't very useful to humansBut humans have other utilities to work with binaryOne of them is "​hexdump"​let's use hexdump to decode that character string ''​Hey Mom!''​ +<​code>​ 
- +$ mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
-<code bash> +$ file theExtDOESNT.matter  
-echo -ne 'Hey Mom!' | hexdump ​-C +theExtDOESNT.matter:​ gzip compressed datafrom Unix, last modified: Wed Sep  2 22:42:14 2015 
-00000000 ​ 48 65 79 20 4d 6f 6d 21                           |Hey Mom!| +$ gunzip theExtDOESNT.matter ​ 
-00000008+gzip: theExtDOESNT.matter:​ unknown suffix -- ignored 
 +mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip ​-v bedtools-2.25.0.tar.gz ​ 
 +bedtools-2.25.0.tar.gz:​  63.2% -- replaced with bedtools-2.25.0.tar
 </​code>​ </​code>​
-^ output ^ '​splain ^ 
-|''​00000000''​ | starting position | 
-|''​48 65 79 20 4d 6f 6d 21''​ | hexadecimal numbers corresponding to ''​H e y (sp) M o m !''​ | 
-|''​%%|%%Hey Mom!%%|%%''​| the ASCII encoding| 
-|''​00000008''​ | end position| 
  
-It's easier to read binary numbers as hexadecimal (base 16), so that's how they are displayed by hexdump by default.+So the file extension //does// matter-- for programs ​that make use of it. But the data is unchanged.
  
-Here's a way to show the decimals with the hexdump command:+==== Some other binary types ====
  
-<​code ​bash+===== Executable ===== 
-echo -ne 'Hey Mom!' | hexdump ​-v -e '4/"%d " " =|"'​ -e '4/1 "​%_p|"​ "​\n"'​ +<​code>​ 
-72 101 121 32 =|H|e|y| | +file /bin/ls 
-77 111 109 33 =|M|o|m|!|+/bin/ls: ELF 64-bit LSB executable, x86-64, version ​(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
 </​code>​ </​code>​
  
-Now we know how to write "​Mom!"​ with the numbers 77, 111, 109, and 33. Let's use the python built-in function "​chr"​ again, this time to write the full string.+You see that the command //ls// is actually a computer program that is a binary executable
  
-<​code ​bash+===== Binary Alignment Map ===== 
-python -c 'print chr(72), chr(101), chr(121), chr(32)'​ +<​code>​ 
-H e y +file AR122.bam 
-$ python -c 'print chr(77)chr(111), chr(109), chr(33)'​ +AR122.bam: gzip compressed dataextra field
-M o m ! +
-$ python -c 'print chr(72), chr(101), chr(121), chr(32), chr(77), chr(111), chr(109), chr(33)'​ +
-H e y   M o m !+
 </​code>​ </​code>​
  
-TA-DA!+ 
 +A genomics data file format BAM ([[https://​www.biorxiv.org/​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type. 
 + 
 +==== Text-based Genomics filetypes ==== 
 + 
 +8-) 
 + 
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer.  
 + 
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats. 
 + 
 +Examples of text formats in genomics. 
 + 
 +^ Filetype ^ description ^ extension ^ format definition ^ 
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​genetics.bwh.harvard.edu/​pph/​FASTA.html|link]] | 
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​maq.sourceforge.net/​fastq.shtml|link]] | 
 +| GFF   | Gene Feature Format | .gff | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format3|link]] |  
 +| BED   | Browser Extensible Data | .bed | [[https://​genome.ucsc.edu/​FAQ/​FAQformat.html#​format1|link]] | 
 +| SAM   | Sequence Alignment Map | .sam | [[http://​www.htslib.org/​doc/​sam.html|link]] | 
 + 
 + 
wiki/curtain_bin_vs_txt.1532678411.txt.gz · Last modified: 2018/07/27 02:00 by david