User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
wiki:curtain_bin_vs_txt [2018/07/27 02:00]
wiki:curtain_bin_vs_txt [2018/07/27 16:36] (current)
david [Some other binary types]
Line 57: Line 57:
 </​code>​ </​code>​
 +Can you tell what the binary string for "​Yorick"​ is?
 +The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it. 
 +==== Metadata and the file extension ====
 +Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux, (non-text) binary files rely upon a form of metadata, a header, for the program to recognize it.
 +The //​**file**//​ command does this, and tries to recognize the header.
-But how is this binary? Numbers like 72 are decimals, that is, they use a base 10 system. The base 2 representation (the binary string) can also be gotten with python, this time using string formatting. ​ 
 <code bash> <code bash>
-python ​-c 'print "{0:b}".format(72)'​ +file bedtools-2.25.0.tar.gz ​ 
-1001000+bedtools-2.25.0.tar.gz:​ gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
 </​code>​ </​code>​
 +One way to keep track of things is with the file extension. ​
 +It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.
-The string of 1's and 0's isn't very useful to humansBut humans have other utilities to work with binaryOne of them is "​hexdump"​let's use hexdump to decode that character string ''​Hey Mom!''​ +<​code>​ 
- +$ mv bedtools-2.25.0.tar.gz theExtDOESNT.matter 
-<code bash> +$ file theExtDOESNT.matter  
-echo -ne 'Hey Mom!' | hexdump ​-C +theExtDOESNT.matter:​ gzip compressed datafrom Unix, last modified: Wed Sep  2 22:42:14 2015 
-00000000 ​ 48 65 79 20 4d 6f 6d 21                           |Hey Mom!| +$ gunzip theExtDOESNT.matter ​ 
-00000008+gzip: theExtDOESNT.matter:​ unknown suffix -- ignored 
 +mv theExtDOESNT.matter bedtools-2.25.0.tar.gz 
 +$ gunzip ​-v bedtools-2.25.0.tar.gz ​ 
 +bedtools-2.25.0.tar.gz:​  63.2% -- replaced with bedtools-2.25.0.tar
 </​code>​ </​code>​
-^ output ^ '​splain ^ 
-|''​00000000''​ | starting position | 
-|''​48 65 79 20 4d 6f 6d 21''​ | hexadecimal numbers corresponding to ''​H e y (sp) M o m !''​ | 
-|''​%%|%%Hey Mom!%%|%%''​| the ASCII encoding| 
-|''​00000008''​ | end position| 
-It's easier to read binary numbers as hexadecimal (base 16), so that's how they are displayed by hexdump by default.+So the file extension //does// matter-- for programs ​that make use of it. But the data is unchanged.
-Here's a way to show the decimals with the hexdump command:+==== Some other binary types ====
-<​code ​bash+===== Executable ===== 
-echo -ne 'Hey Mom!' | hexdump ​-v -e '4/"%d " " =|"'​ -e '4/1 "​%_p|"​ "​\n"'​ +<​code>​ 
-72 101 121 32 =|H|e|y| | +file /bin/ls 
-77 111 109 33 =|M|o|m|!|+/bin/ls: ELF 64-bit LSB executable, x86-64, version ​(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb,​ stripped
 </​code>​ </​code>​
-Now we know how to write "​Mom!"​ with the numbers 77, 111, 109, and 33. Let's use the python built-in function "​chr"​ again, this time to write the full string.+You see that the command //ls// is actually a computer program that is a binary executable
-<​code ​bash+===== Binary Alignment Map ===== 
-python -c 'print chr(72), chr(101), chr(121), chr(32)'​ +<​code>​ 
-H e y +file AR122.bam 
-$ python -c 'print chr(77)chr(111), chr(109), chr(33)'​ +AR122.bam: gzip compressed dataextra field
-M o m ! +
-$ python -c 'print chr(72), chr(101), chr(121), chr(32), chr(77), chr(111), chr(109), chr(33)'​ +
-H e y   M o m !+
 </​code>​ </​code>​
 +A genomics data file format BAM ([[https://​​content/​early/​2015/​05/​29/​020024|binary format based on samtools Sequence Alignment Map]]) is recognized as having gzip compressed data, but the command doesn'​t know the full data type. 
 +==== Text-based Genomics filetypes ==== 
 +As with binary data, text data can have a more specific format. The extensions: //.txt, .sh, .bash, .c, .gff, fasta, .py// are all common file extensions that a genomics user sees on a linux computer.  
 +It is up to the genomicist and the programs he or she uses to produce/​validate files with the given formats. 
 +Examples of text formats in genomics. 
 +^ Filetype ^ description ^ extension ^ format definition ^ 
 +| Fasta | DNA/​RNA/​Protein sequence with header | .fasta, .fa | [[http://​​pph/​FASTA.html|link]] | 
 +| Fastq | DNA/​RNA/​Protein sequence with header + quality information | .fastq, .fq | [[http://​​fastq.shtml|link]] | 
 +| GFF   | Gene Feature Format | .gff | [[https://​​FAQ/​FAQformat.html#​format3|link]] |  
 +| BED   | Browser Extensible Data | .bed | [[https://​​FAQ/​FAQformat.html#​format1|link]] | 
 +| SAM   | Sequence Alignment Map | .sam | [[http://​​doc/​sam.html|link]] | 
wiki/curtain_bin_vs_txt.1532678438.txt.gz · Last modified: 2018/07/27 02:00 by david