User Tools

Site Tools


wiki:curtain_bin_vs_txt

This is an old revision of the document!


What is text format versus binary?

Isn't EVERYTHING binary?

The answer is in character encodings. Everything on a computer is binary, but programs that read text know to map the numbers represented by a text file to certain characters.

Even text is binary

All things in a modern computer are binary. Each program that reads binary data must know what to do with it. That's true for text viewers, like Microsoft Word, nano, and even the terminal itself.

Take the following excerpt from a famous play:

What the text viewer sees (binary)

01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010
00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001
00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100
00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000
01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111
01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101
00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000
01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100
01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001
00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000
01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111
01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011
00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110
01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001
01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111
01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100
00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001
01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001
01110100 00100000

What the text viewer sees (decimal)

 65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110
101 119  32 104 105 109  44  32  72 111 114  97 116 105 111  59  32  97  32 102 101 108 108
111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102
 32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  59  32
104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115
 32  98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59
 32  97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32
105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32

What the text viewer displays (ASCII character set)

Alas! poor Yorick. I knew him, Horatio; a fellow of infinite jest, of most excel
lent fancy; he hath borne me on his back a thousand times; and now, how abhorred
 in my imagination it is!

Can you tell what the binary string for “Yorick” is?

The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it.

Metadata and the file extension

Metadata is a small proportion of a set of data (like a document) that describes the document as a whole. On linux, (non-text) binary files rely upon a form of metadata, a header, for the program to recognize it. The file command does this, and tries to recognize the header.

$ file bedtools-2.25.0.tar.gz 
bedtools-2.25.0.tar.gz: gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015

One way to keep track of things is with the file extension.

It is a way for the user or a program to check to see if the file is the right format. Ultimately, however, the data is still the same regardless of the file extension.

$ mv bedtools-2.25.0.tar.gz theExtDOESNT.matter
$ file theExtDOESNT.matter 
theExtDOESNT.matter: gzip compressed data, from Unix, last modified: Wed Sep  2 22:42:14 2015
$ gunzip theExtDOESNT.matter 
gzip: theExtDOESNT.matter: unknown suffix -- ignored
$ mv theExtDOESNT.matter bedtools-2.25.0.tar.gz
$ gunzip -v bedtools-2.25.0.tar.gz 
bedtools-2.25.0.tar.gz:	 63.2% -- replaced with bedtools-2.25.0.tar

So the file extension does matter– for programs that make use of it. But the data is unchanged.

Some other binary types

$ file /bin/ls
/bin/ls: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=6129e7403942b90574b8c28439d128ff5515efeb, stripped

You see that the command ls is actually a computer program that is a binary executable.

$ file AR122.bam
AR122.bam: gzip compressed data, extra field

A genomics data file format BAM (binary format based on samtools Sequence Alignment Map) is recognized as having gzip compressed data, but the command doesn't know the full data type.

Text-based Genomics filetypes

8-)

As with binary data, text data can have a more specific format. The extensions: .txt, .sh, .bash, .c, .gff, fasta, .py are all common file extensions that a genomics user sees on a linux computer.

It is up to the genomicist and the programs he or she uses to produce/validate files with the given formats.

Examples of text formats in genomics.

Filetype description extension format definition
Fasta DNA/RNA/Protein sequence with header .fasta, .fa link
Fastq DNA/RNA/Protein sequence with header + quality information .fastq, .fq link
GFF Gene Feature Format .gff link
BED Browser Extensible Data .bed link
SAM Sequence Alignment Map .sam link
wiki/curtain_bin_vs_txt.1532730944.txt.gz · Last modified: 2018/07/27 16:35 by david