# Computational biology at CSU

## DSCI 512: RNA-seq

### Questions?

wiki:curtain_bin_vs_txt

This is an old revision of the document!

# What is text format versus binary?

Isn't EVERYTHING binary?

The answer is in character encodings. Everything on a computer is binary, but programs that read text know to map the numbers represented by a text file to certain characters.

#### Even text is binary

All things in a modern computer are binary. Each program that reads binary data must know what to do with it. That's true for text viewers, like Microsoft Word, nano, and even the terminal itself.

Take the following excerpt from a famous play:

### What the text viewer sees (binary)

01000001 01101100 01100001 01110011 00100001 00100000 01110000 01101111 01101111 01110010
00100000 01011001 01101111 01110010 01101001 01100011 01101011 00101110 00100000 01001001
00100000 01101011 01101110 01100101 01110111 00100000 01101000 01101001 01101101 00101100
00100000 01001000 01101111 01110010 01100001 01110100 01101001 01101111 00111011 00100000
01100001 00100000 01100110 01100101 01101100 01101100 01101111 01110111 00100000 01101111
01100110 00100000 01101001 01101110 01100110 01101001 01101110 01101001 01110100 01100101
00100000 01101010 01100101 01110011 01110100 00101100 00100000 01101111 01100110 00100000
01101101 01101111 01110011 01110100 00100000 01100101 01111000 01100011 01100101 01101100
01101100 01100101 01101110 01110100 00100000 01100110 01100001 01101110 01100011 01111001
00111011 00100000 01101000 01100101 00100000 01101000 01100001 01110100 01101000 00100000
01100010 01101111 01110010 01101110 01100101 00100000 01101101 01100101 00100000 01101111
01101110 00100000 01101000 01101001 01110011 00100000 01100010 01100001 01100011 01101011
00100000 01100001 00100000 01110100 01101000 01101111 01110101 01110011 01100001 01101110
01100100 00100000 01110100 01101001 01101101 01100101 01110011 00111011 00100000 01100001
01101110 01100100 00100000 01101110 01101111 01110111 00101100 00100000 01101000 01101111
01110111 00100000 01100001 01100010 01101000 01101111 01110010 01110010 01100101 01100100
00100000 01101001 01101110 00100000 01101101 01111001 00100000 01101001 01101101 01100001
01100111 01101001 01101110 01100001 01110100 01101001 01101111 01101110 00100000 01101001
01110100 00100000

### What the text viewer sees (decimal)

65 108  97 115  33  32 112 111 111 114  32  89 111 114 105  99 107  46  32  73  32 107 110
101 119  32 104 105 109  44  32  72 111 114  97 116 105 111  59  32  97  32 102 101 108 108
111 119  32 111 102  32 105 110 102 105 110 105 116 101  32 106 101 115 116  44  32 111 102
32 109 111 115 116  32 101 120  99 101 108 108 101 110 116  32 102  97 110  99 121  59  32
104 101  32 104  97 116 104  32  98 111 114 110 101  32 109 101  32 111 110  32 104 105 115
32  98  97  99 107  32  97  32 116 104 111 117 115  97 110 100  32 116 105 109 101 115  59
32  97 110 100  32 110 111 119  44  32 104 111 119  32  97  98 104 111 114 114 101 100  32
105 110  32 109 121  32 105 109  97 103 105 110  97 116 105 111 110  32 105 116  32

### What the text viewer displays (ASCII character set)

Alas! poor Yorick. I knew him, Horatio; a fellow of infinite jest, of most excel
lent fancy; he hath borne me on his back a thousand times; and now, how abhorred
in my imagination it is!

Can you tell what the binary string for “Yorick” is?

The point of this lesson is that, in a way, the thing that makes a file format is how it is recognized by the program that uses it, and beyond that, how the human recognizes it.

One way to keep track of things is with the file extension. .txt, .sh, .bash, .c, .gff, fasta, .py are all common file extensions that a genomics user sees on a linux computer. They are also all text files. The thing that distinguishes them, from the name, is their file extension. Incorrectly labelling a filetype by its extension leads to confusion for the user down the line.

Examples of text formats in genomics.

Non-text binary files can be data or a program. Data usual has a file extension, such as .bam, or .gz, whereas binary programs have no extension (in linux), but have the executable permission bit set.