The following commands will help you to extract information from files.
cat – concatenate. concatenate files together
grep – regular expressions. search for a specific pattern within a file
cut – cut. pull out a specific column (or any other delimited information) from a file
Group Exercise: Let's make some simple files to play with. Copy the following text into a file and name it
# An example file containing four Saccharomyces cervisiae chromosomes and their lengths. sacCer3. chrI 230218 chrII 813184 chrIII 316620 chrM 85779
cat command reads one or more files and prints the output of all files to the screen. The output can be redirected to a file, as well, and in this way, we can join files together.
cat <file.txt> …
Typically, we join two different files together. For the purpose of example, let's try to duplicate the contents of our file using cat. The
… means you can keep adding more and more files.
$ cat chr_sizes.txt chr_sizes.txt $ cat chr_sizes.txt chr_sizes.txt > double_sizes.txt $ more double_sizes.txt
Regular expressions in computing describes a sequence of characters for which you want to search. It is often shortened to regex. Regular expressions are very powerful in computing and the expressions themselves can quickly become very complex with lots of wildcards and wiggle room for complex variations on the searched pattern. For this lesson, we'll focus on simple letter and number combinations. In this case, we can think of it here as simple pattern searching and matching.
grep [options] <pattern> <file> …
Let's say we want to know how long the mitochondrial genome is in yeast:
$ grep 'chrM' chr_sizes.txt
Individual Exercises: Try executing the following to get a sense of what grep does and does not do. To learn more about these options, read the grep man page.
$ grep -n 'chrM' chr_sizes.txt $ grep -v 'chrM' chr_sizes.txt $ grep -v '#' chr_sizes.txt $ grep 'chr' chr_sizes.txt $ grep '^chr' chr_sizes.txt $ grep 'chrII' chr_sizes.txt $ grep -w 'chrII' chr_sizes.txt
Common pitfall: Did you notice how searching for
chr gave you both the chromosomes listed in columns as well as the word
chromosome in the header? Also,
chrII returned both
chrIII. This is something to look out for with grep. We'll cover more advanced ways to restrict your regular expressions in later lessons.
Quick tip: As long as you use quotes around your search pattern, you can include a space in it.
cut is a command that can be used for slicing and dicing information out of delimited files. We'll just use the most basic feature of
cut which, by default, pulls out specific columns from tab delimited files. There are ways to change this so that it splits on other delimiters, but today, we'll just stick with the default operation.
column extraction usage
cut -f <number> <file.txt>
Group Exercise: Let's try to just extract out some columns using cut.
cut works by default by splitting a file into tab-delimited columns. The “tab” is called the delimiting character the column is called a field.
$ grep -v "#" chr_sizes.txt > chr_sizes_table.txt $ cut -f 1 chr_sizes_table.txt $ cut -f 2 chr_sizes_table.txt $ cut -f 1,2 chr_sizes_table.txt
Common pitfall: The cut utility counts like so: 1, 2, 3, 4. However, not all computing languages start on 1. Many start on 0 and count like so: 0, 1, 2, 3. It is a good idea to double check your language by testing it every time.
Common pitfall: cut defaults to looking for tab delimiting. It outputs with a default tab separator. To change these defaults, you can use the option *-d*
$ cut -d "," -f 2 file.txt #set the delimiting character to a comma, and then cut out the second field. $ cut -d " " -f 2 file.txt #set the delimiting character to a space, then cut out the second field. $ cut -d "\t" -f 2 file.txt #set the delimiting character to a tab, then cut out the second field. (default)
catcommand to concatenate ALL the chromosomes together into the file
# A tester gff file. # For testing pipes. chrV sacCer3_ensGene CDS 574807 575379 0.000000 - 0 gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene CDS 805038 805256 0.000000 - 0 gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrV sacCer3_ensGene start_codon 575377 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene start_codon 805254 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrII sacCer3_ensGene exon 805035 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrIII sacCer3_ensGene exon 309070 310155 0.000000 + . gene_id "YCR105W"; transcript_id "YCR105W"; CHRII sacCer3_ensGene start_codon 805351 805353 0.000000 + . gene_id "YBR299W"; transcript_id "YBR299W"; CHRIII sacCer3_ensGene start_codon 310958 310960 0.000000 + . gene_id "YCR106W"; transcript_id "YCR106W"; chrV sacCer3_ensGene exon 574804 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrV sacCer3_ensGene stop_codon 575680 575682 0.000000 - . gene_id "YER190C-B"; transcript_id "YER190C-B";
mini.gffthat are on chrII into a new file called
chromosome<tab>start<tab>end<tab>strand. Make a .bed file out of