User Tools

Site Tools


wiki:2018grep

WORKING WITH FILES II

The following commands will help you to extract information from files.

cat – concatenate. concatenate files together
grep – regular expressions. search for a specific pattern within a file
cut – cut. pull out a specific column (or any other delimited information) from a file

Let's make a file

:!: Group Exercise: Let's make some simple files to play with. Copy the following text into a file and name it chr_sizes.txt.

# An example file containing four Saccharomyces cervisiae chromosomes and their lengths. sacCer3.
chrI	230218
chrII	813184
chrIII	316620
chrM	85779

Concatenating files with cat

The cat command reads one or more files and prints the output of all files to the screen. The output can be redirected to a file, as well, and in this way, we can join files together.

concatenate usage:
cat <file.txt> …

Typically, we join two different files together. For the purpose of example, let's try to duplicate the contents of our file using cat. The means you can keep adding more and more files.

$ cat chr_sizes.txt chr_sizes.txt
$ cat chr_sizes.txt chr_sizes.txt > double_sizes.txt
$ more double_sizes.txt

Searching for patterns using grep

Regular expressions in computing describes a sequence of characters for which you want to search. It is often shortened to regex. Regular expressions are very powerful in computing and the expressions themselves can quickly become very complex with lots of wildcards and wiggle room for complex variations on the searched pattern. For this lesson, we'll focus on simple letter and number combinations. In this case, we can think of it here as simple pattern searching and matching.

grep usage
grep [options] <pattern> <file> …

  • there are many options for grep
  • Typically, the pattern given to search is enclosed in quotes.
  • grep can search multiple files

Let's say we want to know how long the mitochondrial genome is in yeast:

$ grep 'chrM' chr_sizes.txt

:!: Individual Exercises: Try executing the following to get a sense of what grep does and does not do. To learn more about these options, read the grep man page.

$ grep -n 'chrM' chr_sizes.txt
$ grep -v 'chrM' chr_sizes.txt
$ grep -v '#' chr_sizes.txt
$ grep 'chr' chr_sizes.txt
$ grep '^chr' chr_sizes.txt 
$ grep 'chrII' chr_sizes.txt
$ grep -w 'chrII' chr_sizes.txt

:!: Common pitfall: Did you notice how searching for chr gave you both the chromosomes listed in columns as well as the word chromosome in the header? Also, chrII returned both chrII and chrIII. This is something to look out for with grep. We'll cover more advanced ways to restrict your regular expressions in later lessons.

;-) Quick tip: As long as you use quotes around your search pattern, you can include a space in it.


Extracting columns with cut

cut is a command that can be used for slicing and dicing information out of delimited files. We'll just use the most basic feature of cut which, by default, pulls out specific columns from tab delimited files. There are ways to change this so that it splits on other delimiters, but today, we'll just stick with the default operation.

column extraction usage
cut -f <number> <file.txt>

:!: Group Exercise: Let's try to just extract out some columns using cut. cut works by default by splitting a file into tab-delimited columns. The “tab” is called the delimiting character the column is called a field.

  • Let's make a test file that has two fields by removing the first line from chr_sizes.txt.
  • Next, we can extract the first column or second column using cut:
$ grep -v "#" chr_sizes.txt > chr_sizes_table.txt
$ cut -f 1 chr_sizes_table.txt
$ cut -f 2 chr_sizes_table.txt
$ cut -f 1,2 chr_sizes_table.txt

:!: Common pitfall: The cut utility counts like so: 1, 2, 3, 4. However, not all computing languages start on 1. Many start on 0 and count like so: 0, 1, 2, 3. It is a good idea to double check your language by testing it every time.

:!: Common pitfall: cut defaults to looking for tab delimiting. It outputs with a default tab separator. To change these defaults, you can use the option *-d*

$ cut -d "," -f 2 file.txt #set the delimiting character to a comma, and then cut out the second field.
$ cut -d " " -f 2 file.txt #set the delimiting character to a space, then cut out the second field.
$ cut -d "\t" -f 2 file.txt #set the delimiting character to a tab, then cut out the second field. (default)

:!: Exercises:

  • cat practice: Go to the directory where you downloaded the individual chromosomes of the yeast genome. Use a cat command to concatenate ALL the chromosomes together into the file sacCer3_genome.fasta.
  • Let's make a test file. Copy and paste the text below into a file called mini.gff
# A tester gff file.                                
# For testing pipes.                                
chrV	sacCer3_ensGene	CDS	574807	575379	0.000000	-	0	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrV	sacCer3_ensGene	start_codon	575377	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	start_codon	805254	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrII	sacCer3_ensGene	exon	805035	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrIII	sacCer3_ensGene	exon	309070	310155	0.000000	+	.	gene_id "YCR105W"; transcript_id "YCR105W";
CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W";
CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
chrV	sacCer3_ensGene	exon	574804	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrV	sacCer3_ensGene	stop_codon	575680	575682	0.000000	-	.	gene_id "YER190C-B"; transcript_id "YER190C-B";
  • grep practice: Save all the features in mini.gff that are on chrII into a new file called chrII_entries.gff
  • cut practice: A .bed file is another standardized file format of the format chromosome<tab>start<tab>end<tab>strand. Make a .bed file out of mini.gff using cut.
  • bonus points:… make sure there are no comment lines in your new .bed file.

Pipes

wiki/2018grep.txt · Last modified: 2018/08/28 11:50 by erin