User Tools

Site Tools


assignments:wr1

Grep Exercise


grep

grep is a command line tool for searching for patterns within a file and returning lines containing the pattern.

$ grep "pattern" file

tr

tr (translate or transliterate) is a command line tool for making single character substitutions in a string or file.

$ tr U T input_file >output_file

sed

sed is a command line tool for making more complex substitution. Allows pattern matching with regular expressions.

$ sed 's/"old_pattern"/"new_pattern"/g' input_file >output_file

Regular Expressions

Regular expressions are character combinations that represent a particular pattern.

Regular Expressions

Expression Meaning
\t tab
\n new line
\s any white space (new line, space, etc.)
\d any digit 0-9
\w any alphanumeric character or underscore
. a wildcard character (pretty much anything but a new line)
* match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible)
+ match the preceding character or pattern one or more times
\b word boundary
^ matches beginning of line or string
$ matches end of line or string
[0-9] a character class, any number
[a-z] a character class, any letter
| use to match something on the left or something on the right e.g. DNA|RNA
\\n the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.

grep options

Common grep options Purpose
-- end options
-v report non-pattern matching lines
-l search against a list of patterns from a file
-A N report N additional lines after pattern match
-B N report N additional lines before pattern match
-F interpret regex characters literally
-i ignore case
-w search whole words only
-r search files in subdirectories
-c return the number of matches to a pattern
-n show the line number of the pattern match


Exercise

miRNAs are grouped into families based on their seed sequences (positions 2-8 of the miRNA). let-7 was the first miRNA discovered in humans but it has homologs across animals. The goals of this exercise are as follows:
1) Identify all of the let-7 family members across species for which miRNAs are available in miRBase.
2) Identify all let-7 miRNAs that are identical across the entire 22 nt sequence.

1. Download miRNA sequences from miRBase:

website: http://www.mirbase.org
file: mature.fa

2. Convert RNA (U) sequences in the file to DNA (T) using sed or tr.

3. Extract all perfectly conserved let-7 miRNAs from the file using grep and the C. elegans let-7 sequence as the reference sequence (TGAGGTAGTAGGTTGTATAGTT). Include \b at either end of the sequence to restrict your search to only sequences that start and end at the same nucleotides. Use the -B option to report the pattern matching sequence line and the previous line containing the sequence ID.


Writing Assignment

Write a micro paper describing the results you obtain using grep to search for miRNAs and let-7 across humans and several common model organisms.

Use grep to identify how many miRNAs are in Homo sapiens (humans), Mus musculus (mice), Danio rerio (zebrafish), Drosophila melanogaster (flies) and Caenorhabditis elegans (worms). Identify how many let-7 family miRNAs (share the seed sequence - .GAGGTAG) each of the species has. Plot your results in whatever form you like (sequences vs species, one plot with total miRNAs and one with let-7 miRNAs) using Excel or other software. Write a one paragraph summary of your results. Your summary should include a brief description of miRNAs, the relevance of let-7, and a summary of your results and the methods you used to obtain the results. Your summary should reference a figure containing the plots and a figure legend. Combine the summary and figure/figure legend into a single one page file, save it as a PDF and submit it on canvas.

grep hints:

1. You can pipe (|) the results of one grep search to another grep search.

2. You can include the line above the pattern match in the output of grep using -A 1.

3. You can pipe the results of a grep search to wc -l to get the number of lines matching the pattern.

assignments/wr1.txt · Last modified: 2018/11/15 11:42 by dokuroot