User Tools

Site Tools


assignments:wr1

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
assignments:wr1 [2018/08/06 17:30]
dokuroot
assignments:wr1 [2018/11/15 11:42] (current)
dokuroot
Line 1: Line 1:
-====== ​Writing Assignment ​======+ 
 +~~NOTOC~~  
 + 
 +====== ​Grep Exercise ​======
  
 ---- ----
  
-  ​*  +====grep==== 
-  *  + 
-  * +''​grep''​ is a command line tool for searching for patterns within a file and returning lines containing the pattern. 
 + $ grep "​pattern"​ file 
 + 
 +====tr==== 
 +''​tr''​ (translate or transliterate) is a command line tool for making single character substitutions in a string or file. 
 +  $ tr U T input_file >​output_file 
 +   
 +====sed==== 
 +''​sed''​ is a command line tool for making more complex substitution. Allows pattern matching with regular expressions. 
 + 
 +  $ sed '​s/"​old_pattern"/"​new_pattern"/​g'​ input_file >​output_file 
 + 
 +====Regular Expressions==== 
 + 
 +Regular expressions are character combinations that represent a particular pattern. 
 + 
 +===Regular Expressions=== 
 +^ Expression ^ Meaning ^ 
 +|%%\t%% | tab |  
 +|%%\n%% | new line | 
 +|%%\s%% | any white space (new line, space, etc.) | 
 +|%%\d%% | any digit 0-9 | 
 +|%%\w%% | any alphanumeric character or underscore | 
 +|%%.%% | a wildcard character (pretty much anything but a new line) | 
 +|%%*%% | match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible) | 
 +|%%+%% | match the preceding character or pattern one or more times | 
 +|%%\b%% | word boundary | 
 +|%%^%% | matches beginning of line or string | 
 +|%%$%% | matches end of line or string | 
 +|%%[0-9]%% | a character class, any number | 
 +|%%[a-z]%% | a character class, any letter | 
 +|%%|%% | use to match something on the left or something on the right e.g. DNA%%|%%RNA | 
 +|%%\\n%% | the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.| 
 + 
 + 
 +===grep options=== 
 +^ Common grep options ​ ^ Purpose ^ 
 +|%%--%% | end options | 
 +|-v | report non-pattern matching lines | 
 +| -l | search against a list of patterns from a file | 
 +| -A N | report N additional lines after pattern match | 
 +| -B N | report N additional lines before pattern match | 
 +| -F | interpret regex characters literally | 
 +|-i | ignore case | 
 +|-w | search whole words only  | 
 +| -r | search files in subdirectories | 
 +| -c | return the number of matches to a pattern | 
 +| -n | show the line number of the pattern match | 
 +\\ 
 + 
 + 
 +===== Exercise ===== 
 + 
 +miRNAs are grouped into families based on their seed sequences (positions 2-8 of the miRNA). let-7 was the first miRNA discovered in humans but it has homologs across animals. The goals of this exercise are as follows:​\\ 
 +1) Identify all of the let-7 family members across species for which miRNAs are available in miRBase. \\ 
 +2) Identify all let-7 miRNAs that are identical across the entire 22 nt sequence.\\ 
 + 
 +**1.** Download miRNA sequences from miRBase: \\ 
 + 
 +website: ''​http://​www.mirbase.org''​ \\ 
 +file: ''​mature.fa''​ \\ 
 + 
 +**2.** Convert RNA (U) sequences in the file to DNA (T) using ''​sed''​ or ''​tr''​. \\ 
 + 
 +**3.** Extract all perfectly conserved let-7 miRNAs from the file using ''​grep''​ and the C. elegans let-7 sequence as the reference sequence (TGAGGTAGTAGGTTGTATAGTT). Include ''​\b''​ at either end of the sequence to restrict your search to only sequences that start and end at the same nucleotides. Use the ''​-B''​ option to report the pattern matching sequence line and the previous line containing the sequence ID.\\ 
 +\\ 
 +\\ 
 +==== Writing Assignment ==== 
 +Write a micro paper describing the results you obtain using grep to search for miRNAs and let-7 across humans and several common model organisms. 
 +\\ 
 +\\ 
 +Use ''​grep''​ to identify how many miRNAs are in Homo sapiens (humans), Mus musculus (mice), Danio rerio (zebrafish),​ Drosophila melanogaster (flies) and Caenorhabditis elegans (worms). Identify how many let-7 family miRNAs (share the seed sequence - **.GAGGTAG**) each of the species has. Plot your results in whatever form you like (sequences vs species, one plot with total miRNAs and one with let-7 miRNAs) using Excel or other software. Write a one paragraph summary of your results. Your summary should include a brief description of miRNAs, the relevance of let-7, and a summary of your results and the methods you used to obtain the results. ​ Your summary should reference a figure containing the plots and a figure legend. Combine the summary and figure/​figure legend into a single one page file, save it as a PDF and submit it on canvas. 
 +\\ 
 +\\ 
 +**grep hints:​**\\ 
 +\\ 
 +1. You can pipe (''​|''​) the results of one grep search to another grep search. 
 + 
 +2. You can include the line above the pattern match in the output of grep using ''​-A 1''​. 
 + 
 +3. You can pipe the results of a grep search to ''​wc -l''​ to get the number of lines matching the pattern.
assignments/wr1.txt · Last modified: 2018/11/15 11:42 by dokuroot