User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

wiki:2016substitutions [2018/07/05 14:56] (current)
david created from old site
Line 1: Line 1:
 +~~NOTOC~~ ​
 +  -Regular expressions.
 +  -Substitutions using translate/​transliterate (**tr**).
 +  -More fun with **grep**.
 +Now that we've learned how to acquire data and do some very basic tasks with that data, we'll discuss ways of manipulating data using substitutions and regular expressions.
 +====1. Regular expressions====
 +Regular expressions are character combinations that represent a particular pattern.
 +===Regular Expressions===
 +^ Expression ^ Meaning ^
 +|%%\t%% | tab | 
 +|%%\n%% | new line |
 +|%%\s%% | any white space (new line, space, etc.) |
 +|%%\d%% | any digit 0-9 |
 +|%%\w%% | any alphanumeric character or underscore |
 +|%%.%% | a wildcard character (pretty much anything but a new line) |
 +|%%*%% | match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible) |
 +|%%+%% | match the preceding character or pattern one or more times |
 +|%%\b%% | word boundary |
 +|%%^%% | matches beginning of line or string |
 +|%%$%% | matches end of line or string |
 +|%%[0-9]%% | a character class, any number |
 +|%%[a-z]%% | a character class, any letter |
 +|%%|%% | use to match something on the left or something on the right e.g. DNA%%|%%RNA (use egrep)|
 +|%%\\n%% | the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.|
 +====2. Substitutions using translate/​transliterate====
 +Substitutions refer to changing one pattern to another in a file, filename, etc. \\
 +There are several tools for doing substitutions,​ the simplest of which is **tr**.
 +**tr**….deletes or replaces one or a set of characters in a file with one or another set of characters.
 +  $ tr '​ACTG'​ '​TGAC'​ input_file
 +In the above example, A would be substituted for T, C for G, T for A, and G for C.  ​
 +====★ Interactive Exercise: substitutions and regular expressions====
 +Follow along with the instructor on your own computer.
 +**1.** Obtain all miRNA sequences in fasta format from the mirbase download page.  The file is called **mature.fa**. \\ \\
 +**2.** Extract all worm (C. elegans), fly (D. malanogaster),​ and human (H. sapiens) miRNAs preserving fasta format (egrep is needed when matching either or). \\ \\
 +**3.** Use **tr** to convert all RNA sequences to DNA sequences. \\ \\
 +**4.** Determine how many miRNAs start with each possible 5' nucleotide (A, C, G, T) - what is the most common 5' nt? \\ \\
 +**5.** Determine how many miRNAs are 20, 21, 22, 23, and 24 nt.  What is the most common length of miRNAs? \\ \\
 +**6.** Extract all potential let-7 miRNA family members using **grep**, preserving the fasta format. ​ miRNA families are determined by their seed sequence, positions 2-8 of the mature miRNA. ​ The seed sequence of let-7 is: GAGGTAG. \\ \\
wiki/2016substitutions.txt · Last modified: 2018/07/05 14:56 by david