# Computational biology at CSU

### Site Tools

wiki:2016substitutions

# Differences

This shows you the differences between two versions of the page.

 — wiki:2016substitutions [2018/07/05 14:56] (current)david created from old site 2018/07/05 14:56 david created from old site 2018/07/05 14:56 david created from old site Line 1: Line 1: + ~~NOTOC~~ ​ + =====SUBSTITUTIONS AND REGULAR EXPRESSIONS===== + ---- + ====OUTLINE==== + -Regular expressions. + -Substitutions using translate/​transliterate (**tr**). + -More fun with **grep**. + + Now that we've learned how to acquire data and do some very basic tasks with that data, we'll discuss ways of manipulating data using substitutions and regular expressions. + + ====1. Regular expressions==== + + Regular expressions are character combinations that represent a particular pattern. + + ===Regular Expressions=== + ^ Expression ^ Meaning ^ + |%%\t%% | tab | + |%%\n%% | new line | + |%%\s%% | any white space (new line, space, etc.) | + |%%\d%% | any digit 0-9 | + |%%\w%% | any alphanumeric character or underscore | + |%%.%% | a wildcard character (pretty much anything but a new line) | + |%%*%% | match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible) | + |%%+%% | match the preceding character or pattern one or more times | + |%%\b%% | word boundary | + |%%^%% | matches beginning of line or string | + |%%\$%% | matches end of line or string | + |%%[0-9]%% | a character class, any number | + |%%[a-z]%% | a character class, any letter | + |%%|%% | use to match something on the left or something on the right e.g. DNA%%|%%RNA (use egrep)| + |%%\\n%% | the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.| + + ====2. Substitutions using translate/​transliterate==== + Substitutions refer to changing one pattern to another in a file, filename, etc. \\ + + There are several tools for doing substitutions,​ the simplest of which is **tr**. + + **tr**….deletes or replaces one or a set of characters in a file with one or another set of characters. + \$ tr '​ACTG'​ '​TGAC'​ input_file + + In the above example, A would be substituted for T, C for G, T for A, and G for C.  ​ + + ====★ Interactive Exercise: substitutions and regular expressions==== + Follow along with the instructor on your own computer. + + **1.** Obtain all miRNA sequences in fasta format from the mirbase download page.  The file is called **mature.fa**. \\ \\ + **2.** Extract all worm (C. elegans), fly (D. malanogaster),​ and human (H. sapiens) miRNAs preserving fasta format (egrep is needed when matching either or). \\ \\ + **3.** Use **tr** to convert all RNA sequences to DNA sequences. \\ \\ + **4.** Determine how many miRNAs start with each possible 5' nucleotide (A, C, G, T) - what is the most common 5' nt? \\ \\ + **5.** Determine how many miRNAs are 20, 21, 22, 23, and 24 nt.  What is the most common length of miRNAs? \\ \\ + **6.** Extract all potential let-7 miRNA family members using **grep**, preserving the fasta format. ​ miRNA families are determined by their seed sequence, positions 2-8 of the mature miRNA. ​ The seed sequence of let-7 is: GAGGTAG. \\ \\