User Tools

Site Tools


2018grep_motifs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
2018grep_motifs [2018/09/10 11:10]
david
2018grep_motifs [2018/09/10 11:31]
david [Variable length patterns]
Line 24: Line 24:
 ^   ​Y ​   |  p**Y**rimadine ​  ​| ​  ​[CT] ​  | ^   ​Y ​   |  p**Y**rimadine ​  ​| ​  ​[CT] ​  |
  
 +It is convention to write these in uppercase...
 +
 +...but our sequence output was in lowercase, so to use these patterns, we must either type them in lowercase or use the ''​-i''​ flag for "case **i**nsensitive"​.
 +
 +  grep -i pattern file.fa
 +  ​
 +For example, we can get the same output from our previous command this way:
 +
 +Grep for AATGY
 +
 +<code bash>
 +$ grep -bi AATG[CT] output/*
 +output/​F56H9.5.1.fa:​24:​catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
 +output/​F56H9.5.1.fa:​75:​tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
 +output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc
 +</​code>​
 +
 +====== Variable length patterns ======
 +
 +Say we want to match any number of T's in AATG, such as AATTG, AATTTTTG, etc.
 +
 +For any number of T a character, add an asterisk (*). 
 +
 +<code bash>
 +$ grep -bi AAT*G output/*
 +output/​F56H9.5.1.fa:​24:​catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
 +output/​F56H9.5.1.fa:​75:​tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
 +output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc
 +output/​R13A5.5.1.fa:​24:​ttactaatttttgttatcttatcaaacaaatatattttccagc
 +</​code>​
 +
 +This added a match from R13A5.5: ''​aatttttg''​
 +
 +Of course, we may not wand that many. It has 5 t's, so let's set an upper bound to 4. We must change how we do it. A length range can be specified after any symbol or character set via curly braces.
 +<​code>​
 +{n,m} - example {2,4}
 +{n}   - example {2,}
 +{,m}  - example {,4}
 +</​code> ​
 +
 +  - The first example matches a repetition of 2-4 of the preceding symbol or character set.
 +  - The second example matches a repetition of AT LEAST 2 of the preceding symbol or character set.
 +  - The third example matches a repetition of AT MOST 4 of the preceding symbol or character set.
 +
 +We need to change the way we search for such a pattern.
 +
 +  * egrep - "​extended"​ regular expressions
 +  * quote the pattern - Curly braces have a special meaning in BASH, so we use single quotes ('''''​) to prevent the shell from interpreting them.
 +
 +<code bash>
 +$ egrep -bi '​AAT{,​4}G'​ output/*
 +output/​F56H9.5.1.fa:​24:​catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
 +output/​F56H9.5.1.fa:​75:​tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
 +output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc
 +</​code>​
 +
 +We lost our R13A5.5.1.fa match. We can restore it by increasing the upper bound to 5 or higher.
 +<code bash>
 +$ egrep -bi '​AAT{,​5}G'​ output/*
 +output/​F56H9.5.1.fa:​24:​catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
 +output/​F56H9.5.1.fa:​75:​tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
 +output/​F56H9.5.2.fa:​24:​ttaaatgcttcaaatttaaatcaattaaatc
 +output/​R13A5.5.1.fa:​24:​ttactaatttttgttatcttatcaaacaaatatattttccagc
 +</​code>​
2018grep_motifs.txt ยท Last modified: 2018/09/10 11:37 by david