User Tools

Site Tools


2018grep_motifs

This is an old revision of the document!


Using regular expressions to find sequence motifs

Sequence motifs can contain degenerate symbols, meaning they match more than one option at a given position.

Example:

WGATAR - [A or T] G A T A [A or G]

This is very close to the regular expression you would use to search a sequence.

grep '[AT]GATA[AG]' sequence.fa

The brackets in regular expression syntax stand for a character set. Any single match can have one of the characters inside the bracket. The match length for the above example is always the same length: 6 characters.

The degenerate symbols are from the IUPAC standard:

Symbol Stands for Character set
K Keto group [GT]
M aMino group [AC]
N aNy [ACGT]
R pUrine [AG]
S Strong [CG]
W Weak [AT]
Y pYrimadine [CT]

It is convention to write these in uppercase…

…but our sequence output was in lowercase, so to use these patterns, we must either type them in lowercase or use the -i flag for “case insensitive”.

grep -i pattern file.fa

For example, we can get the same output from our previous command this way:

Grep for AATGY

$ grep -bi AATG[CT] output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc

Variable length patterns

Say we want to match any number of T's in AATG, such as AATTG, AATTTTTG, etc.

For any number of T a character, add an asterisk (*).

$ grep -bi AAT*G output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
output/R13A5.5.1.fa:24:ttactaatttttgttatcttatcaaacaaatatattttccagc

This added a match from R13A5.5: aatttttg

Of course, we may not wand that many. It has 5 t's, so let's set an upper bound to 4. We must change how we do it. A length range can be specified after any symbol or character set via curly braces.

{n,m} - example {2,4}
{n}   - example {2,}
{,m}  - example {,4}
  1. The first example matches a repetition of 2-4 of the preceding symbol or character set.
  2. The second example matches a repetition of AT LEAST 2 of the preceding symbol or character set.
  3. The third example matches a repetition of AT MOST 4 of the preceding symbol or character set.

We need to change the way we search for such a pattern.

  • egrep - “extended” regular expressions
  • quote the pattern - Curly braces have a special meaning in BASH, so we use single quotes (') to prevent the shell from interpreting them.
$ egrep -bi 'AAT{,4}G' output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc

We lost our R13A5.5.1.fa match. We can restore it by increasing the upper bound to 5 or higher.

$ egrep -bi 'AAT{,5}G' output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
output/R13A5.5.1.fa:24:ttactaatttttgttatcttatcaaacaaatatattttccagc

Here is the full set of repetition operators in regular expressions.

$ man grep

…Scrolling way down…

   Repetition
       A regular expression may be followed by one of several repetition operators:
       ?      The preceding item is optional and matched at most once.
       *      The preceding item will be matched zero or more times.
       +      The preceding item will be matched one or more times.
       {n}    The preceding item is matched exactly n times.
       {n,}   The preceding item is matched n or more times.
       {,m}   The preceding item is matched at most m times.  This is a GNU extension.
       {n,m}  The preceding item is matched at least n times, but not more than m times.
2018grep_motifs.1536600832.txt.gz · Last modified: 2018/09/10 11:33 by david