User Tools

Site Tools


2018grep_motifs

Using regular expressions to find sequence motifs

Sequence motifs can contain degenerate symbols, meaning they match more than one option at a given position.

Example:

WGATAR - [A or T] G A T A [A or G]

This is very close to the regular expression you would use to search a sequence.

grep '[AT]GATA[AG]' sequence.fa

The brackets in regular expression syntax stand for a character set. Any single match can have one of the characters inside the bracket. The match length for the above example is always the same length: 6 characters.

The degenerate symbols are from the IUPAC standard:

Symbol Stands for Character set
K Keto group [GT]
M aMino group [AC]
N aNy [ACGT]
R pUrine [AG]
S Strong [CG]
W Weak [AT]
Y pYrimadine [CT]

It is convention to write these in uppercase…

…but our sequence output was in lowercase, so to use these patterns, we must either type them in lowercase or use the -i flag for “case insensitive”.

grep -i pattern file.fa

For example, we can get the same output from our previous command this way:

Grep for AATGY

$ grep -bi AATG[CT] output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc

Variable length patterns

Say we want to match any number of T's in AATG, such as AATTG, AATTTTTG, etc.

For any number of T a character, add an asterisk (*).

$ grep -bi AAT*G output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
output/R13A5.5.1.fa:24:ttactaatttttgttatcttatcaaacaaatatattttccagc

This added a match from R13A5.5: aatttttg

Of course, we may not wand that many. It has 5 t's, so let's set an upper bound to 4. We must change how we do it. A length range can be specified after any symbol or character set via curly braces.

{n,m} - example {2,4}
{n}   - example {2,}
{,m}  - example {,4}
  1. The first example matches a repetition of 2-4 of the preceding symbol or character set.
  2. The second example matches a repetition of AT LEAST 2 of the preceding symbol or character set.
  3. The third example matches a repetition of AT MOST 4 of the preceding symbol or character set.

We need to change the way we search for such a pattern.

  • egrep - “extended” regular expressions
  • quote the pattern - Curly braces have a special meaning in BASH, so we use single quotes (') to prevent the shell from interpreting them.
$ egrep -bi 'AAT{,4}G' output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc

We lost our R13A5.5.1.fa match. We can restore it by increasing the upper bound to 5 or higher.

$ egrep -bi 'AAT{,5}G' output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
output/R13A5.5.1.fa:24:ttactaatttttgttatcttatcaaacaaatatattttccagc

Here is the full set of repetition operators in regular expressions.

$ man grep

…Scrolling way down…

   Repetition
       A regular expression may be followed by one of several repetition operators:
       ?      The preceding item is optional and matched at most once.
       *      The preceding item will be matched zero or more times.
       +      The preceding item will be matched one or more times.
       {n}    The preceding item is matched exactly n times.
       {n,}   The preceding item is matched n or more times.
       {,m}   The preceding item is matched at most m times.  This is a GNU extension.
       {n,m}  The preceding item is matched at least n times, but not more than m times.

Regular expressions are a general concept, but different commands may have different limitations on what they support. There are more operators, patterns, and capabilities of regular expressions, but like most things we've encountered, YOU MUST TEST each command to make sure it works as expected.

2018grep_motifs.txt · Last modified: 2018/09/10 11:37 by david