User Tools

Site Tools


2018grep_motifs

This is an old revision of the document!


Using regular expressions to find sequence motifs

Sequence motifs can contain degenerate symbols, meaning they match more than one option at a given position.

Example:

WGATAR - [A or T] G A T A [A or G]

This is very close to the regular expression you would use to search a sequence.

grep '[AT]GATA[AG]' sequence.fa

The brackets in regular expression syntax stand for a character set. Any single match can have one of the characters inside the bracket. The match length for the above example is always the same length: 6 characters.

The degenerate symbols are from the IUPAC standard:

Symbol Stands for Character set
K Keto group [GT]
M aMino group [AC]
N aNy [ACGT]
R pUrine [AG]
S Strong [CG]
W Weak [AT]
Y pYrimadine [CT]

It is convention to write these in uppercase…

…but our sequence output was in lowercase, so to use these patterns, we must either type them in lowercase or use the -i flag for “case insensitive”.

grep -i pattern file.fa

For example, we can get the same output from our previous command this way:

Grep for AATGY

$ grep -bi AATG[CT] output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc

Variable length patterns

Say we want to match any number of T's in AATG, such as AATTG, AATTTTTG, etc.

For any number of T a character, add an asterisk (*).

$ grep -bi AAT*G output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
output/R13A5.5.1.fa:24:ttactaatttttgttatcttatcaaacaaatatattttccagc
2018grep_motifs.1536599892.txt.gz · Last modified: 2018/09/10 11:18 by david