User Tools

Site Tools


2018grep_motifs

This is an old revision of the document!


Using regular expressions to find sequence motifs

Sequence motifs can contain degenerate symbols, meaning they match more than one option at a given position.

Example:

WGATAR - [A or T] G A T A [A or G]

This is very close to the regular expression you would use to search a sequence.

grep '[AT]GATA[AG]' sequence.fa

The brackets in regular expression syntax stand for a character set. Any single match can have one of the characters inside the bracket. The match length for the above example is always the same length: 6 characters.

The degenerate symbols are from the IUPAC standard:

Symbol Stands for Character set
K Keto group [GT]
M aMino group [AC]
N aNy [ACGT]
R pUrine [AG]
S Strong [CG]
W Weak [AT]
Y pYrimadine [CT]

It is convention to write these in uppercase…

…but our sequence output was in lowercase, so to use these patterns, we must either type them in lowercase or use the -i flag for “case insensitive”.

grep -i pattern file.fa

For example, we can get the same output from our previous command this way:

$ grep -bi AATG output/*
output/F56H9.5.1.fa:24:catccatttatactattgcaccgaatattgggttaatgtcggtgtttgaa
output/F56H9.5.1.fa:75:tatattttggttacagtttaaatgcttcaaatttaaatcaattaaatc
output/F56H9.5.2.fa:24:ttaaatgcttcaaatttaaatcaattaaatc
2018grep_motifs.1536599602.txt.gz · Last modified: 2018/09/10 11:13 by david