User Tools

Site Tools


wiki:2018substitutions

SUBSTITUTIONS AND REGULAR EXPRESSIONS 🙈

Previous page: Sed, grep and awk 🙊

Now that we've learned how to acquire data and do some very basic tasks with that data, we'll discuss ways of manipulating data using substitutions and regular expressions.

Regular expressions

Regular expressions are character combinations that represent a particular pattern.

Most-used expressions

Expression Meaning
\t tab
\n new line
\s any white space (new line, space, etc.)
\d any digit 0-9
\w any alphanumeric character or underscore
. a wildcard character (pretty much anything but a new line)
* match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible)
+ match the preceding character or pattern one or more times
\b word boundary
^ matches beginning of line or string
$ matches end of line or string
[0-9] a character class, any number
[a-z] a character class, any letter
| use to match something on the left or something on the right e.g. DNA|RNA (use egrep)
\\n the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.

*Be aware: You have to use egrep or grep -e for some of this to work. This is because grep doesn't support all functionality for historical reasons. By default, it does “Basic Regular Expressions” BREs. You have to turn on “Extended Regular Expressions” EREs with -e (or egrep).

Substitutions using regular expressions in sed

We've looked at some examples using sed in the past. Let's build on that experience by using regular expressions.

:!: Exercise: You already worked with a file called mini.gff in PIPES. I've modified it for this example, so, copy and paste the text below into a file by that name. Don't forget to copy everything… you have to scroll right in the window as you are selecting with the mouse.

# A tester gff file.                                
# For testing pipes.                                
chrV	sacCer3_ensGene	CDS	574807	575379	0.000000	-	0	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A"; CHR
chrV	sacCer3_ensGene	start_codon	575377	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	start_codon	805254	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrII	sacCer3_ensGene	exon	805035	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrIII	sacCer3_ensGene	exon	309070	310155	0.000000	+	.	gene_id "YCR105W"; transcript_id "YCR105W";
CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W"; CHR
CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
chrV	sacCer3_ensGene	exon	574804	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrV	sacCer3_ensGene	stop_codon	575680	575682	0.000000	-	.	gene_id "YER190C-B"; transcript_id "YER190C-B";

Step 1: Remove comments.

Let's filter out the comment lines:

$ grep -v '#' < mini.gff > mini2.gff

Using diff

Let use the command diff to see what the difference is, before and after.

$ diff mini.gff mini2.gff
1,2d0
< # A tester gff file.                                
< # For testing pipes.                                

This output says 1,2d0 which means that: lines 1 and 2 are deleted in the first file, relative to the second. The 0 after the d indicates the line number pertaining to the deletion in the second file. Line 0 means the top of the file.

It then shows you the deleted lines, prefixed with a left arrow (less-than sign) to say they are present in the first argument file.

Now reverse the order of the files that you give diff, so that mini2.gff is the first argument, and mini.gff is the second.

$ diff mini2.gff mini.gff
0a1,2
> # A tester gff file.                                
> # For testing pipes.   

This time, it says 0a1,2: the second argument has two additional lines, and it shows you those lines with the right-pointing arrow. This means those lines are present in the second argument file.

Let's continue on with the exercise, using mini2.gff as the input file.

Step 2. Replace the capital 'CHR's

Our task at hand is to make this file compatible with other programs that expect GFF format. However, there are some issues in the first column. Our chromosomes need to have the arabic numbers (1,2,3, etc.), instead of the roman numerals. Also, the abbreviation for chromosome (chr) must be in lowercase.

Let's deal with the uppercase 'CHR's first.

We'll want to use the substitute command 's' of sed to make the change, and save it to mini3.gff

$ sed 's/CHR/chr' mini2.gff > mini3.gff

Now let's look at the changes using diff again.

$ diff mini2.gff mini3.gff
2c2
< chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A"; CHR
---
> chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chr
7,8c7,8
< CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W"; CHR
< CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
---
> chrII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W"; CHR
> chrIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";

The first bit: 2c2 tells me line 2 in the first file was changed to line 2 in the second file. It then shows the contents of the left (mini2.gff) versus the right (mini3.gff) files.

The second bit: 7,8c7,8 says that lines two lines in each file were changed at lines 7 and 8.

The lines are too long to show on this wiki without scrolling, but you should see in your output that the ends of the lines are different.

Although lines 7 and 8 are what we expect, I put a CHR on the end to show you that there has been an unintended edit.

We want to make sure that only the “CHR” on the beginning of the lines are changed. Look at the table above to see what control character we might use to accomplish that using mini2.gff. Save the output in mini4.gff.

Using diff to verify, you should see only lines 7 and 8 changed.

$ diff mini2.gff mini4.gff
7,8c7,8
< CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W"; CHR
< CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
---
> chrII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W"; CHR
> chrIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";

The trick is that you have to specify a pattern that is specific for your target that you want to change, otherwise you'll end up having unintended consequences, and that's just embarrassing.

Step 3. Replace the roman numerals.

Try the following on the command line, but don't save it as mini5.gff yet.

$ sed 's/^chrI/chr1/' mini4.gff

I only want to make one change (I → 1), but in order to specify it uniquely, I put in ^chrI to say: only change the “chrI at the beginning of the line.” But since I specified “chr” in the pattern, I have to add it back in with the replacement.

The next problem is that I've changed all the appropriate targets, plus instances of ^chrII and ^chrIII to chr1I and chr1II, which is total garbage.

How can we isolate the right things to change?

Virtual restriction digest

Save the following file:

> plasmid_fake
CCTGTATCAGGACAATGATACATTTTTGGGTAGTGCTATCATCTGGAGAT
GCTTATCATATCAGTTGATCTTTTACACTTCCTTGTTCGGTTGCATCTAG
TTGCATTTTCCAGAATCTGTTTTGCAAGGCAAAGCCCACTAACATGCAGT
GACTAAAACAAAACAAGCAAAGCCAATAAATCAAACAAAAATGTTTGGGG
AAAAGCTACGAACAACTCTCATCGCAGAGCCTGTAACTGGTCAAAAATAG
AATTTTGCTCCCATATCGCAGGTGTCTTCTTATCGCCTACGCTGTAGCCT
GATCACCAATGATCCTCACTTTTGCTATGGCCTTATCACCCATCCAGTTC
CTTATCACTGTGCTCACCAACGTCCGCGTTCCCTATCATCAGCTCCTTTT
ATCACTACAGCGTTGTACCTGATAAGCTTTTAATGTTGCTCGCAGGCTGT
CCAACTTGCTTATCTCTAGAGCTTATTTTGCTCTCGAGACCCCGACTGCT
CTTTCTGTTTGTGATGCTATCTCCCAAGACCCTTATCGTCCTTTTCAGGC
GTCGCTGCGTCTAAACCATATTGCTTGCGTTTTAGGCACCCGCTCTGTCC
AAAGTTATAAGCCATACCCTTATCAACAGCAACGCTCTATCACCATTATC
GCACGGGCTCTTTATCACCAACACTTTTTATTGTTTTAACTCTCCAATCA
CCCACAAAACCTTATCAGGTTGTCTTCTATTGTCCTCTTCTCTTTTATCA
CCATGTTACAGTACTACTTGCCATATCAGCTGGCAATTTACCGCCCTCAA
ACCTTATCAGCTCGTCCCCTAATTTTATCAATTACTTCTTCTATCAGTAG
ACTCTTCCTACTCCGCAATCATTTTCGAAAAAGGAAGACCATTAAATCGA
GTTGTCAATTTTCTCTTCATAAAATTGAGATTATCATCCACGAATATGGC
AAATAAAAAAGTTATTATGAAACGATAGAAGGTGAAATATCTTTGTCACT

As fake.fasta. Now let's use sed to make a restriction digest.

$ grep -v '^>' fake.fasta | sed 's/TCGTCC/\n/g'
CCTGTATCAGGACAATGATACATTTTTGGGTAGTGCTATCATCTGGAGAT
GCTTATCATATCAGTTGATCTTTTACACTTCCTTGTTCGGTTGCATCTAG
TTGCATTTTCCAGAATCTGTTTTGCAAGGCAAAGCCCACTAACATGCAGT
GACTAAAACAAAACAAGCAAAGCCAATAAATCAAACAAAAATGTTTGGGG
AAAAGCTACGAACAACTCTCATCGCAGAGCCTGTAACTGGTCAAAAATAG
AATTTTGCTCCCATATCGCAGGTGTCTTCTTATCGCCTACGCTGTAGCCT
GATCACCAATGATCCTCACTTTTGCTATGGCCTTATCACCCATCCAGTTC
CTTATCACTGTGCTCACCAACGTCCGCGTTCCCTATCATCAGCTCCTTTT
ATCACTACAGCGTTGTACCTGATAAGCTTTTAATGTTGCTCGCAGGCTGT
CCAACTTGCTTATCTCTAGAGCTTATTTTGCTCTCGAGACCCCGACTGCT
CTTTCTGTTTGTGATGCTATCTCCCAAGACCCTTA
TTTTCAGGC
GTCGCTGCGTCTAAACCATATTGCTTGCGTTTTAGGCACCCGCTCTGTCC
AAAGTTATAAGCCATACCCTTATCAACAGCAACGCTCTATCACCATTATC
GCACGGGCTCTTTATCACCAACACTTTTTATTGTTTTAACTCTCCAATCA
CCCACAAAACCTTATCAGGTTGTCTTCTATTGTCCTCTTCTCTTTTATCA
CCATGTTACAGTACTACTTGCCATATCAGCTGGCAATTTACCGCCCTCAA
ACCTTATCAGC
CCTAATTTTATCAATTACTTCTTCTATCAGTAG
ACTCTTCCTACTCCGCAATCATTTTCGAAAAAGGAAGACCATTAAATCGA
GTTGTCAATTTTCTCTTCATAAAATTGAGATTATCATCCACGAATATGGC
AAATAAAAAAGTTATTATGAAACGATAGAAGGTGAAATATCTTTGTCACT

How about a more meaningful example, where the cut leaves a sticky end instead of deleting the target sequence?

$ grep -v '^>' fake.fasta | sed 's/GCTCTATC/GC\nTCTATC/g'
CCTGTATCAGGACAATGATACATTTTTGGGTAGTGCTATCATCTGGAGAT
GCTTATCATATCAGTTGATCTTTTACACTTCCTTGTTCGGTTGCATCTAG
TTGCATTTTCCAGAATCTGTTTTGCAAGGCAAAGCCCACTAACATGCAGT
GACTAAAACAAAACAAGCAAAGCCAATAAATCAAACAAAAATGTTTGGGG
AAAAGCTACGAACAACTCTCATCGCAGAGCCTGTAACTGGTCAAAAATAG
AATTTTGCTCCCATATCGCAGGTGTCTTCTTATCGCCTACGCTGTAGCCT
GATCACCAATGATCCTCACTTTTGCTATGGCCTTATCACCCATCCAGTTC
CTTATCACTGTGCTCACCAACGTCCGCGTTCCCTATCATCAGCTCCTTTT
ATCACTACAGCGTTGTACCTGATAAGCTTTTAATGTTGCTCGCAGGCTGT
CCAACTTGCTTATCTCTAGAGCTTATTTTGCTCTCGAGACCCCGACTGCT
CTTTCTGTTTGTGATGCTATCTCCCAAGACCCTTATCGTCCTTTTCAGGC
GTCGCTGCGTCTAAACCATATTGCTTGCGTTTTAGGCACCCGCTCTGTCC
AAAGTTATAAGCCATACCCTTATCAACAGCAACGC
TCTATCACCATTATC
GCACGGGCTCTTTATCACCAACACTTTTTATTGTTTTAACTCTCCAATCA
CCCACAAAACCTTATCAGGTTGTCTTCTATTGTCCTCTTCTCTTTTATCA
CCATGTTACAGTACTACTTGCCATATCAGCTGGCAATTTACCGCCCTCAA
ACCTTATCAGCTCGTCCCCTAATTTTATCAATTACTTCTTCTATCAGTAG
ACTCTTCCTACTCCGCAATCATTTTCGAAAAAGGAAGACCATTAAATCGA
GTTGTCAATTTTCTCTTCATAAAATTGAGATTATCATCCACGAATATGGC
AAATAAAAAAGTTATTATGAAACGATAGAAGGTGAAATATCTTTGTCACT

See here I've put the pattern text in the replacement, but added a newline \n inside the sequence after the first GC.

wiki/2018substitutions.txt · Last modified: 2018/08/31 10:46 by david