Sed (stream editor) is a useful tool for making substitutions. It has other functions too, but we'll focus on a common application of sed in biofinformatics: modifying file formats with substitutions.
Sed takes a short, one-line instruction as its argument, and applies that instruction to either to a file argument or standard input <stdin>.
$ sed 's/pattern/replacement/' filename
The parts of this instruction are separated by forward slashes (
The replacement portion always ends with a forward slash, but can then be followed by further options (examples below).
Sed can take its input from a pipe instead a file, as in:
$ cmd | 's/pattern/replacement/'
Here is part of a Wormbase gene annotation in GFF (tabs removed) piped into sed in order to change the chromosome into a BED-compatible format.
$ echo "IV WormBase stop_codon 13382143 13382145" | sed 's/IV/chr4/' chr4 WormBase stop_codon 13382143 13382145
You could run the same command on a whole GFF format file:
$ sed 's/IV/chr4/' wormbase.gff > step1.gff
This one statement doesn't make it BED (UCSC) format, however, so I save it as a GFF file. To translate all of the roman numeral chromosomes into the “chr” format. I don't really know the full contents of the file, but I know that the chromosome is always the first column.
I execute the following command:
$ cut -f1 wormbase.gff | sort -u #!genebuild-version WS263 I II III IV MtDNA V X
Note: The command
sort -u is equivalent to
sort | uniq.
The output tells us there is a comment, plus five roman numerals (I-V), plus X (not the roman numeral for 10), and MtDNA.
That seems like a long string of pipes!
However, we need to avoid the following:
$ echo "IV WormBase stop_codon 13382143 13382145" | sed 's/I/chr1/' chr1V WormBase stop_codon 13382143 13382145
Since “IV” is matched by the pattern “I”, it makes sense to transform it first.
$ grep -v '#' wormbase.gff \ | sed 's/IV/chr4/' \ | sed 's/III/chr3/' \ | sed 's/II/chr2/' \ | sed 's/I/chr1/' \ | sed 's/X/chrX/' \ | sed 's/mtDNA/chrM' > wormbase_chrs.gff