User Tools

Site Tools


wiki:2018sed

Sed

Sed (stream editor) is a useful tool for making substitutions. It has other functions too, but we'll focus on a common application of sed in biofinformatics: modifying file formats with substitutions.

USAGE

Sed takes a short, one-line instruction as its argument, and applies that instruction to either to a file argument or standard input <stdin>.

Instruction s - substitute

$ sed 's/pattern/replacement/' filename

The parts of this instruction are separated by forward slashes (/).

  • s - substitute
  • pattern - an exact string, or a regular expression
  • replacement - an exact string that is substituted to matched specified by pattern

The replacement portion always ends with a forward slash, but can then be followed by further options (examples below).

Sed can take its input from a pipe instead a file, as in:

$ cmd | 's/pattern/replacement/'

Examples

Modifying file format

Here is part of a Wormbase gene annotation in GFF (tabs removed) piped into sed in order to change the chromosome into a BED-compatible format.

$ echo "IV WormBase stop_codon 13382143 13382145" | sed 's/IV/chr4/'
chr4 WormBase stop_codon 13382143 13382145

You could run the same command on a whole GFF format file:

$ sed 's/IV/chr4/' wormbase.gff > step1.gff

This one statement doesn't make it BED (UCSC) format, however, so I save it as a GFF file. To translate all of the roman numeral chromosomes into the “chr” format. I don't really know the full contents of the file, but I know that the chromosome is always the first column.

I execute the following command:

$ cut -f1 wormbase.gff | sort -u
#!genebuild-version WS263
I
II
III
IV
MtDNA
V
X

:!: Note: The command sort -u is equivalent to sort | uniq.

The output tells us there is a comment, plus five roman numerals (I-V), plus X (not the roman numeral for 10), and MtDNA.

That seems like a long string of pipes!

However, we need to avoid the following:

$ echo "IV WormBase stop_codon 13382143 13382145" | sed 's/I/chr1/'
chr1V WormBase stop_codon 13382143 13382145

Since “IV” is matched by the pattern “I”, it makes sense to transform it first.

$ grep -v '#' wormbase.gff \
| sed 's/IV/chr4/' \
| sed 's/III/chr3/' \
| sed 's/II/chr2/' \
| sed 's/I/chr1/' \
| sed 's/X/chrX/' \
| sed 's/mtDNA/chrM' > wormbase_chrs.gff
wiki/2018sed.txt · Last modified: 2018/09/07 16:33 by david