This is an old revision of the document!
Now that we know what piping is, we can discover some new functionalities of Linux. Let's learn how to pipe the following commands:
sort - sort lines in a file
uniq - find unique (or duplicated) lines in a pre-sorted file
tee - redirect stdout or stderr to multiple locations
Exercise: You probably already have the file called
mini.gff
. If you don't, copy and paste the text below into a file by that name.
# A tester gff file. # For testing pipes. chrV sacCer3_ensGene CDS 574807 575379 0.000000 - 0 gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene CDS 805038 805256 0.000000 - 0 gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrV sacCer3_ensGene start_codon 575377 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene start_codon 805254 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrII sacCer3_ensGene exon 805035 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrIII sacCer3_ensGene exon 309070 310155 0.000000 + . gene_id "YCR105W"; transcript_id "YCR105W"; CHRII sacCer3_ensGene start_codon 805351 805353 0.000000 + . gene_id "YBR299W"; transcript_id "YBR299W"; CHRIII sacCer3_ensGene start_codon 310958 310960 0.000000 + . gene_id "YCR106W"; transcript_id "YCR106W"; chrV sacCer3_ensGene exon 574804 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrV sacCer3_ensGene stop_codon 575680 575682 0.000000 - . gene_id "YER190C-B"; transcript_id "YER190C-B";
We can use sort to sort a file's lines into a new order…
sort usage:
sort [options] <file.txt> …
Exercise: Sort the mini.gff file:
$ sort mini.gff
Exercise: Read the sort man pages to figure out how you would…
We can identify unique (or duplicated) lines in a pre-sorted file using the command uniq.
uniq usage:
uniq [options] <sortedFile.txt>
To operate on a presorted file, we have two options. We can do the process in two steps:
OR, we can use the pipe operator to chain the two commands together:
$ sort mini.gff | uniq
That wasn't very interesting. What if we do this just for the first column…
$ cut -f 1 mini.gff $ cut -f 1 mini.gff | sort $ cut -f 1 mini.gff | sort | uniq
Quick tip: To find the duplicated lines, use
-d
as an option for uniq.
Common pitfall: Pipes are fun, but pipes can be problematic with large files. Depending on your computer or cluster, there may be a limit to how much information can be piped to a new command. In these cases, creating a temp file (sometimes written as file.tmp) is preferable.
In an earlier class, we learned how to redirect STDOUT and STDIN to a file. If we want to direct STDOUT to both a file and the screen, we can use the tee
command. tee
is used with the pipe command.
tee usage:
command | tee <filename.txt>
Exercise: Try to send output from a command to both the screen and a file.
$ wc mini.gff | tee wc_output.txt
Quick tip: tee is really used for redirecting stdout. If you want to redirect stdout and stderr, this command works, but I have no idea why:
$ wc mini.gff skdjfldj 2>&1 | tee wc_stdoutstderr.txt
Exercise: Can you write a series of pipes that will output a list of unique gene_id entries from mini.gff?