Now that we know what piping is, we can discover some new functionalities of Linux. Let's learn how to pipe the following commands:
sort - sort lines in a file
uniq - find unique (or duplicated) lines in a pre-sorted file
tee - redirect stdout or stderr to multiple locations
Exercise: You probably already have the file called
mini.gff. If you don't, copy and paste the text below into a file by that name.
# A tester gff file. # For testing pipes. chrV sacCer3_ensGene CDS 574807 575379 0.000000 - 0 gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene CDS 805038 805256 0.000000 - 0 gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrV sacCer3_ensGene start_codon 575377 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrII sacCer3_ensGene start_codon 805254 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrII sacCer3_ensGene exon 805035 805256 0.000000 - . gene_id "YBR298C-A"; transcript_id "YBR298C-A"; chrIII sacCer3_ensGene exon 309070 310155 0.000000 + . gene_id "YCR105W"; transcript_id "YCR105W"; CHRII sacCer3_ensGene start_codon 805351 805353 0.000000 + . gene_id "YBR299W"; transcript_id "YBR299W"; CHRIII sacCer3_ensGene start_codon 310958 310960 0.000000 + . gene_id "YCR106W"; transcript_id "YCR106W"; chrV sacCer3_ensGene exon 574804 575379 0.000000 - . gene_id "YER190C-A"; transcript_id "YER190C-A"; chrV sacCer3_ensGene stop_codon 575680 575682 0.000000 - . gene_id "YER190C-B"; transcript_id "YER190C-B";
We can use sort to sort a file's lines into a new order…
sort [options] <file.txt> …
Exercise: Sort the mini.gff file:
$ sort mini.gff
Exercise: Read the sort man pages to figure out how you would…
We can identify unique (or duplicated) lines in a pre-sorted file using the command uniq.
uniq [options] <sortedFile.txt>
To operate on a presorted file, we have two options. We can do the process in two steps:
OR, we can use the pipe operator to chain the two commands together:
$ sort mini.gff | uniq
That wasn't very interesting. What if we do this just for the first column…
$ cut -f 1 mini.gff $ cut -f 1 mini.gff | sort $ cut -f 1 mini.gff | sort | uniq
Quick tip: To find the duplicated lines, use
-d as an option for uniq.
Common pitfall: Pipes are fun, but pipes can be problematic with large files. Depending on your computer or cluster, there may be a limit to how much information can be piped to a new command. In these cases, creating a temp file (sometimes written as file.tmp) is preferable.
In an earlier class, we learned how to redirect STDOUT and STDIN to a file. If we want to direct STDOUT to both a file and the screen, we can use the
tee is used with the pipe command.
command | tee <filename.txt>
Exercise: Try to send output from a command to both the screen and a file.
$ wc mini.gff | tee wc_output.txt
Quick tip: tee is really used for redirecting stdout. If you want to redirect stdout and stderr, this command works, but I have no idea why:
$ wc mini.gff skdjfldj 2>&1 | tee wc_stdoutstderr.txt
Exercise: Can you write a series of pipes that will output a list of unique transcript_id entries from mini.gff?