User Tools

Site Tools


2018pipes2

MORE PIPES

Now that we know what piping is, we can discover some new functionalities of Linux. Let's learn how to pipe the following commands:

sort - sort lines in a file
uniq - find unique (or duplicated) lines in a pre-sorted file
tee - redirect stdout or stderr to multiple locations

:!: Exercise: You probably already have the file called mini.gff. If you don't, copy and paste the text below into a file by that name.

# A tester gff file.                                
# For testing pipes.                                
chrV	sacCer3_ensGene	CDS	574807	575379	0.000000	-	0	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrV	sacCer3_ensGene	start_codon	575377	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	start_codon	805254	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrII	sacCer3_ensGene	exon	805035	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrIII	sacCer3_ensGene	exon	309070	310155	0.000000	+	.	gene_id "YCR105W"; transcript_id "YCR105W";
CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W";
CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
chrV	sacCer3_ensGene	exon	574804	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrV	sacCer3_ensGene	stop_codon	575680	575682	0.000000	-	.	gene_id "YER190C-B"; transcript_id "YER190C-B";

Sorting files by line using sort

We can use sort to sort a file's lines into a new order…

sort usage:
sort [options] <file.txt> …

:!: Exercise: Sort the mini.gff file:

$ sort mini.gff

:!: Exercise: Read the sort man pages to figure out how you would…

  • sort in reverse order
  • sort the capital and lower case letters together
  • sort in numerical order.
  • Try some of these options

Find unique lines using uniq

We can identify unique (or duplicated) lines in a pre-sorted file using the command uniq.

uniq usage:
uniq [options] <sortedFile.txt>

To operate on a presorted file, we have two options. We can do the process in two steps:

  1. sort file.txt > sortedFile.txt
  2. uniq sortedFile.txt

OR, we can use the pipe operator to chain the two commands together:

$ sort mini.gff | uniq

That wasn't very interesting. What if we do this just for the first column…

$ cut -f 1 mini.gff
$ cut -f 1 mini.gff | sort
$ cut -f 1 mini.gff | sort | uniq

;-) Quick tip: To find the duplicated lines, use -d as an option for uniq.

:!: Common pitfall: Pipes are fun, but pipes can be problematic with large files. Depending on your computer or cluster, there may be a limit to how much information can be piped to a new command. In these cases, creating a temp file (sometimes written as file.tmp) is preferable.


Redirect to multiple locations using tee

In an earlier class, we learned how to redirect STDOUT and STDIN to a file. If we want to direct STDOUT to both a file and the screen, we can use the tee command. tee is used with the pipe command.

tee usage:
command | tee <filename.txt>

:!: Exercise: Try to send output from a command to both the screen and a file.

$ wc mini.gff | tee wc_output.txt

;-) Quick tip: tee is really used for redirecting stdout. If you want to redirect stdout and stderr, this command works, but I have no idea why:

$ wc mini.gff skdjfldj 2>&1 | tee wc_stdoutstderr.txt

:!: Exercise: Can you write a series of pipes that will output a list of unique transcript_id entries from mini.gff?

Assignment 3

2018pipes2.txt · Last modified: 2018/08/30 09:28 by erin