User Tools

Site Tools


2018pipes2

This is an old revision of the document!


MORE PIPES

Now that we know what piping is, we can discover some new functionalities of Linux. Let's learn how to pipe the following commands:

sort - sort lines in a file
uniq - find unique (or duplicated) lines in a pre-sorted file
tee - redirect stdout or stderr to multiple locations

:!: Exercise: You probably already have the file called mini.gff. If you don't, copy and paste the text below into a file by that name.

# A tester gff file.                                
# For testing pipes.                                
chrV	sacCer3_ensGene	CDS	574807	575379	0.000000	-	0	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	CDS	805038	805256	0.000000	-	0	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrV	sacCer3_ensGene	start_codon	575377	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrII	sacCer3_ensGene	start_codon	805254	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrII	sacCer3_ensGene	exon	805035	805256	0.000000	-	.	gene_id "YBR298C-A"; transcript_id "YBR298C-A";
chrIII	sacCer3_ensGene	exon	309070	310155	0.000000	+	.	gene_id "YCR105W"; transcript_id "YCR105W";
CHRII	sacCer3_ensGene	start_codon	805351	805353	0.000000	+	.	gene_id "YBR299W"; transcript_id "YBR299W";
CHRIII	sacCer3_ensGene	start_codon	310958	310960	0.000000	+	.	gene_id "YCR106W"; transcript_id "YCR106W";
chrV	sacCer3_ensGene	exon	574804	575379	0.000000	-	.	gene_id "YER190C-A"; transcript_id "YER190C-A";
chrV	sacCer3_ensGene	stop_codon	575680	575682	0.000000	-	.	gene_id "YER190C-B"; transcript_id "YER190C-B";

Sorting files by line using sort

We can use sort to sort a file's lines into a new order…

sort usage:
sort [options] <file.txt> …

:!: Exercise: Sort the mini.gff file:

$ sort mini.gff

:!: Exercise: Read the sort man pages to figure out how you would…

  • sort in reverse order
  • sort the capital and lower case letters together
  • sort in numerical order.
  • Try some of these options

Find unique lines using uniq

We can identify unique (or duplicated) lines in a pre-sorted file using the command uniq.

uniq usage:
uniq [options] <sortedFile.txt>

To operate on a presorted file, we have two options. We can do the process in two steps:

  1. sort file.txt > sortedFile.txt
  2. uniq sortedFile.txt

OR, we can use the pipe operator to chain the two commands together:

$ sort mini.gff | uniq

That wasn't very interesting. What if we do this just for the first column…

$ cut -f 1 mini.gff
$ cut -f 1 mini.gff | sort
$ cut -f 1 mini.gff | sort | uniq

;-) Quick tip: To find the duplicated lines, use -d as an option for uniq.

:!: Common pitfall: Pipes are fun, but pipes can be problematic with large files. Depending on your computer or cluster, there may be a limit to how much information can be piped to a new command. In these cases, creating a temp file (sometimes written as file.tmp) is preferable.


Redirect to multiple locations using tee

In an earlier class, we learned how to redirect STDOUT and STDIN to a file. If we want to direct STDOUT to both a file and the screen, we can use the tee command. tee is used with the pipe command.

tee usage:
command | tee <filename.txt>

:!: Exercise: Try to send output from a command to both the screen and a file.

$ wc mini.gff | tee wc_output.txt

;-) Quick tip: tee is really used for redirecting stdout. If you want to redirect stdout and stderr, this command works, but I have no idea why:

$ wc mini.gff skdjfldj 2>&1 | tee wc_stdoutstderr.txt

:!: Exercise: Can you write a series of pipes that will output a list of unique gene_id entries from mini.gff?

Assignment 3

2018pipes2.1535642898.txt.gz · Last modified: 2018/08/30 09:28 by erin