User Tools

Site Tools


Where do Genomic Datasets come from?

The main resources for Genomic Datasets are the following repositories:

:!: Best Practices: Every time you download something… write it down!

:!: Best Practices: Try to download genomes with commands that can be written down, not with clicking.

Downloading data from repositories

This used to be straightforward and now it is quite complex. Different linux installations have different options for connecting to online data repositories and downloading data. You'll need to see which of these options works on your system. When you go over software installation later in this course, you can add some of these programs to make your life easier.

ftp - File Transfer Program

ftp <ftp://address/to/ftp/site>
-not currently used on many updated systems due to security
-requires that the host is configured as an ftp site
-use commands cd, ls, get, bye

sftp - Secure File Transfer Program

ftp <sftp://address/to/ftp/site>
-requires that the host is configured as an sftp site
-use commands cd, ls, get, bye

wget - World wide web GET

wget <http://address/to/file/file.txt>
-must be installed on MacOS systems using XCode and/or Homebrew

curl - Command line URL

curl [options] <sourcefile.txt> curl [–remote-name-all] <http://address/to/file/file.txt>
-can only download one file at a time

rsync - Remote file and directory SYNChronization

rsync [options] <source> <target>
rsync [-avzP] http://address/to/dir/> <.>
-a archival, preserves time stamps and permissions
- v verbose, outputs comments to the screen
- z zip, compress to transfer between computers
- P progress, print information on progress to the screen

scp - Secure CoPy

scp <sourcefile> <target>
scp <http://address/to/file/file.txt> <.>

:!: Group Exercise: Obtain the Saccharomyces cervisiae genome from the UCSC Genome Browser.

  • Navigate to DownloadsGenome DataOther GenomesS. cerevisiae
  • Let's take some time to navigate around the SacCer3 links
  • Open an terminal and navigate to a place where you want to download the genome.
  • Make a new directory that will house the genome.
$ mkdir sacCer3
  • navigate into that directory:
$ cd sacCer3
  • We want to obtain the Dataset by Chromosome.
  • Download the contents of this directory using the command:
$ rsync -avzP rsync:// .
  • Use the commands you learned last week to explore what you have obtained (ls, more, less, head, tail, wc, etc).

Didn't work? Try this one…

$ wget --timestamping '*'

Still didn't work? OK, let's just cut to the chase…

:!: Partner work: With a partner, answer the following…

  • Which of your newly downloaded files are binary files?
  • Which are text files?
  • Which command would you use to see how big (how many kb/mb) each file is?
  • Which command would you use to see how many lines are in the text files?

Checking sums

Sometimes data is corrupted during transfer. This often happens for large files like genome files. To ensure that our data was not corrupted, we can use checksums.

:!: Group Exercise: Check sums

  • Check the sums of all the .fa.gz files using this command:
$ md5 *.fa.gz
  • Read the check sums recorded by UCSC Genome browser like so:
$ more md5sum.txt
  • Do the numbers and letters match?

Unzipping files

You may have noticed that the .fa.gz files are not text files, but binary files and you cannot navigate into them using more or less. This is because they have been compressed using a utility called gzip so they are smaller for transfer.

gzip usage
To compress:
gzip <filename.txt>

To un-compress:
gunzip <filename.txt.gz>

Let's unzip the fasta files.

$ gunzip *.fa.gz
$ ls -alh

:!: Quick Tip: There are many other utilities that can be used for compressing files. Some examples are bzip2, zip, and xz.

:!: Partner Exercises:

  1. Using a command with a wildcard, execute word count on all fasta files to see how many lines, characters, and bytes are in each file.
  2. Have the md5 sums of the files changed?
  3. Print out the first 100 lines of chrI.fa to the screen.
  4. It is good practice to take notes every time you download files from online resources. Create a text file for your notes and records (1) today's date, (2) where you obtained the files, (3) how you downloaded them, (4) any other commends.
  5. Download the SacCer genome annotation here: saccer3_annotation.gtf.gz.
  6. Check that your downloaded file matches this md5 sum: MD5 (sacCer3_annotation.gtf.gz) = a22aa4e1be44c7135b7ae121d01640a5
  7. Unzip the .gtf file
  8. Print out the first 10 lines of the .gtf file to the screen.
  9. Print out the last 10 lines of the .gtf file to the screen.
  10. Count how many lines are in the .gtf file.

:!: Common pitfalls: There are many different commands for checking sums for different flavors of linux. You'll just need to test them out:

$ md5
$ sums
$ md5sum
$ cksum
$ sha1sum
  • Note, these will give all different lengths of alphanumeric codes so make sure to test which matches the sums you have been given.


tar compression
tar -zcvf <tarball.tgz> <directory_to_compress>
-z - gzip it
-c - compress it
-v - verbose
-f - files

tar extraction
tar -zxvf <tarball.tgz>
-z - it's gzipped
-x - eXpand it
-v - verbose
-f - files


wiki/sources.txt · Last modified: 2018/08/29 21:11 by erin