Assemble the E. coli genome using high-throughput sequencing data.
Document the exercise in your electronic lab notebook. The notebook entry should have the following format:
# General description of experiment.
# Description of step 1.
# Description of individual step 2.
We will use Velvet to assemble the E. coli genome.
1. Download high throughput sequencing data from an E. coli genome sequencing experiment. The data is on the montgomery lab server: '/Users/genomics/Documents/e_coli'. Use Cyberduck or FileZilla to transfer the 'e_coli' directory to your computer (place it in the 'Documents' directory).
2. Open a terminal window.
3. Change into the 'velvet' directory within the 'e_coli' directory you downloaded.
4. View the contents of the directory using
ls. There are two files corresponding to the forward and reverse reads from a single paired-end Illumina run.
5. The '.gz' extension, indicates that files are compressed (GNU zipped). Decompress the files using
gzip -d file_name (the
-d option is to decompress a file).
6. Inspect the fastq files using
more. This was a paired-end sequencing run. Reads from either ends of the fragments are in two separate files. What information is contained in each line of the files? Record the answer to this question in your lab notebook.
7. Identify how many reads are in the library using
wc -l (the '-l' option specifies to count lines only). How many reads are there in total between the two fastq files? Record this information in your lab notebook.
8. Copy one of the sequences from the fastq file and use it to calculate the length of the reads with
echo -n sequence | wc -c. What do the '-n' and '-c' options in
wc do, respectively? The read length will be identical throughput the library.
9. Create a hash table of the reads using
velveth with k-mer length 35. Direct the output to a folder called 'velvet_output'. Specify the file type is 'fastq' and that the data is short paired-end reads in separate files. Notice how this information is passed via options to the
$ velveth velvet_output 35 -fastq -shortPaired -separate fastq_file1.fastq fastq_file1.fastq
velvetg to assemble the reads into contigs and build a graph. Specify an expected coverage of 21 (determined prior to this exercise based on the size of the E. coli genome and the number and length of reads). Selecting '-read_trkg yes' and '-amos_file yes' options will generate a file with contig information that can be viewed with the graphical viewing software 'Tablet'.
$ velvetg velvet_output -read_trkg yes -amos_file yes
11. In the terminal, change into 'velvet_output' directory using
cd. Examine the 'contigs.fa' files using
more. This file contains all of the assembled contigs.
12. Plot the data in 'Tablet' using
open velvet_asm.afg or by double clicking the file. Examine some of the contigs. How many contigs were produced and what was the size range and average size? Record this information in your lab notebook.
Because of the numerous gaps in our sequencing data, the assembly will produce multiple contigs. How many were produced and how many would a perfect assembly yield?
1. What information is contained in each line of the files? For example, every first line contains…….., every second line contains………, etc.
2. What is the length of each read and how many reads are there in total between the two fastq files we used in the analysis? How many sequenced fragments does this correspond to?
3. What is the estimated coverage?
4. How many contigs were produced in the anlaysis?
5. What was the size range of the contigs (min, max, avg)?