User Tools

Site Tools


assignments:2020assignment3

Assignment 3

Due Date & Grading

  • Due date: September 8, 2019, 10am
  • Compile your answers in a text document
  • Upload your text document to CANVAS
  • Homework assignments account for 70% of your final grade.

:!: HINT: If the question asks for a command, write the full command as you would write it on the command line.

:!: HINT: You don't need to include the question in your write-up, just the answer.


EDIT 9/7/20 - update Question 5B should be : When you execute wc ce11_CDS.bed what is the result?If you already turned it in the other way, not a problem.

Question 1

  • Let's download the C. elegans genome. Create a directory called celegans and navigate into it.
  • Download the ce10 version of the C. elegans genome from UCSC Genome browser with one of the following commands:
  • :!: Hint: This is the 11th version of the C. elegans genome, so the ce11 is ce + one + one, NOT ce+ little L + little L.
  • :!: Hint: don't forget the last period in that rsync command. It tells rsync to use your current directory as the target directory.
$ rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/ce11/chromosomes/ .
OR
$ wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/ce11/chromosomes/*'
  • Not working? Click here for more help
  • A. What is the md5sum you obtain for the file chrI.fa.gz?
  • B. What command line would decompress all the .fa.gz files (in one command)? execute the command
  • C. Now what is the md5sum for the expanded file chrI.fa?

Question 2

If you did Exercise 1 correctly, you should have a directory containing individual fasta files for each C. elegans chromosome. Now, let's merge these individual chromosome files into one large genome fasta file.

  • A. What command line would concatenate all the fasta files into a genome fasta file called celegans_genome.fa? execute the command
  • B. Let's double check that the file celegans_genome.fa contains seven concatenated chromosomes. What (piped) set of commands would you execute to check that the file is composed of seven chromosomes?

Question 3

  • Use the file you made in Question 2 (above), celegans_genome.fa. Assume that the file blerg.jpg doesn't exist. What will be saved in the file output.txt when you execute the following commands?:
A  $ wc celegans_genome.fa blerg.jpg > output.txt
B  $ wc celegans_genome.fa blerg.jpg 2> output.txt
C  $ wc celegans_genome.fa blerg.jpg &> output.txt

Question 4

  • Say your directory has the following contents:
$ ls -1
README.txt
celegans_genome.fa
chrI.fa
chrII.fa
chrIII.fa
chrIV.fa
chrM.fa
chrV.fa
chrX.fa
md5sum.txt
  • Explain what each step of the following piped command chain does:

#This one is for MAC people:

$ md5 *.fa | tail -n 7 | cut -d ' ' -f 4 

#This one is for WINDOWS people:

$ md5sum-lite *.fa | tail -n 7 | cut -d ' ' -f 1

Question 5

Let's make a bed file. Bed files are long lists of genome features in which each row in the file corresponds to a genomic region. The first column of each row lists the chromosome, the second column lists the start site, and the third row lists the stop site. The columns are tab delimited.

Download a C. elegans gtf file using the following command:

$ wget 'http://129.82.125.224:34/Pangea-Web/onishlab/dsci510/ce11_annotation_ensembl_to_ucsc.gtf.gz'

OR

Just download it here: ce11_annotation_ensembl_to_ucsc.gtf.gz

Create a .bed file called ce11_CDS.bed for the genome locations OF JUST THE CODING SEQUENCES (listed as CDS in column 3 of the GTF file). Your .bed file should look like this if you peek into it using head:

$ $ head ce11_CDS.bed 
chrV	1480	1579
chrV	1691	1782
chrV	2851	3036
chrV	5690	5966
chrV	6024	6508
chrV	7651	7818
chrV	7433	7609
chrV	7158	7384
chrV	6939	7110
chrV	7651	7818
A. What piped command line did you use to generate ''ce11_CDS.bed''?
B. When you execute ''wc ce11_CDS.bed'' what is the result?

Fun Stuff

What's wrong with wc?

  • Use wc to save a file containing word count information for one file.
  • Try to use cut to parse the word count information (in your saved file) into lines, words, and characters. It doesn't work.
  • Open the file in your text editor. Can you figure out why you couldn't parse columns using cut?
assignments/2020assignment3.txt · Last modified: 2021/06/01 15:06 (external edit)