User Tools

Site Tools


wiki:hisat2build

BUILDING HISAT2 GENOME INDEXES

References

Downloading the genome & building indexes

Because our project involves the yeast genome, we will now download the yeast genome and build HISAT2 indexes using it. These are files that will make it possible to use the yeast genome as a reference genome in a HISAT2 alignment. This is a very common step in any NextGen sequencing alignment protocol.

To get started, we'll need to download some information on the yeast genome from a trusted repository.

To build the genome, we'll need as input two things…

  • The genome as fasta files, one for each chromosome
  • A name for our genome that we choose. (Since we're using Saccharomyces cerevisiae, version 3, we'll call it sc3)

However, because it is good practice to download all the genome information at the same time and from the same place, we will also download some additional S. cerevisiae genome files, too.


Starting our computational notebook

Before we begin, let's start our computational notebook.

  • Open a page in your notebook
  • Write the date
  • Give your project a name
  • Write a few sentences on what your goal right now is.

Explore the yeast genome


Let's download the genome!

:!: EXERCISE: Download the yeast genome

  • Under your directory for this class, ~/DSCI512_RNAseq you should see a directory called PROJ01_testsummit.
  • Make a new directory called PROJ02_yeastGenome
  • Make a new sub-directory called by_chrom
$ pwd   # should be ~/DSCI512_RNAseq
$ ls    # should be PROJ01_testsummit
$ mkdir PROJ02_yeastGenome
$ cd PROJ02_yeastGenome
$ mkdir by_chrom
$ cd by_chrom

In this directory, let's download all the chromosomes of the yeast genome.

:!: Quick tip: Don't for get the period at the end of the rsync command line

# making sure you're in the directory ''~/PROJ02_yeastGenome/by_chrom''
$ pwd # should be ~/PROJ02_yeastGenome/by_chrom
$ ls
$ rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/ .
$ ls -alh
 
# Check the md5 sums:
$ more md5sum.txt                  # This reads the md5sum file that UCSC included in the directory
$ md5sum *.fa.gz                   # This checks all the md5sums of the files you downloaded
$ md5sum *.fa.gz > 181115_sums.txt # This saves all the md5sums of the files you downloaded in a file
$ diff md5sum.txt  181115_sums.txt # This compares the md5sums you generated to the ones UCSC gave you.
                                   # You shouldn't see anything coming up.

:!: Notebook time: Ok, now record in your computational notebook what you did. You can use history. Make sure you include the URL command and make mention that you checked the md5sums.


Unzip your files

We need to unzip the files

$ gunzip *.fa.gz

Let's build the indexes!

HISAT2 Manual

First let's write a script to build the indexes. We'll call it buildYeastIndexes.sh

# Start a script called buildYeastIndexes.sh
$ nano buildYeastIndexes.sh

Copy and paste this code into the program file to initiate it:

#!/usr/bin/env bash
 
#SBATCH --job-name=execute_hisat2-build
#SBATCH --nodes=1
#SBATCH --ntasks=6 # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00   # modify this to reflect how long to let the job go. This indicates 4 hours.
#SBATCH --output=log_hisat2-build_%J.txt

Now add some pseudocode about what we want to do:

#!/usr/bin/env bash
 
#SBATCH --job-name=execute_hisat2-build
#SBATCH --nodes=1
#SBATCH --ntasks=6 # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00   # modify this to reflect how long to let the job go. This indicates 4 hours.
#SBATCH --output=log_hisat2-build_%J.txt
 
# Install software
 
# Build hisat2 indexes for S. cerevisiae:

Now let's add the command to install the software. Tailor it to your own <eID>:

#!/usr/bin/env bash
 
#SBATCH --job-name=execute_hisat2-build
#SBATCH --nodes=1
#SBATCH --ntasks=6 # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00   # modify this to reflect how long to let the job go. This indicates 4 hours.
#SBATCH --output=log_hisat2-build_%J.txt
 
# Install software
source /scratch/summit/<eID>@colostate.edu/activate.bashrc
 
# Build hisat2 indexes for S. cerevisiae:

Now we need to build some indexes:

If we look in the HISAT2 manual under the entry for The hisat2-build indexer, we'll find the proper usage for building genome indexes:

USAGE:
hisat2-build [options] <reference_in> <ht2_base>

hisat2

options: 
-p <num>             Number of threads to use. The number in this option MUST match the number in the line #SBATCH --ntasks=<num>

<reference_in>       This is a list of comma-separated .fa file names of all the chromosomes in your genome.
<ht2_base>           This is the name of your new genome. It'll just be sc3

OK, it might be easiest to write this line of code in a text editor and then copy-and-paste it over to your executable shell script.

So, first, it'll be just…

hisat2-build -p 6

To get the list of chromosomes may be a little cumbersome, but this is my favorite trick:

$ ls -1 *.fa | sed -z 's/\n/,/g' > chr_list.txt

You can copy and paste your list of chromosomes from the outputfile chr_list.txt and paste it into your scratch script line. Make sure to remove the last trailing comma. Make sure all the chromosomes are there.

Then, just add the last element:

sc3

So, your final code in buildYeastIndexes.sh should look like this with your eID substituted:

#!/usr/bin/env bash
 
#SBATCH --job-name=execute_hisat2-build
#SBATCH --nodes=1
#SBATCH --ntasks=6 # modify this number to reflect how many cores you want to use (up to 24)
#SBATCH --partition=shas-testing
#SBATCH --qos=testing     # modify this to reflect which queue you want to use. Options are 'normal' and 'testing'
#SBATCH --time=0:29:00    # modify this to reflect how long to let the job go. This indicates 4 hours.
#SBATCH --output=log_hisat2-build_%J.txt   #This will spit out a log file
 
# Load software:
source /scratch/summit/<eID>@colostate.edu/activate.bashrc
 
# Build hisat2 indexes:
hisat2-build -p 6 chrI.fa,chrII.fa,chrIII.fa,chrIV.fa,chrIX.fa,chrM.fa,chrV.fa,chrVI.fa,chrVII.fa,chrVIII.fa,chrX.fa,chrXI.fa,chrXII.fa,chrXIII.fa,chrXIV.fa,chrXV.fa,chrXVI.fa sc3

Run it!!!

$ sbatch buildYeastIndexes.sh
 
# To check on it:
$ squeue -u $USER

:?: Did it work??? Check it! You should have a log file and eight new files that all have names like sc3.1.ht2, sc3.2.ht2.


:!: Notebook time!: Write down what you did. Some other information you'll want to include in your notebook:

Check your build

$ hisat2-inspect -s sc3

Check your version

$ hisat2 --version

Download other genome files

wiki/hisat2build.txt · Last modified: 2018/11/15 11:37 by erin