Bioinformatics software is largely geared toward linux. The SUMMIT supercomputer system provides us with a high performance computing environment (HPC) that is ideal for handling the large data sizes and computational workload. There are some caveats to using a more complex system, however:
The first two challenge are addressed by gaining proficiency in a command line environment, then expanding that knowledge into using special commands to request resources for your pipeline, and learning to write scripts that execute those skills.
The third challenge is the most complicated because it requires a software development background and a lot of frustrating experience. Luckily, there are now tools out there to do this for us. This is where Conda comes in.
(Removed the Summit description because it will be covered elsewhere)
Conda is so-named because it is written in the programming/scripting language Python, and many tools like to maintain the snake theme (although Python was named after Monty Python).
Although we are learning conda here for the linux environment specific to Summit, it is available on MacOS and PC, and uses a graphical interface called Anaconda-navigator.
Forms of conda:
Initially, conda just provides some basic software, namely Python and associated programs.
As we use it, conda installs new software, such as open-source bioinformatics programs, and also figures out what software is required to run them (these are called dependencies). These include things in other languages than Python, such as c, perl, etc.
Conda creates environments that you can switch between to control software versions and isolate conflicting setups from each other. It is also great for reproducible research because it provides an instant way for others to construct the same environment that you ran your code in.
Official info on conda: https://docs.conda.io/projects/conda/en/latest/.
We either need to be on a compute node (goto https://jupyter.rc.colorado.edu/hub/login in a web browser), or a compile node (ssh login.rc.colorado.edu -l youreid@colostate.edu, followed by ssh scompile).
$ pwd # you should be in /home/eID@colostate.edu $ ls -alh # you may see a file called .condarc
.condarc
, if you do, follow the next section to make sure has the right info.$ nano .condarc # then copy and paste the following in, changing "youreid" to your eid. You will NOT use youreid@colostate.edu here. pkgs_dirs: - /projects/.colostate.edu/youreid/.conda_pkgs envs_dirs: - /projects/.colostate.edu/youreid/software/anaconda/envs # Exit out of nano using # CTRL + S # Type Y # Return $ more .condarc # do this to check your .condarc file
This gives you access to the pre-installed conda (adds it to our PATH).
$ source /curc/sw/anaconda3/latest
Now that conda is in our PATH, we can run it like other programs.
$ conda
This gave us a list of available commands.
Our first command will be `conda init` to make conda available in future sessions. `–dry-run` will tell us what files and directories will change. This is just to show you that `init` prepares files and directories, and where.
$ conda init --dry-run
Now run it without the `–dry-run`.
$ conda init
This changed your .bashrc to make conda available, and to software install in your directories rather than on the system.
To make it apply to the current session,
We can list all the virtual conda environments we can currently load:
$ conda env list base * /curc/sw/anaconda3/2019.03 globus /curc/sw/anaconda3/2019.03/envs/globus idp /curc/sw/anaconda3/2019.03/envs/idp jupyterhub /curc/sw/anaconda3/2019.03/envs/jupyterhub py3.8 /projects/erinnish@colostate.edu/software/anaconda/envs/py3.8
The output shows us the default environments that the personnel at CU Boulder have kindly initiated for us to use. The one we are currently using is marked by an asterisk. Note also, that (base) shows up before your prompt… another indication that conda is active and working. Further note, the py3.8 environment is a custom environment we started in the DSCI510 class.
Exercise: Let's build our own custom environment
We want to build a custom virtual environment for this class. To do so…
$ hostname. # Ensure first that you're on an scompile node. It should say shas136 or shas137 $ conda create -n dsci512 python==3.8 $ conda env list
You should now see a new virtual environment has appeared called dsci512
To navigate into your new environment, do this…
$ conda activate dsci512 $ conda env list # This shows you which environments are available and selected $ conda list # This shows the software currently installed in your active environment
- Yay! you should have your environment dsci512 installed and activated.
Exercise: Let's install software. For this class, we will need the software packages: fastp, bwa, hisat2, bedtools, and samtools
$ conda config --add channels conda-forge # If you get a warning, that's ok
$ conda install -c bioconda fastp bwa hisat2 bedtools samtools
y
$ fastp $ bwa $ hisat2 $ bedtools $ samtools $ conda list
$ cp /projects/dcking@colostate.edu/troubleshooting/fix_CSU.bash . $ bash fix_CSU.bash $ hisat2
-h/--help print this usage message (ERR): hisat2-align exited with value 1
Yay! You have your conda environment successfully installed and activated
OK, next time I start up, what do I need to do?
$ ssh scompile (base) [tstark@colostate.edu@shas0136 ~]$
$ source /curc/sw/anaconda3/latest
$ conda activate dsci512 $ bwa
$ conda install -c bioconda <software_name_here>
rc-help@colorado.edu
can assist you.