User Tools

Site Tools


CONDA for CSU users on SUMMIT

Software Installation on SUMMIT

Bioinformatics software is largely geared toward linux. The SUMMIT supercomputer system provides us with a high performance computing environment (HPC) that is ideal for handling the large data sizes and computational workload. There are some caveats to using a more complex system, however:

  1. Developing your pipeline must be done in a jupyter notebook, a special shell that is allocated, or a script.
  2. Scripting is required for scaling up your pipeline.
  3. Open source software requires special configuration that causes conflicts on a shared system (person A's configuration for program X might be incompatible with person B's program Y).

The first two challenge are addressed by gaining proficiency in a command line environment, then expanding that knowledge into using special commands to request resources for your pipeline, and learning to write scripts that execute those skills.

The third challenge is the most complicated because it requires a software development background and a lot of frustrating experience. Luckily, there are now tools out there to do this for us. This is where Conda comes in.

(Removed the Summit description because it will be covered elsewhere)

What is "conda"?

Conda is so-named because it is written in the programming/scripting language Python, and many tools like to maintain the snake theme (although Python was named after Monty Python).

Although we are learning conda here for the linux environment specific to Summit, it is available on MacOS and PC, and uses a graphical interface called Anaconda-navigator.

Forms of conda:

  1. Miniconda : a free minimal installer for conda. It is a small, bootstrap version of Anaconda that we will build on as needed.
  2. Anaconda : too much. A lot of pre-installed (750+) libraries and software. The base (Miniconda) will determine what we need.
  3. Anaconda-navigator : a graphical alternative to the command line (not available for Summit, though).
  4. Specialty - Summit will have a slightly different way to provide access to a pre-installed version.

What does "conda" actually do?

Initially, conda just provides some basic software, namely Python and associated programs.

As we use it, conda installs new software, such as open-source bioinformatics programs, and also figures out what software is required to run them (these are called dependencies). These include things in other languages than Python, such as c, perl, etc.

Conda creates environments that you can switch between to control software versions and isolate conflicting setups from each other. It is also great for reproducible research because it provides an instant way for others to construct the same environment that you ran your code in.

Official info on conda:

1. Log on to Summit

We either need to be on a compute node (goto in a web browser), or a compile node (ssh -l, followed by ssh scompile).

2. Create a new config file (or check a pre-existing one)

$ pwd           # you should be in /home/
$ ls -alh       # you may see a file called .condarc
  • you may already see a .condarc, if you do, follow the next section to make sure has the right info.
  • if you don't have it, you will add the below.
$ nano .condarc
# then copy and paste the following in, changing "youreid" to your eid. You will NOT use here.
  - /projects/
  - /projects/
# Exit out of nano using 
# CTRL + S
# Type Y
# Return
$ more .condarc     # do this to check your .condarc file

3. Initialize conda

This gives you access to the pre-installed conda (adds it to our PATH).

$ source /curc/sw/anaconda3/latest

Now that conda is in our PATH, we can run it like other programs.

$ conda

This gave us a list of available commands.

Our first command will be `conda init` to make conda available in future sessions. `–dry-run` will tell us what files and directories will change. This is just to show you that `init` prepares files and directories, and where.

$ conda init --dry-run

Now run it without the `–dry-run`.

$ conda init 

This changed your .bashrc to make conda available, and to software install in your directories rather than on the system.

To make it apply to the current session,

  • in the terminal, log out of scompile (exit), and ssh scompile again.
  • in jupyterhub, start another terminal.

4. next steps

We can list all the virtual conda environments we can currently load:

$ conda env list
base                   * /curc/sw/anaconda3/2019.03
globus                   /curc/sw/anaconda3/2019.03/envs/globus
idp                      /curc/sw/anaconda3/2019.03/envs/idp
jupyterhub               /curc/sw/anaconda3/2019.03/envs/jupyterhub
py3.8                    /projects/

The output shows us the default environments that the personnel at CU Boulder have kindly initiated for us to use. The one we are currently using is marked by an asterisk. Note also, that (base) shows up before your prompt… another indication that conda is active and working. Further note, the py3.8 environment is a custom environment we started in the DSCI510 class.

:!: Exercise: Let's build our own custom environment

We want to build a custom virtual environment for this class. To do so…

$ hostname.      # Ensure first that you're on an scompile node. It should say shas136 or shas137
$ conda create -n dsci512 python==3.8
$ conda env list

You should now see a new virtual environment has appeared called dsci512

To navigate into your new environment, do this…

$ conda activate dsci512
$ conda env list # This shows you which environments are available and selected
$ conda list  # This shows the software currently installed in your active environment

:-) - Yay! you should have your environment dsci512 installed and activated.

:!: Exercise: Let's install software. For this class, we will need the software packages: fastp, bwa, hisat2, bedtools, and samtools

  • First, let's make sure we have access to the source-forge repository of software (online):
$ conda config --add channels conda-forge
# If you get a warning, that's ok
  • Next, go ahead and install the software we need:
$ conda install -c bioconda fastp bwa hisat2 bedtools samtools
  • You will be prompted whether you want to install the dependency packages. Type y
  • Go ahead and see whether the software you requested was installed successfully. If installed successfully, you should see the usage descriptions. If they weren't installed successfully, you will get an error message.
$ fastp
$ bwa
$ hisat2
$ bedtools
$ samtools
$ conda list
  • Likely, you were able to install everything with the exception of hisat2. There is a coding error in hisat2 that doesn't like the '@' symbol in your user name. So that's a bit of a bug. David wrote some code to work around this.
$ cp /projects/ .
$ bash fix_CSU.bash
$ hisat2
  • If you see the below error message, that's OK. As long as you see the other user info, it should work.
  -h/--help          print this usage message
(ERR): hisat2-align exited with value 1

:!: Yay! You have your conda environment successfully installed and activated

3. conda cheat sheet

:?: OK, next time I start up, what do I need to do?

  • Next time you log into SUMMIT, first check whether conda has started on its own.
$ ssh scompile
 (base) [ ~]$ 
  • If you don't see that '(base)' tag, initiate conda…
$ source /curc/sw/anaconda3/latest
  • If you do see the '(base)' tag, to ahead and activate your preferred environment:
$ conda activate dsci512
$ bwa
  • Now you are ready to start working with any of your already-installed programs. You won't need re-install bwa, fast, etc. They should just work.
  • If you want to install something new at this point into dsci512, just run this one line of code:
$ conda install -c bioconda <software_name_here>
  • There are over 7,000 packages ready to install. To search packages bioconda
  • If you run into problems using conda or installing software, can assist you.

4. References for today

wiki/conda_for_csu_users.txt · Last modified: 2021/06/01 15:06 (external edit)