Home » BioCentric » Bioinformatics Frameworks Part 2: BioConda and BioContainers

Bioinformatics Frameworks Part 2: BioConda and BioContainers

Introduction

One of the biggest changes within the field of bioinformatics is the rise of the framework. Frameworks are the professionalization of analysis pipelines, constructed with software development and data science in mind. With this series of blogs I will discuss a few that have crossed my path, and try to explain the essence of their design philosophy. In part 2, I deviate a bit from the concept of frameworks but still remain on-topic; I will focus on BioConda and BioContainers: the hassle-free application management.

Background

Bioinformatics is a field in constant motion. Unlike most computer science fields, every few years we change the way we look at our data. For example, in a span of ten years we went from control-relative mRNA expression measured from microarrays to RNA-Seq based FPKM values (Fragments per Gene length in kilobase per million fragments in dataset) to RSEM (RNA-Seq by Expectation Maximization) and TPM (Transcript per Million) metrics. While all these data formats capture the same phenomenon, namely gene expression, they are incompatible and cannot be used interchangeably.
The same applies to data from different parameters or versions of certain tools. If a version of a tool becomes obsolete, it often becomes unsupported and unavailable from its maintainer. If you want to do validations, you have to keep track of all these tool versions and test how they influence your results.
After that you have the system administrator aspect: some software packages are notorious for their dependency on other toolkits. If you update one of the other tools, who is to say that the package will still work?
And don’t get me started on different normalizations of the same data type or changes within the workflow, these variations can be inexhaustible.
Considering all these aspects, one of the key issues of managing a data analysis pipeline is keeping it functional and up to date, while still maintaining compatibility with previous versions of data and software. Some attempts have been made in the past, like using Debian’s Yum or Ubuntu’s Apt-Get for installing tools and their dependencies. But these application repositories are often not well maintained and have limited versions available to install within a given Linux release.

Bioconda

Enter Bioconda. This is a package repository based on Anaconda’s conda package manager. Actually, Bioconda is a channel within the conda package repository. Anaconda and conda used to be focused on managing Python libraries and packages, but they have branched out, first to the R language and now to basically any tool you can think of. The packages within Bioconda are basically complete products, containing the tool you are looking for combined with all its dependencies. It is also versioned and contained: the packages function independent from each other so you can have isolated workflows with the same tools but with different versions.


Installing tools via Bioconda is easy: install conda and configure the Bioconda channel. After that, just install a tool like using Apt-get:

1
conda install bwa

and Presto, BWA is installed.
The backend management of the Bioconda system is their Github repository. Here, people submit simple scripts called recipes, which detail how a certain tool should be installed on a local system. If you request an Bioconda install like the BWA example, conda looks for a matching tool in the Github, pulls the recipe and gets going. Versioning of the tools and dependencies is also done via Github, I will give an example of that later on. There are already about 3000 recipes on Bioconda, making it the most complete bioinformatics repository, while also being one of the youngest!

Biocontainer

While Bioconda facilitates the management of tools and packages, executing them can still be a hassle. You can imagine that one workflow requires one version of a tool and the next workflow requires another version. You have to remove and reinstall different versions on the fly. Furthermore, you cannot run these workflows simultaneously.


Next to that is the system management problem. If you have many workflows or many users with different bioinformatics needs. Even with Bioconda, at some point you will need a full time system manager to keep the calculation server from crashing.
Biocontainer solves that problem neatly. It is an implementation of the Docker tool, which is so popular these days because of its convenience. Docker creates containers, local and isolated execution environments where you can install, run and change tools without affecting the overall operating system. It is like a virtual machine, without the machine emulation part. When the container is finished or the user done, the container can be suspended or removed entirely. You can keep track of specific types of workflows or tool executions within a container (which you can backup for example), or build it from scratch easily with a simple Dockerfile script.

In The Mix

When combining Bioconda and Biocontainers, you see the greatness of this type of application management. I will give a personal example: I found that using a specific version of BCFtools (a Binary Variant Calling File manipulation tool used to filter mutations) gives errors in a variant calling workflow. Normally, I need to remove and reinstall other BCFtools versions on my system till I find which version works best. Now, using Bioconda and Biocontainer, I can debug my workflow using this Dockerfile:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Base Image
FROM biocontainers/biocontainers:latest
 
################## METADATA ######################
 
LABEL base.image="biocontainers:latest"
LABEL version="1"
LABEL software="bwa-samtools-bfctools"
LABEL software.version="1.0"
LABEL description="combined toolset analysis"
LABEL website="https://"
LABEL documentation="https://"
LABEL license="https://"
LABEL tags="Genomics"
 
################## MAINTAINER ######################
 
MAINTAINER Sander Bervoets
 
################## INSTALLATION ######################
 
RUN conda install bwa
RUN conda install samtools
RUN conda install bcftools=1.5
 
WORKDIR /data/
 
CMD ["bwa"]

As you can see from the build script, I build a container from the base Biocontainer image, and install BWA, SAMtools and BCFtools within that container using (Bio)conda. To alter the versions of BCFtools, I simple append the BCFtools installation command with “=1.5” and only that version will be available to my currently excecuted workflow.

Conclusion

Using Bioconda and Biocontainer, a bioinformatician only needs a decent system with a decent internet connection and his dataset. All the rest will be installed, managed and executed on the fly without investing too much energy into management and risking dependency hell!
These tools fit perfectly within a bioinformatics framework environment, because they solve the problems of versioning, installing and running tools, giving the user more time to focus on the pipeline needed to perform his/her research. But it is important that your framework takes advantage of these tools.