Home » BioCentric » Bioinformatics Frameworks Part 3: Snakemake

Bioinformatics Frameworks Part 3: Snakemake

Introduction

One of the biggest changes within the field of bioinformatics is the rise of the framework. Frameworks are the professionalization of analysis pipelines, constructed with software development and data science in mind. With this series of blogs I will discuss a few that have crossed my path, and try to explain the essence of their design philosophy. In part three, I return to the pipeline framework analysis; I will focus on SnakeMake, the modern Make alternative.

Background

In the first part of this series I mentioned the previous state of computational frameworks for bioinformatics pipeline. A good start for making pipelines robust and reproducible was to work with Make. A downside of Make is that it’s not designed with data analysis in mind, it is mostly used for compiling and installing software under POSIX systems. Another problem is that it absolutely ancient, it is about 46 years old! Now, age is not a specific issue, but when Make was introduced, there were a lot of different programming paradigms in place. Programming in general, back then, was only done by computer scientist and things like memory conservation was very important. You notice this in how Make works, for example it’s manual is scary long, and it contains a section named “How to read this manual”, which is a clear red flag when it comes to user-friendliness. When problems arise, often you will get very cryptic error messages, much vaguer then what we are used to now. Make sort of clicks into Bash, meaning you can run basic Bash commands within Make when running data analyses. But it’s by no means a direct connection, often you don’t know when Make syntax or nomenclature ends and Bash begins. This last one is particularly annoying if you want to use basic text manipulation tools which make Bash and POSIX systems so useful. You expect that applying Bash commands in a certain way will lead to the results you hope, but it all leads to unexpected Make behavior. And lastly, Make doesn’t have any resource management build in, except that you can specify how many concurrent threads you want to execute.
All this, and the current shift to more modern and user-friendly programming languages for this age of Big Data, several developers are looking to replace Make for something else.

SnakeMake

Enter SnakeMake. SnakeMake is being developed by Johannes Köster and was first introduced in 2012. The paper made by Köster was published in Bioinformatics, which signifies that the language was developed with bioinformatics data analysis in mind.
SnakeMake takes all the good stuff from make and transports into the 21st century. Since the language is based on the rule and target structure, it’s as robust and reproducible as Make. It clicks into the Bash shell but still clearly retains its Python programming background and simplicity. It formulates the jobs your system has to run based on the directed acyclic graph (DAG) it generates from the rules and targets in the Snakemake file. A DAG is a representation of the relationship between the files and tools within your data analysis, it basically shows through what steps your data has to go from start to finish. The DAG can easily be converted to an image to create an overview of what your data analysis is consisting of. A very nice addition to any data analysis report!

The code

Inside the Snakemake file (Snakefile), you find a modified Python structure. The following code example leads to the DAG shown in the previous section (which is itself based on the tutorial example of the Snakemake manual):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
SAMPLES = ["A", "B"]
 
rule all:
input:
"calls/all.vcf"
 
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
 
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
 
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
 
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

In this example, the Snakemake syntax is clear and different from regular Python. With things like”rule all” you can see the Make influences. Furthermore, what input and output represent is clear, and shell is a direct transfer to the Bash shell. But at any time, you can deviate from this syntax and make things Pythonic! As you can see from the bcftools_call rule, you can use expand() to generate sample file lists. A nice upgrade would be to switch the sample definition from hardcoded to a directory detection method:

1
2
3
4
5
6
7
8
9
10
11
12
13
from os.path import join
 
# Globals ---------------------------------------------------------------------
 
# Full path to a folder that holds all of your FASTQ files.
FASTQ_DIR = './fastq/'
 
# A Snakemake regular expression matching the forward mate FASTQ files.
SAMPLES, = glob_wildcards(join(FASTQ_DIR, '{sample,Samp[^/]+}.R1.fastq.gz'))
 
# Patterns for the 1st mate and the 2nd mate using the 'sample' wildcard.
PATTERN_R1 = '{sample}.R1.fastq.gz'
PATTERN_R2 = '{sample}.R2.fastq.gz'

In this example you import Python libraries to inspect OS paths and be able to use Regex glob strings. With this code you can match specific files in a predefined folder and with specific sample name patterns. By using Python’s endless library of packages you can go well beyond what Make can manage, and without the hassle.

Bonus Features

There are a lot of extra features that have made it into Snakemake which make the language even more interesting to implement. The first example is the intuitive inclusion of YAML/JSON config files. You can use config files for environment specifications, like defining paths of tools or the setup of the pipeline parameters and result storage locations. You can also define samples within a config file and use them as input parameters like shown in the next example:

1
2
3
4
5
6
7
8
9
10
11
configfile: "config.yaml"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

Commandline and tool logging is extensive and is handled natively. You can use the logging and other results in the automatic report generation tool, which is awesome! You can write and include your own text, which is formatted by the RestructuredText markup language. You can import files into the report, make tables and render images. All this is parsed into a single monolithic HTML file which is converted to PDF easily.
And lastly, the feature that I wanted put in the spotlight: Snakemake has Integrated Package Management! As I mentioned in the previous post of this series, package and tool management is becoming the way forward for bioinformatics and data analysis in general. Snakemake has a great implementation of such a system, using Bioconda. If you specify –use-conda during Snakefile evocation, and you have a defined environment config file, Snakemake will sort everything out for you! Especially if you combine this with Python virtual environments for predefining workflows, installing and version management is an absolute breeze.

Conclusion

If you are used to Make for either installing your software of running your data analysis, switching to Snakemake is a breath of fresh air. With all the advantages mentioned in this blogpost, I don’t know what other arguments you need to start using it right now!