nVidia CUDA Bioinformatics: BarraCUDA
Today, we will continue our journey into the current state-of-the-art of bioinformatics tools that make use of nVidia’s CUDA API. I will discuss specific bioinformatics problems, the default tools that are used in the industry and the possible Graphics Processor (GPU) based alternatives. If you want to read more about the history and purpose of CUDA based Bioinformatics, and why I’m looking at it, please read this Introduction blogpost.
In this blog post I will touch upon one of the most pertinent of Next Generation Sequencing (NGS) computational bottlenecks; Short Read Alignment. What is it, why is it so computational heavy and what options of processing are there.
As mentioned in the CUDA introduction, one of the biggest sources of biological data these days are the dataset generated by Next Generation Sequencing, namely the ones produced by Illumina. These sequencers scan the DNA of an organism (mostly humans) by looking at short DNA sequences, nucleotide-per-nucleotide. At the end of a sequence run, we have the exact readout of hundreds of millions of short sequences, randomly sampled from the analyzed genome. To find things like pathogenic mutations in your DNA, these millions of short reads have to be “aligned” with its original location on the reference genome. Computationally, this is a massive undertaking; the human genome is 3 billion nucleotides long and the sequence datasets can be terabytes of raw text data, which all match somewhere on that genome.
Heng Li and BWA
This computational problem had to be solved before NGS and genomics and general could become useful on a large scale. There were several tools created in the early years of NGS (around 2008), but a breakthrough approach was devised by bioinformatics scientist Heng Li. Li is one of my personal bioinformatics heroes, he was a decisive person in the formative years of NGS bioinformatics. While working in BGI and being involved with the 1000 Genomes project, he created many of the tools (BWA, Samtools, Bioawk, Seqtk) and conventions (SAM/BAM file structures, MAPQ scoring) that still define our day-to-day workflows. One of his earlier tools was BWA ALN, applies the techniques Burrows Wheeler Transformation and Smith-Waterman algorithm to, respectively, transform text to a compressed form and matching text from a short seed and extend to a full word (or DNA sequence) within a long string (or chromosome). The combination of these principles, in a massively parallel execution, proved that aligning many short DNA or RNA reads with a genome sequence reference can be done in a relatively short time and on regular computers, instead of hulking beasts of servers or grids as was the case before that point.
Even though BWA was fast at the time, the amounts of sequencing data per run has increased a 1000 fold since then. By now, even with hundreds of CPU cores available, the alignment step in a typical genome diagnostics pipeline takes hours and creates bottlenecks.
When thinking of the issue of short read alignment, one can easily see that a GPU is better suited for this workload then a CPU; a GPU has many more cores (both shaders and clusters) and reads are short enough to granularly tweak the amounts to fit in the smaller sizes of cache and memory found on modern graphics cards.
That must have gone through the mind of William Langdon. He is one of the creators and main developer of the alignment tool BarraCUDA. Over the years he updated the tool several times, the newest version arrived in 2017. And thankfully so, otherwise the tools would probably not work any more. Wilson indeed showed clear speed increases in his papers, often eclipsing BWA ALN on CPU’s multiple times. Outside of his publications, there are only a few reports of BarraCUDA in the wild. So, lets answer the question; Is BarraCUDA a good alternative for DNA short read alignment tools?
Software and Hardware
For reviewing the relative advantage of GPU’s versus CPU’s, I will compare the latest version of BWA ALN (0.7.12-5) and BarraCUDA (0.7.0r107) using a human genome (HG38 Patch 12) dataset (not public data) of 71,735,202 single end reads. The reads are 100 bp long, not trimmed or filtered, and not paired end because the aln step in both tools is single end, to be combined as paired end in SAMPE. This is about 6 GB of compressed FASTQ data, not a whole genome; I’m not going to compare SNPs or bitwise flags, it is about the alignment performance.
The hardware platform in this review is a Intel X79 based workstation with a 10-core HT enabled Intel Xeon E5 2680 v2 paired with 24 GB of DDR3 and a 1TB SSD. The nVidia GPU is a GTX 1080 Ti, a top end graphics card with 3584 shaders, 28 SM cores, and 11 GB of very fast on-board DDR5 RAM. I run Ubuntu 16.04 LTS as the OS, Qualimap 2.2.2 to collect the alignment results, and ‘time’ to time the process.
The total time the ALN step took for this FASTQ, from raw reads to a sai file, on the Intel CPU was 2812.868 seconds, so 46 minutes 53 seconds. The nVidia GPU with CUDA took 1723.262 seconds, so 28 minutes 43 seconds. So there is a sizeable increase in speed, BarraCUDA is done about 63% faster. When looking at larger datasets, say a whole human genome, the speeds boost is bigger, to about 75% (1.08 billion reads in 413 minutes vs 730 minutes).
To show how good the alignment worked in general, after the BWA SAMPE step for both tools, I’ve included these two Qualimap tables:
As you can see, there is about a 0.5% discrepancy in the amounts of mapped (aligned) reads, with ALN mapping about 300.000 reads more than BarraCUDA. We will dive into the details in the Discussion section.
Alignment Quality and Mismatches
To show how the alignment looks in terms of mapping quality and mismatches, I’ve included these two Qualimap tables:
Again there is a difference.
By comparing the timing of the alignment step, we can clearly say that BarraCUDA is faster than a more recent version of BWA ALN. It’s not a multiplication of the raw alignment performance as mentioned in Wilsons paper, probably because that was a best-case scenario with only 36 bp long reads. Still, if you consider that the nVidia 1080Ti SMs are separate cores (28 vs 10 of the Intel 2680 v2) that are clocked much lower (1550mhz vs 3100mhz of the Intel 2680 v2) and the Xeon has HT and things like SSE3 (which combined boosts raw simple integer performance 20-30%) the smaller performance boost makes more sense. Moreover, it must also be said that the nVidia GTX 1080Ti is not used to its fullest potential; the boost clocks can go to near 2000mhz and the TDP to 225 watts in PC games, but when monitoring the GPU under BarraCUDA load, it only reaches ~1550mhz and 140 watts maximum respectively. The Intel Xeon was running near maximum performance all the time. Probably the configuration and optimization of BarraCUDA was aimed at the GPU’s of 2014 rather than the GPU’s of 2018. Considering compatibility, I’m glad that the 1080 Ti and BarraCUDA performed as it did; other CUDA alignment tools like the nvbio based implementation of Bowtie2 does not even work out of the box (https://github.com/NVlabs/nvbio/issues) on newer graphics cards and CUDA versions.
When comparing the 1080Ti performance to other reported nVidia performance figures, it scores very well; on this Github page, the team of Wright lab at the University of Toronto mentions that their nVidia Tesla K80 dual-GPU’s with 26 combined SMs of the Kepler generation reach about 28.000 reads per second at maximum while my 1080Ti reaches about 42.000 reads per second. Wilson mentions about 26.000 (single end equivalent) reads in his paper. The 1080Ti is the clear victor in all comparisons, even versus multi GPU cards and without optimizations!
As mentioned in the Results, the alignment properties of the resulting BAM files are not identical.
As shown in the Qualimap tables, BWA ALN maps more reads at the expense of a lower MAPQ scores and more mismatches, while BarraCUDA has more insertions. Finding out if these mismatches and insertions are actual SNPs and indels or not is beyond the scope of this comparison. Though, on a whole genome dataset with a regular level of coverage these small differences are negligible, especially if you consider things like quality trimming on the raw data beforehand and variant calling heuristics after the alignment.
Still, these discrepancies surprised me quite a bit; I was under the impressions that ALN was not changed since the introduction of MEM. But the only explanation I have is that there is a differences in the code of BarraCUDA vs BWA ALN, probably in the way the algorithms deal with insertions.
Even though the speed boost is not exactly mind-blowing, there are some notable advantages in using BarraCUDA. CPU sockets are often limited in a system, for workstations two sockets is often the maximum, and you would pay premium to get more than a quad-socket server. PCI-E ports are often found in droves in a professional system; these days high-end motherboards with PCI-E ports are not rare, especially if you count the ports that can be used with PCI-E risers. This makes the use of six GPUs for such an applications easier then say six CPUs. Moreover, you could combine the use of CPU’s and GPU’s; SAMPE is CPU focused in both BWA and BarraCUDA SAMPE. You could dedicate one GPU per FASTQ Read file, pipe the output into SAMPE and after this directly into Samtools view, while dedicating multiple thread and gigabytes of RAM to Samtools sort. This would use nearly all resources of a high memory, multi-core Xeon and multi-GPU workstation setup. Moreover, this would boost speeds significantly, outperforming much bigger servers. The fancy term for this is called heterogeneous computing.
Another advantage of the tool is how Wilson designed it. He basically created BarraCUDA as a drop-in replacement of BWA ALN; it uses the same code base (perhaps not completely), the same manual, the same options, and the same execution style as ALN. This is awesome; comparing and implementing the tool becomes quite easy; just change the start of the command from “bwa aln” to “barracuda aln” and you are good to go. Conversions of existing workflows should be finished in mere minutes, after the installation of CUDA that is.
Still, there are some glaring issues. While ALN and BarraCUDA are more or less comparable, BWA MEM is the elephant in the room. It’s a newer alignment tool released in 2013; Li designed it with a different seed matching strategy called super-maximal exact matches. MEM also retained more reads because of better whole read extension/backtracing and paired end mapping considerations. Also, the SAMPE step is gone, making running the tool easier.
Another big shift in alignment land: since the creation the most recent human reference, HG19/HG38, there are now ALT regions: special regions that represent previously ignored/unknown genomic areas that are not universal in all populations but can have a profound effect on an individual’s alignment metrics. MEM is ALT-aware: it will map on ALTs if there is enough indication to do so but treat the alignments different (MAPQ scores and soft-clipping will be adjusted). ALN and BarraCUDA are blind to these regions, and are therefor used more often in conjunction with HG18 instead, which is not the best human reference.
All in all, MEM is a more modern aligner which is by now the default alignment tool for mapping DNA-Seq data, and has made its way into bioinformatics pipelines the world over, including Broad’s GATK4 Best Practices:
When I run BWA MEM on the same datasets, the results speak for themselves: 2-3% more mapped reads and much higher MAPQ values, which will have a profound effect on downstream analysis. Moreover, MEM runs faster than ALN (I did not time this) and skips the SAMPE step, so some of the hardware-based advantages that BarraCUDA had going for it are now mitigated.
This has become a pretty long post, but I wanted to give a complete overview over how and why BarraCUDA should or shouldn’t be used. I did not do a complete Apples-to-Apples comparison between BWA version matches of ALN versus BarraCUDA, nor did I deeply analyze the mapping errors. But that was beyond the scope of what I wanted to validate: is BarraCUDA a worthy replacement of the industry standard BWA?
I came to the conclusion that the higher speeds and hardware efficiencies of GPU’s leveraged by BarraCUDA do not weigh up against the improvements that BWA MEM has made over the years. Higher genome coverage from the same dataset and better mapping characteristics will always trump speed. Also, these days there are SPARK version of the whole GATK workflow, which speeds up the whole process more than only the faster alignment of BarraCUDA.
It’s too bad, I was hoping to start updating the pipelines that BioCentric uses, but now I can put my efforts on other GPU based tools. Thank you for reading and stay tuned for the next CUDA post!