Bioinformatics Frameworks Part 5: Cromwell and WDL
Introduction
It has been a couple of months since the last addition to the Bioinformatics Frameworks series. Since then I’ve been mostly focusing on using Nextflow as my main workflow/pipeline language. Recently I visited the DTL Programmers Meeting, where I had a interesting discussion with members of the SASC bioinformatics development group of the LUMC. For them, bioinformatics pipelines are more and more about scalability and manageability, not about user-friendliness or readability. According to them, the core of any bioinformatics framework system should be focused on execution, deployment and how well data/version management is implemented.
Departure from rule-based frameworks
The SASC developers made clear that ruleset based languages like Make and SnakeMake are great when executed on single systems and intuitive to learn the mindset needed for running complex data analysis. But when dealing with expansive grids of compute nodes and diverse cloud ecosystems, just managing rules does not cut it. Too much time is wasted on execution of jobs and data flow management. Moreover, every part of analysis infrastructure (like Dockers, Conda and Github) should be integrated natively instead of being an extension of a local language. They opted instead for Cromwell, a framework language maintained by the Broad Institute.
Enter WDL and Cromwell
The Broad institute is no stranger when it comes to high performance bioinformatics. Their best known efforts in this field is of course the Genome Analysis ToolKit, the defacto standard tool for human variant calling. The GATK package has seen many upgrades and expansions over the years. Because of the large amounts of DNA sequencing data the institute was processing soon after the introduction of NGS, Broad developed its own workflow language to facilitate high throughput analysis on grid infrastructures. This language is named Queue and was the backbone of the GATK for several versions. By their own account, Queue wasn’t that user friendly, and therefore more recently, Broad introduced WDL as a workflow/pipeline language.
WDL is a pipeline programming language constructed in such a way that you can formally define the commands, inputs and outputs of a computational task like in Snakemake, and also the resources and environment required to perform said task. An example is shown here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | task ps { command { ps } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { File procs = stdout() } } task cgrep { String pattern File in_file command { grep '${pattern}' ${in_file} | wc -l } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { Int count = read_int(stdout()) } } task wc { File in_file command { cat ${in_file} | wc -l } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { Int count = read_int(stdout()) } } workflow three_step { call ps call cgrep { input: in_file=ps.procs } call wc { input: in_file=ps.procs } } |
task ps { command { ps } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { File procs = stdout() } } task cgrep { String pattern File in_file command { grep '${pattern}' ${in_file} | wc -l } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { Int count = read_int(stdout()) } } task wc { File in_file command { cat ${in_file} | wc -l } runtime { docker_image: "ubuntu:xenial" cpu: "1" memory_gb: "4" queue: "research-hpc" resource: "rusage[gtmp=10, mem=4000]" job_group: '/myjobgroup/' } output { Int count = read_int(stdout()) } } workflow three_step { call ps call cgrep { input: in_file=ps.procs } call wc { input: in_file=ps.procs } }
Personally, I think the markup reminds me of Perl, without the hocus pocus that is intrinsic of Perl. But its not as clear as a Pythonic language. The biggest advantage that everything is pretty straight forward and formal, a bit more clear than Nextflow.
Cromwell, on the other hand, is the management system which makes the connection between WDL and the computational infrastructure, be it a single system or a cloud ecosystem. A nice image showing how it works is below this text.
So, that’s a quick introduction of WDL and Cromwell. Its quite short, since there is not much to go on. I base my experiences for this Blog series on tutorials and example code, but for WDL and Cromwell, there isn’t that much available. Even the ReadTheDocs of Cromwell spends much more effort on configurations for all the APIs and services, rather than making it possible for me to try it out for myself. Google seems to endorse the system, but they only have a couple of examples on their github, which only runs on their Cloud of course. Maybe I’m spoiled, but I need a bit more to go on before I’m going to invest a lot of time and create my own pipelines with a certain framework.
Furthermore, I cannot figure out if Cromwell and WDL facilitate the usage of the Framework Best Practices. I’ll post the list here, and strikethrough the items that are seemingly absent:
- Both local and scaled execution
- Docker or containerized commands
Fully reproducible- Crash/failure robustness
Versioning for pipelines, software and filesSoftware package management- Resource management
Conclusion
While I like the formalized way of stating both commands and resources, I get the feeling Cromwell and WDL don’t have a big community behind them. Even though I agree with the SASC developers that functionality trumps user-friendliness, if a certain framework does not gain much traction outside its base of developers (even if it’s a big institute like Broad), it won’t last very long. Secondly, I wonder if Cromwell+WDL can claim to have the same functionalities that SnakeMake or Nextflow do have.
BioCentric Services
- Home
- About BioCentric
- Consultancy and Research
- BioIT Infrastructure
- Bioinformatics Training
- Contact BioCentric
Posts and News
- March 2023
- March 2022
- February 2021
- July 2020
- March 2020
- February 2020
- November 2019
- October 2019
- September 2019
- July 2019
- June 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- December 2017
- November 2017
- September 2017