Home » BioCentric » Bioinformatics Frameworks Part 5: Cromwell and WDL

Bioinformatics Frameworks Part 5: Cromwell and WDL

Introduction

It has been a couple of months since the last addition to the Bioinformatics Frameworks series. Since then I’ve been mostly focusing on using Nextflow as my main workflow/pipeline language. Recently I visited the DTL Programmers Meeting, where I had a interesting discussion with members of the SASC bioinformatics development group of the LUMC. For them, bioinformatics pipelines are more and more about scalability and manageability, not about user-friendliness or readability. According to them, the core of any bioinformatics framework system should be focused on execution, deployment and how well data/version management is implemented.

Departure from rule-based frameworks

The SASC developers made clear that ruleset based languages like Make and SnakeMake are great when executed on single systems and intuitive to learn the mindset needed for running complex data analysis. But when dealing with expansive grids of compute nodes and diverse cloud ecosystems, just managing rules does not cut it. Too much time is wasted on execution of jobs and data flow management. Moreover, every part of analysis infrastructure (like Dockers, Conda and Github) should be integrated natively instead of being an extension of a local language. They opted instead for Cromwell, a framework language maintained by the Broad Institute.

Enter WDL and Cromwell

The Broad institute is no stranger when it comes to high performance bioinformatics. Their best known efforts in this field is of course the Genome Analysis ToolKit, the defacto standard tool for human variant calling. The GATK package has seen many upgrades and expansions over the years. Because of the large amounts of DNA sequencing data the institute was processing soon after the introduction of NGS, Broad developed its own workflow language to facilitate high throughput analysis on grid infrastructures. This language is named Queue and was the backbone of the GATK for several versions. By their own account, Queue wasn’t that user friendly, and therefore more recently, Broad introduced WDL as a workflow/pipeline language.

WDL is a pipeline programming language constructed in such a way that you can formally define the commands, inputs and outputs of a computational task like in Snakemake, and also the resources and environment required to perform said task. An example is shown here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
task ps {
  command {
    ps
  }
 
  runtime {
          docker_image: "ubuntu:xenial"
          cpu: "1"
          memory_gb: "4"
          queue: "research-hpc"
          resource: "rusage[gtmp=10, mem=4000]"
          job_group: '/myjobgroup/'
  }
 
  output {
    File procs = stdout()
  }
}
 
task cgrep {
  String pattern
  File in_file
 
  command {
    grep '${pattern}' ${in_file} | wc -l
  }
 
  runtime {
          docker_image: "ubuntu:xenial"
          cpu: "1"
          memory_gb: "4"
          queue: "research-hpc"
          resource: "rusage[gtmp=10, mem=4000]"
          job_group: '/myjobgroup/'
  }
 
  output {
    Int count = read_int(stdout())
  }
 
}
 
task wc {
  File in_file
 
  command {
    cat ${in_file} | wc -l
  }
 
  runtime {
          docker_image: "ubuntu:xenial"
          cpu: "1"
          memory_gb: "4"
          queue: "research-hpc"
          resource: "rusage[gtmp=10, mem=4000]"
          job_group: '/myjobgroup/'
  }
 
  output {
    Int count = read_int(stdout())
  }
 
}
 
workflow three_step {
  call ps
  call cgrep {
    input: in_file=ps.procs
  }
  call wc {
    input: in_file=ps.procs
  }
}

Personally, I think the markup reminds me of Perl, without the hocus pocus that is intrinsic of Perl. But its not as clear as a Pythonic language. The biggest advantage that everything is pretty straight forward and formal, a bit more clear than Nextflow.

Cromwell, on the other hand, is the management system which makes the connection between WDL and the computational infrastructure, be it a single system or a cloud ecosystem. A nice image showing how it works is below this text.

So, that’s a quick introduction of WDL and Cromwell. Its quite short, since there is not much to go on. I base my experiences for this Blog series on tutorials and example code, but for WDL and Cromwell, there isn’t that much available. Even the ReadTheDocs of Cromwell spends much more effort on configurations for all the APIs and services, rather than making it possible for me to try it out for myself. Google seems to endorse the system, but they only have a couple of examples on their github, which only runs on their Cloud of course. Maybe I’m spoiled, but I need a bit more to go on before I’m going to invest a lot of time and create my own pipelines with a certain framework.

Furthermore, I cannot figure out if Cromwell and WDL facilitate the usage of the Framework Best Practices. I’ll post the list here, and strikethrough the items that are seemingly absent:

  • Both local and scaled execution
  • Docker or containerized commands
  • Fully reproducible
  • Crash/failure robustness
  • Versioning for pipelines, software and files
  • Software package management
  • Resource management

Conclusion

While I like the formalized way of stating both commands and resources, I get the feeling Cromwell and WDL don’t have a big community behind them.  Even though I agree with the SASC developers that functionality trumps user-friendliness, if a certain framework does not gain much traction outside its base of developers (even if it’s a big institute like Broad), it won’t last very long. Secondly, I wonder if Cromwell+WDL can claim to have the same functionalities that SnakeMake or Nextflow do have.