Bioinformatics Frameworks Part 6: Rabix
Introduction
Bioinformatics as a functional tool has arrived, and many organizations have seen its potential to create value, both internally and in the marketplace. The first group that started to roll out bioinformatics with this in mind are the high tech biotech companies and research institutes. Bioinformatics here is used for research and product development infrastructures, often as an extension of or addition to the existing “wet lab”. In more recent years, you see the rise of the dedicated bioinformatics companies that have no internal lab, but are instead focusing their efforts to offer Bioinformatics as a Service.
Full Production
As mentioned before in this blog series about bioinformatics frameworks, the professionalizing of the pipeline languages is in full swing. This professionalization is driven by the high-end companies and institute that need bioinformatics to run in full production mode, comparable to other commercial and open source software systems.
A good example of this is Seven Bridges Genomics. Seven Bridges is the leading biomedical data company, specializing in software and data analytics to drive public and private healthcare research. In practical terms, Seven Bridges leverages high grade levels of software development to bring the bioinformatics tools and pipelines to an abstracted level. With the use of a web-based graphical interface, a user can design workflows and configure all the tools he/she need, all connected and managed on the cloud.
This way, the user doesn’t have to worry about different tools that have to interact with each other, databases that need to be updated or hardware that has to be bought. Executions are very fast or very cheap, depending on the requirements of the user. Moreover, availability and up-time is guaranteed.
To meet all these demands, one can imagine that the framework behind this efforts meets the most rigorous of software development requirements. When Seven Bridges was conceived, a bioinformatics framework with this in mind didn’t exist yet. So, as those things go, the developers at Seven Bridges made their own language, Rabix, which they release as a beta to the public in 2017.
Enter Rabix
Rabix is itself a sublayer with the many layers that exist in modern software design. It builds upon Docker, a tool has featured on this blog several times. Docker ‘rests’ on the kernel of the operating system, but virtualizes all apps that runs on it. For Seven Bridges, the next level is the Common Workflow Language (CWL) which makes the bioinformatics apps standardized in both usage and data structures. CWL will be discussed in a later blog entry.
Rabix itself functions as a composer for the platform and as an executor for workflows on the cloud infrastructure. The language basically is the go-between of all the user designed workflows, the CWL defined tools and (cloud based) resources. Seven Bridges explains it using the following image:
Rabix itself is very easy to setup and run, comparable to Snakemake and Nextflow. An example run can be performed from a blank machine using the follow commands:
$ wget https://github.com/rabix/bunny/releases/download/v1.0.5/rabix-1.0.5.tar.gz && tar -xvf rabix-1.0.5.tar.gz
$ cd rabix-cli-1.0.5
$ ./rabix examples/dna2protein/dna2protein.cwl.json examples/dna2protein/inputs.json
As you can see, Rabix does not use language-specific ‘main’ scripts like with Snakemake or Nextflow. Because all the intermediate tools are well-defined using CWL and completely dockerized, everything is not only configured in standard JSON files, but also executed as such. The ‘main’ script in this case would be dna2protein.cwl.json, which contains all the info needed for the workflow. Indeed the first class defined in this JSON is the class “Workflow”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | { "class": "Workflow", "cwlVersion": "v1.0", "steps": { "Transcribe": { "run": "transcribe.cwl.json", "in": { "input_file": "input_file", "verbose": { "default": true } }, "out": ["output_file_glob"] }, "Translate": { "run": "translate.cwl.json", "in": { "input_file": "Transcribe/output_file_glob", "output_filename": "output_filename" }, "out": ["output_protein"] } }, "inputs": { "output_filename": { "description": "Optional output_filename string", "type": [ "null", "string" ] }, "input_file": { "type": [ "File" ], "label": "input_file" } }, "outputs": { "output_protein": { "type": "File", "label": "output_protein", "outputSource": "Translate/output_protein" } }, "label": "dna2protein", "description": "A workflow that converts DNA sequences into peptides." } |
{ "class": "Workflow", "cwlVersion": "v1.0", "steps": { "Transcribe": { "run": "transcribe.cwl.json", "in": { "input_file": "input_file", "verbose": { "default": true } }, "out": ["output_file_glob"] }, "Translate": { "run": "translate.cwl.json", "in": { "input_file": "Transcribe/output_file_glob", "output_filename": "output_filename" }, "out": ["output_protein"] } }, "inputs": { "output_filename": { "description": "Optional output_filename string", "type": [ "null", "string" ] }, "input_file": { "type": [ "File" ], "label": "input_file" } }, "outputs": { "output_protein": { "type": "File", "label": "output_protein", "outputSource": "Translate/output_protein" } }, "label": "dna2protein", "description": "A workflow that converts DNA sequences into peptides." }
It’s clear that this workflow ‘script’ is very verbose and basically does not do anything yet, it sets up the inputs, outputs and refers to the next JSON files for execution. The next example is the actual ‘tool’ which is executed, an Python script which transcribes a DNA text file to mRNA:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | { "class": "CommandLineTool", "cwlVersion": "v1.0", "label": "Transcribe", "description": "This project was created to demonstrate the use of argparse to create CLI tools in Python, wrap them using CWL v1.0, and running them.\n\nThis tool takes a TXT file with a DNA sequence and converts to an RNA sequence.", "requirements": [ { "class": "InlineJavascriptRequirement" }, { "class": "InitialWorkDirRequirement", "listing": [ { "entry": "#!/usr/bin/env/python\nimport argparse\nimport re\nimport sys\n\ndef transcribe(args):\n\t# create a transcription map and use regex to translate\n\tmap = {\"A\":\"U\", \"T\":\"A\", \"C\":\"G\", \"G\":\"C\"}\n\tmap = dict((re.escape(k), v) for k, v in map.iteritems())\n\tpattern = re.compile(\"|\".join(map.keys()))\n\tDNA = args['dna'].read().strip()\n\tmRNA = pattern.sub(lambda m: map[re.escape(m.group(0))], DNA)\n\n\t# write a verbose output to stderr and just mRNA to sdtout \n\tif args['verbose']:\n\t\tsys.stderr.write(\"Your original DNA sequence: \" + DNA + \"\\n\")\n\t\tsys.stderr.write(\"Your translated mRNA sequence: \" + mRNA + \"\\n\")\n\tsys.stdout.write(mRNA + '\\n')\n\tsys.exit(0)\n\treturn mRNA\n\nif __name__ == \"__main__\":\n\t\"\"\" Parse the command line arguments \"\"\"\n\tparser = argparse.ArgumentParser()\n\tparser.add_argument(\"-d\", \"--dna\", type=argparse.FileType(\"r\"), default=sys.stdin)\n\tparser.add_argument(\"-v\", \"--verbose\", action=\"store_true\", default=False)\n\t# By setting args as var(...), it becomes a dict, so 'dna' is a key\n\t# Alternative use: args = parser.parse_args(), and 'dna' is an attr of args!\n\t# You must change how you call the args you parse based on this usage! \n\targs = vars(parser.parse_args())\n\n\t\"\"\" Run the desired methods \"\"\"\n\ttranscribe(args)", "entryname": "transcribe_argparse.py" } ] } ], "inputs": { "input_file": { "type": "File", "description": "Input file", "inputBinding": { "position": 3, "prefix": "-d" } }, "verbose": { "type": [ "null", "boolean" ], "inputBinding": { "position": 4, "prefix": "--verbose", "separate": true } }, "output_filename": { "type": [ "null", "string" ], "description": "Specify output filename" } }, "outputs": { "output_file_glob": { "type": "File", "outputBinding": { "glob": "*.txt" } } }, "hints": [ { "class": "DockerRequirement", "dockerPull": "python:2-alpine" } ], "baseCommand": [ "python", "transcribe_argparse.py" ], "stdout": "${return inputs.output_filename || 'rna' + '.txt'}" } |
{ "class": "CommandLineTool", "cwlVersion": "v1.0", "label": "Transcribe", "description": "This project was created to demonstrate the use of argparse to create CLI tools in Python, wrap them using CWL v1.0, and running them.\n\nThis tool takes a TXT file with a DNA sequence and converts to an RNA sequence.", "requirements": [ { "class": "InlineJavascriptRequirement" }, { "class": "InitialWorkDirRequirement", "listing": [ { "entry": "#!/usr/bin/env/python\nimport argparse\nimport re\nimport sys\n\ndef transcribe(args):\n\t# create a transcription map and use regex to translate\n\tmap = {\"A\":\"U\", \"T\":\"A\", \"C\":\"G\", \"G\":\"C\"}\n\tmap = dict((re.escape(k), v) for k, v in map.iteritems())\n\tpattern = re.compile(\"|\".join(map.keys()))\n\tDNA = args['dna'].read().strip()\n\tmRNA = pattern.sub(lambda m: map[re.escape(m.group(0))], DNA)\n\n\t# write a verbose output to stderr and just mRNA to sdtout \n\tif args['verbose']:\n\t\tsys.stderr.write(\"Your original DNA sequence: \" + DNA + \"\\n\")\n\t\tsys.stderr.write(\"Your translated mRNA sequence: \" + mRNA + \"\\n\")\n\tsys.stdout.write(mRNA + '\\n')\n\tsys.exit(0)\n\treturn mRNA\n\nif __name__ == \"__main__\":\n\t\"\"\" Parse the command line arguments \"\"\"\n\tparser = argparse.ArgumentParser()\n\tparser.add_argument(\"-d\", \"--dna\", type=argparse.FileType(\"r\"), default=sys.stdin)\n\tparser.add_argument(\"-v\", \"--verbose\", action=\"store_true\", default=False)\n\t# By setting args as var(...), it becomes a dict, so 'dna' is a key\n\t# Alternative use: args = parser.parse_args(), and 'dna' is an attr of args!\n\t# You must change how you call the args you parse based on this usage! \n\targs = vars(parser.parse_args())\n\n\t\"\"\" Run the desired methods \"\"\"\n\ttranscribe(args)", "entryname": "transcribe_argparse.py" } ] } ], "inputs": { "input_file": { "type": "File", "description": "Input file", "inputBinding": { "position": 3, "prefix": "-d" } }, "verbose": { "type": [ "null", "boolean" ], "inputBinding": { "position": 4, "prefix": "--verbose", "separate": true } }, "output_filename": { "type": [ "null", "string" ], "description": "Specify output filename" } }, "outputs": { "output_file_glob": { "type": "File", "outputBinding": { "glob": "*.txt" } } }, "hints": [ { "class": "DockerRequirement", "dockerPull": "python:2-alpine" } ], "baseCommand": [ "python", "transcribe_argparse.py" ], "stdout": "${return inputs.output_filename || 'rna' + '.txt'}" }
The tool script is actually encoded within the JSON file, that’s a first for me! For more complex tools like BWA, a docker image would be loaded which contains the tool, but Rabix would still refer it in the same verbose but clear manner.
Conclusion
Rabix is a very interesting take on the Workflow discussion. Instead of trying to build something specific or directly functional like Nextflow, it focusses more on having well defined tools, commands, files and executions. This makes the creation of a workflow scripts a bit harder because of the verboseness and the bigger emphasis on coding rigor that is required. But an advantage is that you can make a graphical interface for workflow building, as shown here. All in all, the system can build upon each layer of abstraction, making things granular, modular and scalable. Another question if this approach works with bioinformaticians though. I love to be able to create functions in workflow files, while verboseness is not my cup of tea.
Therefore, on my checklist, Rabix performs well. The asterisks are unclear for me how it works, but the functions are claimed and probably work on Seven Bridges’ web-based service, to which I don’t have access right now.
- Both local and scaled execution
- Docker or containerized commands
- Fully reproducible*
- Crash/failure robustness*
- Versioning for pipelines, software and files*
- Software package management
- Resource management
BioCentric Services
- Home
- About BioCentric
- Consultancy and Research
- BioIT Infrastructure
- Bioinformatics Training
- Contact BioCentric
Posts and News
- March 2023
- March 2022
- February 2021
- July 2020
- March 2020
- February 2020
- November 2019
- October 2019
- September 2019
- July 2019
- June 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- December 2017
- November 2017
- September 2017