Containers for bioinformatics: a hands-on workshop

Brian Skjerven and Marco De La Pierre, Pawsey Supercomputing Centre

In the past year, staff at the Pawsey supercomputing centre in Perth, Australia, have been investigating the deployment of containers on HPC resources to address several major issues researchers face when migrating workflows to HPC: complex software stacks and dependencies, cross-platform portability, reproducibility of results, and difficulties in collaboration. A solution has been devised, which involves a combination of Docker and Shifter container engines, where the former allows building as well as deployment on cloud systems, and the latter deployment on HPC clusters.

Researchers in bioinformatics can potentially benefit greatly from the adoption of container technology. Their workflows typically involve a large variety of software packages, many of which require a large number of dependencies that can be difficult and time-consuming to install. This, combined with a desire to improve collaboration and make data reproducibility more accessible, make containers an ideal tool for bioinformaticians.

In order to introduce this community of researchers to the use of containers, a dedicated hands-on tutorial has been developed and made publicly available on GitHub at https://github.com/pawseysc/bio-workshop-18. The tutorial features lecture materials, examples and exercises, and can be run both during live workshop sessions and as a self-paced tutorial. The only requirement is a workstation or laptop with Docker installed.

The tutorial starts with introducing the key Docker commands required to handle and run containers, using both general (Linux) and domain specific examples. The next step is porting a sample bioinformatics workflow into containers, allowing researchers to get a full, real feeling of the benefits of containers adoption. Subsequent examples showcase the porting of container runs into HPC systems with Shifter, and how to build containers from scratch for customised applications. The materials are a work in progress, with additional examples and improvements being added to the GitHub repo.

The tutorial has been used for the first time at a workshop during a Bioinformatics Symposium run at the Pawsey centre in September 2018 and received positive feedback from both in-room and remote attendants. Further training events are planned to implement it in the future. Pawsey is also working with several bioinformatics groups to scale up their workflows, and much of this work is being done in containers.

Anyone interested in containers for bioinformatics is invited to clone and make use of the tutorial repository from GitHub, and eventually to contribute to its development.

Leave a Reply

Your email address will not be published. Required fields are marked *