RSECON22 Walktrhough: A FAIR Data Pipeline: provenance-driven data management for traceable scientific workflows


Modern scientific analyses depend critically on access to and use of data. Rapidly evolving data, such as data streams changing during a disease outbreak, are particularly challenging. Added complexities come from the analytical software itself often changing, and keeping track of which output was generated from which specific analysis using which release of the data (their provenance) is all too often neglected as too time consuming. In this walkthrough, we will demonstrate a lightweight Findable, Accessible, Interoperable and Reusable (FAIR) data pipeline that enables easy tracing of the provenance of results generated from scientific analyses, alongside management of the FAIR data and code itself. The pipeline is platform agnostic (tested on Linux, macOS and Windows), with APIs provided in multiple programming languages (currently Python, R, Java, C++ and Julia), and can be run either with central servers managing data repositories or entirely locally or offline (e.g. on an isolated laptop or in an HPC environment). Although developed during the pandemic to trace provenance of analyses for public policy, it allows easy management and annotation of any data as they are consumed by analyses and also traces the provenance of scientific outputs back to primary data. It provides a mechanism for fellow RSEs, scientists or the public to better assess scientific evidence by inspecting its provenance, while allowing support for policy-makers to openly justify their decisions. We believe that this tool is of general value to the RSE community, and offers a step forward in our ability to promote Open Science.

Sep 7, 2022 9:00 AM — 10:30 AM
Ryan J Field
Ryan J Field
Research Associate

My research interests include distributed robotics, mobile computing and programmable matter.