Towards Reproducibility in Scientific Workflows: An infrastructure-based approach

Abstract

It is commonly agreed that in-silico scientific experiments should be executable and repeatable processes. Most of the current approaches for computational experiment preservation and reproducibility have focused so far on two of the main components of the experiment, namely data and method.
In this paper we propose a new approach that addresses the third cornerstone of experimental reproducibility: the equipment. This work focuses on the set of software and hardware components that are involved in the execution of a scientific workflow. In order to demonstrate the feasibility of our proposal, we describe a use case scenario on the Text Analytics domain and the application of our approach to it. From the original workflow we document its execution environment by means of a set of semantic models and a catalogue of resources and we generate an equivalent infrastructure for re-executing it.

Datasets

In this section we include the files containing the datasets of our experiment and a set of queries for retrieving information from them.

  • Workflow Requirements dataset: this RDF dataset contains the annotations about the requirements of the 4 Workflow Templates of our experiment. These Text Analytics workflows have been implemented in the Wings platform. AVAILABLE HERE
  • Software Stacks dataset: an RDF dataset containing the description of the Software Components necessary for executing the described workflows.AVAILABLE HERE
  • Scientific Virtual Appliance dataset: an RDF dataset containing the annotations of the 2 Scientific Virtual Appliances used in this experimentation.AVAILABLE HERE
  • Sample queries: a set of SPARQL queries that can be executed over the three above-mentioned datasets to obtain the the results of this experiment. This file contains the queries used by the Infrastructure Specification Algorithm.AVAILABLE HERE

Among the many tools and systems available for querying the RDF datasets using SPARQL, we recommend Twinkle. Even though our annotations can be loaded in any SPARQL engine, Twinkle is a simple and cross-platform GUI for local queries. A demo of the Infrastructure Specification Algorithm can be obtained here, along with the necessary files for executing it.

About the authors

Idafen Santana-Perez(isantana@fi.upm.es) Ontology Engineering Group, UPM
María Pérez-Hernández(mperez@fi.upm.es) Ontology Engineering Group, UPM

Acknowledgements

The authors would like to thank the Wings project for their support and materials, and the FPU grant (Formacion de Profesorado Universitario) program from the Spanish Science and Innovation Ministry (MICINN).