GEOTEC’s road to Open Science & Reproducibility

Since 2017, GEOTEC members have been interested in incorporating reproducible practices into our research activities. We often conduct computationally-oriented and data-intensive research, so coding and data analysis are core activities for us. However, enabling reproducibility is a long road, with many phases of trial and error, where we have learned a lot and have put in place more and more supporting actions and tools to improve reproducibility practices over the years.

It goes without saying that reproducibility practices do not only require technical competences. That is important, of course, but promoting and changing the habits of a team of researchers is the most challenging. This requires time, constant training, support, coaching, for early-stage researchers to strengthen the ability to work in a reproducible manner that can have an immediate impact on their next paper, but more importantly, allows them to gain a long-term vision on transparency, integrity, honesty and reproducibility for the benefit of their future academic/research careers.

GEOTEC has not established a consistent policy to improve reproducibility practices as other groups have done. It is time to reflect a bit on this by taking as an example a recent scientific article to show how we GEOTEC put open science and reproducible practices at work. But this reflection is to put in writing a wish: beyond an exceptional case of good practices like the example below, we would like to extend it to all members, current and future, so that this post is a kind of initial commitment regarding the expected way of working at GEOTEC.

We here refer to the scientific paper authored by Miguel as part of his doctoral thesis.

Miguel Matey-Sanz, Alberto González-Pérez, Sven Casteleyn, Carlos Granell. Implementing and evaluating the Timed Up and Go test automation using smartphones and smartwatches. IEEE Journal of Biomedical and Health Informatics, 28(11): 6594-6605, 2024, ISSN 2168-2194.

I leave aside the scientific contribution of the article itself, to focus on the how, that is, the process of carrying out research based on reproducible and open practices. In what follows I summarise the steps that Miguel and his co-authors took to carry out the research activities and results reported in this paper.

We conduct research activities by applying reproducible research practices

Most of your time is about conducting your research tasks and selecting and using the tooling and methods required. We, for instance, write code or use supporting tools for data collection, data analysis and data visualization. We programmatically run models and machine learning algorithms to produce plots and graphs. We also produce maps as we explore lots of geospatial datasets. Our code research is often experimental and iterative, and demands to work with notebooks, either in python or R, to interactively explore, transform, process, analyse and visualise datasets. We also build web applications or mobile apps, which follow a more general purpose software engineering practices.

So, when it comes to writing code, managing datasets, and so on, you need to follow basic recommendations and tips for making your resources more reproducible and easy to manage and document. There are excellent practical guides published elsewhere with general recommendations for promoting reproducibility, research data management and open science practices. I point here to the Reproducible Research Practices course where you can find a good selection of recommendations to apply before, during and after data analysis.

We develop research code using version control

In GEOTEC, the use of GitHub is the preferred choice. We have an organization account where teams, members (developers), and repos associated with each team are publicly available. A team in Github is related to a research project or a line of research at GEOTEC, so the repositories within a team are connected to the same line of research or to a series of related projects.

Regardless of the programming language used, the folder structure of the repository, or whatever the intention (thesis, article, presentation, etc.) of the repository, there are some elements that are common to all repositories:

A README file describes what the repository is about, what elements (code, data) it contains, license, how to reproduce it, specific conditions of the computational environment such as hardware devices or computation time demands if required.
A LICENSE file
A requirements.txt file or similar to declare external dependencies and versions

We write paper using also version control

If research code is under control, so are research papers. For example, we extensively use the cloud-based Overleaf application for collaborative paper writing, which keeps track of every change or edit by authors using version control tools.

We release/archive data/code at the time of submission

When the article is almost ready to be submitted, it is time to permanently archive the version of the software/data used/developed/reported in the article. In short, the idea is to create a static, read-only version of the repositories and deposit them on Zenodo. So each relevant GitHub repository for an article must have an associated Zenodo record. There are websites that explain how to release a github repo as a Zenodo record, so GitHub repo content and license are automatically exported to the Zenodo registry.

We add a reproducibility declaration to the paper

Another task to perform before submitting an article to a journal is to add a brief explanation of where the data and/or code used in the reported article are located.

“The collected datasets and code used for their processing, along with the machine learning models, the code for training them, and the data obtained from the experiments are available under permissive licences as a reproducible package (specifying required dependencies, software versions, and documentation) on Zenodo [57].”

[57] M. Matey-Sanz, “Reproducible Package for “Implementing and evaluating the Timed Up and Go test automation using smartphones and smartwatches”, version v1.0.0-r1, Zenodo, Jun. 2024, [Available in]: https://doi.org/10.5281/zenodo.12570519.

The example above, taken from that paper, cites the previously archived research resources on Zenodo (ref 57), along with a mention of the selected permissive license. It is not included above but recommended to include a description of the hardware and software resources required for the analytical workflow, such as the hardware configuration (CPU/GPU, memory size, special devices, etc.), operating system, runtime or computational demand (can be omitted if negligible or very low), and the requirements configuration, i.e. a list of required libraries and versions. The latter can be omitted from the article if a requirements.txt, environment.yml, Dockerfile, compose.yml or similar together with a README file are part of the associated code repository (Github) that is connected to the corresponding Zenodo registry. For example, this is the requirements.txt file and Dockerfile file related to ref 57 on GitHub.

We deposit preprint at the time of submission

We can upload the preprint to arXiv, SSRN or EarthArXiv at the time of submission. This does not mean that we need to make double effort: the same file can be uploaded to the preprint server and sent to a journal for peer-review.

We update connected resources once the paper is out

Once a paper is accepted and published, it is good practice to complete bits of information such as such DOI and citation details, in the associated resources on Github and Zenodo. By doing so, we allow bi-directional interconnection between all resources. From the published paper, simply clicking on a DOI will take you to the Zenodo entry and from there to the corresponding GitHub repository. Conversely, you can navigate from the GitHub repository to the Zenodo entry and to the published paper or postprint version.

We (library indeed ) upload a postprint in the institutional repository

At UJI, the library requests from authors the pre- or postprint version of a non-open access article, depending on the publisher’s conditions. In the case of an open access article, the library will deposit that version in the institutional repository. For example, this is the handler of the open access scientific article deposited in the UJI institutional repository.