Towards reproducible science in the digital humanities – Doctoral Training Unit “Digital History & Hermeneutics”

How to publish your data and code alongside your research with the help of Zenodo?

Popular data
Following the principles of reproducible science has recently become an increasingly important part of publishing research.¹ The trend gained momentum as a result of the replication crisis in 2010s, that started after it was found that most of the published results in medicine and the social sciences are not reproducible.²

Because replicability in the humanities³ is generally more complicated⁴ than in other sciences, it has been discussed that it might not even be desirable at all.⁵ However, the principles of publishing reproducible results are certainly important for digital humanists who produce knowledge using digital tools and algorithms. For example, the statistical analysis of a dataset should be reproducible and the ability to do so will probably also be an essential component of the emerging digital hermeneutics in history. Luckily, most digital methods make results easily reproducible as a significant part of the work exists in silico, as media, data or code. Those digital assets come coupled with methodologies that have been developed to easily reproduce the workflow that has been applied while doing research (content management systems, version control systems, digital asset repositories etc.).

To make the results of digital humanities projects reproducible, it is necessary to publish not only the results but also digital assets (data, software). Publishing those along with your research paper is still not considered a requirement in humanities journals but it is becoming more popular⁶ and has several benefits. By giving readers an insight into the methods used, it helps them to build directly on your work and it immensely increases your scientific visibility. Although opening up this way is sometimes accompanied with unease, it also helps others to be involved in constructive criticism of your work and thus enables building better foundations for further research.

Let us move on to the goal of the current post: how to publish your data and code alongside your research?

Let’s assume that you have a dataset in Excel format that is used as a data source or a Jupyter Notebook file including statistical analysis and generation of diagrams used in your article.

You would like to make this material available for others to explore, use and cite. A traditional way of sharing has been the use of private or institutional digital space (eg. data published on a web page) and your main article refers to its URL. Unfortunately there are no guarantees that the selected service will continue to exist and it is not citable as a scientific publication. Also, other researchers attempting to reproduce your results would like to access exactly the same dataset, or code, you used, but there are no guarantees this is the case on private servers. To solve those problems Zenodo, a domain agnostic, free, open-access research repository, was created within the OpenAIRE project, which is located at the CERN Data Centre and funded by the EU.

What are the steps to publish your code / data in Zenodo?

1) Clarify publication conditions

The first step before publishing your code is informing all stakeholders of your research and clarifying the conditions of the research data publication. The sensitivity of any data has to be considered. If your research involves personal data or is subject to GDPR it must be taken into account and anonymised.

Each domain has its exceptional cases. For example, in archaeology precise find locations are often not published because of treasure hunters who tend to read research papers with the sole purpose of looting sites for antiquities.

Authorship and copyright conditions of published digital assets must be determined. Are the authors of the assets the same as the authors of the related research paper? The manner in which collective work is going to be published has to be considered. It needs to be discussed with co-authors and representatives of your institution because they often have the copyright to your research data. The publisher might also set the conditions for data publishing. Some publishers for example require the publication of all the used source code in the publishers data repository. In the humanities this is yet to happen.

2) Prepare your data / code

Preparing your data and code as the next step is case dependent but must be considered seriously. Before publishing your code, it has to be properly cleaned, commented and documented. All personal test scripts should be removed and the coding style should meet the guidelines of the used programming language.

Data should have meaningful variables and classifier names and there should be a description of a structure and all elements. In a perfect world this would have been the case already during your work but usually the world is not so perfect.

And most importantly – if you build your code based on published theory, don’t shy away from scientific citations in your code and dataset comments.

3) Pick a LICENCE

Licensing is a very important part of publishing your research materials as it defines the way how they can be used by others. Possible licenses range from commercial licenses requiring purchase of the assets to the Unlicense – allowing everybody to do whatever they want with your work without even requiring to mention you.

The optimal licensing strategy varies for different types of digital assets, for example Creative Commons licenses are well suited for publishing media or data, but unsuitable for code / software. For the latter the best to use is GPL, LGPL or Apache licenses. In case of code a major factor is to ensure the compliance with all of the components you have used in your code. Some components might prescribe that you must use the same licence again for distributing your own code.

For selecting from Open Source licenses it is good to explore https://choosealicense.com/, a site dedicated exactly to this issue.

In most cases you would select a license text and put it into a LICENCE.txt file in your code/data folder. You would also need to mention the licence in your general project description.

4) Create a README file / description of your repository

Published code typically includes a README file with the description of its goals and usage. The typical file format used is MarkDown and the file is created as a README.md in the root directory. The file and its format are not mandatory, but similar content should be available in your published repository in the end, eg. in Zenodo it could be copied to the description field. The README / description should contain the following sections.

General overview. This section should contain an overview of the goals of the code / data, general description of functionality and structure.

Install. This section is required for code and contains step-by-step guidelines for installing the code for potential users. It should also include information about the software requirements.

For data there should be a description of data structure, giving information about tables, fields, relations, possibly as a diagram.

Usage. This section provides a template of the code / data usage, eg. assuming we are publishing a code for assessing the quality of statistical models, we should here show a simple example that includes instructions on how to generate or acquire an example dataset (downloading from the Internet).

License. This section is mandatory and should be simple: “Released under the licence [your selected licence]”.

Citing. This section should contain information on how to cite the data eg.:

Cite as
Author1, Author2 (2020). Supporting data and script for “Our article name” (Authors) [Data set]. http://doi.org/XX.XXXX/zenodo.XXXXXXX

The URL is to your document identifier (doi) that is central to citing your data assets.

References. This section should end the text, as usual.

5) Upload your work to Zenodo

When preparations are done you only need to upload your work to Zenodo. Log in (https://zenodo.org/login/) or create an user account in Zenodo (https://zenodo.org/signup/). It is preferable to join your account to your existing ORCID ID and / or Github accounts if you have those. Next it is time to click the upload button.

Giving a detailed overview of the simple upload process is not the goal of the current post as the process is mostly self-explanatory and has been already been covered in several tutorials⁷. However I will bring out some more notes worth mentioning.

First, probably you want to insert the DOI (document ID) in the supplementary documents you are uploading alongside your digital assets. To help with the issue there is a “Reserve doi” button which reserves the doi and registers it only when your work is submitted. The DOI URL should also be used in your published article when referring to the digital assets.

The form gives an opportunity to link your dataset to the main article through the related/alternative identifiers section. This can of course be done only when the article is submitted and has received a DOI.

Zenodo allows embargoing the digital assets to protect them until the paper is accepted for publication. Embargo could be typically set for 6 months and can be lifted anytime you want.

6) Cite the digital assets in your publication where you used them

When having your DOI, either published or just reserved you are ready to cite the digital assets in your manuscript(s). This is done by including DOI URL in the figure and table captions, referred in supplementary materials or similar sections and most importantly through the inclusion of the citation in the reference list of your manuscript.

Conclusion

After following those steps your research has been published in a transparent way following the principles of reproducible science. It can also be seen that publishing your data and source code takes quite some effort and time. At the same time it will probably change your way of thinking about the “digital” part of your research. The benefits also include increased visibility and general closer connectedness to the digital humanities community by using technological tools, services and networks originally created for the software development industry.

Reproducibility of Scientific Results, Stanford Encyclopedia of Philosophy, https://plato.stanford.edu/entries/scientific-reproducibility/ Munafò, M.R., Nosek, B.A., Bishop, D.V., Button, K.S., Chambers, C.D., Du Sert, N.P., Simonsohn, U., Wagenmakers, E.J., Ware, J.J. and Ioannidis, J.P., 2017. A manifesto for reproducible science. Nature human behaviour, 1(1), pp.1-9. https://www.nature.com/articles/s41562-016-0021 https://ropensci.github.io/reproducibility-guide/sections/introduction/
Pashler, H., & Wagenmakers, E. 2012. Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence? Perspectives on psychological science : a journal of the Association for Psychological Science, 7 6, pp 528-530.
O’Sullivan, J. 2019. “The humanities have a ‘reproducibility’ problem”. Talking humanities. https://talkinghumanities.blogs.sas.ac.uk/2019/07/09/the-humanities-have-a-reproducibility-problem/
Peels, R. and Bouter, L., 2018. The possibility and desirability of replication in the humanities. Palgrave Communications, 4(1), pp.1-4.
https://www.nature.com/articles/s41599-018-0149-x
de Rijcke, S., Holbrook, J.B. and Penders, B., 2019. The humanities do not need a replication drive. https://www.cwts.nl/blog?article=n-r2v2a4&title=the-humanities-do-not-need-a-replication-drive
Marwick, B. (2017). Computational reproducibility in archaeological research: basic principles and a case study of their implementation. Journal of Archaeological Method and Theory, 24 (2), 424-450.
https://ro.uow.edu.au/smhpapers/4034/
General tutorials:
https://help.zenodo.org/,
https://library.cfa.harvard.edu/data-archiving-and-sharing,
https://genr.eu/wp/cite/,
https://instruct-eric.eu/help/other/zenodo-upload-guidelines,
https://www.openaire.eu/zenodo-guide,
https://guides.lib.berkeley.edu/citeyourcode,
For publishing your code directly as a snapshot from github repository see:
https://www.software.ac.uk/blog/2016-09-26-making-code-citable-zenodo-and-github,
https://guides.github.com/activities/citable-code/

How to publish your data and code alongside your research with the help of Zenodo?

Leave a Reply Cancel reply