How to make your research paper reproducible

Abstract

A reproducible paper is one in which every table and figure can be recreated by a reader or a reviewer from your data sets, using software on their own computer or using web-based tools. You may want to use a Jupyter notebook to write your next paper and make the entire document reproducible.

It is assumed that the reader has a basic knowledge of the Python programming language and is familiar with using the terminal, and – if you wish to try this from scratch on your own computer – has the ability to install pip and docker on their computer. For the purposes of this tutorial, however, you only need a browser and a stable internet connection. The document is made interactive thanks to Binder. You can also optionally archive the reproducible version of your paper on Zenodo.

  1. To start with, choose to perform your analysis in one of the programming languages supported by Jupyter.
  2. Create and launch a new Jupyter notebook.
  3. Enter your text in a “text cell”. You can also use LaTeX for any equations you want to include.
  4. Import your data in a “code cell”. You can store the data in the same folder as your notebook, or fetch data that was uploaded to Zenodo.
  5. Instead of preparing your tables and figures separately and placing them in your paper, include the code for each in a “code cell”. Then, “run” each code cell to generate the table or figure.
  6. When your paper is ready, upload the Jupyter notebook to a code-sharing platform like GitHub. Optionally, upload the file to Zenodo.
  7. Follow the instructions for Binder to allow your paper to be recreated in the cloud.
  8. Share the Binder link to your reproducible paper!

As a bonus, using an experimental tool called “Jupyter Graffiti”, you can can watch a recording of this document in action right within it. Just click here!

Introduction

Consider a piece of research, the output of which can be showcased by a single plot. For example, by analysing the spectrum of the "invariant" masses of two muons recorded by the Compact Muon Solenoid particle detector at the Large Hadron Collider, it is possible to detect the presence of a Z boson (with an invariant mass of ~91GeV/c²), as seen in the following image.

Instead of simply including this image, you can provide the data that produced it, along with the code that prepared the was used in the process.

Preparing the reproducible document

You can install jupyter on your computer and create a notebook for your paper locally. However, since your final document will reside online and be made interactive via Binder, you can work with the same environment provided by Binder by using repo2docker. You will need to install docker before proceeding.

Once you have installed repo2docker, open a terminal and navigate to a new folder/directory where you would like to assemble the contents of your paper. Create a file called requirements.txt containing a list of the Python libraries to be used in your analysis. See the file used in this tutorial. Save the file and then run the following command:

repo2docker -E .

The -E flag stands for “editable” and tells repo2docker to allow any changes made in your browser to be written to the directory you are working in. The . means “current directory”.

The first time you run this command, repo2docker will build a “Docker container” with all of the software needed to run a Jupyter notebook. However, once this container has been built once, the program will run faster on subsequent launches.

Some time after you run the above command, you will be presented with a URL in the same terminal window that you can copy and paste into a new browser tab. The Jupyter interface will greet you.

Now, click on the “New” button on the right side, and select a Python3 notebook. In your new notebook, you can add individual “cells” for (a) plain text prose formatted as markdown or (b) code for performing your analysis.

Doubleclick on this text to enter the edit mode. You can make any changes in the text, if you wish, and exit the edit mode for a markdown cell by pressing Ctrl+Enter (or Command+Enter on macOS).

Setup

Next, start preparing your analysis by importing the libraries you installed in the requirements.txt file and setting any configuration options as needed. The next cell is a code cell, and you can run the code in it by selecting the cell and hitting Ctrl/Command+Enter or by clicking on the “Run” button in the menu above.

Go ahead, select the cell and click on “Run”.

In [ ]:
# pandas is for working with data structures and performing data analysis
import pandas as pd 

# numpy is for scientific computing
import numpy as np

# matplotlib is for plotting
import matplotlib.pyplot as plt

# Set the DPI for the images to make them larger than the default setting
plt.rcParams['figure.dpi'] = 300

The blank space next to the cell in In [ ] will now show the number 1. This shows that the code in the (input) cell has been executed and assigned a number. If you run the same cell again, the number will change to 2.

Import data

You will need to connect the document with the data used in your analysis. The data can reside in the same directory as your paper or a sub-directory of the main directory. However, this will increase the size of your working directory and will make it more problematic to share your paper with others.

Therefore, consider hosting the final dataset on an online data repository. In this tutorial, the data we will “analyse” are from the CMS collaboration at CERN, served from the CERN Open Data portal, in CSV format: https://opendata.web.cern.ch/record/545/. The next cell has the data-import step.

Now, select this cell and click on “Run” as before.

In [ ]:
data = pd.read_csv('https://opendata.cern.ch/record/545/files/Zmumu.csv')

You can also show the reader that the data have been imported correctly by showing the first five rows (the “head”) of the CSV. When you run the next cell, an output block will appear below the input cell, carrying the same number as the input cell.

In [ ]:
data.head()

Describing and performing the analysis

You can also use $\LaTeX$ in your markdown prose. Here, I will quote from the example provided by the CMS collaboration at CERN.

Let's use the following expression for the invariant mass $M$ in the calculation:

$M = \sqrt{2p_{t1}p_{t2}(\cosh(\eta_{1}-\eta_{2}) - \cos(\phi_{1}-\phi_{2}))}$

In the expression $p_t$ is the component of the momentum which is perpendicular to the beam axis, $\eta$ is the pseudorapidity (angle) and $\phi$ the azimuthal angle.

In the calculation below we will use the numpy module which was named as np in the first code cell. With numpy it is possible to use mathematical commands like sqrt and cosh by calling first the name of the module (np) and then the command separated by a dot. So for example the square root could be called by writing np.sqrt( ).

The labels pt1, pt2, eta1, eta2, phi1 and phi2 refer to the columns of the data. In the code, you have to declare where the values will be taken. So if you want to get the column pt1, you have to write data.pt1 in the code.

Now we are ready to calculate the values of the invariant masses for the different events. Numpy will automatically calculate the values for all of the events when we give the calculation in the following form. So the equation given is calculated for all of the rows.

Once you have described the calculation, you can perform it in a code cell, for transparency. The next code cell calculates the “invariant” mass from the other data available. We can also display the head of the newly generated data.

In [ ]:
invariant_mass = np.sqrt(2*data.pt1*data.pt2*(np.cosh(data.eta1-data.eta2) - np.cos(data.phi1-data.phi2)))
In [ ]:
invariant_mass.head()

Displaying plots

Now that you have your data all ready for plotting, go ahead and write the code for doing so. The following code cell (which has comments to aide the reader) generates the same plot we saw earlier in the introduction.

In [ ]:
# Make a histogram using the invariant mass calculated above.
# N.B.: You can change the value for bins and re-run the same code.
plt.hist(invariant_mass, bins=500)

# Name the axises and give the title.
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n')

# Show the plot.
plt.show()

Et voilà ! Your reproducible paper is ready. If a reader wants to make a change to your plotting code – say, changing the bin size – they can do so in the code cell above and re-run it. Go ahead, try changing the bins to a different number, like 50.

Before you share it, make sure the document is clean. If you want to share it with all of the code already executed, simply click on Kernel and Restart & Run All. This will ensure that all of the code cells are run once only, and in the right order. If you want to share the paper without all of the code executed, click on Kernel and Restart & Clear Output. This will allow the reader to run the code on a clean slate, using Binder.

Sharing the paper

Since your paper is compatible with Binder, you are all set. If you are familiar with git and use services such as GitHub, upload your entire directory to GitHub as a new repository. For example, this paper resides on Github at https://github.com/RaoOfPhysics/reproducible-paper. You can get a “Badge” to put in your README file, that links directly to the executible version of your paper, like the following button that links to this page:

Binder

Once you upload your paper to GitHub, go to https:///mybinder.org and get a Binder link to your paper as well. Share this link with your readers.

Optionally, or if you aren’t familiar with git, upload the entire contents of your working directory to Zenodo and then point Binder to the Zenodo record.

Conclusion

With a little preparation, you can make your next research paper more open, transparent and reproducible.

Acknowledgements

This tutorial relies on example code produced by members of the CMS Collaboration. The code samples can be found in the following notebooks:

Python libraries and tools used