This Jupyter Notebook presents the results from the survey "Survey on Understanding Experiments and Research Practices for Reproducibility" conducted in the context of DFG CRC/TRR ReceptorLight. The notebook analyses the dataset which provides insight into the reproducibility crisis and how to tackle this problem according to scientists. The purpose of the online survey was to gain a better understanding of what is needed to achieve reproducibility of experiments in science. This dataset consists of 26 questions grouped in 6 sections. We analyse each question in the following cells.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
# Read csv from the dataset from the current Github Repository
df = pd.read_csv('ProcessedData_Survey_on_Understanding_Experiments_and_Research_Practices_for_Reproducibility.csv')
# Print the first 5 rows of the dataset
df.head()
Response ID | Date started | Date last action | [Yes, I have read the privacy policy and agree./ Ja, ich habe die Datenschutzerklärung gelesen und stimme zu.] | What is your current position? | What is your current position? [Other] | What is your primary area of study? | What is your primary area of study? [Other] | Do you think there is a reproducibility crisis in your field of research? | Do you think there is a reproducibility crisis in your field of research? [Other] | ... | Where do you save your experimental metadata like descriptions of experiment, methods, samples used? [Data Management Platforms] | Where do you save your experimental metadata like descriptions of experiment, methods, samples used? [Other] | Do you write scripts or program to perform data analysis at any stage in your experimental workflow? | Have you heard about the FAIR (Findable, Accessible, Interoperable, Reusable) principles? | Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Findable] | Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Accessible] | Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Interoperable] | Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Reusable] | Please feel free to provide comments regarding what you think is important to enable understandability and reproducibility of scientific experiments in your field of research. | Total time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2019-01-24 08:37:24 | 2019-01-24 08:40:56 | Yes | PhD Student | NaN | Computer Science | NaN | Yes | NaN | ... | Primary Source | Primary Source | Yes | Yes | Often | Often | Sometimes | Often | NaN | 215.17 |
1 | 3 | 2019-01-24 10:10:30 | 2019-01-24 10:16:33 | Yes | PhD Student | NaN | Computer Science | NaN | Yes | NaN | ... | Secondary Source | NaN | Yes | Yes | Often | Often | Sometimes | Sometimes | NaN | 365.51 |
2 | 4 | 2019-01-24 10:20:13 | 2019-01-24 10:33:42 | Yes | PostDoc | NaN | Biology(other) | NaN | No | NaN | ... | NaN | NaN | No | Yes | Often | Rarely | Sometimes | Sometimes | NaN | 811.73 |
3 | 6 | 2019-01-24 18:34:03 | 2019-01-24 18:40:46 | Yes | Professor | NaN | Computer Science | NaN | Other | partly | ... | Other | Other | Yes | No | NaN | NaN | NaN | NaN | NaN | 405.30 |
4 | 7 | 2019-01-24 18:43:32 | 2019-01-24 18:56:51 | Yes | Professor | NaN | Computer Science | NaN | Yes | NaN | ... | NaN | NaN | Yes | Yes | Rarely | Sometimes | Rarely | Rarely | NaN | 802.05 |
5 rows × 94 columns
# The size of the dataframe: the total number of rows and columns in the dataset
df.shape
(101, 94)
# Set the configurations for the chart
fontsize = 14
chart_color = "#1565C0"
colors = ['#2196F3', '#90CAF9', '#E3F2FD']
# Function to draw chart provided the column title
def draw_bar_chart(title, df=df):
ax = df[title].value_counts().plot(kind="bar", figsize=(15,7), color=chart_color, title=title)
plt.xticks(fontsize=fontsize)
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.yticks([])
total = len(df[title])
# This loop adds the annotations
for p in ax.patches:
percentage = '{:0.2f}%'.format(100 * float(p.get_height())/total)
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.annotate(percentage, (x, y + height + 0.2), fontsize=14)
# Function to draw chart with multiple columns provided the column title
def draw_bar_chart_mul_col(column_array, title, df=df):
df1 = pd.DataFrame(df[column_array])
ax = df1.apply(pd.Series.value_counts, dropna=True).plot(kind="bar", figsize=(15,9), zorder=2, width=0.8, title=title)
ax.legend(loc='best')
ax.spines['right'].set_visible(True)
ax.spines['top'].set_visible(True)
ax.spines['left'].set_visible(True)
ax.spines['bottom'].set_visible(True)
plt.xticks(fontsize=fontsize)
for spine in plt.gca().spines.values():
spine.set_visible(False)
# Function to draw pie chart provided the title
def draw_pie_chart(title, df=df):
labels = df[title].value_counts().index
fig, ax1 = plt.subplots(figsize = (7,7))
ax1.pie(df[title].value_counts(), autopct = '%0.2f%%', colors=colors, textprops={'fontsize': 14})
ax1.legend(labels, loc = "upper right")
plt.title(title, bbox={'facecolor':'1.0', 'pad':5})
plt.tight_layout()
plt.show()
draw_bar_chart('What is your current position?')
draw_bar_chart('What is your primary area of study?')
title = 'Do you think there is a reproducibility crisis in your field of research?'
draw_pie_chart(title)
# The opinion of participants who selected "Other"
df['Do you think there is a reproducibility crisis in your field of research? [Other]'].value_counts()
partly 2 In general, yes. 1 not a crisis, but there should be more attention given to it 1 paertially 1 i don't know 1 It rather goes with the territory: humans are unpredicatable 1 I would not say crisis, but it can certainly be imroved 1 dependend on the scientific field 1 maybe 1 Crisis is a bad word; better call it issue or room for improvement? 1 Name: Do you think there is a reproducibility crisis in your field of research? [Other], dtype: int64
df1=df.loc[:,'In your experience, what are the factors leading to poor reproducibility? [Lack of sufficient metadata regarding the experiment (e.g. culturing conditions, environmental conditions, software version)]':'In your experience, what are the factors leading to poor reproducibility? [Data privacy (e.g. Data sharing with third parties)]']
column_array=['Lack of sufficient metadata regarding the experiment', 'Lack of data that is publicly available for use', 'Lack of complete information in the Methods/Standard Operating Procedures/Protocols', 'Poor experimental design', 'Lack of resources like equipments/devices in your workplace', 'Lack of the information related to the settings used in original experiment', 'Difficulty in understanding laboratory notebook records', 'Pressure to publish', 'Lack of knowledge or training on reproducible research practices', 'Lack of time to follow reproducible research practices', 'Data privacy']
df1.columns = column_array
title='In your experience, what are the factors leading to poor reproducibility?'
draw_bar_chart_mul_col(column_array, title, df1)
# What are the other factors leading to poor reproducibility which are not listed in the options according to participants?
df['In your experience, what are the factors leading to poor reproducibility? [Other]'].value_counts()
lack of funding 1 the technic itself 1 ignorance of necessity of data management 1 Lack of mandatory pre-registration of study protocols, not following reporting guidelines, lack of collaboration, lack of automation 1 type of data that cannot be reproducible 1 intrinsic uncertainty 1 Basic midunderstandings of statistics 1 Patents, Copyright, closed access, etc 1 standardised format for article, preventing suffisiant details to be included 1 Lack of statistical understanding 1 Name: In your experience, what are the factors leading to poor reproducibility? [Other], dtype: int64
df1=df.loc[:,'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Input Data]':'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Results]']
column_array=['Input Data', 'Metadata about the methods', 'Metadata about the steps', 'Metadata about the experimental setup', 'Results']
df1.columns = column_array
title='How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:,'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Input Data]':'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Results]']
column_array = ['Newcomer Input Data', 'Newcomer Metadata about the methods', 'Newcomer Metadata about the steps', 'Newcomer Metadata about the experimental setup', 'Newcomer Results']
df1.columns = column_array
title='How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)?'
draw_bar_chart_mul_col(column_array, title, df1)
title = 'Have you ever been unable to reproduce published results of others?'
draw_pie_chart(title)
title = 'Has anybody contacted you that they have a problem in reproducing your published results?'
draw_pie_chart(title)
title = 'Do you repeat your experiments to verify the results?'
draw_pie_chart(title)
df1 = df.loc[:, 'What is your opinion on sharing experimental data? [Raw Data]':'What is your opinion on sharing experimental data? [Text Annotations]']
column_array = ['Raw Data', 'Processed Data', 'Negative Results', 'Measurements', 'Scripts/Code/Program', 'Image Annotations', 'Text Annotations']
df1.columns = column_array
title = 'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing metadata regarding experimental requirements? [Experiment Materials]':'What is your opinion on sharing metadata regarding experimental requirements? [Instruments/Devices Used]']
column_array = ['Experiment Materials', 'Instruments/Devices Used']
df1.columns = column_array
title='What is your opinion on sharing metadata regarding experimental requirements?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing metadata regarding settings? [Instrument Settings]':'What is your opinion on sharing metadata regarding settings? [Publications used]']
column_array = ['Instrument Settings', 'Experiment Environment Conditions', 'Publications used']
df1.columns = column_array
title='What is your opinion on sharing metadata regarding settings?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing metadata regarding time, duration, and the location of experiments? [Date]':'What is your opinion on sharing metadata regarding time, duration, and the location of experiments? [Location]']
column_array=['Date', 'Time', 'Duration', 'Location']
df1.columns = column_array
title='What is your opinion on sharing metadata regarding time, duration, and the location of experiments?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing metadata regarding software used? [Software Parameters]':'What is your opinion on sharing metadata regarding software used? [Scripts/Code/Program]']
column_array=['Software Parameters', 'Software Version', 'Software License', 'Scripts/Code/Program']
df1.columns = column_array
title='What is your opinion on sharing metadata regarding software used?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing metadata regarding all the steps and plans? [Laboratory Protocols]':'What is your opinion on sharing metadata regarding all the steps and plans? [Quality Control Methods]']
column_array=['Laboratory Protocols', 'Methods', 'Activities/Steps', 'Order of Activities/Steps', 'Validation Methods', 'Quality Control Methods']
df1.columns = column_array
title='What is your opinion on sharing metadata regarding all the steps and plans?'
draw_bar_chart_mul_col(column_array, title, df1)
df1 = df.loc[:, 'What is your opinion on sharing the intermediate and final results of each trial of your experiments? [Final Results]':'What is your opinion on sharing the intermediate and final results of each trial of your experiments? [Intermediate Results]']
column_array=['Final Results', 'Intermediate Results']
df1.columns = column_array
title='What is your opinion on sharing the intermediate and final results of each trial of your experiments?'
draw_bar_chart_mul_col(column_array, title, df1)
df['Please let us know what else should be shared when publishing experimental results.'].value_counts()
A permanent ID for data with correpsonding license 1 supplementary results or negative results which might not be important for the published story may also be shared. 1 Platforms should provide easy access 1 data owner (contact)\nproperty right\n 1 Metadata in a standardised format;\nLicense for data reuse\n 1 The minimum information standards of the respective domains are a good starting point. I am generally in favor of open notebook science which aims to be totally open about everything as soon as the data, planning, etc is done. 1 factors that negatively influence the outcome / working of an experiment 1 hidden parameters for data processing and reasoning for specific choices made for methods, steps, parameters 1 The current academic rewarding system is pushing people into coming up with a nice story which unfortunately is encouraging people to publish their results without properly validating, hiding their negative data, adjusting statistical tests in a way that shows a significant difference and so on. The whole system is broken and has to change. \n\nPre-registration of experimental plans, openly sharing lab notebooks, sharing all versions of the manuscripts along with reviewer's comments and answers to those comments, seperately publishing underlying datasets, codes and methods and therefore not forcing, polishing and hiding data to make a nice story but being open and transparent from the beginning and sharing all elements of research as individual items. To incentivize all these, promotion and hiring criteria should not only look for high impact journal publications but rather these type of efforts.\n\nResearchers typically spent the least effort to explain their materials and methods while that is one of the most important elements for research reproducibility. Dedicated methods repositories that archive not only the experimental procedures and parameters such as protocols.io but also videos of the procedures performed by the researchers would help enormously. \n 1 Protocols used in the study with versions/adjustments made 1 Ethics approval, the systematic review conducted before and alongside the study, the limitations, the contact people in project with long-term access.\nEverything should be shared in an structured findable, accessible, inter-operable, and reusable format (following FAIR guidelines). 1 Computational environment needs to be fully specified, including OS and any software dependencies 1 URL/DOI Links to data in curated repositories; data availability statement 1 Name: Please let us know what else should be shared when publishing experimental results., dtype: int64
df1 = df.loc[:, 'What kind of data do you work primarily with? [Images]':'What kind of data do you work primarily with? [Tabular]']
column_array=['Images', 'Multimedia files (Video, Audio)', 'Measurement Data', 'Graphs', 'Tabular']
df1.columns = column_array
title='What kind of data do you work primarily with?'
draw_bar_chart_mul_col(column_array, title, df1)
df['What kind of data do you work primarily with? [Other]'].value_counts()
Text 2 word files 1 Numbers, a lot of numbers 1 code, scripts, metadata 1 Text data 1 molecular data, notes/textfiles 1 code 1 Model output 1 csv, strucutured data 1 geo-data 1 Name: What kind of data do you work primarily with? [Other], dtype: int64
df1 = df.loc[:, 'Where do you store your experimental data files? [Personal Devices (eg. Computer)]':'Where do you store your experimental data files? [Data Management Platforms]']
column_array=['Personal Devices (eg. Computer)', 'Local Server provided at your workplace', 'Removable Storage Device (eg. USB, Harddisk, CD Drive)', 'Version Controlled Repositories (eg. Github, GitLab, Figshare, Zenodo etc.)', 'Data Management Platforms']
df1.columns = column_array
title='Where do you store your experimental data files?'
draw_bar_chart_mul_col(column_array, title, df1)
# Which are the other places where the participants store their experimental data files?
df['Where do you store your experimental data files? [Other]'].value_counts()
Cloud storage (eg. Google Drive, DropBox, etc.) 1 Cloud storage (dropbox) 1 Dropbox 1 institution's repository 1 as a data manager I don't have own ones 1 public databases like Metabolights 1 Name: Where do you store your experimental data files? [Other], dtype: int64
df1 = df.loc[:, 'Where do you save your experimental metadata like descriptions of experiment, methods, samples used? [Hand written Lab Notebooks]':'Where do you save your experimental metadata like descriptions of experiment, methods, samples used? [Data Management Platforms]']
column_array=['Hand written Lab Notebooks', 'Electronic Notebooks', 'Scientific Data Management Platforms']
df1.columns = column_array
title='Where do you save your experimental metadata like descriptions of experiment, methods, samples used?'
draw_bar_chart_mul_col(column_array, title, df1)
title = 'Do you write scripts or program to perform data analysis at any stage in your experimental workflow?'
draw_pie_chart(title)
title = 'Have you heard about the FAIR (Findable, Accessible, Interoperable, Reusable) principles?'
draw_pie_chart(title)
df1 = df.loc[:, 'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Findable]':'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Reusable]']
column_array=['Findable', 'Accessible', 'Interoperable', 'Reusable']
df1.columns = column_array
title='Does your research follow the FAIR principles?'
draw_bar_chart_mul_col(column_array, title, df1)
df['Please feel free to provide comments regarding what you think is important to enable understandability and reproducibility of scientific experiments in your field of research.'].value_counts()
The bottleneck for experimental scientists is that FAIR data sharing comes on top of everything else they have to do to generate and analyse the data. They are usually not experts in data handling/storage. The platforms we share our data on are often made by IT experts that do not realize that 'their language' and expertise is not immediately clear to biologists. It thus costs a lot of extra time and energy for experimental biologists to share their data. Moreover, there is still a feeling among my lab scientists that it is unfair that they are forced to share data, but that the one taking the data and doing synthesis projects (yielding high IF papers) never have to go in the lab and do the hard work of getting the data. We discussed this very often. Getting credits for sharing data does not sufficiently resolve this issue for them. 1 Internationally accepted metadata schemata covering all disciplines\nControlled vocabularies covering most of the metadata fields\nAn agreement on file formats 1 It is really important (in the case of hand- written notes) the scientists explains all the abbreviations used in her/ his lab books. It is also very important to keep the same structure of storing the experimental details/ steps (dates, treatments, titles). 1 Data sharing among other lab members is important to understand the reproducibility of the experiments. 1 Follow community/domain conventions. Eg when scripting in a particular language, follow software engineering conventions of that particular language to package up code. 1 As a data manager I cannot really answer questions about the quality of my data, as I manage data of others and don't have own research data. I also cannot say where that data is saved as it always depends on the customer. 1 I think it should be a criteria for research funders to allocate funding for reproducibility of each research and make it a mandatory criteria. 1 Name: Please feel free to provide comments regarding what you think is important to enable understandability and reproducibility of scientific experiments in your field of research., dtype: int64
# yes_crisis are the participants who think there is a reproducibility crisis
yes_crisis = df[df['Do you think there is a reproducibility crisis in your field of research?'] == 'Yes']
# no_crisis are the participants who think there is no reproducibility crisis
no_crisis = df[df['Do you think there is a reproducibility crisis in your field of research?'] == 'No']
# other_crisis are the participants who selected 'Other'
other_crisis = df[df['Do you think there is a reproducibility crisis in your field of research?'] == 'Other']
title = 'What is your current position?'
draw_bar_chart(title, df=yes_crisis)
title = 'What is your current position?'
draw_bar_chart(title, df=no_crisis)
title = 'What is your primary area of study?'
draw_bar_chart(title, df=yes_crisis)
title = 'What is your primary area of study?'
draw_bar_chart(title, df=no_crisis)
df1 = yes_crisis.loc[:, 'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Findable]':'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Reusable]']
column_array=['Findable', 'Accessible', 'Interoperable', 'Reusable']
df1.columns = column_array
title='Does your research follow the FAIR principles?'
draw_bar_chart_mul_col(column_array, title, df=df1)
df1 = no_crisis.loc[:, 'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Findable]':'Does your research follow the FAIR (Findable, Accessible, Interoperable, Reusable) principles? [Reusable]']
column_array=['Findable', 'Accessible', 'Interoperable', 'Reusable']
df1.columns = column_array
title='Does your research follow the FAIR principles?'
draw_bar_chart_mul_col(column_array, title, df=df1)
df1 = yes_crisis.loc[:,'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Input Data]':'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Results]']
column_array = ['Input Data', 'Metadata about the methods', 'Metadata about the steps', 'Metadata about the experimental setup', 'Results']
df1.columns = column_array
title = 'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)?'
draw_bar_chart_mul_col(column_array, title, df=df1)
df1 = yes_crisis.loc[:,'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Input Data]':'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Results]']
column_array = ['Newcomer Input Data', 'Newcomer Metadata about the methods', 'Newcomer Metadata about the steps', 'Newcomer Metadata about the experimental setup', 'Newcomer Results']
df1.columns = column_array
title = 'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you?'
draw_bar_chart_mul_col(column_array, title, df=df1)
df1 = no_crisis.loc[:,'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Input Data]':'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)? [Results]']
column_array = ['Input Data', 'Metadata about the methods', 'Metadata about the steps', 'Metadata about the experimental setup', 'Results']
df1.columns = column_array
title = 'How easy would it be for you to find all the experimental data related to your own project in order to reproduce the results at a later point in time (e.g. 6 months after the original experiment)?'
draw_bar_chart_mul_col(column_array, title, df=df1)
df1 = no_crisis.loc[:,'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Input Data]':'How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you? [Results]']
column_array=['Newcomer Input Data', 'Newcomer Metadata about the methods', 'Newcomer Metadata about the steps', 'Newcomer Metadata about the experimental setup', 'Newcomer Results']
df1.columns = column_array
title='How easy would it be for a newcomer in your workplace to find all the experimental data related to your project/experiment without any/limited instructions from you?'
draw_bar_chart_mul_col(column_array, title, df=df1)
title = 'Have you ever been unable to reproduce published results of others?'
draw_pie_chart(title, df=yes_crisis)
title = 'Have you ever been unable to reproduce published results of others?'
draw_pie_chart(title, df=no_crisis)