OpenSAFELY is a secure analytics platform for electronic health records in the NHS, created to deliver urgent research during the global COVID-19 emergency. It is now successfully delivering analyses across more than 58 million patients’ full pseudonymised primary care NHS records.
All our analytic software is open for security review, scientific review, and re-use. OpenSAFELY uses a new model for enhanced security and timely access to data: we don’t transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments managed by the electronic health record software company; instead, trusted analysts can run large scale computation across near real-time pseudonymised patient records inside the data centre of the electronic health records software company.
This document is intended as a short walkthrough of the OpenSAFELY platform. Please visit docs.opensafely.org for more comprehensive documentation, and opensafely.org for any other info.
The examples in this document are all available in the os-demo-research github repository.
OpenSAFELY maintains extremely high standards for data privacy, whilst ensuring complete computational and analytical transparency. As such there are some technical pre-requisites that users must satisfy to use the platform. These include installing and using git and python, though typically only a narrow set of actions are required and guide-rails are provided. These details can be found in the documentation pages, which also provides an alternative option to installing requirements locally, using the Gitpod service. It is recommended that you follow the Getting Started guide for an introduction to the technical requirements and basic workflow before working through this walkthrough. The walkthrough assumes that the technical set-up has been completed.
make_chart.R
depends on process_data.R
, which depends on the study dataset having been extracted. This reduces redundancies by only running scripts that need to be run.opensafely
command-line interface])(https://github.com/opensafely-core/opensafely-cli) is used to run actions defined in the project pipelines, as well as other useful tasks like importing codelists.For researchers using the platform, the OpenSAFELY workflow for a single study can typically be broken down into the following steps:
These steps should always proceed with frequent git commits and code reviews where appropriate. Steps 2-5 can all be progressed on your local machine without accessing the real data.
It is possible to automatically test that the complete analytical pipeline defined in step 5 can be successfully executed on dummy data, using the opensafely run
command. This pipeline is also automatically tested against dummy data every time a new version of the repository is saved (“pushed”) to GitHub.
As well as your own Python, R or Stata scripts, other non-standard actions are available. For example, it’s possible to run a matching routine that extracts a matched control population to the population defined in the study definition, without having to extract all candidate matches into a dataset first.
Let’s start with a simple example.
This example introduces study definitions, expectations and dummy data, and project pipelines. We’re going to use OpenSAFELY to find out how many patients are registered at a TPP practice within each STP (Sustainability and Transformation Partnership) on 1 January 2020.
Go to the research template repository and click Use this template
. Follow the instructions in the Getting Started guide.
Your newly-created repo will contain a template study definition. Edit this to suit your needs. You can also view or clone the repo for these demo materials where the code is already written and tested.
In this example, it’s very simple — the entire file, called study_definition_1_stppop.py
, looks like this:
# LIBRARIES
# cohort extractor
from cohortextractor import StudyDefinition, patients
# set the index date
index_date = "2020-01-01"
# STUDY POPULATION
study = StudyDefinition(
default_expectations={
"date": {
"earliest": index_date,
"latest": "today",
}, # date range for simulated dates
"rate": "uniform",
"incidence": 1,
},
# This line defines the study population
population=patients.registered_as_of(index_date),
# this line defines the stp variable we want to extract
stp=patients.registered_practice_as_of(
index_date,
returning="stp_code",
return_expectations={
"category": {"ratios": {"STP1": 0.3, "STP2": 0.2, "STP3": 0.5}},
},
),
)
Let’s break it down:
from cohortextractor import StudyDefinition, patients
This imports the required functions from the OpenSAFELY cohortextractor
library, that you will have previously installed.
index_date = "2020-01-01"
This defines the registration date that we’re interested in.
We then use the StudyDefinition()
function to define the cohort population and the variables we want to extract.
default_expectations={
"date": {"earliest": index_date, "latest": "today"}, # date range for simulated dates
"rate": "uniform",
"incidence": 1
},
This defines the default expectations which are used to generate dummy data. This just says that we expect date variables to be uniformly distributed between the index date and today’s date. For this study we’re not producing any dates. An overview of all the options available for dummy data can be found in the documentation
population=patients.registered_as_of(index_date)
This says that we want to extract information only for patients who were registered at a practice on 1 January 2020. There will be one row for each of these patients in the extracted dataset. Note that population
is a reserved variable name for StudyDefinition
which specifies the study population — we don’t have to do any additional filtering/subsetting on this variable.
stp=patients.registered_practice_as_of(
index_date,
returning="stp_code",
return_expectations={
"incidence": 0.99,
"category": {"ratios": {"STP1": 0.3, "STP2": 0.2, "STP3": 0.5}},
},
)
This says we want to extract the STP for each patient (or more strictly, the STP of each patient’s practice). Here we also use the returning_expectations
argument, which specifies how the stp
variable will be distributed in the dummy data. The "incidence": 0.99
line says that we expect an STP code to be available for 99% of patients. The "category": {"ratios": {"STP1" : 0.3, "STP2" : 0.2, "STP3" : 0.5}}
line says that the STP dummy variable will have values STP1
, STP2
, and STP3
that are randomly generated in proportion 0.3
, 0.2
, and 0.5
respectively.
This study definition uses two in-built variable functions in the OpenSAFELY cohortextractor
’s patients
module, patients.registered_as_of()
and patients.registered_practice_as_of()
. There are many more such functions, like patients.age()
, patients.with_these_clinical_events()
, and patients.admitted_to_icu()
, which are listed in OpenSAFELY’s documentation.
Note that more realistic STP dummy values are possible. For example, by creating a table of realistic STP values and their proportions and saving it as a python dictionary, we can import it into the study definition using from dictionaries import dict_stp
and then use "category": {"ratios": dict_stp}
in the return_expectations
argument. See the study_definition_1_stppop_map.py
for an example of this.
Now that we’ve defined the study cohort in the study_definition_1_stppop.py
python script, we can generate a dummy dataset. Assuming you have the correct technical set-up on your local machine (or you are using Gitpod), this involves two steps:
project.yaml
file, andopensafely
command line module.The template repository provides a working project.yaml
file to get started, or you can use the project.yaml
available in the repo for these demo materials. In brief, the project.yaml
file specifies the execution order for data extracts, data analyses, and the outputs that will be released. This is best demonstrated with an example:
version: '3.0'
expectations:
population_size: 10000
actions:
generate_cohort_stppop:
run: cohortextractor:latest generate_cohort --study-definition study_definition_1_stppop --output-dir=output/cohorts
outputs:
highly_sensitive:
cohort: output/cohorts/input_1_stppop.csv
plot_stppop:
run: r:latest analysis/1-plot-stppop.R
needs: [generate_cohort_stppop]
outputs:
moderately_sensitive:
figure1: output/plots/plot_stppop_bar.png
The file begins with some housekeeping that specifies which version to use and the size of the dummy dataset used for testing.
The main part of the yaml file is the actions
section. In this example we have two actions:
generate_cohort_stppop
is the action to extract the dataset from the database. The run
section gives the extraction command (generate_cohort
), which study definition to use (--study-definition study_definition_1_stppop
) and where to put the output (--output-dir = output/cohorts
). The outputs
section declares that the expected output is highly_sensitive
and shouldn’t be considered for release.plot_stppop
is the action that runs the R script (defined below). The run
section tells it which script to run. The needs
section says that this action cannot be run without having first run the generate_cohort_stppop
action. The outputs
sections declares that the expected outputs (a plot) is moderately_sensitive
and should be considered for release after disclosivity checks.To run the first action, open a command line terminal that can find the opensafely
module, and run
opensafely run generate_cohort_stppop
This will create a file input_1_stppop.csv
in the /output/cohorts/
folder with 10000
rows.
It will also create log files in the metadata/
folder, for debugging. Security-wise, this log file is treated as moderately_sensitive
.
Note that the population argument is not encompassed in the dummy data, as explained in the documentation.
We now have a dummy dataset so we can begin to develop and test our analysis code. For this example, we just want to import the dataset, count the number of STPs, and output to a file. These steps are outlined code-block below, and are available in the file /analysis/1-plot-stppop.R
.
First we import and process the data (click “code” to the right to show the code):
## import libraries
library('tidyverse')
## import data
df_input <- read_csv(
here::here("output", "cohorts", "input_1_stppop.csv"),
col_types = cols(
patient_id = col_integer(),
stp = col_character()
)
)
df_stppop = df_input %>% count(stp, name='registered')
Then create a bar chart of counts (click “code” to the right to show the code):
plot_stppop_bar <- df_stppop %>%
mutate(
name = forcats::fct_reorder(stp, registered, median, .desc=FALSE)
) %>%
ggplot() +
geom_col(aes(x=registered/1000000, y=name, fill=registered), colour='black') +
scale_fill_gradient(limits = c(0,NA), low="white", high="blue", guide=FALSE)+
labs(
title="TPP-registered patients per STP",
subtitle= "as at 1 January 2020",
y=NULL,
x="Registered patients\n(million)",
fill = NULL)+
theme_minimal()+
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
plot.title.position = "plot",
plot.caption.position = "plot"
)
plot_stppop_bar
To save the chart to the output folder, add the following to the end of the file:
# create directory where output will be saved
dir.create(here::here("output", "plots"))
ggsave(
plot= plot_stppop_bar,
filename="plot_stppop_bar.png", path=here::here("output", "plots"),
units = "cm",
height = 15,
width = 15
)
Note if we want to put the output plot in a
plots
subfolder, we need to ensure it exists withdir.create(here::here("output", "plots"))
. When the action is run with theopensafely run
command, it will ignore any files that match expected outputs and any emoty folders. An error similar toError in grid.newpage(): could not open file '/workspace/output/plots/plot_stppop_bar.png'
is likely due to a missing or ignored output folder.
To run this script, run the plot_stppop
action
opensafely run plot_stppop
Now we have a working analysis script that outputs what we want, we can run the analysis for real via the OpenSAFELY Job Server.
To run one or more actions defined in the project.yaml
against real data, go to the Job Server
at jobs.opensafely.org. Here you can see (even without a login) all the ongoing projects within OpenSAFELY, and the specific jobs that have been run on the server. Note however that to run analyses inside the OpenSAFELY secure environment, you need to be an approved user.
Full details on submitting jobs can be found on the OpenSAFELY documentation pages. Following these instructions will create the required outputs on the server, which can be reviewed on the server and released via github if they are deemed non-disclosive (there are detailed reviewing guidelines for approved researchers).
This is a manual step that must be carried out entirely in the serve. Instructions for this process are available for individuals with appropriate permissions.. Essentially, the step ensures that the outputs do not contain any potentially disclosive information due to small patient counts.
Once results have been checked they can be released via GitHub. This step must also occur in the server. The released outputs are put in the /released-output
folder.
In this example, the bar chart looks like this after running on real data:
Here we’re just linking to the released .png
files in the repo (clicking the image takes you there). We could have also released the underlying STP count data and developed an analysis script directly on this dataset, provided it is non-disclosive. However, typically only the most high-level aggregated datasets are suitable for public release.
We could also create something more sophisticated like a map, and include full STP names instead of codes. This requires a shape file, and to run successfully locally would also need matching dummy data. See the study_definition_1_stppop_map.py
study definition and analysis/1-plot-stppop_map.R
R script to see how this would work.
The outputs look like this: