OpenSAFELY is a secure analytics platform for electronic health records in the NHS, created to deliver urgent research during the global COVID-19 emergency. It is now successfully delivering analyses across more than 58 million patients’ full pseudonymised primary care NHS records.
All our analytic software is open for security review, scientific review, and re-use. OpenSAFELY uses a new model for enhanced security and timely access to data: we don’t transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments managed by the electronic health record software company; instead, trusted analysts can run large scale computation across near real-time pseudonymised patient records inside the data centre of the electronic health records software company.
This document is intended as a short walkthrough of the OpenSAFELY platform. Please visit docs.opensafely.org for more comprehensive documentation, and opensafely.org for any other info.
The examples in this document are all available in the os-demo-research github repository.
OpenSAFELY maintains extremely high standards for data privacy, whilst ensuring complete computational and analytical transparency. As such there are some technical pre-requisites that users must satisfy to use the platform. These include installing and using git and python, though typically only a narrow set of actions are required and guide-rails are provided. These details can be found in the documentation pages, which also provides an alternative option to installing requirements locally, using the Gitpod service. It is recommended that you follow the Getting Started guide for an introduction to the technical requirements and basic workflow before working through this walkthrough. The walkthrough assumes that the technical set-up has been completed.
make_chart.R
depends on process_data.R
, which depends on the study dataset having been extracted. This reduces redundancies by only running scripts that need to be run.opensafely
command-line interface])(https://github.com/opensafely-core/opensafely-cli) is used to run actions defined in the project pipelines, as well as other useful tasks like importing codelists.For researchers using the platform, the OpenSAFELY workflow for a single study can typically be broken down into the following steps:
These steps should always proceed with frequent git commits and code reviews where appropriate. Steps 2-5 can all be progressed on your local machine without accessing the real data.
It is possible to automatically test that the complete analytical pipeline defined in step 5 can be successfully executed on dummy data, using the opensafely run
command. This pipeline is also automatically tested against dummy data every time a new version of the repository is saved (“pushed”) to GitHub.
As well as your own Python, R or Stata scripts, other non-standard actions are available. For example, it’s possible to run a matching routine that extracts a matched control population to the population defined in the study definition, without having to extract all candidate matches into a dataset first.
Let’s start with a simple example.
This example introduces study definitions, expectations and dummy data, and project pipelines. We’re going to use OpenSAFELY to find out how many patients are registered at a TPP practice within each STP (Sustainability and Transformation Partnership) on 1 January 2020.
Go to the research template repository and click Use this template
. Follow the instructions in the Getting Started guide.
Your newly-created repo will contain a template study definition. Edit this to suit your needs. You can also view or clone the repo for these demo materials where the code is already written and tested.
In this example, it’s very simple — the entire file, called study_definition_1_stppop.py
, looks like this:
# LIBRARIES
# cohort extractor
from cohortextractor import StudyDefinition, patients
# set the index date
index_date = "2020-01-01"
# STUDY POPULATION
study = StudyDefinition(
default_expectations={
"date": {
"earliest": index_date,
"latest": "today",
}, # date range for simulated dates
"rate": "uniform",
"incidence": 1,
},
# This line defines the study population
population=patients.registered_as_of(index_date),
# this line defines the stp variable we want to extract
stp=patients.registered_practice_as_of(
index_date,
returning="stp_code",
return_expectations={
"category": {"ratios": {"STP1": 0.3, "STP2": 0.2, "STP3": 0.5}},
},
),
)
Let’s break it down:
from cohortextractor import StudyDefinition, patients
This imports the required functions from the OpenSAFELY cohortextractor
library, that you will have previously installed.
index_date = "2020-01-01"
This defines the registration date that we’re interested in.
We then use the StudyDefinition()
function to define the cohort population and the variables we want to extract.
default_expectations={
"date": {"earliest": index_date, "latest": "today"}, # date range for simulated dates
"rate": "uniform",
"incidence": 1
},
This defines the default expectations which are used to generate dummy data. This just says that we expect date variables to be uniformly distributed between the index date and today’s date. For this study we’re not producing any dates. An overview of all the options available for dummy data can be found in the documentation
population=patients.registered_as_of(index_date)
This says that we want to extract information only for patients who were registered at a practice on 1 January 2020. There will be one row for each of these patients in the extracted dataset. Note that population
is a reserved variable name for StudyDefinition
which specifies the study population — we don’t have to do any additional filtering/subsetting on this variable.
stp=patients.registered_practice_as_of(
index_date,
returning="stp_code",
return_expectations={
"incidence": 0.99,
"category": {"ratios": {"STP1": 0.3, "STP2": 0.2, "STP3": 0.5}},
},
)
This says we want to extract the STP for each patient (or more strictly, the STP of each patient’s practice). Here we also use the returning_expectations
argument, which specifies how the stp
variable will be distributed in the dummy data. The "incidence": 0.99
line says that we expect an STP code to be available for 99% of patients. The "category": {"ratios": {"STP1" : 0.3, "STP2" : 0.2, "STP3" : 0.5}}
line says that the STP dummy variable will have values STP1
, STP2
, and STP3
that are randomly generated in proportion 0.3
, 0.2
, and 0.5
respectively.
This study definition uses two in-built variable functions in the OpenSAFELY cohortextractor
’s patients
module, patients.registered_as_of()
and patients.registered_practice_as_of()
. There are many more such functions, like patients.age()
, patients.with_these_clinical_events()
, and patients.admitted_to_icu()
, which are listed in OpenSAFELY’s documentation.
Note that more realistic STP dummy values are possible. For example, by creating a table of realistic STP values and their proportions and saving it as a python dictionary, we can import it into the study definition using from dictionaries import dict_stp
and then use "category": {"ratios": dict_stp}
in the return_expectations
argument. See the study_definition_1_stppop_map.py
for an example of this.
Now that we’ve defined the study cohort in the study_definition_1_stppop.py
python script, we can generate a dummy dataset. Assuming you have the correct technical set-up on your local machine (or you are using Gitpod), this involves two steps:
project.yaml
file, andopensafely
command line module.The template repository provides a working project.yaml
file to get started, or you can use the project.yaml
available in the repo for these demo materials. In brief, the project.yaml
file specifies the execution order for data extracts, data analyses, and the outputs that will be released. This is best demonstrated with an example:
version: '3.0'
expectations:
population_size: 10000
actions:
generate_cohort_stppop:
run: cohortextractor:latest generate_cohort --study-definition study_definition_1_stppop --output-dir=output/cohorts
outputs:
highly_sensitive:
cohort: output/cohorts/input_1_stppop.csv
plot_stppop:
run: r:latest analysis/1-plot-stppop.R
needs: [generate_cohort_stppop]
outputs:
moderately_sensitive:
figure1: output/plots/plot_stppop_bar.png
The file begins with some housekeeping that specifies which version to use and the size of the dummy dataset used for testing.
The main part of the yaml file is the actions
section. In this example we have two actions:
generate_cohort_stppop
is the action to extract the dataset from the database. The run
section gives the extraction command (generate_cohort
), which study definition to use (--study-definition study_definition_1_stppop
) and where to put the output (--output-dir = output/cohorts
). The outputs
section declares that the expected output is highly_sensitive
and shouldn’t be considered for release.plot_stppop
is the action that runs the R script (defined below). The run
section tells it which script to run. The needs
section says that this action cannot be run without having first run the generate_cohort_stppop
action. The outputs
sections declares that the expected outputs (a plot) is moderately_sensitive
and should be considered for release after disclosivity checks.To run the first action, open a command line terminal that can find the opensafely
module, and run
opensafely run generate_cohort_stppop
This will create a file input_1_stppop.csv
in the /output/cohorts/
folder with 10000
rows.
It will also create log files in the metadata/
folder, for debugging. Security-wise, this log file is treated as moderately_sensitive
.
Note that the population argument is not encompassed in the dummy data, as explained in the documentation.
We now have a dummy dataset so we can begin to develop and test our analysis code. For this example, we just want to import the dataset, count the number of STPs, and output to a file. These steps are outlined code-block below, and are available in the file /analysis/1-plot-stppop.R
.
First we import and process the data (click “code” to the right to show the code):
## import libraries
library('tidyverse')
## import data
df_input <- read_csv(
here::here("output", "cohorts", "input_1_stppop.csv"),
col_types = cols(
patient_id = col_integer(),
stp = col_character()
)
)
df_stppop = df_input %>% count(stp, name='registered')
Then create a bar chart of counts (click “code” to the right to show the code):
plot_stppop_bar <- df_stppop %>%
mutate(
name = forcats::fct_reorder(stp, registered, median, .desc=FALSE)
) %>%
ggplot() +
geom_col(aes(x=registered/1000000, y=name, fill=registered), colour='black') +
scale_fill_gradient(limits = c(0,NA), low="white", high="blue", guide=FALSE)+
labs(
title="TPP-registered patients per STP",
subtitle= "as at 1 January 2020",
y=NULL,
x="Registered patients\n(million)",
fill = NULL)+
theme_minimal()+
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
plot.title.position = "plot",
plot.caption.position = "plot"
)
plot_stppop_bar
To save the chart to the output folder, add the following to the end of the file:
# create directory where output will be saved
dir.create(here::here("output", "plots"))
ggsave(
plot= plot_stppop_bar,
filename="plot_stppop_bar.png", path=here::here("output", "plots"),
units = "cm",
height = 15,
width = 15
)
Note if we want to put the output plot in a
plots
subfolder, we need to ensure it exists withdir.create(here::here("output", "plots"))
. When the action is run with theopensafely run
command, it will ignore any files that match expected outputs and any emoty folders. An error similar toError in grid.newpage(): could not open file '/workspace/output/plots/plot_stppop_bar.png'
is likely due to a missing or ignored output folder.
To run this script, run the plot_stppop
action
opensafely run plot_stppop
Now we have a working analysis script that outputs what we want, we can run the analysis for real via the OpenSAFELY Job Server.
To run one or more actions defined in the project.yaml
against real data, go to the Job Server
at jobs.opensafely.org. Here you can see (even without a login) all the ongoing projects within OpenSAFELY, and the specific jobs that have been run on the server. Note however that to run analyses inside the OpenSAFELY secure environment, you need to be an approved user.
Full details on submitting jobs can be found on the OpenSAFELY documentation pages. Following these instructions will create the required outputs on the server, which can be reviewed on the server and released via github if they are deemed non-disclosive (there are detailed reviewing guidelines for approved researchers).
This is a manual step that must be carried out entirely in the serve. Instructions for this process are available for individuals with appropriate permissions.. Essentially, the step ensures that the outputs do not contain any potentially disclosive information due to small patient counts.
Once results have been checked they can be released via GitHub. This step must also occur in the server. The released outputs are put in the /released-output
folder.
In this example, the bar chart looks like this after running on real data:
Here we’re just linking to the released .png
files in the repo (clicking the image takes you there). We could have also released the underlying STP count data and developed an analysis script directly on this dataset, provided it is non-disclosive. However, typically only the most high-level aggregated datasets are suitable for public release.
We could also create something more sophisticated like a map, and include full STP names instead of codes. This requires a shape file, and to run successfully locally would also need matching dummy data. See the study_definition_1_stppop_map.py
study definition and analysis/1-plot-stppop_map.R
R script to see how this would work.
The outputs look like this:
This example introduces codelists. We’re going to use OpenSAFELY to look at the frequency of covid-related deaths compared with non-covid deaths between 1 January 2020 and 30 September 2020, and see how this differs by age and sex. For this, we’ll need death dates for anyone in the database who died during the period of interest, and we’ll need a way to identify whether these deaths were covid-related or not.
The study definition for this task is available at /analysis/study_definition_2_deaths.py
, and can be viewed by clicking code
to the right.
# LIBRARIES
# cohort extractor
from cohortextractor import StudyDefinition, codelist_from_csv, patients
# CODELISTS
# All codelist are held within the codelist/ folder.
codes_ICD10_covid = codelist_from_csv(
"codelists/opensafely-covid-identification.csv", system="icd10", column="icd10_code"
)
# STUDY POPULATION
# Defines both the study population and points to the important covariates
index_date = "2020-01-01"
end_date = "2020-09-30"
study = StudyDefinition(
# Configure the expectations framework
default_expectations={
"date": {"earliest": index_date, "latest": end_date},
"rate": "uniform",
},
index_date=index_date,
# This line defines the study population
population=patients.satisfying(
"""
(sex = 'F' OR sex = 'M') AND
(age >= 18 AND age < 120) AND
(NOT died) AND
(registered)
""",
registered=patients.registered_as_of(index_date),
died=patients.died_from_any_cause(
on_or_before=index_date,
returning="binary_flag",
),
),
age=patients.age_as_of(
index_date,
return_expectations={
"int": {"distribution": "population_ages"},
"incidence": 1,
},
),
sex=patients.sex(
return_expectations={
"category": {"ratios": {"M": 0.49, "F": 0.51}},
"incidence": 1,
}
),
date_death=patients.died_from_any_cause(
between=[index_date, end_date],
returning="date_of_death",
date_format="YYYY-MM-DD",
return_expectations={
"incidence": 0.2,
},
),
death_category=patients.categorised_as(
{
"covid-death": "died_covid",
"non-covid-death": "(NOT died_covid) AND died_any",
"alive": "DEFAULT",
},
died_covid=patients.with_these_codes_on_death_certificate(
codes_ICD10_covid,
returning="binary_flag",
match_only_underlying_cause=False,
between=[index_date, end_date],
),
died_any=patients.died_from_any_cause(
between=[index_date, end_date],
returning="binary_flag",
),
return_expectations={
"category": {
"ratios": {"alive": 0.8, "covid-death": 0.1, "non-covid-death": 0.1}
},
"incidence": 1,
},
),
)
As before, we first import libraries and dictionaries. This time we also import the codelist for identifying covid-related deaths. This uses data from death certificates, which are coded using ICD-10 codes. The covid codes in this system are U071
and U072
, and have been collected in a codelist at https://www.opencodelists.org/codelist/opensafely/covid-identification/2020-06-03/. To import the codelists, put the codelist path (opensafely/covid-identification/2020-06-03) in the codelists/codelists.txt
file in the repo, then run the following command:
opensafely codelists update
This will create a .csv
for each codelist, which are imported to the study definition using
codes_ICD10_covid = codelist_from_csv(
"codelists/opensafely-covid-identification.csv",
system = "icd10",
column = "icd10_code"
)
Then as before, we define the cohort population and the variables we want to extract within a study definition. Here we utilise satisfying()
to define a population who meet a range of criteria.
...
population = patients.satisfying(
"""
(sex = 'F' OR sex = 'M') AND
(age >= 18 AND age < 120) AND
(NOT died) AND
(registered)
""",
registered = patients.registered_as_of(index_date),
died = patients.died_from_any_cause(
on_or_before=index_date,
returning="binary_flag",
),
),
The first argument states the series of conditions patients must satisfy to be included. After this, the variables in the condition statement can be defined. Note that these variables can either be defined under this statement (as with registered
and died
) or elsewhere in the study definition.
Here registered
is used to select patients registered as of the index date and died
is used to select any patients that died on or before the index date. As for many in-built variable functions, patients.died_from_any_cause()
has multiple options for how the variable should returned, defined by returning
. Here we are interested in whether the patient has died or not so use the option "binary_flag"
.
As before, we then define a cohortextractor
action in the project.yaml
file, and run using opensafely run
. The other variables extracted are sex
, date_death
, and death_category
. These use some new variable functions and some more expectation definitions, whose details can be found in the Study Definition variable reference.
If you are defining a population satisfying a list of criteria, as here, you will likely wish to produce an inclusion/exclusion flowchart.
Once again, we use the dummy data to develop the analysis script then create a project.yaml
to run the entire study on the real data. The analysis script is available in the file /analysis/2-plot-deaths.R
. This script is run by the job server via the project.yaml
file, then the outputs are reviewed in the server and released via github. The final graph looks like this:
Here you can see the familiar covid mortality bump during the first wave of the pandemic. There is also a bump in non-covid deaths, suggesting that identification of covid-related deaths may not be 100% sensitive, or that health services struggled during this period, or that people were not seeking the care they needed.
In our final example, we introduce the Measures framework. This enables the extraction of multiple study cohorts each covering different time periods, and calculates a set of statistics for each period.
We’ll look at the frequency of cholesterol and INR (International Normalised Ratio, which measures how long it takes for blood to clot) measurements recorded in the Primary Care record, by practice and by STP.
The entire study definition is available at /analysis/study_definition_3_activity.py
, and can be viewed by clicking code
to the right.
# LIBRARIES
# cohort extractor
from cohortextractor import Measure, StudyDefinition, codelist_from_csv, patients
# dictionary of STP codes (for dummy data)
from dictionaries import dict_stp
# CODELISTS
# All codelist are held within the codelist/ folder.
codes_cholesterol = codelist_from_csv(
"codelists-local/cholesterol-measurement.csv", system="ctv3", column="id"
)
codes_inr = codelist_from_csv(
"codelists-local/international-normalised-ratio-measurement.csv",
system="ctv3",
column="id",
)
# STUDY POPULATION
index_date = "2020-01-01"
study = StudyDefinition(
# Configure the expectations framework
default_expectations={
"date": {"earliest": index_date, "latest": "today"},
"rate": "uniform",
"incidence": 1,
},
index_date=index_date,
# This line defines the study population
population=patients.satisfying(
"""
(age >= 18 AND age < 120) AND
(NOT died) AND
(registered)
""",
died=patients.died_from_any_cause(
on_or_before=index_date, returning="binary_flag"
),
registered=patients.registered_as_of(index_date),
age=patients.age_as_of(index_date),
),
# geographic/administrative groups
practice=patients.registered_practice_as_of(
index_date,
returning="pseudo_id",
return_expectations={
"int": {"distribution": "normal", "mean": 100, "stddev": 20}
},
),
stp=patients.registered_practice_as_of(
index_date,
returning="stp_code",
return_expectations={
"category": {"ratios": dict_stp},
},
),
cholesterol=patients.with_these_clinical_events(
codes_cholesterol,
returning="number_of_episodes",
between=["index_date", "index_date + 1 month"],
return_expectations={
"int": {"distribution": "normal", "mean": 2, "stddev": 0.5}
},
),
inr=patients.with_these_clinical_events(
codes_inr,
returning="number_of_episodes",
between=["index_date", "index_date + 1 month"],
return_expectations={
"int": {"distribution": "normal", "mean": 3, "stddev": 0.5}
},
),
)
measures = [
Measure(
id="cholesterol_practice",
numerator="cholesterol",
denominator="population",
group_by="practice",
),
Measure(
id="cholesterol_stp",
numerator="cholesterol",
denominator="population",
group_by="stp",
),
Measure(
id="inr_practice",
numerator="inr",
denominator="population",
group_by="practice",
),
Measure(id="inr_stp", numerator="inr", denominator="population", group_by="stp"),
]
We’ll pick out some key parts:
...
population=patients.satisfying(
"""
(age >= 18 AND age < 120) AND
(NOT died) AND
(registered)
""",
died=patients.died_from_any_cause(
on_or_before=index_date,
returning=binary_flag"
),
registered=patients.registered_as_of(index_date),
age=patients.age_as_of(index_date),
)
...
The population declaration says that for each period, we want the set of adult patients who are both alive and registered at the index date.
...
cholesterol=patients.with_these_clinical_events(
codes_cholesterol,
returning="number_of_episodes",
between=["index_date", "index_date + 1 month"],
return_expectations={
"int": {"distribution": "normal", "mean": 2, "stddev": 0.5}
},
),
inr=patients.with_these_clinical_events(
codes_inr,
returning="number_of_episodes",
between=["index_date", "index_date + 1 month"],
return_expectations={
"int": {"distribution": "normal", "mean": 3, "stddev": 0.5}
},
),
...
Then we want to extract the number of cholesterol- or INR-measurement “episodes” recorded during the month starting from the index date. The codes_cholesterol
and the codes_inr
codelists are defined similarly to the codes_ICD10_covid
in example 2. Using expectations, we say the dummy values for these variables will be a low-valued integer.
measures=[
Measure(
id="cholesterol_practice",
numerator="cholesterol",
denominator="population",
group_by="practice"
),
Measure(
id="cholesterol_stp",
numerator="cholesterol",
denominator="population",
group_by="stp"
),
Measure(
id="inr_practice",
numerator="inr",
denominator="population",
group_by="practice"
),
Measure(
id="inr_stp",
numerator="inr",
denominator="population",
group_by="stp"
),
]
Finally, we define the measures that we want to calculate. Here we want four measures, one for each combination of cholesterol/inr and practice/STP. For more details on Measures, see the documentation.
As before, we generate dummy data by defining a cohortextractor generate_cohort
action in the project.yaml
and then running that action using opensafely run
. For measures, we include an additional --index-date-range
option so that it extracts a new cohort for each date specified, as follows:
cohortextractor:latest generate_cohort --study-definition study_definition_3_activity --index-date-range "2020-01-01 to 2020-09-01 by month"
Here we go from 1 January 2020 to 1 September 2020 in monthly increments. These dates are passed to the index_date
variable in the study definition. This will produce a set of input_3_activity_<date>.csv
files, each with 10000
rows of dummy data.
An additional step is now needed to generate the measures. This is done as follows:
cohortextractor:latest generate_measures --study-definition study_definition_3_activity --output-dir=output/measures
which will produce a set of files of the form measure_<id>.csv
.
Now we have the extracted outputs we can develop our analysis script. The analysis script for this can be found at /analysis/3-plot-activity.R
.
Let’s look at the number of measurements in each STP for cholesterol:
and for INR:
Clearly there is a substantial dip in activity for cholesterol which corresponds neatly to the first pandemic wave. This activity dip is similar across all STPs. However, for INR the decline is less pronounced.
We can also look at deciles of measurement activity for each GP practice for cholesterol:
and for INR:
For more information on using the platform, see the OpenSAFELY documentation pages.