Case-control studies
Background🔗
Data extracts often use an "index date" to define variables such as age, or presence of comorbidities at the start of a study. This can either be a fixed date (e.g. "2020-02-01") or a date that varies between patients, also known as dynamic dates (e.g. the date that people were hospitalised). However, some studies that use dynamic index dates for a "case" population and then have a matched "control" comparator group. Because the control group by definition doesn't have the event of interest, the convention is for them to inherit the index date from their matched "case". The process to extract data for such a matched case-control population in OpenSAFELY is described below.
There are five steps to undertaking a case-control study (or matched cohort study) with OpenSAFELY:
- Extract data for the cases
- Extract data for the potential controls
- Match the cases to the potential controls
- Extract data for the matched controls
- Analysis
To begin with, our project.yaml
looks like this:
version: '4.0'
actions:
# Extract data for the cases
# Extract data for the potential controls
# Match the cases to the potential controls
# Extract data for the matched controls
# Analysis
Extract data for the cases🔗
In this step, we will construct a dataset definition to extract all the data we need for the cases: that is, the matching data, or the data we will use to match the cases to the potential controls; and the non-matching data, or the data we will use for analysis.
To avoid duplicating code to extract the matching data and the non-matching data in this and the following steps, we could use separate Python scripts to share common variables or parametrise the dataset definition.
As we will construct multiple dataset definitions in this and the following steps,
we will name this dataset definition dataset_definition_cases.py
.
When working with multiple dataset definitions, it is good practice to use the same suffix to name each corresponding output file.
Here, the dataset definition's suffix is cases
, so the corresponding output file will be named dataset_cases.csv
.
Our project.yaml
now includes the following action:
# ...
actions:
extract_cases:
run: ehrql:v1 generate-dataset analysis/dataset_definition_cases.py --output output/dataset_cases.csv
outputs:
highly_sensitive:
dataset: output/dataset_cases.csv
Extract data for the potential controls🔗
The potential controls are the group of patients that are matched to the cases to give the matched controls. In this step, we will construct a second dataset definition to extract only the matching data for the potential controls.
We will name this dataset definition dataset_definition_potential_controls.py
.
Our project.yaml
now includes the following action:
# ...
actions:
# ...
extract_potential_controls:
run: ehrql:v1 generate-dataset analysis/dataset_definition_potential_controls.py --output output/dataset_potential_controls.csv
outputs:
highly_sensitive:
dataset: output/dataset_potential_controls.csv
Match the cases to the potential controls🔗
In this step, we will use the OpenSAFELY matching library in a scripted action to match the cases to the potential controls.
We will name this scripted action match.py
.
Whilst the OpenSAFELY matching library can output multiple files, we will use two: matching_report.txt
and matched_matches.csv
.
The former contains information about the matching process.
The latter contains the matched controls.
Our project.yaml
now includes the following action:
# ...
actions:
# ...
matching:
run: python:v1 python analysis/match.py
needs: [extract_cases, extract_potential_controls]
outputs:
moderately_sensitive:
matching_report: output/matching_report.txt
highly_sensitive:
matched_matches: output/matched_matches.csv
Alternatives to the OpenSAFELY matching library
Rather than using the OpenSAFELY matching library, you may wish to match the cases to the potential controls by implementing your own algorithm in a scripted action. If you do, please note that not all Python, R, or Stata packages are available to you; to find out which packages are, see the python-docker, r-docker, or stata-docker repositories. Alternatively, you can request new libraries.
Extract data for the matched controls🔗
In this step, we will construct a third dataset definition to extract only the non-matching data for the matched controls.
We will name this dataset definition dataset_definition_controls.py
.
We will use the @table_from_file
feature in ehrQL to make a table called matched_patients
from the matched_matches.csv
file.
When we are done, matched_patients
would behave as if it were any other
PatientFrame
in ehrQL.
Suppose matched_matches.csv
has the following columns: patient_id
, age
, sex
and case_index_date
-
i.e. age
and sex
were the matching data we extracted in the first step.
We can create matched_patients
with:
import datetime
from ehrql.query_language import PatientFrame, Series, table_from_file
CONTROLS = "output/matched_matches.csv"
@table_from_file(CONTROLS)
class matched_patients(PatientFrame):
age = Series(int)
sex = Series(str)
case_index_date = Series(datetime.date)
This allows us to only include the matched controls in our dataset.
from ehrql import create_dataset
dataset = create_dataset()
dataset.define_population(
matched_patients.exists_for_patient()
)
Since the case_index_date
column in matched_patients
is now accessible,
we can use it as the index date for controls (since they by definition don't have an index date).
For example, we might want to see if our matched controls have had a codelist event on or after their case index date.
from ehrql import codelist_from_csv
from ehrql.tables.core import clinical_events
codelist = codelist_from_csv("codelists/codelist.csv")
events_in_codelist = clinical_events.where(
clinical_events.snomedct_code.is_in(codelist)
)
dataset.has_event_in_codelist = events_in_codelist.where(
events_in_codelist.date.is_on_or_after(matched_patients.case_index_date)
).exists_for_patient()
codelists/codelist.csv
.
Putting it all together, our dataset_definition_controls.py
looks like this:
import datetime
from ehrql import codelist_from_csv, create_dataset
from ehrql.query_language import PatientFrame, Series, table_from_file
from ehrql.tables.core import clinical_events
CONTROLS = "output/matched_matches.csv"
codelist = codelist_from_csv("codelists/codelist.csv")
@table_from_file(CONTROLS)
class matched_patients(PatientFrame):
age = Series(int)
sex = Series(str)
case_index_date = Series(datetime.date)
dataset = create_dataset()
dataset.define_population(matched_patients.exists_for_patient())
events_in_codelist = clinical_events.where(
clinical_events.snomedct_code.is_in(codelist)
)
dataset.has_event_in_codelist = events_in_codelist.where(
events_in_codelist.date.is_on_or_after(matched_patients.case_index_date)
).exists_for_patient()
Our project.yaml
now includes the following action:
# ...
actions:
# ...
extract_controls:
run: ehrql:v1 generate-dataset analysis/dataset_definition_controls.py --output output/dataset_controls.csv.gz
needs: [matching]
outputs:
highly_sensitive:
dataset: output/dataset_controls.csv.gz
Analysis🔗
Finally, we will construct one or more scripted or reusable actions to undertake our analysis.