Skip to content

Case-control studies

Warning

At present, only the OpenSAFELY TPP backend supports case-control studies.

There are five steps to undertaking a case-control study (or matched cohort study) with OpenSAFELY:

  1. Extract data for the cases
  2. Extract data for the potential controls
  3. Match the cases to the potential controls
  4. Extract data for the matched controls
  5. Analysis

To begin with, our project.yaml looks like this:

version: '3.0'

expectations:
  population_size: 1000

actions:
  # Extract data for the cases
  # Extract data for the potential controls
  # Match the cases to the potential controls
  # Extract data for the matched controls
  # Analysis

Extract data for the cases🔗

In this step, we will construct a study definition to extract all the data we need for the cases: that is, the matching data, or the data we will use to match the cases to the potential controls; and the non-matching data, or the data we will use for analysis. To avoid duplicating code to extract the matching data and the non-matching data in this and the following steps, we could use separate Python scripts to share common study definition variables.

As we will construct multiple study definitions in this and the following steps, we will name this study definition study_definition_cases.py. When working with multiple study definitions, each study definition's suffix is used to name each corresponding output file. Here, the study definition's suffix is cases, so the corresponding output file will be named input_cases.csv.

Our project.yaml now includes the following action:

# ...
actions:
  extract_cases:
    run: cohortextractor:latest generate_cohort --study-definition study_definition_cases
    outputs:
      highly_sensitive:
        cohort: output/input_cases.csv

Extract data for the potential controls🔗

The potential controls are the group of patients that are matched to the cases to give the matched controls. In this step, we will construct a second study definition to extract only the matching data for the potential controls. We will name this study definition study_definition_potential_controls.py.

Our project.yaml now includes the following action:

# ...
actions:
  # ...
  extract_potential_controls:
    run: cohortextractor:latest generate_cohort --study-definition study_definition_potential_controls
    outputs:
      highly_sensitive:
        cohort: output/input_potential_controls.csv

Match the cases to the potential controls🔗

In this step, we will use the OpenSAFELY matching library in a scripted action to match the cases to the potential controls. We will name this scripted action match.py. Whilst the OpenSAFELY matching library can output multiple files, we will use two: matching_report.txt and matched_matches.csv. The former contains information about the matching process. The latter contains the matched controls.

Our project.yaml now includes the following action:

# ...
actions:
  # ...
  matching:
    run: python:latest python analysis/match.py
    needs: [extract_cases, extract_potential_controls]
    outputs:
      moderately_sensitive:
        matching_report: output/matching_report.txt
      highly_sensitive:
        matched_matches: output/matched_matches.csv

Alternatives to the OpenSAFELY matching library

Rather than using the OpenSAFELY matching library, you may wish to match the cases to the potential controls by implementing your own algorithm in a scripted action. If you do, please note that not all Python, R, or Stata packages are available to you; to find out which packages are, see the python-docker, r-docker, or stata-docker repositories. Alternatively, you can request new libraries.

Extract data for the matched controls🔗

In this step, we will construct a third study definition to extract only the non-matching data for the matched controls. We will name this study definition study_definition_controls.py.

We will use the functions patients.which_exist_in_file and patients.with_value_from_file. We will use patients.which_exist_in_file to include only the matched controls in the population:

from cohortextractor import StudyDefinition, patients

CONTROLS = "output/matched_matches.csv"

study = StudyDefinition(
    index_date="2021-01-01",  # Ignored
    population=patients.which_exist_in_file(CONTROLS),
)

We will use patients.with_value_from_file to access the case index dates:

from cohortextractor import codelist_from_csv, StudyDefinition, patients

CONTROLS = "output/matched_matches.csv"
codelist = codelist_from_csv("codelists/codelist.csv")

study = StudyDefinition(
    index_date="2021-01-01",  # Ignored
    population=patients.which_exist_in_file(CONTROLS),
    case_index_date=patients.with_value_from_file(
        CONTROLS,
        returning="case_index_date",
        returning_type="date",
    ),
    has_event_in_codelist=patients.with_these_clinical_events(
        codelist,
        on_or_after="case_index_date",
    ),
)

Our project.yaml now includes the following action:

# ...
actions:
  # ...
  extract_controls:
    run: cohortextractor:latest generate_cohort --study-definition study_definition_controls
    needs: [matching]
    outputs:
      highly_sensitive:
        cohort: output/input_controls.csv

Analysis🔗

Finally, we will construct one or more scripted or reusable actions to undertake our analysis.