Skip to content

The `cohortextractor` action

cohortextractor is an action provided by the OpenSAFELY framework. Every pipeline will start with this as its first action. It is used to convert the study definition into an actual analysis-ready dataset based on dummy or real data.

The dataset it creates can be either:

  • A dummy dataset used for developing and testing analysis code on the user's own machine. Users have control over the characteristics of each dummy variable, which are defined inside the study definition.
  • A real dataset created from the OpenSAFELY database, used for the analysis proper. Real datasets never leave the secure server, only summary data and other outputs that are derived from them can be released (after disclosivity checks).

cohortextractor also performs some other tasks, like creating Measures, and many of the functions that are needed within a study definition script.

Installing cohortextractor🔗

In most cases you won't need to install cohortextractor directly. Instead, you create an action in your project.yaml file and run it via opensafely run <action>. This will use cohortextractor via a Docker image rather than a local installation.

If you need to install, follow these instructions Go to the Anaconda prompt and run the following command (or use another method to install the module if you know how):
pip install opensafely-cohort-extractor
To check this has installed successfully, run:
cohortextractor --version
If you need to install a new version, update with:
pip install --upgrade opensafely-cohort-extractor

Using cohortextractor in the project.yaml🔗

Typically, you will be writing cohortextractor commands inside the project.yaml and executing them using the opensafely run command. The available cohortextractor commands and how they are declared in the project.yaml are detailed below.

generate_cohort🔗

This is the command used to generate a dataset from the study definition.

A basic generate_cohort action looks like this:

generate_study_cohort
  run: cohortextractor:latest generate_cohort --output-format=csv.gz
  outputs:
    highly_sensitive:
      data: output/input.csv.gz

This produces a zipped file to reduce storage space required in the backend. Missing the --output-format=csv.gz argument allows an uncompressed CSV to be produced which may be useful while testing in dummy data. To import the compressed file in Python, you can use pd.read_csv(<file>, compression='gzip').

The size of the dummy dataset is determined by the population_size option in the project.yaml.

Running the action with opensafely run generate_study_population will create the file output/input.csv.gz.

Running the action again will overwrite the existing dataset with a new one (check the file's modified date) which will contain different data than previously (even if the study_definition.py didn't change) because the values are randomly generated. Beware that on Windows, you can't have input.csv(.gz) open and generate a new one at the same time.

If you have multiple study definitions in the repo (e.g., study_definition_cohort1.py and study_definition_cohort2.py) then generate_cohort will by default create a dataset for each and name them appropriately (e.g., input_cohort1.csv.gz and input_cohort2.csv.gz). You can restrict the command to produce a single study definition using the study-definition option, like this:

generate_study_cohort
  run: cohortextractor:latest generate_cohort --study-definition study_definition_cohort2 --output-format=csv.gz
  outputs:
    highly_sensitive:
      data: output/input_cohort2.csv.gz

You can change the location of the outputted .csv.gz file using the --output-dir option, for example run: cohortextractor:latest generate_cohort --output-dir output/cohorts