The cohortextractor action
cohortextractor is an action provided by the OpenSAFELY framework. Every pipeline will start with this as its first action.
It is used to convert the study definition into an actual analysis-ready dataset based on dummy or real data.
The dataset it creates can be either:
- A dummy dataset used for developing and testing analysis code on the user's own machine. Users have control over the characteristics of each dummy variable, which are defined inside the study definition.
- A real dataset created from the OpenSAFELY database, used for the analysis proper. Real datasets never leave the secure server, only summary data and other outputs that are derived from them can be released (after disclosivity checks).
cohortextractor also performs some other tasks, like creating Measures, and many of the functions that are needed within a study definition script.
In most cases you won't need to install
Instead, you create an action in your project.yaml file and run it via
opensafely run <action>.
This will use
cohortextractor via a Docker image rather than a local installation.
If you need to install, follow these instructionsGo to the Anaconda prompt and run the following command (or use another method to install the module if you know how):
pip install opensafely-cohort-extractor
pip install --upgrade opensafely-cohort-extractor
cohortextractor in the
Typically, you will be writing
cohortextractor commands inside the
project.yaml and executing them using the
opensafely run command.
cohortextractor commands and how they are declared in the
project.yaml are detailed below.
This is the command used to generate a dataset from the study definition.
generate_cohort action looks like this:
generate_study_cohort run: cohortextractor:latest generate_cohort --output-format=csv.gz outputs: highly_sensitive: data: output/input.csv.gz ### This file is produced in compressed format. If using Stata for your analysis you may need to change ### the --output-format option to 'csv' and the .gz file extension to produce an uncompressed CSV instead.
This produces a zipped file to reduce storage space required in the backend. Changing to
--output-format=csv argument allows an uncompressed CSV to be produced which may be useful while testing in dummy data or if using Stata for analysis of this file. To import the compressed file in Python, you can use
The size of the dummy dataset is determined by the
population_size option in the
Running the action with
opensafely run generate_study_population will create the file
Running the action again will overwrite the existing dataset with a new one (check the file's modified date) which will contain different data than previously (even if the
study_definition.py didn't change) because the values are randomly generated.
Beware that on Windows, you can't have
input.csv(.gz) open and generate a new one at the same time.
If you have multiple study definitions in the repo (e.g.,
generate_cohort will by default create a dataset for each and name them appropriately (e.g.,
You can restrict the command to produce a single study definition using the
study-definition option, like this:
generate_study_cohort run: cohortextractor:latest generate_cohort --study-definition study_definition_cohort2 --output-format=csv.gz outputs: highly_sensitive: data: output/input_cohort2.csv.gz ### This file is produced in compressed format. If using Stata you may need to remove the --output-format option and ### the .gz file extension to produce an uncompressed CSV instead.
You can change the location of the outputted
.csv.gz file using the
--output-dir option, for example
run: cohortextractor:latest generate_cohort --output-dir output/cohorts