The project pipeline
This section covers how to develop, run, and test your code to ensure it will work end-to-end within the secure framework.
The ehrQL documentation describes how to make an action which generate dummy datasets based on the instructions defined in your
These dummy datasets are the basis for developing the analysis code that will eventually be passed to the server to run on real datasets.
The code can be written and run on your local machine using whatever development set up you prefer (e.g., developing R in RStudio).
However, it's important to ensure that this code will run successfully in OpenSAFELY's secure environment too, using the specific language and package versions that are installed there. To do this, you should use the project pipeline.
The project pipeline, defined entirely in a
project.yaml file, is a system for executing your code using a series of actions i.e., a discrete analytical step within the analysis, each of which may depend on previous actions.
The primary purpose of the pipeline is to specify the execution order for all your code, so that it can be automatically run and tested from start to finish using dummy data and using the live database in the secure environment, using an identical software configuration. Arranging your code like this also has several other advantages:
- The pipeline knows if outputs for given actions already exist, and by default skips running them if so. This greatly speeds up the debugging cycle when testing against live data
- In production, actions that can be executed in parallel will be, automatically
- Thinking about your analysis in terms of actions makes it more readable and therefore easier to review and test. For example, being explicit about what the inputs and outputs of each actions are ensures you don't overwrite files by accident.
- The pipeline forces you to declare which outputs may be more or less disclosive.
The project pipeline is defined in a single file,
project.yaml, which lives in the repository's root directory.
It is written using a configuration format called YAML, which uses indentation to indicate groupings of related variables.
A simple example of a
project.yaml is as follows:
# Ignore this `expectations` block. It is required but not used, and will be removed in future versions.
run: ehrql:v0 generate-dataset analysis/dataset_definition.py --output output/dataset.csv.gz
run: stata-mp:latest analysis/model.do
This example declares the pipeline
version, and two actions:
You only need to change
version if you want to take advantage of features of newer versions of the pipeline framework.
generate_dataset action will create the highly sensitive
It will be dummy data when run locally, and will be based on real data from the OpenSAFELY database when run in the secure environment.
run_model action will run a Stata script called
model.do based on the
dataset.csv.gz created by the previous action.
It will output two moderately sensitive files
survival-plot.png, which can be checked and released if appropriate.
project.yaml requires a
In general, actions are composed as follows:
- Each action must be named using a valid YAML key (you won't go wrong with letters, numbers, and underscores) and must be unique.
- Each action must include a
runkey which includes an officially-supported command and a version (which at present is usually just
ehrqlcommand has the same options as described in the ehrQL reference.
stata-mpcommands provide a locked-down execution environment that can take one or more
inputswhich are passed to the code.
- Each action must include an
outputskey with at least one output, classified as either
highly_sensitiveoutputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. This includes all data at the pseudonymised patient-level. Outputs labelled highly_sensitive will not be visible to researchers.
moderately_sensitiveoutputs should never include patient-level data, only data that is considered non-disclosive. This includes aggregated patient-data outputs such as summary tables, summary statistics and the outputs from statistical models. For a full list check the allowed file types subsection. The appropriate statistical disclosure controls should have been applied to these files. They are copied to the secure review area (otherwise known as Level 4).
- Outputs should be separated onto different lines, each with a unique 'key', but related outputs can be combined using a wildcard (
*). Note, when using a wildcard, it is extremely important to ensure that no
highly_sensitiveoutputs are included. E.g.:
outputs: moderately_sensitive: table: output/summary_results.txt survival_figure: output/figures/survival-plot.png time_series_figures: output/figures/time_series_*.png
- Keys serve only as a human-readable description of the outputs, and are ignored when the job is run.
- Each action can include a
needskey which specifies a list of actions (contained within square brackets and separated by commas) that are required for it to successfully run. When an action runs, the
outputsof all its
needsactions are copied to its working directory.
needsactions can be defined anywhere in the
project.yaml, but it's more readable if they are defined above.
When writing and running your pipeline, note that:
All file paths must be declared relative to the repository's root directory. So for example use
File paths are case-sensitive as everything is run inside a Linux Docker container.
The location of each action's output is determined by the underlying code that the action invoked, not by the value of the
outputsconfiguration. The purpose of
outputsis to label the disclosivity of each output and indicate that it should be stored securely — any outputs not labelled will not be saved.
Each action is run in its own isolated environment in a temporary working directory. This means that all the necessary libraries and data must be imported within the script for each action — For R users, this essentially means that the R is restarted for each action.
If one or more dependencies of an action have not been run (i.e., their outputs do not exist) then these dependency actions will be run first. If a dependency has changed but has not been run (so the outputs are not up-to-date with the changes), then the dependency actions will not be run, and the dependent actions will be run using the out-of-date outputs.
The ordering of columns may not be consistent between the dummy data and the TPP/EMIS backend. You should avoid referring to index integer positions and instead use the index / column names. Using index / column names will be more robust to different versions of ehrQL and will also avoid problems caused by index integer positions changing as columns are added/removed.
Running your code locally🔗
Whilst you can develop and run code locally using your own installations of R, Stata or Python, it's important to check that these will also successfully run on the real data in an identical execution environment.
opensafely run command will execute one or more actions according to the
To see its options, type
opensafely run --help.
opensafely run to work:
- You need to have both Python and Docker installed.
- The Docker daemon must be running on your machine:
- For Windows users using Docker Desktop, there should be a Docker icon in your system tray.
- For Mac users using Docker Desktop, there should be a Docker icon in the top status bar.
To run the first action in the example above, using dummy data, you can use:
opensafely run generate_dataset
This will generate the
dataset.csv.gz file as explained in the ehrQL documentation.
To run the second action you can use:
opensafely run run_model
It will create the two files as specified in the
To force the dependencies to be run you can use for example
opensafely run run_model --force-run-dependencies, or
-f for short.
This will ensure for example that both the
generate_dataset actions are run, even if
dataset.csv.gz already exists.
To run all actions, you can use a special
run_all action which is created for you (no need to define it in your
opensafely run run_all
Each time an action is run, logging information about your run will be put into the
If any of your actions fail, you may find clues here as to why.
Click here for information on the exact steps that occur when each job is run locally
- A new, empty temporary directory for the job is created
- Any files in the local repo that do not match the output patterns in the
project.yamlare copied into the temporary folder
- Any output files from the job's dependencies are copied into the temporary folder
- The job is run
- All the files matching the specified output patterns are copied into the local repo
- The log files for the job are saved into the
- The temporary directory is deleted
Running your code with GitHub actions🔗
Every time you create a pull request to merge a development branch onto the main remote branch, GitHub will automatically run a series of tests on the code; specifically, that your codelists are up-to-date, and that
run_all completes successfully.
Depending on your settings, you may receive email notifications about the results of these tests.
You can view the tests, including any errors or failures, by going to the pull request page on GitHub and clicking the
You can re-run these tests by clicking the
re-run jobs button.
Running your code on the server🔗
To run code for real in the production environment, use the jobs site.
After your project has been executed via the jobs site, its outputs will be stored on a secure server.
Users with permission to access Level 4 can view output files that are labelled as moderately sensitive; they can also view automatically created log files of the run for debugging purposes.
For security reasons, they will be in a different directory than if you had run locally. For the TPP backend, outputs labelled
moderately_sensitive in the
project.yaml will be saved in
D:/Level4Files/workspaces/<NAME_OF_YOUR_WORKSPACE>. These outputs can be reviewed on the server and released if they are deemed non-disclosive.
highly_sensitive are not visible.
If you have Level 3 access🔗
No data should ever be published from the Level 3 server. Access is only for permitted users, for the purpose of debugging problems in the secure environment.
Highly sensitive outputs can be seen in
E:/high_privacy/workspaces/<WORKSPACE_NAME>. This includes a directory called
metadata, containing log files for each action e.g.
Moderately sensitive outputs can be seen in