The most common kind of action is a scripted action.
Generally-speaking, you can write whatever code you like as long as it will run successfully on server, and it is possible to test this locally.
However, note the following restrictions and guidance:
Write analyses in Python, R, or Stata. You can can use more than one language in a single project if necessary. You can find more information about the available libraries below.
Do not write code that requires an internet connection to run. Any research objects (datasets, libraries, etc.) that are retrieved via the internet should be imported to the repo locally first. If this is not possible (for instance if the object size is too large to be transferred via GitHub) then get in touch.
Avoid code that consumes a lot of time or memory. The server is not an infinite resource. We can advise on code optimisation if run-times become problematic. A good strategy is to split your processing into separate project pipeline actions; the job runner can then choose to run them in parallel if sufficient resources are available.
Write code that runs in the OpenSAFELY platform. Code will be run within a Linux-based Docker environment. In practice, this just means ensuring you use forward-slashes
Structure your code into discrete chunks, both within scripts, and by splitting into different pipeline actions. This helps with:
- parallelisation via the project pipeline
Reading and Writing Outputs🔗
Scripted actions can read and write output files that are saved in the workspace. These generally fall into two categories:
* large files of
highly_sensitive data for use by other actions
moderately_sensitive outputs for review and release
highly_sensitive output files🔗
It is important that the right files formats are used for large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.
The template sets up
cohortextractor command to produce
This is the current recommended output format, as CSV files compress well,
and this reduces both storage requirements and improves job execution times
on the backend.
If you need to view the raw CSV data locally, you can unzip with
opensafely unzip input.csv.gz.
# read compressed CSV output from cohortextractor pd.read_csv("output/input.csv.gz") # write compressed feather file df.to_feather("output/model.feather", compression="zstd") # read feather file, decompressed automatically pd.read_feather("output/input.feather")
# read compressed CSV output from cohortextractor df <- readr::read_csv("output/input.csv.gz") # write a compressed feather file arrow::write_feather(df, "output/model.feather", compression = "zstd") # read a feather file, decompressed automatically df <- arrow::read_feather("output/input.feather")
// stata cannot handle compressed CSV files directly, so unzip first to a plain CSV file // the unzipped file will be discarded when the action finishes. !gunzip output/input.csv.gz // now import the uncompressed CSV using delimited import delimited using output/input.csv // save in compressed dta.gz format gzsave output/model.dta.gz // load a compressed .dta.gz file gzload output/input.dta.gz
moderately_sensitive output files🔗
These outputs are marked as
moderately_sensitive in your
project.yaml, and are available to view with Level 4 access. Outputs can be:
* aggregate summary data
* log files for debugging action code
Due to the fact that Level 4 files need to be reviewed, there are various restrictions placed on sizes and formats of files that can be released
File format restrictions🔗
These are restricted so that reviewers can properly examine the outputs on the secure server.
File size restrictions🔗
There is a maximum file size of 32 MB to:
- limit the amount of data that can be accessed via Level 4
- allow a thorough review of the outputs in a reasonable time
OpenSAFELY currently supports Stata, Python, and R for statistical analysis.
For security reasons, available libraries are restricted to those provided by the framework, though you can request additions.
The framework executes your scripts using Docker images which have been preloaded with a fixed set of libraries. These Docker images have yet to be optimised; if you have skills in creating Dockerfiles and would like to help, get in touch!
We currently package version 16.1, with
safecount libraries installed; when installed, new libraries will appear in the stata-docker GitHub repository.
As Stata is a commercial product, a license key is needed to use it.
If you are a member of the
opensafely GitHub organisation🔗
- If you are using Windows, then the
opensafelycommand line software will automatically use the OpenSAFELY Stata license.
- If you are using macOS:
- Download and install GitHub's command-line tool (
gh auth login --web. Select the "HTTPS" option, and follow the instructions
opensafelycommand line software will now automatically use the OpenSAFELY Stata license
All other external users🔗
If you are not a member of the
opensafely GitHub organisation, you must provide your own Stata/MP license. Unfortunately other Stata flavours are not yet supported; let us know if this is a problem.
Locate your Stata license string as follows: Locate a text file, called
STATA.LIC(on Windows) or
stata.lic(macOS and Linux) which is usually at the top level of the folder of your Stata installation:
- On Windows machine it's usually somewhere like
- On Linux, somewhere like
- On macOS it's usually in
- Within that file, locate a license string of the format
- Set it as an environment variable using a method appropriate to your operating system. On Linux or macOS, you'd do it like this:
export STATA_LICENSE='your license string'
opensafelycommand line software will now automatically use this Stata license.
- On Windows machine it's usually somewhere like
The Docker image provided is Python 3.8, with this list of packages installed.
The R image provided is R 4.0, with this list of libraries installed.