What is dummy data?
Dummy data (in the context of OpenSAFELY) is data that mimics the output of a dataset produced by running an ehrQL generate-dataset action, in the absence of the real data.
The primary function of dummy data in ehrQL is to allow you to test that your ehrQL code and your downstream analysis code is working as expected. It allows you to test and develop your analyses without access to the real data.
When you run an ehrQL command or a project.yaml action locally (i.e. outside of the secure
environment where real patient data lives, in a Codespace or on your own computer), ehrQL allows the action to run
using simulated data, either data generated by ehrQL itself, or data that you provide.
What is dummy data not?🔗
Dummy data is not a faithful recreation of the real data.
In many cases it is "cleaner" than the real data, and will not necessarily simulate data quality issues such as impossible dates or values resulting from inaccurate data entry or missingness in the data. You should be prepared to handle such edge cases in the real data, even if you do not encounter it in dummy data.
In other cases, dummy data may not accurately represent trends that you might expect to see in the real data, for example, patterns of comorbidity. The lack of such trends in analyses performed on dummy data should not be assumed to reflect trends in the real data.
Using dummy data in OpenSAFELY🔗
There are 3 ways to use dummy data:
- Allow ehrQL to generate a dummy dataset from your dataset definition
- Provide your own dummy tables
- Provide your own dummy dataset
Next: Generate a dummy dataset with ehrQL