Using the measures framework🔗

Introduction🔗

The measures framework is used to calculate quotients (i.e. a numerator divided by a denominator) and to see how these vary over time and when broken down by different groupings.

The numerators, denominators and groups are all defined using ehrQL queries, just like we would use with a dataset definition. But the way those definitions are used is a bit different.

To explain the concepts involved we'll start with a basic but complete example and walk through each of the elements.

Basic example🔗

Suppose we want to know what proportion of the patients prescribed atorvastatin tablets in a given month were prescribed 80mg tablets, with the results broken down by sex and calculated for each of the first six months of 2022. Here's how we'd do that using the measures framework:

Code🔗

from ehrql import INTERVAL, case, create_measures, months, when
from ehrql.tables.core import medications, patients

# Every measure definitions file must include this line
measures = create_measures()

# Disable disclosure control for demonstration purposes.
# Values will neither be suppressed nor rounded.
measures.configure_disclosure_control(enabled=False)

# Small codelist for demonstration purposes; the real list would be longer
atorvastatin_tablets = [
    "39733211000001101",
    "39695411000001103",
    "39733011000001106",
    "39733111000001107",
]

atorvastatin_80mg_tablets = [
    "39733211000001101",
]

# The use of the special INTERVAL placeholder below is the key part of
# any measure definition as it allows the definition to be evaluated
# over a range of different intervals, rather than a fixed pair of dates
rx_in_interval = medications.where(
    medications.date.is_during(INTERVAL)
)
atorvastatin_rx = rx_in_interval.where(
    medications.dmd_code.is_in(atorvastatin_tablets)
)
atorvastatin_80_rx = rx_in_interval.where(
    medications.dmd_code.is_in(atorvastatin_80mg_tablets)
)

has_recorded_sex = patients.sex.is_in(["male", "female"])

measures.define_measure(
    name="atorva_80",
    numerator=atorvastatin_80_rx.exists_for_patient(),
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
    group_by={
        "sex": patients.sex
    },
    intervals=months(6).starting_on("2022-01-01"),
)

Running the example🔗

You can save this file as measure_definition.py and then run the generate-measures command on it:

opensafely exec ehrql:v1 generate-measures measure_definition.py --output measures.csv

Results🔗

This should produce a file called measures.csv whose contents look something like this:

measure	interval_start	interval_end	ratio	numerator	denominator	sex
atorva_80	2022-01-01	2022-01-31	0	0	5	female
atorva_80	2022-01-01	2022-01-31	0.2	2	10	male
atorva_80	2022-02-01	2022-02-28	0	0	3	female
atorva_80	2022-02-01	2022-02-28	0.2	1	5	male
atorva_80	2022-03-01	2022-03-31	0.5	3	6	female
atorva_80	2022-03-01	2022-03-31	0.2	2	10	male
atorva_80	2022-04-01	2022-04-30	0.444	4	9	male
atorva_80	2022-04-01	2022-04-30	0.125	1	8	female
atorva_80	2022-05-01	2022-05-31	0.286	2	7	male
atorva_80	2022-05-01	2022-05-31	0.2	1	5	female
atorva_80	2022-06-01	2022-06-30	0.25	2	8	male
atorva_80	2022-06-01	2022-06-30	0.5	5	10	female

Here the measure column always has the same value (atorva_80) because we only have a single measure defined, but if we had defined multiple measures then this would tell us which measure each row relates to.

The interval_start and interval_end columns show the date range (inclusive) covered by each row. In this case there are 12 rows because each of the six months appears twice: once where sex=male and again where sex=female.

The denominator column gives the number of patients for each interval-sex combination which match our denominator definition. The numerator column gives the number of patients which also match the numerator definition. That is, a patient must match both the denominator and numerator definitions to be included in the numerator.

The ratio is simply numerator divided by denominator.

The sex column gives the value of the expression we supplied when defining group_by. Had we defined any other groupings (say age_band) then they would appear as additional columns here.

Core concepts🔗

Defining a measure🔗

A measure definition always starts by creating a measures collection object:

measures = create_measures()

Each individual measure is then defined by calling measures.define_measure():

measures.define_measure(
    name="atorva_80",
    numerator=atorvastatin_80_rx.exists_for_patient(),
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
    group_by={
        "sex": patients.sex
    },
    intervals=months(6).starting_on("2022-01-01"),
)

The name argument is just used so we can identify our measure in the output. It can be anything you like so long as it contains only alphanumeric and underscore characters and starts with a letter.

The denominator argument defines the condition that patients must match to be included in the denominator i.e. it is a boolean patient series. (It is also possible to supply an integer here, but we'll cover this later.)

The numerator arguments defines the additional condition that patients must also meet to be included in the numerator. (Again, this can also be an integer as we'll discuss later.)

The group_by argument defines how we would like our results broken down. It's optional – leaving it out means that we'll get a single row for each time interval – but most measures define at least one set of groups. It is supplied as a dictionary mapping group names to group definitions.

As we saw in the example above, each group name defined here ends up as column in the results. Each group definition should be a "categorical" patient series, that is: a patient series which takes only a fixed set of values.

The interval argument defines the time periods over which the measure will be calculated. This is given as a list of start date/end date pairs but, as typing these all out by hand would be laborious, we provide several convenience functions for generating such lists: years, months and weeks.

The `INTERVAL` placeholder🔗

The key difference between the ehrQL we'd write for a dataset definition and the ehrQL we write for a measure comes in this line here:

rx_in_interval = medications.where(
    medications.date.is_during(INTERVAL)
)

This filters the medications table to include just those prescribed during "the current interval" without specifying exactly what the interval is.

In a dataset definition we would need to reference a specific pair of dates here e.g.:

rx_in_interval = medications.where(
    medications.date.is_on_or_between("2022-01-01", "2022-01-31")
)

But the query we use in our measure definition needs to be evaluated over a range of date intervals, not just one. We can handle this by using the special INTERVAL value as a placeholder for the start and end dates of the intervals:

rx_in_interval = medications.where(
    medications.date.is_on_or_between(INTERVAL.start_date, INTERVAL.end_date)
)

However, it's a bit cumbersome to have to type is_on_or_between(), start_date and end_date every time so we provide is_during() as a convenient shorthand:

rx_in_interval = medications.where(
    medications.date.is_during(INTERVAL)
)

Disclosure control🔗

By default, numerators and denominators are subject to disclosure control. First, values less than or equal to seven are replaced with zero (suppressed); then, values are rounded to the nearest five.

We disabled disclosure control with this line here:

measures.configure_disclosure_control(enabled=False)

However, we should carefully consider whether we wish to disable disclosure control when running inside of the secure environment. If we don't, then we should remove that line.

Additional notes🔗

Multiple measures🔗

We're not restricted to just one measure per file. If we wanted to add another measure using the same numerator and denominator but this time broken down by age band, then we would start by defining an age band variable like this:

age = patients.age_on(INTERVAL.start_date)
age_band = case(
    when((age >= 0) & (age < 20)).then("0-19"),
    when((age >= 20) & (age < 40)).then("20-39"),
    when((age >= 40) & (age < 60)).then("40-59"),
    when((age >= 60) & (age < 80)).then("60-79"),
    when(age >= 80).then("80+"),
)

And then we can define a new measure exactly like the previous one but with different name and using age_band in our group_by argument:

measures.define_measure(
    name="atorva_80_by_age",
    numerator=atorvastatin_80_rx.exists_for_patient(),
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
    group_by={
        "age_band": age_band
    },
    intervals=months(6).starting_on("2022-01-01"),
)

Writing measure outputs to separate files🔗

By default, when you define multiple measures the output is combined together in a single file.

If you prefer each measure to be in its own individual file you can do this by specifying a directory name, rather than a file name, as the --output and adding the file type (e.g. :arrow or :csv) to the end of the directory name.

For example, if you change the earlier command from using --output measures.csv to:

opensafely exec ehrql:v1 \
  generate-measures measure_definition.py --output measures/:csv

Instead of a single combined file measures.csv file like this:

measure	interval_start	interval_end	ratio	numerator	denominator	sex	age_band
atorva_80	2022-01-01	2022-01-31	0.23	4	17	male
atorva_80	2022-01-01	2022-01-31	0.08	1	12	female
atorva_80_by_age	2022-01-01	2022-01-31	0.15	3	19		0-19
atorva_80_by_age	2022-01-01	2022-01-31	0	0	3		40-59

You should now have a measures directory with two files:

measures/atorva_80.csv

interval_start	interval_end	ratio	numerator	denominator	sex
2022-01-01	2022-01-31	0.23	4	17	male
2022-01-01	2022-01-31	0.08	1	12	female

measures/atorva_80_by_age.csv

interval_start	interval_end	ratio	numerator	denominator	age_band
2022-01-01	2022-01-31	0.15	3	19	0-19
2022-01-01	2022-01-31	0	0	3	40-59

Depending on your use case having multiple smaller files may be easier to work with.

Don't forget to update outputs: in project.yaml

If you switch to using multiple files don't forget to update the outputs: specification in your project.yaml file. For instance, if you have a specification like this for a single measure file:

outputs:
  moderately_sensitive:
    measures: output/measures_ckd.csv

It would need to change to this for multiple measure files:

outputs:
  moderately_sensitive:
    measures: output/measures_ckd/*.csv

Removing duplication🔗

As our second measure shares so much in common with our first it seems redundant to have to specify the numerator, denominator and intervals all over again. Furthermore, if we ever wanted to make changes to these definitions we'd need to make the change in multiple places to keep them consistent.

To avoid this we can use measures.define_defaults() to set values which we know are going to be common between all the measures in our file:

measures.define_defaults(
    numerator=atorvastatin_80_rx.exists_for_patient(),
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
    intervals=months(6).starting_on("2022-01-01"),
)

And then we can define our two measures using:

measures.define_measure(
    name="atorva_80",
    group_by={
        "sex": patients.sex
    },
)

measures.define_measure(
    name="atorva_80_by_age",
    group_by={
        "age_band": age_band
    },
)

Grouping by multiple features🔗

The above example defines two separate measures, one grouping by sex and the other by age band. But it is also possible to define a single measure that groups by sex and age band simultaneously:

measures.define_measure(
    name="atorva_80_by_age_and_sex",
    group_by={
        "sex": patients.sex,
        "age_band": age_band,
    },
)

This will produce a row of output for each possible combination of sex and age band: male,0-19 female,0-19, male,20-39, female,20-39 and so on for all ten combinations.

As we defined our measure to cover monthly intervals over a six month period, this means that this single measure will produce 60 rows – ten for each month.

Do not group by too many features at once

Be careful when grouping by multiple features because the number of rows in the measure output can grow very quickly (exponentially, in fact) with each new feature you add: it is determined by the number of possible values for each of the features multiplied together, and all multiplied again by the number of time periods covered by the measure. By grouping on too many features it is possible to produce measures which are so large they will never complete in practice.

Dummy data🔗

When run outside of the secure environment ehrQL will generate dummy data for measures just as it does with datasets. However, note that generating meaningful results for measures defined over a large time period and with many different groupings can require significantly more dummy data than would be needed for a dataset definition. The measures framework will make a crude attempt to guess how many dummy patients it should generate, but you may need to adjust this number using the measures.configure_dummy_data() method.

For more information about using dummy data with measures please see our how-to guide.