Skip to content

Variable reference

This section describes each available function for creating variables within a study definition.

For more information on the datasets contained within the OpenSAFELY database, see the Data sources section.

For more information on writing a study definition, go to the study definition section.

Primary Care Record🔗

These variables are derived from data held in the patients' primary care records.  

registered_as_of(reference_date, return_expectations=None) 🔗

All patients registered on the given date. Note this function passes arguments to registered_with_one_practice_between()

Parameters:

Name Description Default
reference_date

date of interest as a string with the format YYYY-MM-DD. Filters results to patients registered at a practice on the given date.

required
return_expectations

a dictionary containing an expectation definition defining an incidence between 0 and 1.

None

Returns:

Type Description
list

of integers of 1 or 0.

Examples:

This creates a variable "registered" with patient returning an integer of 1 if patient registered at date. Patients who are not registered return an integer of 0:

registered=patients.registered_as_of(
    "2020-03-01",
    return_expectations={"incidence": 0.98}
)

 

registered_with_one_practice_between(start_date, end_date, return_expectations=None) 🔗

All patients registered with the same practice through the given period

Parameters:

Name Description Default
start_date

start date of interest of period as a string with the format YYYY-MM-DD. Together with end date, this filters results to patients registered at a practice between two dates

required
end_date

end date of interest of period as a string with the format YYYY-MM-DD. Together with start date, this filters results to patients registered at a practice between two dates

required
return_expectations

a dictionary containing an expectation definition defining an incidence between 0 and 1.

None

Returns:

Type Description
list

of integers of 1 or 0.

Examples:

This creates a variable registered_one with patient returning an integer of 1 if patient registered at one practice between two dates. Patients who are not registered return an integer of 0.

registered_one=patients.registered_with_one_practice_between(
    start_date="2020-03-01",
    end_date="2020-06-01",
    return_expectations={"incidence": 0.90}
)

 

date_deregistered_from_all_supported_practices(on_or_before=None, on_or_after=None, between=None, date_format=None, return_expectations=None) 🔗

Returns the date (if any) on which the patient de-registered from all practices for which OpenSAFELY has data. Events which occur in primary care after this date will not be recorded in the platform (though there may be data from other sources e.g. SGSS, CPNS).

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of strings with a date format returned if patient had deregistered, otherwise empty

Examples:

A variable called dereg_date is created with returns a date of de-registration if patient has deregistered from a practice within the dataset within the specified time period.

dereg_date=patients.date_deregistered_from_all_supported_practices(
    on_or_after="2020-03-01",
    date_format="YYYY-MM",
    return_expectations={
        {"date": {"earliest": "2020-03-01"},
        "incidence": 0.05
    }
)

 

with_complete_history_between(start_date, end_date, return_expectations=None) 🔗

All patients for which we have a full set of records between the given dates

Parameters:

Name Description Default
start_date

start date of interest of period as a string with the format YYYY-MM-DD. Together with end date, this filters results to patients registered at a practice between two dates who have a complete history.

required
end_date

end date of interest of period as a string with the format YYYY-MM-DD. Together with start date, this filters results to patients registered at a practice between two dates who have a complete history.

required
return_expectations

a dictionary containing an expectation definition defining an incidence between 0 and 1.

None

Returns:

Type Description
list

of integers of 1 or 0

Examples:

This creates a variable has_consultation_history with patient returning an integer of 1 if patient registered at one practice between two dates and has a completed record. Patients who are not registered with a complete record return an integer of 0.

has_consultation_history=patients.with_complete_gp_consultation_history_between(
    start_date="2019-02-01",
    end_date="2020-01-31",
    return_expectations={"incidence": 0.9},
)

 

with_complete_gp_consultation_history_between(start_date, end_date, return_expectations=None) 🔗

The concept of a "consultation" in EHR systems does not map exactly to the GP-patient interaction we're interested in (see above) so there is some processing required on the part of the EHR vendor to produce the consultation record we need. This does not happen automatically as part of the GP2GP transfer, and therefore this query can be used to find just those patients for which the full history is available. This means finding patients who have been continuously registered with a single TPP-using practice throughout a time period.

Parameters:

Name Description Default
start_date

start date of interest as a string with the format YYYY-MM-DD

required
end_date

end date of interest as a string with the format YYYY-MM-DD

required
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

 

registered_practice_as_of(date, returning=None, return_expectations=None) 🔗

Return patients' practice address characteristics such as STP or MSOA

Parameters:

Name Description Default
date

date of interest as a string with the format YYYY-MM-DD. Filters results to the given date.

required
returning

string indicating value to be returned. Options are:

  • msoa: Middle Layer Super Output Area codes
  • nuts1_region_name: 9 English regions
  • stp_code: Sustainability Transformation Partnerships codes
  • pseudo_id: Pseudonymised GP practice identifier
  • rct__{trial_name}__{property_name}: Properties from a Cluster Randomised Controlled Trial (see below)
None
return_expectations

a dict defining the rate and the categories returned with ratios

None

Returns:

Type Description
list

of strings

Exceptions:

Type Description
ValueError

if unsupported returning argument is provided

Examples:

This creates a variable called region based on practice address of the patient:

region=patients.registered_practice_as_of(
    "2020-02-01",
    returning="nuts1_region_name",
    return_expectations={
        "rate": "universal",
        "category": {
            "ratios": {
                "North East": 0.1,
                "North West": 0.1,
                "Yorkshire and the Humber": 0.1,
                "East Midlands": 0.1,
                "West Midlands": 0.1,
                "East of England": 0.1,
                "London": 0.2,
                "South East": 0.2,
            },
        },
    },
)
Cluster RCTs🔗

Support is currently available for randomised controlled trials clustered at practice level (though we are also happy to add support for RCTs randomised at person-level).

A series of data files supplied by the trialists will be imported into OpenSAFELY; this will indicate which practices are enrolled, their assignment to an intervention group, and any other relevant practice properties or data gathered as part of the RCT outside of OpenSAFELY (e.g. number of GPs/nurses, number of practice visits made).

These RCT variables are only available for use by the researchers officially nominated by the responsible research group.

There is special syntax for accessing this data using the returning argument:

rct__{trial_name}__{property_name}

(Note the double underscores separating rct, trial name and property name.)

For example, for a trial called germdefence which has a property called deprivation_pctile, a variable can be created with:

practice_deprivation_pctile=patients.registered_practice_as_of(
    "2020-01-01",
    returning="rct__germdefence__deprivation_pctile",
    return_expectations={
        "rate": "universal",
        "category": {
            "ratios": {
                "1": 0.5,
                "2": 0.5
            },
        },
    },
)

The special property enrolled is a boolean indicating whether the practice was enrolled in the trial.

All other properties are returned as strings, exactly as supplied by the trialists. For the germdefence trial the available properties are:

trial_arm
av_rooms_per_house
deprivation_pctile
group_mean_behaviour_mean
group_mean_intention_mean
hand_behav_practice_mean
hand_intent_practice_mean
imd_decile
intcon
meanage
medianage
minority_ethnic_total
n_completers_hw_behav
n_completers_ri_behav
n_completers_ri_intent
n_engaged_pages_viewed_mean_mean
n_engaged_visits_mean
n_goalsetting_completers_per_practice
n_pages_viewed_mean
n_times_visited_mean
n_visits_practice
prop_engaged_visits
total_visit_time_mean

The data resulting from the study definition will be at patient level as usual and therefore practice variables will be repeated many times for each practice, and should be aggregated in a later analysis step.

 

address_as_of(date, returning=None, round_to_nearest=None, return_expectations=None) 🔗

Return patients' address characteristics such as IMD as of a particular date

Parameters:

Name Description Default
date

date of interest as a string with the format YYYY-MM-DD. Filters results to the given date.

required
returning

string indicating value to be returned. Options are:

  • index_of_multiple_deprivation
  • rural_urban_classification
  • msoa
None
round_to_nearest

an integer that represents how index_of_multiple_deprivation value are rounded. Only use when returning is index_of_multiple_deprivation

None
return_expectations

a dict defining the rate and the categories returned with ratios

None

Returns:

Type Description
list

of integers for rural_urban_classification and index_of_multiple_deprivation, strings for msoa

Exceptions:

Type Description
ValueError

if unsupported returning argument is provided

Examples:

This creates a variable called imd based on patient address.

imd=patients.address_as_of(
    "2020-02-29",
    returning="index_of_multiple_deprivation",
    round_to_nearest=100,
    return_expectations={
        "rate": "universal",
        "category": {"ratios": {"100": 0.1, "200": 0.2, "300": 0.7}},
    },
)

 

with_gp_consultations(on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', date_format=None, return_expectations=None) 🔗

These are GP-patient interactions, either in person or via phone/video call. The concept of a "consultation" in EHR systems is generally broader and might include things like updating a phone number with the receptionist.

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
find_first_match_in_period

a boolean that indicates if the data returned is first event if there are multiple matches within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last event if there are multiple matches within the time period

None
returning

string indicating value to be returned. Options are:

  • binary_flag: indicates if they have had the an event or not
  • date: indicates date of event and used with either find_first_match_in_period or find_last_match_in_period
  • number_of_matches_in_period: counts the events in the period
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date; a list of integers if returning argument is set to number_of_matches_in_period

Examples:

A variable called gp_count is created that counts number of GP consultation between two dates in 2019.

gp_count=patients.with_gp_consultations(
    between=["2019-01-01", "2020-12-31"],
    returning="number_of_matches_in_period",
    return_expectations={
        "int": {"distribution": "normal", "mean": 6, "stddev": 3},
        "incidence": 0.6,
    },
)

 

sex(return_expectations=None) 🔗

Returns the sex of the patient.

Parameters:

Name Description Default
return_expectations

a dict containing an expectation definition defining a rate and a ratio for sexes

None

Returns:

Type Description
list

"M" male, "F" female, "I" intersex, or "U" unknown.

Examples:

This creates a variable 'sex' with all patients returning a sex of either "M", "F" or ""

sex=patients.sex(
    return_expectations={
        "rate": "universal",
        "category": {"ratios": {"M": 0.49, "F": 0.51}},
    }
)

 

age_as_of(reference_date, return_expectations=None) 🔗

Returns the patient's age, in whole years, as at reference_date. Note that the patient's date of birth is rounded down to the first of the month, and age is derived from this rounded date.

Age can be negative if a patient's date of birth is after the reference_date.

Parameters:

Name Description Default
reference_date

date of interest as a string with the format YYYY-MM-DD

required
return_expectations

a dict defining an expectation definition that includes at least a rate and a distribution. If distribution is defined as "population_ages" it returns likely distribution based on known UK age bands in 2018 (see file: "uk_population_bands_2018.csv")

None

Returns:

Type Description
list

ages as integers

Examples:

This creates a variable "age" with all patient returning an age as an integer:

age=patients.age_as_of(
    "2020-02-01",
    return_expectations={
        "rate" : "universal",
        "int" : {"distribution" : "population_ages"}
    }
)

 

date_of_birth(date_format=None, return_expectations=None) 🔗

Returns date of birth as a string with format "YYYY-MM".

Parameters:

Name Description Default
date_format

a string detailing the format of the dates for date of birth to be returned. It can be "YYYY-MM" or "YYYY" and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with month and year.

None
return_expectations

a dictionary containing an expectation definition defining a rate and a distribution

None

Returns:

Type Description
list

dates as strings with "YYYY-MM" format

Exceptions:

Type Description
ValueError

if Date of Birth is attempted to be returned with a YYYY-MM-DD format.

Examples:

This creates a variable dob with all patient returning a year and month as a string:

dob=patients.date_of_birth(
    "YYYY-MM",
    return_expectations={
        "date": {"earliest": "1950-01-01", "latest": "today"},
        "rate": "uniform",
    }
)

 

most_recent_bmi(on_or_before=None, on_or_after=None, between=None, minimum_age_at_measurement=16, return_expectations=None, include_measurement_date=False, date_format=None, include_month=False, include_day=False) 🔗

Return patients' most recent BMI (in the defined period) either computed from weight and height measurements or, where they are not availble, from recorded BMI values. Measurements taken when a patient was below the minimum age are ignored. The height measurement can be taken before (but not after) the defined period as long as the patient was over the minimum age at the time.

The date of the measurement can be obtained using date_of("<bmi-column-name>"). If the BMI is computed from weight and height then we use the date of the weight measurement for this.

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to measurements on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to measurements on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to measurements between the two dates provided (inclusive). The two dates must be in chronological order.

None
minimum_age_at_measurement

Measurements taken before this age will not count towards BMI calculations. It is an integer.

16
return_expectations

a dictionary defining the incidence and distribution of expected BMI within the population in question. This is a 3-item key-value dictionary of "date" and "float". "date" is dictionary itself and should contain the earliest and latest dates needed in the dummy data. float is a dictionary of distribution, mean, and stddev. These values determine the shape of the dummy data returned, and the float means a float will be returned rather than an integer. incidence must have a value and this is what percentage of dummy patients have a BMI. It needs to be a number between 0 and 1.

None
include_measurement_date

a boolean indicating if an extra column, named date_of_bmi, should be included in the output.

False
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year. Only used if include_measurement_date is True

None
include_month

a boolean indicating if month should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False

Returns:

Type Description
float

most recent BMI

Examples:

This creates a variable "bmi" returning a float of the most recent bmi calculated from recorded height and weight, or from a recorded bmi record. Patient who do not have this information available do not return a value:

bmi=patients.most_recent_bmi(
    between=["2010-02-01", "2020-01-31"],
    minimum_age_at_measurement=18,
    include_measurement_date=True,
    date_format="YYYY-MM",
    return_expectations={
        "date": {"earliest": "2010-02-01", "latest": "2020-01-31"},
        "float": {"distribution": "normal", "mean": 28, "stddev": 8},
        "incidence": 0.80,
    }
)

 

mean_recorded_value(codelist, on_most_recent_day_of_measurement=None, return_expectations=None, on_or_before=None, on_or_after=None, between=None, include_measurement_date=False, date_format=None, include_month=False, include_day=False) 🔗

Return patients' mean recorded value of a numerical value as defined by a codelist on a particular day within the defined period. This is important as allows us to account for multiple measurements taken on one day.

The date of the measurement can be included by flagging with date format options.

Parameters:

Name Description Default
codelist

a codelist for requested value

required
on_most_recent_day_of_measurement

boolean flag for requesting measurements be on most recent date

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question. This is a 3-item key-value dictionary of "date" and "float". "date" is dictionary itself and should contain the earliest and latest dates needed in the dummy data. float is a dictionary of distribution, mean, and stddev. These values determine the shape of the dummy data returned, and the float means a float will be returned rather than an integer. incidence must have a value and this is what percentage of dummy patients have a value. It needs to be a number between 0 and 1.

None
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to measurements on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to measurements on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to measurements between the two dates provided (inclusive). The two dates must be in chronological order.

None
include_measurement_date

a boolean indicating if an extra column, named date_of_bmi, should be included in the output.

False
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year. Only used if include_measurement_date is True

None
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False

Returns:

Type Description
float

mean of value

Examples:

This creates a variable bp_sys returning a float of the most recent systolic blood pressure from the record within the time period. In the event of repeated measurements on the same day, these are averaged. Patient who do not have this information available do not return a value:

bp_sys=patients.mean_recorded_value(
    systolic_blood_pressure_codes,
    on_most_recent_day_of_measurement=True,
    between=["2017-02-01", "2020-01-31"],
    include_measurement_date=True,
    date_format="YYYY-MM",
    return_expectations={
        "float": {"distribution": "normal", "mean": 80, "stddev": 10},
        "date": {"earliest": "2019-02-01", "latest": "2020-01-31"},
        "incidence": 0.95,
    },
)

 

with_these_medications(codelist, return_expectations=None, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', include_date_of_match=False, date_format=None, ignore_days_where_these_clinical_codes_occur=None, episode_defined_as=None, return_binary_flag=None, return_number_of_matches_in_period=False, return_first_date_in_period=False, return_last_date_in_period=False, include_month=False, include_day=False) 🔗

Patients who have been prescribed at least one of this list of medications in the defined period

Parameters:

Name Description Default
codelist

a codelist for requested medication(s)

required
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question. If returning an integer (returning number_of_matches_in_period, number_of_episodes), this is a 2-item key-value dictionary of int and incidence. int is a dictionary of distribution, mean, and stddev. These values determine the shape of the dummy data returned, and the int means a int will be returned rather than a float. incidence must have a value and this is what percentage of dummy patients have a value. It needs to be a number between 0 and 1. If returning binary_flag this is a 1-item dictionary of incidence as described above. If returning either first_date_in_period or last_date_in_period, this is a 2-item dictionary of date and incidence. date is a dict of earliest and/or latest date possible.

None
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • binary_flag
  • date
  • number_of_matches_in_period
  • number_of_episodes
  • code
  • category
'binary_flag'
find_first_match_in_period

a boolean indicating if any returned date, code, category, or numeric value should be based on the first match in the period.

None
find_last_match_in_period

a boolean indicating if any returned date, code, category, or numeric value should be based on the last match in the period.

None
include_date_of_match

a boolean indicating if an extra column should be included in the output.

False
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year. Only used if include_date_of_match is True

None
ignore_days_where_these_clinical_codes_occur

a codelist that contains codes for medications to be ignored. if a medication is found on this day, the date is not matched even it matches a code in the main codelist

None
episode_defined_as

a string expression indicating how an episode should be defined

None
return_binary_flag

a bool indicatin if a binary flag should be returned (deprecated: use date_format instead)

None
return_number_of_matches_in_period

a boolean indicating if the number of matches in a period should be returned (deprecated: use date_format instead)

False
return_first_date_in_period

a boolean indicating if the first matches in a period should be returned (deprecated: use date_format instead)

False
return_last_date_in_period

a boolean indicating if the last matches in a period should be returned (deprecated: use date_format instead)

False
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag, number_of_episodes or number_of_matches_in_period; list of strings with a date format returned if returning argument is set to first_date_in_period or last_date_in_period.

Examples:

This creates a variable exacerbation_count returning an int of the number of episodes of oral steroids being prescribed within the time period where a prescription is counted as part of the same episode if it falls within 28 days of a previous prescription. Days where oral steroids are prescribed on the same day as a COPD review are also ignored as may not represent true exacerbations.

exacerbation_count=patients.with_these_medications(
    oral_steroid_med_codes,
    between=["2019-03-01", "2020-02-29"],
    ignore_days_where_these_clinical_codes_occur=copd_reviews,
    returning="number_of_episodes",
    episode_defined_as="series of events each <= 28 days apart",
    return_expectations={
        "int": {"distribution": "normal", "mean": 2, "stddev": 1},
        "incidence": 0.2,
    },
)

 

with_these_clinical_events(codelist, return_expectations=None, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', include_date_of_match=False, date_format=None, ignore_missing_values=False, ignore_days_where_these_codes_occur=None, episode_defined_as=None, return_binary_flag=None, return_number_of_matches_in_period=False, return_first_date_in_period=False, return_last_date_in_period=False, include_month=False, include_day=False) 🔗

Patients who have had at least one of these clinical events in the defined period

Parameters:

Name Description Default
codelist

a codelist for requested event(s)

required
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question. If returning an integer (returning=number_of_matches_in_period or returning=number_of_episodes), this is a 2-item key-value dictionary of int and incidence. int is a dictionary of distribution, mean, and stddev. These values determine the shape of the dummy data returned, and the int means a int will be returned rather than a float. incidence must have a value and this is what percentage of dummy patients have a value. It needs to be a number between 0 and 1. If returning binary_flag this is a 1-item dictionary of incidence as described above. If returning either first_date_in_period or last_date_in_period, this is a 2-item dictionary of date and incidence. date is a dict of earliest and/or latest date possible.

None
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • binary_flag
  • date
  • number_of_matches_in_period
  • number_of_episodes
  • code
  • category
  • numeric_value
'binary_flag'
find_first_match_in_period

a boolean indicating if any returned date, code, category, or numeric value should be based on the first match in the period. If several matches compare equal, then their IDs are used to break the tie.

None
find_last_match_in_period

a boolean indicating if any returned date, code, category, or numeric value should be based on the last match in the period. This is the default behaviour. If several matches compare equal, then their IDs are used to break the tie.

None
include_date_of_match

a boolean indicating if an extra column should be included in the output.

False
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year. Only used if include_date_of_match is True

None
ignore_days_where_these_codes_occur

a codelist that contains codes for events to be ignored. if a events is found on this day, the date is not matched even it matches a code in the main codelist

None
episode_defined_as

a string expression indicating how an episode should be defined

None
ignore_missing_values

ignore events where the value is missing or zero. We are unable to distinguish between zeros and null values due to limitations in how the data is recorded in TPP.

False
return_binary_flag

a boolean indicating if the number of matches in a period should be returned (deprecated: use date_format instead),

None
return_number_of_matches_in_period

a boolean indicating if the number of matches in a period should be returned (deprecated: use date_format instead)

False
return_first_date_in_period

a boolean indicating if the first matches in a period should be returned (deprecated: use date_format instead)

False
return_last_date_in_period

a boolean indicating if the last matches in a period should be returned (deprecated: use date_format instead)

False
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag, number_of_episodes or number_of_matches_in_period; list of strings with a date format returned if returning argument is set to first_date_in_period or last_date_in_period. a list of strings with a category represented in an extra column in the codelist object category is returned.

Examples:

This creates a variable haem_cancer returning the first date of a diagnosis of haematology malignancy within the time period.

haem_cancer=patients.with_these_clinical_events(
    haem_cancer_codes,
    between=["2015-03-01", "2020-02-29"],
    returning="date",
    find_first_match_in_period=True,
    return_expectations={"date": {earliest; "2015-03-01", "latest": "2020-02-29"}},
)

 

with_death_recorded_in_primary_care(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, return_expectations=None) 🔗

Identify patients with a date-of-death in their primary care record.

There is generally a lag between the death being recorded in ONS data and appearing in the primary care record, but the date itself is usually reliable when it appears. By contrast, cause of death is often not accurate in the primary care record so we don't make it available to query here.

Args:
!!! on_or_before "date of interest as a string with the format `YYYY-MM-DD`. Filters results"
    on or before the given date.
!!! on_or_after "date of interest as a string with the format `YYYY-MM-DD`. Filters results"
    on or after the given date.
!!! between "two dates of interest as a list with each date as a string with the format `YYYY-MM-DD`."
    Filters results to  between the two dates provided (inclusive).
    The two dates must be in chronological order.
returning: string indicating value to be returned. Options are:

    * `date_of_death`: Date of death
    * `binary_flag`: If they died or not

!!! date_format "a string detailing the format of the dates to be returned. It can be `YYYY-MM-DD`,"
    `YYYY-MM` or `YYYY` and wherever possible the least disclosive data should be returned. i.e returning
    only year is less disclosive than a date with day, month and year.
!!! return_expectations "a dictionary defining the incidence and distribution of expected value"
    within the population in question.

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date_of_death

Examples:

A variable called died_date_gp is created that returns the date of death for any patients have died in the GP dataset.

died_date_gp=patients.with_death_recorded_in_primary_care(
    on_or_after="2020-02-01",
    returning="date_of_death",
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
),

 

care_home_status_as_of(date, categorised_as=None, return_expectations=None) 🔗

TPP have attempted to match patient addresses to care homes as stored in the CQC database. At its most simple this query returns a boolean indicating whether the patient's address (as of the supplied time) matched with a care home.

It is also possible return a more complex categorisation based on attributes of the care homes in the CQC database, which can be freely downloaded here: https://www.cqc.org.uk/about-us/transparency/using-cqc-data

At present the only imported fields are: LocationRequiresNursing LocationDoesNotRequireNursing

But we can ask for more fields to be imported if needed.

The categorised_as argument acts in effectively the same way as for the categorised_as function except that the only columns that can be referred to are those belonging to the care home table (i.e. the two nursing fields above) and the boolean IsPotentialCareHome

Parameters:

Name Description Default
date

date of interest as a string with the format YYYY-MM-DD. Filters results to the given date

required
categorised_as

a logic expression that applies an algorithm to specific variables to create categories

None
return_expectations

a dict defining the rate and the categories returned with ratios

None

Returns:

Type Description
list

of strings which each letter representing a category as defined by the algorithm

Examples:

This creates a variable called care_home_type which contains a 2 letter string which represents a type of care home environment. If the address is not valid, it defaults to an empty string.

care_home_type=patients.care_home_status_as_of(
    "2020-02-01",
    categorised_as={
        "PC":
        """
          IsPotentialCareHome
          AND LocationDoesNotRequireNursing='Y'
          AND LocationRequiresNursing='N'
        """,
        "PN":
        """
          IsPotentialCareHome
          AND LocationDoesNotRequireNursing='N'
          AND LocationRequiresNursing='Y'
        """,
        "PS": "IsPotentialCareHome",
        "PR": "NOT IsPotentialCareHome",
        "": "DEFAULT",
    },
    return_expectations={
        "rate": "universal",
        "category": {"ratios": {"PC": 0.05, "PN": 0.05, "PS": 0.05, "PR": 0.84, "": 0.01},},
    },
),

 

with_tpp_vaccination_record(target_disease_matches=None, product_name_matches=None, on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, find_first_match_in_period=None, find_last_match_in_period=None, return_expectations=None) 🔗

Identify patients with a vaccination record for a target disease within the TPP vaccination record

Vaccinations can be recorded via a Vaccination Record or via prescription of a vaccine i.e a product code.

Parameters:

Name Description Default
target_disease_matches

the target disease as a string

None
product_name_matches

the product name as a string

None
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • binary_flag: indicates if they have had the vaccination or not
  • date: date of vaccination
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
find_first_match_in_period

a boolean that indicates if the data returned is first indication of vaccination if there are multiple matches within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last indication of vaccination if there are multiple matches within the time period

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date

Examples:

A variable called flu_vaccine is created that returns the date of vaccination for any patients in the GP dataset between 2 dates.

flu_vaccine=patients.with_tpp_vaccination_record(
    target_disease_matches="influenza",
    between=["2019-09-01", "2020-04-01"],
    returning="date",
    date_format="YYYY-MM",
    find_first_match_in_period=True,
    return_expectations={
        date": {"earliest": "2019-09-01", "latest": "2020-03-29"}
    }
),

 

household_as_of(reference_date, returning=None, return_expectations=None) 🔗

Return information about the household to which the patient belonged as of the reference date. This is inferred from address data using an algorithm developed by TPP (to be documented soon) so the results are not 100% reliable but are apparently pretty good.

Parameters:

Name Description Default
reference_date

date of interest as a string with the format YYYY-MM-DD. Filters results to a particular set date.

required
returning

string indicating value to be returned. Options are:

  • pseudo_id: An integer identifier for the household which has no meaning other than to identify individual members of the same household (0 if no household information available)
  • household_size: the number of individuals in the household (0 if no household information available)
  • is_prison: Boolean indicating whether household is a prison. See https://github.com/opensafely/cohort-extractor/issues/271#issuecomment-679069981 for details of how this is determined
  • has_members_in_other_ehr_systems: Boolean indicating whether some household members are registered with GPs using a different EHR system, meaning that our coverage of the household is incomplete.
  • percentage_of_members_with_data_in_this_backend: Integer giving the (estimated) percentage of household members where we have EHR data available in this backend (i.e. not in other systems as above).
  • msoa: Returns the MSOA (Middle Super Output Area) in which the household is situated
None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers if returning argument is set to pseudo_id, household_size or percentage_of_members_with_data_in_this_backend. a list of 1 or 0 is returning is set to is_prison or has_members_in_other_ehr_systems

Examples:

household_id=patients.household_as_of(
    "2020-02-01", returning="pseudo_id"
)

household_size=patients.household_as_of(
    "2020-02-01", returning="household_size"
),

 

with_these_decision_support_values(algorithm, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='numeric_value', include_date_of_match=False, date_format=None, ignore_missing_values=False, return_expectations=None) 🔗

Returns values computed by the given decision support algorithm.

Parameters:

Name Description Default
algorithm

a string indicating the decision support algorithm. Currently, the only option is electronic_frailty_index for the electronic frailty index algorithm.

required
on_or_before

the date of interest as a string with the format YYYY-MM-DD. Filters matches to on or before the given date.

None
on_or_after

the date of interest as a string with the format YYYY-MM-DD. Filters matches to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters matches to between the two dates (inclusive). The two dates must be in chronological order.

None
find_first_match_in_period

a boolean indicating if values should be based on the first match in the period.

None
find_last_match_in_period

a boolean indicating if values should be based on the last match in the period. This is the default behaviour.

None
returning

a string indicating the values to return. The options are: * binary_flag * date * number_of_matches_in_period * numeric_value The default value.

'numeric_value'
include_date_of_match

a boolean indicating if an extra column containing the date of the match should be returned.

False
date_format

a string indicating the format of any dates included in the values. It can be YYYY-MM-DD, YYYY-MM, or YYYY. Wherever possible the least disclosive dates should be returned i.e returning dates with year and month is less disclosive than returning dates with year, month, and day. Only used if include_date_of_match=True.

None
ignore_missing_values

a boolean indicating if matches where the value is missing or zero should be ignored. We are unable to distinguish between null values (missing) and zeros due to limitations in how the data are recorded by TPP.

False
return_expectations

as described elsewhere.

None

 

ICNARC🔗

Warning

Data from ICNARC were last imported on 21-Jan-2021, with no further imports currently planned. Alternative data on ICU admission can be gleaned from SUS (i.e. returning=days_in_critical_care).

These variables are derived from the Intensive Care National Audit and Research Centre Case-Mix Programme (ICNARC-CMP), which collects information on ICU admissions across England. For more information, see the ICNARC data section.  

admitted_to_icu(on_or_after=None, on_or_before=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', date_format=None, return_expectations=None, include_month=False, include_day=False) 🔗

Return information about being admitted to ICU.

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results between the two dates provided (inclusive). The two dates must be in chronological order.

None
find_first_match_in_period

a boolean that indicates if the data returned is first admission to icu if there are multiple admissions within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last admission to icu if there are multiple admissions within the time period

None
returning

string indicating value to be returned. Options are:

  • binary_flag: Whether patient attended ICU
  • date_admitted: Date patient arrived in ICU
  • had_respiratory_support: Whether patient received any form of respiratory support
  • had_basic_respiratory_support: Whether patient received "basic" respiratory support
  • had_advanced_respiratory_support: Whether patient received "advanced" respiratory support

(Note that the terms "basic" and "advanced" are derived from the underlying ICNARC data.)

'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year. Only used if returning is binary_flag

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question. This is a 2-item key-value dictionary of date and rate.

None
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag, had_respiratory_support, had_basic_respiratory_support or had_advanced_respiratory_support; list of strings with a date format returned if returning argument is set to date_admitted

Examples:

This returns two variables — one called icu_date_admitted and another had_resp_support:

has_resp_support=patients.admitted_to_icu(
    on_or_after="2020-02-01",
    find_first_match_in_period=True,
    returning="had_respiratory_support",
    return_expectations={
            "date": {"earliest" : "2020-02-01"},
            "rate" : "exponential_increase"
    },
),

icu_date_admitted=patients.admitted_to_icu(
    on_or_after="2020-02-01",
    find_first_match_in_period=True,
    returning="date_admitted",
    date_format="YYYY-MM-DD",
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
   },
),

 

SGSS🔗

These variables are derived from Second Generation Surveillance System (SGSS) data which captures routine laboratory surveillance data on infectious diseases across England. For more information, see the SGSS data section.  

with_test_result_in_sgss(pathogen=None, test_result='any', on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, restrict_to_earliest_specimen_date=True, returning='binary_flag', date_format=None, return_expectations=None) 🔗

Finds COVID lab test results recorded in SGSS (Second Generation Surveillance System).

Please note that all dates used here are "specimen dates" (i.e. the date the specimen was taken), rather than the date the lab result was obtained.

It's important to note that data is supplied in two separate datasets: an "Earliest Specimen" dataset and an "All Tests" dataset.

Earliest Specimen Dataset🔗

Where a patient has multiple positive tests, SGSS groups these into "episodes" (referred to as "Organism-Patient-Illness-Episodes"). Each pathogen has a maximum episode duration (usually 2 weeks) and unless positive tests are separated by longer than this period they are assumed to be the same episode of illness. The specimen date recorded is the earliest positive specimen within the episode.

For SARS-CoV-2 the episode length has been set to infinity, meaning that once a patient has tested positive every positive test will be part of the same episode and record the same specimen date.

This means that using find_last_match_in_period is pointless when querying for positive results as only one date will ever be recorded and it will be the earliest.

Our original assumption, though the documentation didn't state either way, is that every negative result would be treated as unique. However this does not appear to be the case as though some patients do have multiple negative tests in this dataset, the number is far too small to be realistic.

Information about the SARS-CoV-2 episode length was via email from someone at the National Infection Service:

The COVID-19 episode length in SGSS was set to indefinite, so all
COVID-19 records from a single patient will be classified as one
episode. This may change, but is set as it is due to limited
information around re-infection and virus clearance.
All Tests Dataset🔗

This dataset is not subject to the same restriction as above and we expect each individual test result (postive or negative) to appear in this regardless of whether they are considered as within the same infection episode. In an ideal world we could use just this dataset, but there are some fields we need (e.g. SGTF) which are only supplied on the "earliest specimen" dataset.

S-Gene Target Failure🔗

Using the returning="s_gene_target_failure" option provides additional output from PCR tests results which can be used as a proxy for the presence of certain Variants of Concern.

Possible values are "", "0", "1", "9"

Definitions (from email from PHE)

1: Isolate with confirmed SGTF
Undetectable S gene; CT value (CH3) =0
Detectable ORF1ab gene; CT value (CH2) <=30 and >0
Detectable N gene; CT value (CH1) <=30 and >0

0: S gene detected
Detectable S gene (CH3>0)
Detectable y ORF1ab CT value (CH1) <=30 and >0
Detectable N gene CT value (CH2) <=30 and >0

9: Cannot be classified

Null are where the target is not S Gene. I think LFTs are currently
also coming across as 9 so will need to review those to null as well as
clearly this is a PCR only variable.
Case Category (type of test used)🔗

Using the returning="case_category" option (only available on positive, earliest specimen date results) reports whether the test was a Lateral Flow or PCR test. Possible values are:

"LFT_Only", "PCR_Only", "LFT_WithPCR"
Variant🔗

The returning="variant" option (only available in the "All Tests" data) returns details on specific SARS-CoV-2 variants detected. Possible values include, but are not limited to:

B.1.617.2
VOC-21JAN-02
VUI-21FEB-04
P.1
E484K
B.1.1.7+E484K
No VOC detected
Sequencing Failed
Undetermined
Undetermined + e484k

The returning="variant_detection_method" options returns possible values:

"Reflex Assay" and "Private Lab Sequencing"
Symptomatic🔗

The returning="symptomatic" option (only available in the "All Tests" data) returns details on whether patients are symptomatic of SARS-CoV-2 or not. This option is available regardless of the test result outcome.

Possible values are "", "Y", "N".

Number of Tests🔗

The returning="number_of_matches_in_period" option (only available in the "All Tests" data) returns a count of the number of tests a patient has had in the defined time period.

It is used with test_result which must be set as "positive", "negative" or "any". returning="number_of_matches_in_period" can therefore be used to return the number of positive, negative or all tests.

For more detail on SGSS in general see PHE_Laboratory_Reporting_Guidelines.pdf

Parameters:

Name Description Default
pathogen

pathogen we are interested in. Only SARS-CoV-2 results are included in our data extract so this will throw an error if the specified pathogen is anything other than "SARS-CoV-2".

None
test_result

must be one of "positive", "negative" or "any"

'any'
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date. The default value is None.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided. The two dates must be in chronological order.

None
find_first_match_in_period

a boolean that indicates if the data returned is first event if there are multiple matches within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last event if there are multiple matches within the time period

None
restrict_to_earliest_specimen_date

a boolean indicating whether to use the "earliest specimen" or "all tests" dataset (see above). True by default, meaning that the "earliest specimen" dataset is used.

True
returning

string indicating value to be returned. Options are:

  • binary_flag: indicates if they have had the an event or not
  • date: indicates date of event and used with either find_first_match_in_period or find_last_match_in_period
  • s_gene_target_failure: returns the value of the SGTF field (see above)
  • case_category (see above)
  • variant (see above)
  • variant_detection_method (see above)
  • symptomatic (see above)
  • number_of_matches_in_period (see above)
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date;

Examples:

Two variables are created. One called first_tested_for_covid is the first date that a patient has a covid test never mind the result. The second called first_positive_test_date is the first date that a patient has a positive test result.

first_tested_for_covid=patients.with_test_result_in_sgss(
    pathogen="SARS-CoV-2",
    test_result="any",
    on_or_after="2020-02-01",
    find_first_match_in_period=True,
    returning="date",
    date_format="YYYY-MM-DD",
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
),
first_positive_test_date=patients.with_test_result_in_sgss(
    pathogen="SARS-CoV-2",
    test_result="positive",
    on_or_after="2020-02-01",
    find_first_match_in_period=True,
    returning="date",
    date_format="YYYY-MM-DD",
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
),

 

CPNS🔗

These variables are derived from the COVID-19 Patient Notification System (CPNS), which collects info on all in-hospital covid-related deaths. For more information, see the CPNS data section.  

with_death_recorded_in_cpns(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None) 🔗

Identify patients who with death registered in CPNS dataset

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • date_of_death: Date of death
  • binary_flag: If they died or not
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date_of_death

Examples:

A variable called died_date_cpns is created that returns the date of death for any patients have died in the CPNS dataset.

died_date_cpns=patients.with_death_recorded_in_cpns(
    on_or_after="2020-02-01",
    returning="date_of_death",
    include_month=True,
    include_day=True,
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
),

 

ONS deaths🔗

These variables are derived from the Death Registry data provided by the Office for National Statistics. For more information, see the ONS deaths section.  

died_from_any_cause(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None) 🔗

Identify patients who with ONS-registered deaths

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • date_of_death: Date of death
  • binary_flag: If they died or not
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date_of_death

Examples:

A variable called died_any is created that returns the date of death for any patients that have died in the time period.

died_any=patients.died_from_any_cause(
    on_or_after="2020-02-01",
    returning="date_of_death",
    date_format="YYYY-MM-DD",
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
)

 

with_these_codes_on_death_certificate(codelist, on_or_before=None, on_or_after=None, between=None, match_only_underlying_cause=False, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None) 🔗

Identify patients with ONS-registered death, where cause of death matches the supplied icd10 codelist

Parameters:

Name Description Default
codelist

a codelist for requested value

required
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results between the two dates provided (inclusive). The two dates must be in chronological order.

None
match_only_underlying_cause

boolean for indicating if filters results to only specified cause of death.

False
returning

string indicating value to be returned. Options are:

  • date_of_death: Date of death
  • binary_flag: If they died or not
  • underlying_cause_of_death: The icd10 code corresponding to the underlying cause of death
'binary_flag'
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
include_month

a boolean indicating if day should be included in addition to year (deprecated: use date_format instead).

False
include_day

a boolean indicating if day should be included in addition to year and month (deprecated: use date_format instead).

False
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag or underlying_cause_of_death; list of strings with a date format returned if returning argument is set to date_of_death

Examples:

A variable called died_ons_covid_flag_any is created that returns the date of death for any patients that have covid on their death certificate even if that is the not the underlying cause of death.

died_ons_covid_flag_any=patients.with_these_codes_on_death_certificate(
    covid_codelist,
    on_or_after="2020-02-01",
    match_only_underlying_cause=False,
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
)

 

High Cost Drugs🔗

(Documentation on the source of this data will be forthcoming later.)

with_high_cost_drugs(drug_name_matches=None, on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, find_first_match_in_period=None, find_last_match_in_period=None, return_expectations=None) 🔗

Returns data from the High Cost Drugs Dataset

More details available in this notebook: https://github.com/opensafely/highcostdrugs-research/blob/master/analysis/High%20Cost%20Drugs%20Dataset.ipynb

Parameters:

Name Description Default
drug_name_matches

a drug name as a string, or a list of such names, or a codelist containing such names. Results will be filtered to just rows matching any of the supplied names exactly. Note these are not standardised names, they are just the names however they come to us in the original data.

None
returning

string indicating value to be returned. Options are:

  • binary_flag: if the patient received any matching drugs
  • date: date drug received
'binary_flag'
on_or_before

as described elsewhere

None
on_or_after

as described elsewhere

None
between

as described elsewhere

None
find_first_match_in_period

as described elsewhere

None
find_last_match_in_period

as described elsewhere

None
date_format

only "YYYY" and "YYYY-MM" supported here as day level data not available

None
return_expectations

as described elsewhere

None

Examples:

The first month in which each patient received "ACME Drug" after March 2019:

covid_admission_date=patients.with_high_cost_drugs(
    drug_name_matches="ACME Drug",
    on_or_after="2019-03-01",
    find_first_match_in_period=True,
    returning="date",
    date_format="YYYY-MM",
    return_expectations={"date": {"earliest": "2019-03-01"}},
)

 

SUS🔗

These variables are derived from the Secondary Uses Services (SUS) data, and their underlying datasets:

admitted_to_hospital(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', find_first_match_in_period=None, find_last_match_in_period=None, date_format=None, with_these_diagnoses=None, with_these_primary_diagnoses=None, with_these_procedures=None, with_admission_method=None, with_source_of_admission=None, with_discharge_destination=None, with_patient_classification=None, with_admission_treatment_function_code=None, with_administrative_category=None, return_expectations=None) 🔗

Return information about admission to hospital.

See https://github.com/opensafely/cohort-extractor/issues/186 for in-depth discussion and background.

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • binary_flag: if they were admitted at all,
  • date_admitted: date patient admitted to hospital,
  • date_discharged: date patient discharged from hospital,
  • number_of_matches_in_period: number of times patient was admitted in time period specified,
  • primary_diagnosis: primary diagnosis code for admission,
  • admission_method: 2-digit code identifying method of admission: planned (booked/planned/waiting list), emergency (various types), transfer from another provider, or birth/maternity.
  • source_of_admission: 2-digit code identifying source of admission: most commonly = 19 usual place of residence. Also useful for identifying admissions from care homes ('54', '65', '85', '88'). Somewhat useful for identifying birth spells and admissions via transfer (but method_of_admission usually preferable)
  • discharge_destination: ,
  • patient_classification: single-digit numeric code:
    • 1 ordinary admission;
    • 2 day case;
    • 3/4 regular admissions (e.g. patient admitted weekly for chemotherapy or dialysis);
    • 5 mother and baby using delivery facilities only.
  • admission_treatment_function_code: specialty of patient admission (use with caution for emergency admissions),
  • days_in_critical_care: number of days in critical care during spell,
  • administrative_category: private vs NHS funded treatment,
  • duration_of_elective_wait: days on waiting list for planned procedures (use with caution).
'binary_flag'
find_first_match_in_period

a boolean that indicates if the data returned is first event if there are multiple matches within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last event if there are multiple matches within the time period

None
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
with_these_diagnoses

icd10 codes to match against any diagnosis (note this uses prefix matching so a code like J12 will match J120, J121 etc.)

None
with_these_primary_diagnoses

icd10 codes to match against the primary diagnosis note this uses prefix matching so a code like J12 will match J120, J121 etc.)

None
with_these_procedures

opcs4 codes to match against the procedure

None
with_admission_method

string or list of strings to match against

None
with_source_of_admission

string or list of strings to match against

None
with_discharge_destination

string or list of strings to match against

None
with_patient_classification

string or list of strings to match against

None
with_admission_treatment_function_code

string or list of strings to match against

None
with_administrative_category

string or list of strings to match against

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; of strings with a date format returned if returning argument is set to date_admitted or date_discharged; of integers if returning argument is set to number_of_matches_in_period, days_in_critical_care or duration_of_elective_wait; of strings with alphanumerical code format for ICD10 code if returning argument is set to primary_diagnosis; of 1-2-digit numeric or alphanumeric codes if returning argument is admission_method, source_of_admission, discharge_destination, patient_classification, or administrative_category; of 3-digit numeric specialty codes if returning argument is admission_treatment_function_code

Examples:

The day of each patient's first hospital admission for Covid19:

covid_admission_date=patients.admitted_to_hospital(
    returning= "date_admitted",
    with_these_diagnoses=covid_codelist,
    on_or_after="2020-02-01",
    find_first_match_in_period=True,
    date_format="YYYY-MM-DD",
    return_expectations={"date": {"earliest": "2020-03-01"}},
)

 

attended_emergency_care(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', find_first_match_in_period=None, find_last_match_in_period=None, date_format=None, with_these_diagnoses=None, discharged_to=None, return_expectations=None) 🔗

Return information about attendance of A&E from the ECDS dataset. Please note that there is a limited number of diagnoses allowed within this dataset, and so will not match with the range of diagnoses allowed in other datasets such as the primary care record.

Parameters:

Name Description Default
on_or_before

date of interest as a string with the format YYYY-MM-DD. Filters results to on or before the given date.

None
on_or_after

date of interest as a string with the format YYYY-MM-DD. Filters results to on or after the given date.

None
between

two dates of interest as a list with each date as a string with the format YYYY-MM-DD. Filters results to between the two dates provided (inclusive). The two dates must be in chronological order.

None
returning

string indicating value to be returned. Options are:

  • binary_flag: Whether patient attended A&E
  • date_arrived: date patient arrived in A&E
  • number_of_matches_in_period: number of times patient attended A&E
  • discharge_destination: SNOMED CT code of discharge destination. This will be a member of refset 999003011000000105
'binary_flag'
find_first_match_in_period

a boolean that indicates if the data returned is first event if there are multiple matches within the time period

None
find_last_match_in_period

a boolean that indicates if the data returned is last event if there are multiple matches within the time period

None
date_format

a string detailing the format of the dates to be returned. It can be YYYY-MM-DD, YYYY-MM or YYYY and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with day, month and year.

None
with_these_diagnoses

a list of SNOMED CT codes

None
discharged_to

a list of members of refset 999003011000000105.

None
return_expectations

a dictionary defining the incidence and distribution of expected value within the population in question.

None

Returns:

Type Description
list

of integers of 1 or 0 if returning argument is set to binary_flag; list of strings with a date format returned if returning argument is set to date_arrived; of integers if returning argument is set to number_of_matches_in_period or discharge_destination (with SNOMED CT code as a numerical value)

Examples:

A variable called emergency_care is created with returns a date of first attendence in A&E if patient had attended emergency room during the time period.

emergency_care=patients.attended_emergency_care(
    on_or_after="2020-01-01",
    returning="date_arrived",
    date_format="YYYY-MM-DD",
    find_first_match_in_period=True,
    return_expectations={
        "date": {"earliest" : "2020-02-01"},
        "rate" : "exponential_increase"
    },
)

 

with_ethnicity_from_sus(returning=None, use_most_frequent_code=None, return_expectations=None) 🔗

Returns ethnicity data from the SUS Datasets

Parameters:

Name Description Default
returning

string indicating value to be returned. Options are:

  • code: don't group ethnicities at all, return the recorded code
  • group_6: group ethnicities into 6 groups
  • group_16: group ethnicities into 16 groups
None
use_most_frequent_code

when multiple codes are present, pick the most frequent one

None
return_expectations

a dictionary describing what dummy data should look like

None

Examples:

Patients with ethnicity, grouped to our 16 categories:

ethnicity_by_16_grouping=patients.with_ethnicity_from_sus(
    returning="group_16",
    use_most_frequent_code=True,
)

 

NHS England COVID-19 data store🔗

(Documentation on the source of this data will be forthcoming later.)

with_healthcare_worker_flag_on_covid_vaccine_record(returning='binary_flag', return_expectations=None) 🔗

Return whether patient was recorded as being a healthcare worker at the time they received a COVID-19 vaccination.

This data is from the NHS England COVID-19 data store, and reflects information collected at the point of vaccination where recipients are asked by vaccination staff whether they are in the category of health and care worker.

Parameters:

Name Description Default
returning

must be 'binary_flag', if supplied

'binary_flag'
return_expectations

as described elsewhere.

None

Utility functions🔗

These variables create new variable from existing variables. They do not extract any data directly.  

random_sample(percent=None, return_expectations=None) 🔗

Flags a random sample of approximately percent patients.

Parameters:

Name Description Default
percent

an integer between 1 and 100 for the percent of patients to include within the random sample

None
return_expectations

a dict containing an expectations definition defining at least an incidence

None

Returns:

Type Description
list

of integers of 1 or 0

Examples:

This creates a variable example, flagging approximately 10% of the population with the value 1:

example=patients.random_sample(percent=10, expectations={'incidence': 0.1})

 

categorised_as(category_definitions, return_expectations=None, **extra_columns) 🔗

Patients who had had 1 or more code from 1 or more codelists are categorised into groups according to a prescribed algorithm.

Parameters:

Name Description Default
category_definitions

a dict that defines the algorithm and the associated category The keys of the dict are strings representing categories from a defined by other arguments such as with_these_clinical_events. The values are expressions of logic using statements and AND/OR statements. A default argument should be provided if a particular patient cannot be categorised to the algorithm.

required
return_expectations

a dict that defined the ratios of each category. The keys are the category values as strings and the values are ratios as floats. The ratios should add up to 1.

None

Retyrns

list: of strings which each letter representing a category as defined by the algorithm

Examples:

This creates a variable of asthma status based on codes for asthma and categorising for recent steroid use.

current_asthma=patients.categorised_as(
    {
        "1": "DEFAULT",
        "2": "recent_asthma_code AND
              prednisolone_last_year = 0"
        "3": "recent_asthma_code AND prednisolone_last_year > 0"
    },
    recent_asthma_code=patients.with_these_clinical_events(
        asthma_codes, between=["2017-02-01", "2020-01-31"],
    ),
    prednisolone_last_year=patients.with_these_medications(
        pred_codes,
        between=["2019-02-01", "2020-01-31"],
        returning="number_of_matches_in_period",
    ),
    return_expectations={
        "category":{"ratios": {"0": 0.8, "1": 0.1, "2": 0.1}}
    },
)

 

satisfying(expression, return_expectations=None, **extra_columns) 🔗

Patients who meet the criteria for one or more expressions. Used as a way of combining groups or making subgroups based on certain characteristics.

Parameters:

Name Description Default
expression

a string in that links together 2 or more expressions into one statement. key variables for this expression can be defined under this statement or anywhere in study definition.

required
return_expectations

a dictionary defining the rate of expected value within the population in question

None

Returns:

Type Description
list

of integers, either 1 or 0

Examples:

This creates a study population where patients included have asthma and not copd:

population=patients.satisfying(
    """
    has_asthma AND NOT
    has_copd
    """,
    has_asthma=patients.with_these_clinical_events(
        asthma_codes, between=["2017-02-28", "2020-02-29"],
    ),
    has_copd=patients.with_these_clinical_events(
        copd_codes, between=["2017-02-28", "2020-02-29"],
    ),
)

 

date_of(source, date_format=None, include_month=False, include_day=False, return_expectations=None) 🔗

 

minimum_of(*column_names, **extra_columns) 🔗

Return the minimum value over the supplied columns e.g

min_value=patients.minimum_of("some_column", "another_column")

Note: this ignores "empty values" (i.e. the values used if there is no data for a particular column, such as 0.0 for numeric values or the empty string for dates). This ensures that the minimum of a column with a defined value and one with a missing value is equal to the defined value.

Additional columns can be defined within the function call which will be used in computing the minimum but won't themselves appear in the output:

min_value=patients.minimum_of( "some_column", another_colum=patients.with_these_medications(...) )

This function doesn't accept return_expectations but instead derives dummy values from the values of its source columns.

 

maximum_of(*column_names, **extra_columns) 🔗

Return the maximum value over the supplied columns e.g

max_value=patients.maximum_of("some_column", "another_column")

Additional columns can be defined within the function call which will be used in computing the maximum but won't themselves appear in the output:

max_value=patients.maximum_of( "some_column", another_colum=patients.with_these_medications(...) )

This function doesn't accept return_expectations but instead derives dummy values from the values of its source columns.