Skip to content

ehrQL tutorial: Operations on tables🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Example dataset definition 3a: Operations on tables🔗

By the end of this tutorial, you should be able to:

  • describe and generate some simple operations that can be performed with ehrQL tables.
  • describe the types of operations that can be carried out
  • explain the importance of data types.

Full Example🔗

In this section, we will be building up a more complex dataset definition. This dataset definition includes information about a patient's address and their hospitalisation record. This means that we are combining 3 different tables:

  • patients
  • patient_address
  • hospitalisations

In this example, instead of solely considering the year of birth for each patient, we look for specific details of the index of multiple deprivation (IMD) where patients live. Importantly we are restricting the population by IMD, rather than adding IMD as a column.

For brevity, the tables will not be displayed here but can be reviewed in the example-data/multiple2/ folder.

Dataset definition: 3a_multiple2_dataset_definition.py
3a_multiple2_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import (
    hospitalisations,
    patient_address,
    patients,
)

dataset = Dataset()

year_of_birth = patients.date_of_birth.year

patient_address_by_date = patient_address.sort_by(patient_address.date_end)
earliest_imd = (
    patient_address_by_date.first_for_patient().index_of_multiple_deprivation_rounded
)
latest_imd = (
    patient_address_by_date.last_for_patient().index_of_multiple_deprivation_rounded
)

imd_has_increased = latest_imd > earliest_imd
latest_imd_is_at_least_5000 = latest_imd >= 5000
population = (year_of_birth < 2000) & (imd_has_increased | latest_imd_is_at_least_5000)
dataset.define_population(population)

dataset.sex = patients.sex
dataset.was_hospitalised = hospitalisations.exists_for_patient()

The output of the query above should generate a table with sex and was_hospitalised as columns.

Output dataset: outputs/3a_multiple2_dataset_definition.csv
patient_id sex was_hospitalised
1 M T
4 M T
6 F T

Line by line explanation🔗

This dataset definition finds the patients whose data meet all of the following conditions:

  • born before the year 2000
  • and matching at least one of the following
    • with a most recent patient address in a location with an index of multiple deprivation greater than 5000
    • where the index of multiple deprivation has increased from the earliest address to the latest

For those patients, the output dataset shows:

  • patient sex
  • whether the patient has ever been hospitalised

Note that in this code, there has been the addition of parentheses to make the code easier to read.

Import statements🔗

As in previous tutorials, we are importing the tables that we wish to work with. In this case we need patients, patient_address and hospitalisations.

Different table types: patient-level and event-level🔗

This introduces a distinction between patient-level and event-level tables:

  • patient is a patient-level table where one row represents one patient
  • patient_address is event-level where each row is an event and a patient can have many events. In this case, someone might move house and have a new address. See the explainer for more information.

Address by date🔗

In this line we are querying the patient_address table and sorting by the data associated with address entries for each patient.

Note that similar to year_of_birth, we are creating a variable that is not being put into the dataset defintion as a column. By creating such variables, we can either restrict populations or use them as intermediate variables. In this case, we have created a variable of patient address sorted according to the latest date.

Earliest IMD🔗

Each address has an associated IMD. In this line, we are taking the previous variable of addresses sorted by latest date, and are further filtering by taking the first_for_patient(). This can be thought of as single-column tables. The .index_of_multiple_deprivation_rounded returns the raw IMD value rounded.

patient_id earliest_imd
1 1289
2 10984
3 4591
4 112
5 50
6 9889
7 7043

Latest IMD🔗

This is similar to finding the earliest IMD. Instead of taking the first IMD value sorted by date, we take the last.

Has IMD increased🔗

Now we are creating a variable called imd_has_increased. This takes the two previous variables of earliest_imd and latest_imd and compares them to see if latest is larger than earliest.

Comparison operators🔗

This introduces us to the comparison operators available in ehrQL. You might already be familiar with some, such as <, >, ==.

In this case, values of the two IMD columns are compared for each patient row.

This comparison can be thought of as a new single-column table, indicating whether the IMD has increased from earliest to latest date, represented by one of the Boolean values, True or False:

patient_id imd_has_increased
1 T
2 T
3 F
4 T
5 F
6 F
7 T

IMD over 5000🔗

We are creating the final variable we need for our population, finding people with IMD equal to or greater than 5000. Instead of comparing the values of two columns in each row, we compare the "latest" IMD to the integer 5000.

Integers in ehrQL are written as numbers without a decimal point. This again should return a True or False.

Representation of numbers in ehrQL🔗

This introduces us to the two different types of numbers in ehrQL.

  • int: to represent integers
  • float: to represent real numbers

ehrQL is currently strict when comparing numeric types: only integers can be compared to integers, and floating point numbers ("floats") to other floats.

Data Builder will give an error if you try to compare incompatible types in your dataset definition.

The Contracts reference tells you which data type each table column has. This tells you what kinds of values and columns you can directly compare with.

If you need to convert values in columns, as a temporary fix, then you can use: .as_int() and .as_float().

Constructing the population with logical operators🔗

We have now created all the variables that we need to construct our population. Remember we are aiming to get people born before 2000, whose IMDs have increased or whose latest IMD is greater than 5000.

Logical operators combine Boolean values together to give a single Boolean value. ehrQL has the following logical operators that you might already be familiar with:

  • & to represent AND
    • a & b is True when both a and b are True
  • | to represent OR
    • a | b is True when either or both of a and b are True
  • ÂŽ to represent NOT
    • ÂŽa is True when a is False.

With ehrQL, these, like the comparison operators, are applied per table row, resulting in a table as output.

In this tutorial dataset definition, we combine multiple logical expressions. The parentheses around each logical expression make the intent clearer. The parentheses also ensure the order of evaluation: each expression in parentheses is evaluated before combining them together.

The logical operators are used to combine the criteria for patients to include in the population, as mentioned above in the summary. In this dataset definition, we use:

  • | to specify that we want either an increased IMD, or an IMD greater than a specified value.
  • & to then specify we want to match the previous IMD criteria and certain values of year of birth.

The value of this variable is True or False as patients either meet the criteria or they do not.

Define population🔗

Now we take the population variable created above and pass this into define_population(). This restricts the entire population to those patients who have the value True in the variable population.

Adding sex column🔗

Finally, we add multiple columns to our dataset, as we have done in previous dataset definitions.

We add a sex column as previously in the tutorial.

Adding hospitalisation column🔗

We are interested in if a patient ever has been admitted to hospital. This is inferred by the presence of a row in the hospitalisations table. To check for the presence of a row, we can use the exists_for_patient() method on a table. This results in a Boolean column indicating whether any rows exist.

Your turn🔗

Run the dataset definition.

Question

  1. Is hospitalisations a patient-level or event-level table? Why?
  2. What do you think would if you compare the IMD values for patients to the floating point value 5000.0, instead of the integer value 5000? Modify the dataset definition to check if you are correct.
  3. Can you further restrict the population to those patients who have a postcode?
  4. Can you change a single line of this dataset definition so that the patient population selection is inverted? Specifically, all patients previously selected are now not selected, and all patients previously unselected are now selected.
  5. Can you change the population to include only patients born before 2000 and hospitalised? Ignore IMD for now.
  6. Can you add IMD value as a column?
  7. Can you change the population to find only hospitalised males? Ignore IMD and age.