ehrQL tutorial: Working with multiple tables🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Example dataset definition 2a: Working with multiple tables🔗

By the end of this tutorial, you should be able to:

write a dataset definition that access multiple tables
look up the details of data tables that you can access via Data Builder
run a simple query of event-level data

Full Example🔗

OpenSAFELY backends provide several different collections of related data on patients. As you might expect if you have worked with databases before, each collection is made available via Data Builder's tables.

For the purposes of this tutorial, each individual table is stored in a single CSV file, where the CSV filename indicates the table name. This is to simulate a real backend with multiple tables available.

In the previous definitions, we accessed just a single table. This dataset definition accesses multiple tables and also demonstrates some of the querying that ehrQL permits.

Dataset definition: 2a_multiple_dataset_definition.py

from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients, prescriptions

dataset = Dataset()

year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)

dataset.sex = patients.sex
dataset.most_recent_dmd_code = (
    prescriptions.sort_by(prescriptions.processing_date)
    .last_for_patient()
    .prescribed_dmd_code
)

As explained above, this definition needs two tables to query. Below shows the data available in the patient table and the prescriptions table. Our dataset definition will combine data from both tables to generate one dataset.

Data table: multiple/patients.csv

patient_id	date_of_birth	sex
1	1980-05-01	M
2	2005-10-01	F
3	1946-01-01	M
4	1920-11-01	M
5	2010-04-01	M
6	1999-12-01	F
7	2000-01-01	M

Data table: multiple/prescriptions.csv

patient_id	prescribed_dmd_code	processing_date
1	pr1	2022-05-01
1	pr2	2022-06-02
2	pr3	2021-05-06
3	pr1	2020-09-01
4	pr1	2021-05-12
5	pr2	2020-09-08
5	pr1	2021-09-08
6	pr4	2022-01-03
7	pr3	2018-01-06

When we run the dataset definition against these tables, we should get this result.

Output dataset: outputs/2a_multiple_dataset_definition.csv

patient_id	sex	most_recent_dmd_code
2	F	pr3
5	M	pr1
7	M	pr3

Line by line explanation🔗

Most of this dataset definition will be familiar from the previous examples. There are only two changes.

Import statements🔗

We import prescriptions as well as patients.

The prescriptions table differs from patients in that prescriptions is an event-level table, while patients is a patient table. We will cover the difference between these more later. For now, it is sufficient to understand that prescriptions may contain multiple entries per patient.

Query the prescription table🔗

The final line of dataset definition finds the most recently prescribed Dictionary of Medicines and Devices (DMD) code for a patient.

This is done by sorting the table by processing_date and picking the last entry for a patient. You can see the sort_by method described in the ehrQL reference.

We will see more on working with dates in a later tutorial.

Note

Columns in the output have the same order as they are added to the dataset in the dataset definition.

Your turn🔗

Run the dataset definition.

Question

Can you modify the dataset definition so that the output shows:

The earliest DMD code prescribed
Reorder the columns so DMD prescription comes first, followed by sex
The output dataset population only contains males born in or after 2000