ehrQL tutorial: Working with multiple tables🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.
Example dataset definition 2a: Working with multiple tables🔗
By the end of this tutorial, you should be able to:
- write a dataset definition that access multiple tables
- look up the details of data tables that you can access via Data Builder
- run a simple query of event-level data
Full Example🔗
OpenSAFELY backends provide several different collections of related data on patients. As you might expect if you have worked with databases before, each collection is made available via Data Builder's tables.
For the purposes of this tutorial, each individual table is stored in a single CSV file, where the CSV filename indicates the table name. This is to simulate a real backend with multiple tables available.
In the previous definitions, we accessed just a single table. This dataset definition accesses multiple tables and also demonstrates some of the querying that ehrQL permits.
Dataset definition: 2a_multiple_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients, prescriptions
dataset = Dataset()
year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)
dataset.sex = patients.sex
dataset.most_recent_dmd_code = (
prescriptions.sort_by(prescriptions.processing_date)
.last_for_patient()
.prescribed_dmd_code
)
As explained above, this definition needs two tables to query. Below shows the data available in the patient table and the prescriptions table. Our dataset definition will combine data from both tables to generate one dataset.
Data table: multiple/patients.csv
patient_id | date_of_birth | sex |
---|---|---|
1 | 1980-05-01 | M |
2 | 2005-10-01 | F |
3 | 1946-01-01 | M |
4 | 1920-11-01 | M |
5 | 2010-04-01 | M |
6 | 1999-12-01 | F |
7 | 2000-01-01 | M |
Data table: multiple/prescriptions.csv
patient_id | prescribed_dmd_code | processing_date |
---|---|---|
1 | pr1 | 2022-05-01 |
1 | pr2 | 2022-06-02 |
2 | pr3 | 2021-05-06 |
3 | pr1 | 2020-09-01 |
4 | pr1 | 2021-05-12 |
5 | pr2 | 2020-09-08 |
5 | pr1 | 2021-09-08 |
6 | pr4 | 2022-01-03 |
7 | pr3 | 2018-01-06 |
When we run the dataset definition against these tables, we should get this result.
Output dataset: outputs/2a_multiple_dataset_definition.csv
patient_id | sex | most_recent_dmd_code |
---|---|---|
2 | F | pr3 |
5 | M | pr1 |
7 | M | pr3 |
Line by line explanation🔗
Most of this dataset definition will be familiar from the previous examples. There are only two changes.
Import statements🔗
We import prescriptions
as well as patients
.
The prescriptions
table differs from patients
in that prescriptions
is an event-level table,
while patients
is a patient table.
We will cover the difference between these more later.
For now, it is sufficient to understand that prescriptions
may contain multiple entries per patient.
Query the prescription table🔗
The final line of dataset definition finds the most recently prescribed Dictionary of Medicines and Devices (DMD) code for a patient.
This is done by sorting the table by processing_date
and picking the last entry for a patient.
You can see the sort_by
method described in the ehrQL reference.
We will see more on working with dates in a later tutorial.
Note
Columns in the output have the same order as they are added to the dataset
in the dataset definition.
Your turn🔗
Run the dataset definition.
Question
Can you modify the dataset definition so that the output shows:
- The earliest DMD code prescribed
- Reorder the columns so DMD prescription comes first, followed by sex
- The output dataset population only contains
male
s born in or after 2000