ehrQL tutorial Part 1: Minimal dataset definitionπ
By the end of this tutorial, you should able:
- to write a very simple dataset definition
- to run that dataset definition with Data Builder
Full Exampleπ
For all of our examples in this series of tutorials, we will start by showing the full dataset definition and then explain it line by line.
In this first tutorial, we will start with a minimal dataset definition. This finds the patients whose year of birth is 2000 or later.
Dataset definition: 1a_minimal_dataset_definition.py
from databuilder.ehrql import Dataset
from databuilder.tables.examples.tutorial import patients
dataset = Dataset()
year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)
If we run this against the sample data provided (see below), it will pick out only patients who were born in 2000 or later.
Original Data: minimal/patients.csv
patient_id | date_of_birth | sex |
---|---|---|
1 | 1980-05-01 | M |
2 | 2005-10-01 | F |
3 | 1946-01-01 | M |
4 | 1920-11-01 | M |
5 | 2010-04-01 | M |
6 | 1999-12-01 | F |
7 | 2000-01-01 | M |
In this case, patient 2, 5 and 7.
Output dataset: outputs/1a_minimal_dataset_definition.csv
patient_id |
---|
2 |
5 |
7 |
Line by line explanationπ
Import statementsπ
Lines of the format fromβ¦ importβ¦
specify which of Data Builder's code and features
to use in our dataset definition.
Here, we import two components of Data Builder:
Dataset
as provided by the query language, to create a dataset- the
patients
table, which is one of several data tables that ehrQL gives access to
Create a Dataset
π
A valid dataset definition must contain a dataset assigned to the name dataset
. Like many other programming languages, we use =
to assign a value to a variable name. In this case, we have assigned Dataset()
to the variable dataset
. This creates an empty dataset.
In subsequent steps,
we specify the data from the available data tables
that we wish to add to the dataset.
Find year of birthπ
Next we define a year of birth. date_of_birth
is in the patient table and therefore we can assign it to this new variable. We want to only capture the year
of birth so we add the .year
to the end of this variable assignment.
Define populationπ
Finally we define the population of the dataset. We use the special method called define_population()
and pass in the definition of the population. In this case, we want to use our previously created year_of_birth
and say, if year of birth is equal to or greater than 2000, include in this dataset.
Your turnπ
Run the dataset definition by:
opensafely exec databuilder:v0 generate-dataset "1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"
or if you are using project.yaml
:
opensafely run extract_1a_minimal_population
Question
Can you modify the dataset definition so that the output shows:
- Patients that were born before 1980?
- Patient that were born between 1980 and 2000?