Skip to content

ehrQL explanation: Allowed and disallowed operations on series and frames🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

One row per patient and many rows per patient data🔗

ehrQL works with two kinds of data.

We discuss this with an example in the ehrQL tutorial.

Briefly, the data that ehrQL works is considered to be either:

  • at most, one row per patient
  • possibly many rows per patient

That applies both to backend tables that are accessed and derived tables generated by ehrQL operations.

Todo

Could we term this "patient row multiplicity?"

This one row/many rows distinction is particularly important in two contexts:

  1. Creating a Dataset() — series that are included in this dataset must be, at most, one row per patient, because a dataset corresponds to a single row per patient.
  2. Composing ehrQL series and frames that contain data. The permitted operations on series and frames depend on the rows per patient property.

Data Builder checks the entire ehrQL query before submission to an OpenSAFELY backend. If you do make a coding mistake, the ehrQL you have written will not be submitted to a backend for query, and you will see an error.

Todo

Can we specify the error types?

Allowed and disallowed operations when composing series and frames🔗

Warning

The rules may change in future, to allow more operations.

Here, we will refer to "at most, one row per patient data" as "one row per patient data" and "possibly many rows per patient data" as "many rows per patient data".

This short section summarises what operations are allowed when composing series and frames.

  • One row per patient series and frames can always be combined with any other series or frame.
  • Many rows per patient data originating from different tables cannot be combined.
  • Many rows per patient data from a single underlying table can be combined, if and only if they have had the same sequence of filtering operations applied to them.
  • Many rows per patient frames can be combined with many rows per patient series, if and only if:
    • the frame and series are derived from the same underlying table
    • and the frame and series have related filtering operations applied:
      • either the frame and series have had the same sequence of filtering operations
      • or the frame has had the same filtering operations applied as the series, with the frame then having subsequent filtering operations applied.

Todo

Check this summary is correct. We should also really have some examples to make this concrete. Adding failing examples and checking the code gives an error, is a little more problematic. We would probably need some Python framework for checking, ideally.