Skip to content

OpenSAFELY Data Builder🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY Data Builder and its documentation are still undergoing extensive development. We will announce when Data Builder is ready for general use on the Platform News page.

Danger

This content has not yet been reviewed by OpenSAFELY technical leads. This page is not a definitive statement about the status of Data Builder, cohort-extractor or any other part of OpenSAFELY.

It should be taken as potentially incorrect, until this notice is removed.

Data Builder constructs datasets for researchers🔗

Data Builder is a tool to construct your dataset to use for research studies and analysis using OpenSAFELY.

With Data Builder:

  • Researchers can specify data they want to use in their research via a dataset definition.
  • Data providers can specify data they want to offer for research via the OpenSAFELY Contracts specification and implementation.

Features🔗

Readable dataset definitions🔗

A new query language ehrQL has been developed for Data Builder. Researchers can now use a dataset definition to specify the data to be extracted from OpenSAFELY.

ehrQL is designed to be semantically easy to read and understand how the dataset it is defining is constructed.

Multiple backends🔗

Data Builder facilitates querying multiple different data backends, without the researcher concerning themselves with the specific details of how that backend works. This means that a researcher only need to write a dataset definition once and be able to use this to query different datasets.

Researcher-provided dummy data🔗

Data Builder allows researchers to provide their own dummy data to use to develop their analytical code against.

Note

There is work in progress to add the functionality to generate dummy data from the dataset definition. This is currently in development.

Why Data Builder was created🔗

For researchers familiar with OpenSAFELY, there is naturally a question as to why we are writing software to replace cohort extractor. Data Builder is intended to eventually replace the use of cohort-extractor in new studies. We have more information about the differences between cohort-extractor and Data Builder to read if you are interested.

In OpenSAFELY's first two years, researchers have used cohort-extractor and study definitions to successfully complete a number of research studies using multiple data sources and linked data.

Data Builder is a complete redesign and reimplementation of cohort-extractor aimed at making OpenSAFELY even easier to work with for researchers and data providers. Data Builder's design incorporates feedback from researchers' use of cohort-extractor.

Data Builder:

  • Provides more expressive ways for researchers to specify cohorts.
  • Integrates external data providers more easily by introducing OpenSAFELY Contracts to provide specifications for data.
  • Simplifies the implementation of new features across multiple different data backends.

For more information on how Data Builder and Cohort Extractor compare, see the development plan for Data Builder.

Reading the Data Builder documentation🔗

Other documentation pages explain in more detail the concepts to write a dataset definition:

Data Builder is still in development🔗

Warning

There is considerable on-going work into Data Builder's design and development. Data Builder is subject to frequent change, indicated by its current v0 version.

We recommend that users still favour the existing OpenSAFELY Cohort Extractor for their research.

TO BE REPLACED IN FULL DOCS BUILD

This snippet will be replaced in the main docs with the parent file 'includes/glossary.md'