Skip to content

Resolving ehrQL errors

ehrQL error messages🔗

If an error is found in your dataset definition. ehrQL will stop running and give you an error message. ehrQL error messages are shown as a Python error report, known as a "traceback".

🗒 The error messages are from Python because ehrQL runs in Python.

These error messages can be confusing to read, but they also give you lots of information to use to debug and fix your dataset definition.

Example error message🔗

Let's look at an example of an error report:

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 7, in <module>
    dataset._age = age
    ^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')
  • The traceback tells you what code actually caused the error. The traceback shows both the filename and where the error occurred in the file.
  • There is an error message at the end. The error message shows what kind of error occurred — here, this is an AttributeError — followed by details of what the problem is.

How to use this page🔗

Structure of this page🔗

For each error, there is:

  1. a simple code example that causes the error
  2. the error details
  3. the simple code example modified to fix the error

Finding an error on this page🔗

If you are working with ehrQL, and encounter an error, this page may help you.

Because of the included code examples and errors, this is a long page.

Here are some tips on narrowing down the search

Using the table of contents🔗

Skimming the table of contents navigation bar on the right-hand side of this page, to see if any of the general descriptions of errors apply to what you are trying to do.

Using your browser's "Find text in page" feature🔗

Using the "Find text in page" feature of your browser, searching for parts of the error report. Let's look at the example given above again:

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 7, in <module>
    dataset._age = age
    ^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')

The first part of this traceback depends on the specific code that has been written here. It shows:

  • the name of the file — dataset_definition.py stored in the analysis directory
  • the line number in the file causing the error — line 7
  • the line of code causing the error

All of these will vary depending on the code being run. These are useful to point you to where your error is.

However, they are possibly less useful to search for in the list provided here, because this part of the error report will vary. What will stay more constant is the final error message. Searching in this page for parts of that line, for example AttributeError or Variable names must start with a letter may show you the relevant error.

⚠ This page covers many of the common ehrQL errors you may see, but is not an exhaustive list.

⚠ Notice that even the error message may contain references to the precise code. In this example: you defined a variable '_age'.

❓ Can you find the part of this page that does explain this error?

Examples currently use the TPP backend🔗

Python syntax errors🔗

These can occur because Python has its own syntactic rules that ehrQL code must also adhere to.

Code indentation error🔗

Python has particular rules about indentation. If a dataset definition contains indentation errors, the error message will tell you about them. For example, there is an indentation error in the following dataset definition.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
 dataset.define_population(dataset.age > 16) # This line has incorrect indentation.

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Failed to import 'analysis/dataset_definition.py':

  File "/workspace/analysis/dataset_definition.py", line 6
    dataset.define_population(dataset.age > 16)
IndentationError: unexpected indent

The error message tells us that there is an indentation error, and also the line that the error occurred on.

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(dataset.age > 16) # This line now has correct indentation.

Forbidden feature names🔗

Python has constraints on allowed variable names, which also apply to the names of dataset features. For example, a name — age! — with a non-alphanumeric character is invalid:

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name.

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Failed to import 'analysis/dataset_definition.py':

  File "/workspace/analysis/dataset_definition.py", line 5
    dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name.
               ^
SyntaxError: invalid syntax

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age = patients.age_on("2023-01-01") # We have changed the invalid feature name, "age!", to a valid one, "age".

Common ehrQL errors🔗

These errors are specific to ehrQL, rather than Python.

Forgetting to set a population🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

A population has not been defined; define one with define_population()

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(dataset.age > 16) # Here we have now defined a population for the dataset.

Invalid feature name: population is a reserved name🔗

There are a few constraints on feature names in ehrQL.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.population = patients.age_on("2023-01-01") > 16

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 6, in <module>
    dataset.population = patients.age_on("2023-01-01") > 16
    ^^^^^^^^^^^^^^^^^^
AttributeError: Cannot set variable 'population'; use define_population() instead

Fixed dataset definition ✔🔗

Define population with the define_population syntax:

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.define_population(patients.age_on("2023-01-01") > 16)

Or rename the feature, if it is required as a separate output:

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.over_16 = patients.age_on("2023-01-01") > 16

Invalid feature name: variables is a reserved name🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.variables = patients.age_on("2023-01-01") > 16
...

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    dataset.variables = patients.age_on("2023-01-01") > 16
    ^^^^^^^^^^^^^^^^^
AttributeError: 'variables' is not an allowed variable name

Fixed dataset definition ✔🔗

Rename the feature to something other than variables.

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age_greater_than_16 = patients.age_on("2023-01-01") > 16
...

Invalid feature name: feature names must not start with underscores🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset._age = age

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 7, in <module>
    dataset._age = age
    ^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')

Fixed data definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age # _age feature renamed to remove the leading underscores.

Re-defining a feature🔗

In the following dataset definition, dataset.age is first defined as age and then defined again as age1.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2000-01-01")
age1 = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age
dataset.age = age1

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 9, in <module>
    dataset.age = age1
    ^^^^^^^^^^^
AttributeError: 'age' is already set and cannot be reassigned

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2000-01-01")
age1 = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age
dataset.age1 = age1 # The second age feature now has a unique name on the dataset

Undefined features🔗

All features set on a dataset must be defined; in the following dataset, age has been defined on its own, but has not been defined when set on the dataset:

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2000-01-01")
dataset.define_population(age > 16)
dataset.age

Run the dataset definition with:

opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 8, in <module>
    dataset.age
AttributeError: Variable 'age' has not been defined

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2000-01-01")
dataset.define_population(age > 16)
dataset.age = age # dataset.age is now defined

Trying to set a feature that has more than one row per patient🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations

dataset = Dataset()
dataset.registered_on = practice_registrations.start_date

The practice_registrations table contains multiple rows per patient.

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    dataset.registered_on = practice_registrations.start_date
    ^^^^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'registered_on'. Dataset variables must return one row per patient

Fixed dataset definition ✔🔗

To return the latest registered_on date, first sort the practice registrations table, find the last registration for each patient, and then get the start date.

from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations

dataset = Dataset()
latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
dataset.registered_on = latest_registration_per_patient.start_date

Trying to set a feature to a row rather than a value🔗

In the following dataset definition, we have reduce the practice registrations to one row per patient, but we have not selected a value as the feature:

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations

dataset = Dataset()
dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
    ^^^^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'registered_on'. Dataset variables must be values not whole rows

Fix the dataset definition by setting the feature to a single value, in this case, start_date.

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations

dataset = Dataset()
latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
dataset.registered_on = latest_registration_per_patient.start_date

Type errors in ehrQL expressions🔗

Many ehrQL comparisons require the elements being compared to be of the same type.

In the following dataset definition, age is an integer, but in the last line we try to define the population by comparing age to the string "10"

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age >= "10")

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 6, in <module>
    dataset.define_population(age >= "10")
                              ^^^^^^^^^^^
ehrql.query_model.nodes.TypeValidationError: GE.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[str]'

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age >= 10)  # age is now being compared to the integer 10

Invalid keywords "and", "or", "not"🔗

In normal Python, logical operations can be performed using the keywords and, or and not. In ehrQL these are prohibited and will raise an error.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) and (age <= 80))

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 6, in <module>
    dataset.define_population((age >= 16) and (age <= 80))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead.
(You will also see this error if you try use a chained comparison, such as 'a < b < c'.)

Fixed dataset definition ✔🔗

As described in the error message, use the operator & instead:

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) & (age <= 80))

Chaining comparisons🔗

Chained comparisons are not allowed in ehrQL.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(16 < age <= 80)

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 6, in <module>
    dataset.define_population(16 < age <= 80)
                              ^^^^^^^^^^^^^^
TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead.
(You will also see this error if you try use a chained comparison, such as 'a < b < c'.)

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) & (age <= 80))

Trying to perform arithmetic operations with an integer column and a float constant🔗

In the following dataset, age is an integer. We cannot subtract a float from it.

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_minus_5 = age - 5.5

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 6, in <module>
    dataset.age_minus_5 = age - 5.5
                          ~~~~^~~~~
ehrql.query_model.nodes.TypeValidationError: Subtract.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[float]'

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_minus_5 = age - 5

Calculate a date difference without specifying return units🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age_in_may = "2023-05-01" - patients.date_of_birth

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    dataset.age_in_may = "2023-05-01" - patients.date_of_birth
    ^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'age_in_may'. Dataset variables must be values not whole rows

To fix this error, specify the units of the date difference that you want in the feature:

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.age_in_may = ("2023-05-01" - patients.date_of_birth).years

Trying to subtract/add constants to dates🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.date_at_age_16 = patients.date_of_birth + 16

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    dataset.date_at_age_16 = patients.date_of_birth + 16
                             ~~~~~~~~~~~~~~~~~~~~~~~^~~~
TypeError: unsupported operand type(s) for +: 'DatePatientSeries' and 'int'

ehrQL cannot add an integer to a date - it needs to know what sort of time unit we are adding (days, months, years).

Fixed dataset definition ✔🔗

from ehrql import Dataset, years
from ehrql.tables.beta.tpp import patients

dataset = Dataset()
dataset.date_at_age_16 = patients.date_of_birth + years(16)

Incorrectly referencing a table column🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events

dataset = Dataset()
first_event = clinical_events.sort_by(date).first_for_patient()
dataset.event_date = first_event.date

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 5, in <module>
    first_event = clinical_events.sort_by(date).first_for_patient()
                                          ^^^^
NameError: name 'date' is not defined

Fixed dataset definition ✔🔗

Columns can be specified as the table attribute:

from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events

dataset = Dataset()
first_event = clinical_events.sort_by(clinical_events.date).first_for_patient()
dataset.event_date = first_event.date

They can also be specified as a name string:

from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events

dataset = Dataset()
first_event = clinical_events.sort_by("date").first_for_patient()
dataset.event_date = first_event.date

Specifying a default for case which is a different type to the values🔗

In the following dataset definition, two age groups are defined as integers (1 and 2). A default value (for patients who don't fall into one of the categories) is defined as "unknown". This is an error - any default value given for a case statement must be of the same type (or None).

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import case, patients, when

dataset = Dataset()

age = patients.age_on("2023-01-01")
dataset.age_group = case(
   when(age < 10).then(1),
   when(age > 80).then(2),
   default="unknown",
)

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 7, in <module>
    dataset.age_group5 = case(
                         ^^^^^
ehrql.query_model.nodes.TypeValidationError: Case.default requires 'ehrql.query_model.nodes.Series[int] | None' but got 'ehrql.query_model.nodes.Series[str]'

Fixed dataset definition ✔🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import case, patients, when

dataset = Dataset()

age = patients.age_on("2023-01-01")
dataset.age_group = case(
   when(age < 10).then(1),
   when(age > 80).then(2),
   default=0,
)

Using is_in without a container🔗

Failing dataset definition ❌🔗

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()

age = patients.age_on("2023-01-01")
dataset.age_30 = age.is_in(30)

Error🔗

Traceback (most recent call last):
  File "/workspace/analysis/dataset_definition.py", line 7, in <module>
    dataset.age_30 = age.is_in(30)
                     ^^^^^^^^^^^^^
ehrql.query_model.nodes.TypeValidationError: In.rhs requires 'ehrql.query_model.nodes.Series[collections.abc.Set[int]]' but got 'ehrql.query_model.nodes.Series[int]'

🗒 This is also an error:

dataset.age_30_or_40 = age.is_in(30, 40)

Fixed dataset definition ✔🔗

Arguments passed to is_in must be wrapped in a python container - a set, list or tuple. All of the following features defined with is_in are valid.

from ehrql import Dataset
from ehrql.tables.beta.tpp import patients

dataset = Dataset()

age = patients.age_on("2023-01-01")
dataset.age_30_list = age.is_in([30])
dataset.age_30_or_40_set = age.is_in({30, 40})
dataset.age_30_or_40_tuple = age.is_in((30, 40))