Resolving ehrQL errors
ehrQL error messages🔗
If an error is found in your dataset definition. ehrQL will stop running and give you an error message. ehrQL error messages are shown as a Python error report, known as a "traceback".
The error messages are from Python because ehrQL runs in Python.
These error messages can be confusing to read, but they also give you lots of information to use to debug and fix your dataset definition.
Example error message🔗
Let's look at an example of an error report:
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 7, in <module>
dataset._age = age
^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')
- The traceback tells you what code actually caused the error. The traceback shows both the filename and where the error occurred in the file.
- There is an error message at the end.
The error message shows what kind of error occurred —
here, this is an
AttributeError
— followed by details of what the problem is.
How to use this page🔗
Structure of this page🔗
For each error, there is:
- a simple code example that causes the error
- the error details
- the simple code example modified to fix the error
Finding an error on this page🔗
If you are working with ehrQL, and encounter an error, this page may help you.
Because of the included code examples and errors, this is a long page.
Here are some tips on narrowing down the search
Using the table of contents🔗
Skimming the table of contents navigation bar on the right-hand side of this page, to see if any of the general descriptions of errors apply to what you are trying to do.
Using your browser's "Find text in page" feature🔗
Using the "Find text in page" feature of your browser, searching for parts of the error report. Let's look at the example given above again:
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 7, in <module>
dataset._age = age
^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')
The first part of this traceback depends on the specific code that has been written here. It shows:
- the name of the file —
dataset_definition.py
stored in theanalysis
directory - the line number in the file causing the error — line 7
- the line of code causing the error
All of these will vary depending on the code being run. These are useful to point you to where your error is.
However, they are possibly less useful to search for in the list provided here,
because this part of the error report will vary.
What will stay more constant is the final error message.
Searching in this page for parts of that line,
for example AttributeError
or Variable names must start with a letter
may show you the relevant error.
This page covers many of the common ehrQL errors you may see, but is not an exhaustive list.
Notice that even the error message may contain references to the precise code.
In this example: you defined a variable '_age'
.
Can you find the part of this page that does explain this error?
Examples currently use the TPP backend🔗
Python syntax errors🔗
These can occur because Python has its own syntactic rules that ehrQL code must also adhere to.
Code indentation error🔗
Python has particular rules about indentation. If a dataset definition contains indentation errors, the error message will tell you about them. For example, there is an indentation error in the following dataset definition.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(dataset.age > 16) # This line has incorrect indentation.
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Failed to import 'analysis/dataset_definition.py':
File "/workspace/analysis/dataset_definition.py", line 6
dataset.define_population(dataset.age > 16)
IndentationError: unexpected indent
The error message tells us that there is an indentation error, and also the line that the error occurred on.
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(dataset.age > 16) # This line now has correct indentation.
Forbidden feature names🔗
Python has constraints on allowed variable names, which also apply to the names of dataset features.
For example, a name — age!
— with a non-alphanumeric character is invalid:
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name.
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Failed to import 'analysis/dataset_definition.py':
File "/workspace/analysis/dataset_definition.py", line 5
dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name.
^
SyntaxError: invalid syntax
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age = patients.age_on("2023-01-01") # We have changed the invalid feature name, "age!", to a valid one, "age".
Common ehrQL errors🔗
These errors are specific to ehrQL, rather than Python.
Forgetting to set a population🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
A population has not been defined; define one with define_population()
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(dataset.age > 16) # Here we have now defined a population for the dataset.
Invalid feature name: population
is a reserved name🔗
There are a few constraints on feature names in ehrQL.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.population = patients.age_on("2023-01-01") > 16
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 6, in <module>
dataset.population = patients.age_on("2023-01-01") > 16
^^^^^^^^^^^^^^^^^^
AttributeError: Cannot set variable 'population'; use define_population() instead
Fixed dataset definition 🔗
Define population with the define_population
syntax:
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.define_population(patients.age_on("2023-01-01") > 16)
Or rename the feature, if it is required as a separate output:
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.over_16 = patients.age_on("2023-01-01") > 16
Invalid feature name: variables
is a reserved name🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.variables = patients.age_on("2023-01-01") > 16
...
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
dataset.variables = patients.age_on("2023-01-01") > 16
^^^^^^^^^^^^^^^^^
AttributeError: 'variables' is not an allowed variable name
Fixed dataset definition 🔗
Rename the feature to something other than variables
.
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age_greater_than_16 = patients.age_on("2023-01-01") > 16
...
Invalid feature name: feature names must not start with underscores🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset._age = age
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 7, in <module>
dataset._age = age
^^^^^^^^^^^^^
AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age')
Fixed data definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age # _age feature renamed to remove the leading underscores.
Re-defining a feature🔗
In the following dataset definition, dataset.age
is first defined as age
and then defined again as age1
.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2000-01-01")
age1 = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age
dataset.age = age1
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 9, in <module>
dataset.age = age1
^^^^^^^^^^^
AttributeError: 'age' is already set and cannot be reassigned
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2000-01-01")
age1 = patients.age_on("2023-01-01")
dataset.define_population(age > 16)
dataset.age = age
dataset.age1 = age1 # The second age feature now has a unique name on the dataset
Undefined features🔗
All features set on a dataset must be defined; in the following dataset, age
has been
defined on its own, but has not been defined when set on the dataset:
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2000-01-01")
dataset.define_population(age > 16)
dataset.age
Run the dataset definition with:
opensafely exec ehrql:v0 generate-dataset analysis/dataset_definition.py
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 8, in <module>
dataset.age
AttributeError: Variable 'age' has not been defined
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2000-01-01")
dataset.define_population(age > 16)
dataset.age = age # dataset.age is now defined
Trying to set a feature that has more than one row per patient🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations
dataset = Dataset()
dataset.registered_on = practice_registrations.start_date
The practice_registrations
table contains multiple rows per patient.
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
dataset.registered_on = practice_registrations.start_date
^^^^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'registered_on'. Dataset variables must return one row per patient
Fixed dataset definition 🔗
To return the latest registered_on
date, first sort the practice registrations table, find the
last registration for each patient, and then get the start date.
from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations
dataset = Dataset()
latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
dataset.registered_on = latest_registration_per_patient.start_date
Trying to set a feature to a row rather than a value🔗
In the following dataset definition, we have reduce the practice registrations to one row per patient, but we have not selected a value as the feature:
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations
dataset = Dataset()
dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
^^^^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'registered_on'. Dataset variables must be values not whole rows
Fix the dataset definition by setting the feature to a single value, in this case, start_date
.
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import practice_registrations
dataset = Dataset()
latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient()
dataset.registered_on = latest_registration_per_patient.start_date
Type errors in ehrQL expressions🔗
Many ehrQL comparisons require the elements being compared to be of the same type.
In the following dataset definition, age
is an integer, but in the last line we
try to define the population by comparing age to the string "10"
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age >= "10")
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 6, in <module>
dataset.define_population(age >= "10")
^^^^^^^^^^^
ehrql.query_model.nodes.TypeValidationError: GE.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[str]'
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(age >= 10) # age is now being compared to the integer 10
Invalid keywords "and", "or", "not"🔗
In normal Python, logical operations can be performed using the keywords and
, or
and not
. In ehrQL
these are prohibited and will raise an error.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) and (age <= 80))
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 6, in <module>
dataset.define_population((age >= 16) and (age <= 80))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead.
(You will also see this error if you try use a chained comparison, such as 'a < b < c'.)
Fixed dataset definition 🔗
As described in the error message, use the operator &
instead:
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) & (age <= 80))
Chaining comparisons🔗
Chained comparisons are not allowed in ehrQL.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population(16 < age <= 80)
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 6, in <module>
dataset.define_population(16 < age <= 80)
^^^^^^^^^^^^^^
TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead.
(You will also see this error if you try use a chained comparison, such as 'a < b < c'.)
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.define_population((age >= 16) & (age <= 80))
Trying to perform arithmetic operations with an integer column and a float constant🔗
In the following dataset, age
is an integer. We cannot subtract a float from it.
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_minus_5 = age - 5.5
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 6, in <module>
dataset.age_minus_5 = age - 5.5
~~~~^~~~~
ehrql.query_model.nodes.TypeValidationError: Subtract.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[float]'
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_minus_5 = age - 5
Calculate a date difference without specifying return units🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age_in_may = "2023-05-01" - patients.date_of_birth
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
dataset.age_in_may = "2023-05-01" - patients.date_of_birth
^^^^^^^^^^^^^^^^^^
TypeError: Invalid variable 'age_in_may'. Dataset variables must be values not whole rows
To fix this error, specify the units of the date difference that you want in the feature:
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.age_in_may = ("2023-05-01" - patients.date_of_birth).years
Trying to subtract/add constants to dates🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.date_at_age_16 = patients.date_of_birth + 16
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
dataset.date_at_age_16 = patients.date_of_birth + 16
~~~~~~~~~~~~~~~~~~~~~~~^~~~
TypeError: unsupported operand type(s) for +: 'DatePatientSeries' and 'int'
ehrQL cannot add an integer to a date - it needs to know what sort of time unit we are adding (days, months, years).
Fixed dataset definition 🔗
from ehrql import Dataset, years
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
dataset.date_at_age_16 = patients.date_of_birth + years(16)
Incorrectly referencing a table column🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events
dataset = Dataset()
first_event = clinical_events.sort_by(date).first_for_patient()
dataset.event_date = first_event.date
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 5, in <module>
first_event = clinical_events.sort_by(date).first_for_patient()
^^^^
NameError: name 'date' is not defined
Fixed dataset definition 🔗
Columns can be specified as the table attribute:
from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events
dataset = Dataset()
first_event = clinical_events.sort_by(clinical_events.date).first_for_patient()
dataset.event_date = first_event.date
They can also be specified as a name string:
from ehrql import Dataset
from ehrql.tables.beta.tpp import clinical_events
dataset = Dataset()
first_event = clinical_events.sort_by("date").first_for_patient()
dataset.event_date = first_event.date
Specifying a default for case
which is a different type to the values🔗
In the following dataset definition, two age groups are defined as integers (1 and 2). A default value (for patients who don't fall into one of the categories) is defined as "unknown". This is an error - any default value given for a case statement must be of the same type (or None).
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import case, patients, when
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_group = case(
when(age < 10).then(1),
when(age > 80).then(2),
default="unknown",
)
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 7, in <module>
dataset.age_group5 = case(
^^^^^
ehrql.query_model.nodes.TypeValidationError: Case.default requires 'ehrql.query_model.nodes.Series[int] | None' but got 'ehrql.query_model.nodes.Series[str]'
Fixed dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import case, patients, when
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_group = case(
when(age < 10).then(1),
when(age > 80).then(2),
default=0,
)
Using is_in
without a container🔗
Failing dataset definition 🔗
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_30 = age.is_in(30)
Error🔗
Traceback (most recent call last):
File "/workspace/analysis/dataset_definition.py", line 7, in <module>
dataset.age_30 = age.is_in(30)
^^^^^^^^^^^^^
ehrql.query_model.nodes.TypeValidationError: In.rhs requires 'ehrql.query_model.nodes.Series[collections.abc.Set[int]]' but got 'ehrql.query_model.nodes.Series[int]'
This is also an error:
dataset.age_30_or_40 = age.is_in(30, 40)
Fixed dataset definition 🔗
Arguments passed to is_in
must be wrapped in a python container - a set, list or tuple.
All of the following features defined with is_in
are valid.
from ehrql import Dataset
from ehrql.tables.beta.tpp import patients
dataset = Dataset()
age = patients.age_on("2023-01-01")
dataset.age_30_list = age.is_in([30])
dataset.age_30_or_40_set = age.is_in({30, 40})
dataset.age_30_or_40_tuple = age.is_in((30, 40))