AQL is one of the most clever things openEHR offers: a query language that allows users to access data they’re interested in, using the elements of openEHR reference model. Its primary author is Chunlan Ma, a real veteran of health IT, who has been a cornerstone of Ocean Informatics (Ocean Health Systems) for many years now. Heath Frankel is the other person from Ocean who made AQL possible.
It has a specification that explains its syntax, and how you can use it. Well, more or less. In case you have not seen it yet, Google is your friend.
As other openEHR vendors began to implement and market their platforms, AQL became a frequently used tool in both developing applications and analysing data. There is a lot to say about domain specific query languages, but I won’t digress, at least for the moment. I’d like to stick to some problems users (and even implementers) may find confusing and discuss the nature of the confusion.
The FROM block
This is where you define the scope of your query. The FROM clause is also what confuses users most. Why? Because it actually has two overlapping semantics, and users find it difficult to keep both of those in mind sometimes. As I said above, it may even be the case for implementers. The two semantics I’m talking about are
- Filtering
- Defining RM elements which can be used to fetch data
What makes AQL different than many other query languages out there is this overlapping semantics/use of FROM block. Consider this example:
SELECT ...
FROM
EHR E
CONTAINS COMPOSITION C
CONTAINS ACTION A
What we do in the FROM block is to define a constraint: we are interested in the case in which an EHR contains a COMPOSITION and that COMPOSITION contains an ACTION. Note that I left SELECT contents in pseudo code, i.e. ...
With a focus on the filtering semantics of AQL, this query fill filter out the EHRs that do not contain a COMPOSITION which in turn does not contain an ACTION.
This notion of filtering has no consideration for cardinality, that is, if there are more than one ACTION instances in a COMPOSITION, that has no relevance to our thinking, we’re focused on a boolean result: does the COMPOSITION sitting under an EHR satisfy our criteria or not? If there are two ACTIONs, our filter will still have positive value, i.e. a true
value for a boolean
returning operation. Maybe we can say that we care about cardinality, but within the boundaries of [0,1]. A cardinality of 1 in the actual data is all we care about; that’s what the filter needs to identify the data we want.
This is a semantics people are used to from many other languages, in fact, it is used inside AQL itself, to enforce various constraints (predicates as we’ll call them in a minute) for RM elements used in the query. Let’s modify the above example just a bit:
SELECT ...
FROM
EHR E [ehr_id/value='...']
CONTAINS COMPOSITION C
CONTAINS ACTION A
See the [ehr_id/value=…] predicate? That is the filtering semantics again: we want to filter out all the EHRs with an ehr_id value that is different than the one we provided. In this case, the cardinality does not even get into the picture: we know that an EHR instance has only one ehr_id field, so there is no possibility of this condition (constraint) being satisfied by more than one actual piece of data.
There is an even more important difference here though: the scope of the ehr_id/value
is limited to predicate for E (the alias for the EHR we use in the query). There is no way to refer to that path from EHR type’s root with a name, there is no way to refer to it from anywhere else in the query.
That’s not the case for C
or A
from the FROM block though. Those aliases can be used in SELECT and WHERE clauses, which is where the second semantics of FROM block comes in: it allows us to define aliases for the components of our filter, which then can be used as starting positions to describe paths of clinical data. In other words, the components of the filter, or structural predicate are visible to the rest of the query, i.e. they have a query level scope. The reason I used the term structural predicate is due to use of CONTAINS keyword: it defines a structural relationship for data: There is an ACTION within a COMPOSITION within an EHR.
When the aliases defined in the FROM block are used elsewhere, their cardinality may a different meaning. Let’s extend the sample query again:
SELECT A
FROM
EHR E [ehr_id/value='...']
CONTAINS COMPOSITION C
CONTAINS ACTION A
So we want to select ACTIONs under a COMPOSITION with this query. What would the user expect as a result set given the following pseudo xml form of data?
<ehr>
<composition>
<action/>
<action/>
</composition>
</ehr>
They’d expect to get back two actions, because there are two of those in the data. So the cardinality of actual data matters in a different way in this case, because it changes the result set contents. In other words:
For the above simplified content, the output of the filtering functionality of FROM does not change as long as we have at least one ACTION under a composition. We could have a million actions, the filter would still return true
The story is different for its RM elements as data sources semantics. Suddenly, the fact that the constraint defined in FROM can match actual data in more than one way matters, because C/A1 and C/A2 are the results the user would expect the result set to return both.
Trouble begins when the user is interested in the cardinality of ACTIONs, but now the actions themselves. Let’s say a user wants to get some value in a composition, based on the criteria that the composition contains an ACTION with archetype node id X:
SELECT C
FROM
EHR E [ehr_id/value='...']
CONTAINS COMPOSITION C
CONTAINS ACTION A [archetype_node_id='X]
and the data is the same:
<ehr>
<composition>
<action/>
<action/>
</composition>
</ehr>
The user is expecting a single composition, i.e. just one row in a result set. They’re trying to say, filter the data and give me all compositions that contain an action with archetype node id X.
In most implementations of AQL, they get back two rows, and the infamous cartesian product problem of AQL emerges from its cave.
Why did the user get back two rows? Because the implementation of AQL found out that there are two actions in data, and there are therefore two ways of satisfying the constraint defined in the FROM block, and returned a subset of the matches, i.e. [[C,A1], [C,A2] -> [C,C]. This fact cannot be observed from the query results though, because the SELECT clause does not include A in its scope. The user thinks the same composition is returned twice because they’re using the query with a filtering semantics in mind for the FROM clause.
If the query was written as:
SELECT C, A
FROM
EHR E [ehr_id/value='...']
CONTAINS COMPOSITION C
CONTAINS ACTION A [archetype_node_id='X]
The result set would be:
<result_set>
<row>
<cell>COMPOSITION_1</cell>
<cell>ACTION_1</cell>
</row>
<row>
<cell>COMPOSITION_1</cell>
<cell>ACTION_2</cell>
</row>
</result_set>
Looking at this, the user can probably see that composition is repeated because there are two different actions leading to two combinations of a composition and an action, as requested in the SELECT clause: SELECT C, A ...
The implementer’s dilemma
Why are implementers confusing the poor users? Not intentionally of course.
The problem here is, from most implementers’ perspective, the semantics of SELECT is implemented as a downstream operation on the outputs of FROM. Therefore, the implementer has to support the case in which any subset of the aliases defined in the FROM clause can be used as a column in the returned result set. In order to do that, the outputs of the FROM are usually based on the elements as data sources semantics, rather than the filtering semantics, because filtering semantics is a special case of data source semantics from a cardinality point of view. Just like selecting only C being a special case of selecting C,A.
I’m not saying this is what all implementations do, but this is a pretty natural way of thinking for an implementer, because it ensures consistency of behaviour. It gives sensible results and it is possible to explain why a particular result set is returned, as I’ve done above, as well as explaining why a user is getting unexpected results, as I’ve also done above. AQL’s design also encourages this way of thinking: you cannot use anything you did not define in the FROM clause from SELECT or WHERE, i.e. SELECT and WHERE depend on FROM’s contents, in other words the scope defined by it.
A common argument that arises in this situation is to make the implementation of AQL smarter. For example, when the user is selecting only C, since the reference model of openEHR gives the implementers the information that an ACTION will always be contained within a COMPOSITION, an implementation could figure out that it can apply a pure filter semantics to what is in the FROM clause, and return only one row per COMPOSITION instance, which would make the previously confused user happy.
I’ll take an opinionated view here and say that it is a bad idea. The reason I gave this example and made the point that I’m against it is because this approach can potentially apply to more cases during the use of AQL and it is a big, big design decision for the query language itself, even though this particular example is kept very simple.
Introducing context specific behaviour changes the unmodified/standard behaviour of query language in subtle ways, and it is extremely difficult, if not downright impossible to guarantee there won’t emerge a combination of data and queries that won’t confuse users in another context.
I’d personally choose a pair of inconvenience and a specific way to deal with it, over an implementation level, behind the scenes approach to figuring out what the user wants. In other words, query semantics should be expressed by the user, not guessed by the implementation. If there is an ambiguity in the query language, such as the FROM block serving as a filter and a data source description depending on the user’s intentions, then the language should remove that ambiguity, and guide the user to use the features that allows them to express their intentions.
There are even more reasons for implementers to prioritize consistent behaviour over case specific conveniences, required due to features of AQL I did not cover here. I won’t go in to those in this post, that’s the next one, but know that they exists.
Thanks Seref for taking the time to write on this well overdue topic. I just wanted to provide my original intension of the semantics of the FROM clause, which demonstrates your point about the well overdue need to better define the semantics of AQL. As you know, the history of the development of AQL was as an r&D project, with the emphasis on the D (lowercase r was intentionally representative) resulting in a paper that specified the language by example based on a resulting commercial implementation rather than specification based on mathematics. Due to commercial commitments of the authors, any subsequent efforts to formalise and extend the language further was not achievable until the openEHR SEC inherited it and took up this challenge.
Anyway, perhaps it is just a difference in “English semantics”, but I don’t believe the FROM clause has filtering semantics other than in one exception. This exception is your example of
FROM EHR E [ehr_id/value=’…’]
If I had my time again, and we actually did change our implementation for this very reason although retained it for backward compatibility, I would remove this kind of filtering predicate and require it to be specified in the WHERE clause. For example:
FROM EHR E
CONTAINS …
WHERE E/ehr_id/value=’…’
Regarding CONTAINS, I never considered this to be filtering and certainly never considered it to be some kind of boolean existence test or anything relevant to cardinality. I’ll comment separately about my rationale relating to the archetype ID predicate not being filtering semantics in your part 2. The FROM clause including CONTAINS was intended to be purely about the data source as is the case in the SQL, of which AQL is strongly influenced (intentionally).
CONTAINS was intended to be just like an SQL JOIN, except the target could be anywhere in the child hierarchy, not just the next level. Effectively (and intentionally), equivalent to // in XPath, the other influence of AQL.
Therefore, result sets should be formed based on the same SQL join principles, hence your cartesian cross product logic would apply. You called this a problem, but from my SQL technician perspective, this was just logical, although I can see why non-technical users would be confused by this as they would be if they used joins in SQL (as we technicians have also been caught out many a times).
Your final point about making an implicit simplification of a result set when the child join results in a cross product but not be returned in the selected result columns. I certainly agree with you that we need to remove these implicit rules and if there is a use case where you have a CONTAINS statement resulting in a cross product but not returned in the SELECTed columns, then this should be reflected in the results and not implicitly simplified. Perhaps there is a need for a DISTINCT operation for this purpose.
Thanks again Seref, I hope this original insight helps, even if it was somewhat naive. I will comment further in Part 2.