Archetype Query Language, The Confusing Bits

AQL is one of the most clever things open_EHR_ offers: a query language that allows users to access data they’re interested in, using the elements of openEHR reference model. Its primary author is Chunlan Ma, a real veteran of health IT, who has been a cornerstone of Ocean Informatics (Ocean Health Systems) for many years now. Heath Frankel is the other person from Ocean who made AQL possible.

It has a specification that explains its syntax, and how you can use it. Well, more or less. In case you have not seen it yet, Google is your friend.

As other openEHR vendors began to implement and market their platforms, AQL became a frequently used tool in both developing applications and analysing data. There is a lot to say about domain specific query languages, but I won’t digress, at least for the moment. I’d like to stick to some problems users (and even implementers) may find confusing and discuss the nature of the confusion.

The FROM block

This is where you define the scope of your query. The FROM clause is also what confuses users most. Why? Because it actually has two overlapping semantics, and users find it difficult to keep both of those in mind sometimes. As I said above, it may even be the case for implementers. The two semantics I’m talking about are

Filtering
Defining RM elements which can be used to fetch data

What makes AQL different than many other query languages out there is this overlapping semantics/use of FROM block. Consider this example:

1
2
3
4
5
SELECT ...
  FROM
    EHR E
    CONTAINS COMPOSITION C
    CONTAINS ACTION A

What we do in the FROM block is to define a constraint: we are interested in the case in which an EHR contains a COMPOSITION and that COMPOSITION contains an ACTION. Note that I left SELECT contents in pseudo code, i.e. ...

With a focus on the filtering semantics of AQL, this query fill filter out the EHRs that do not contain a COMPOSITION which in turn does not contain an ACTION.

This notion of filtering has no consideration for cardinality, that is, if there are more than one ACTION instances in a COMPOSITION, that has no relevance to our thinking, we’re focused on a boolean result: does the COMPOSITION sitting under an EHR satisfy our criteria or not? If there are two ACTIONs, our filter will still have positive value, i.e. a true value for a boolean returning operation. Maybe we can say that we care about cardinality, but within the boundaries of [0,1]. A cardinality of 1 in the actual data is all we care about; that’s what the filter needs to identify the data we want.

This is a semantics people are used to from many other languages, in fact, it is used inside AQL itself, to enforce various constraints (predicates as we’ll call them in a minute) for RM elements used in the query. Let’s modify the above example just a bit:

1
2
3
4
5
SELECT ...
  FROM
    EHR E [ehr_id/value='...']
    CONTAINS COMPOSITION C
    CONTAINS ACTION A

See the [ehr_id/value=…] predicate? That is the filtering semantics again: we want to filter out all the EHRs with an ehr_id value that is different than the one we provided. In this case, the cardinality does not even get into the picture: we know that an EHR instance has only one ehr_id field, so there is no possibility of this condition (constraint) being satisfied by more than one actual piece of data.

There is an even more important difference here though: the scope of the ehr_id/value is limited to predicate for E (the alias for the EHR we use in the query). There is no way to refer to that path from EHR type’s root with a name, there is no way to refer to it from anywhere else in the query.

That’s not the case for C or A from the FROM block though. Those aliases can be used in SELECT and WHERE clauses, which is where the second semantics of FROM block comes in: it allows us to define aliases for the components of our filter, which then can be used as starting positions to describe paths of clinical data. In other words, the components of the filter, or structural predicate are visible to the rest of the query, i.e. they have a query level scope. The reason I used the term structural predicate is due to use of CONTAINS keyword: it defines a structural relationship for data: There is an ACTION within a COMPOSITION within an EHR.

When the aliases defined in the FROM block are used elsewhere, their cardinality may a different meaning. Let’s extend the sample query again:

1
2
3
4
5
SELECT A
  FROM
    EHR E [ehr_id/value='...']
    CONTAINS COMPOSITION C
    CONTAINS ACTION A

So we want to select ACTIONs under a COMPOSITION with this query. What would the user expect as a result set given the following pseudo xml form of data?

1
2
3
4
5
6
<ehr>
  <composition>
    <action/>
    <action/>
  </composition>
</ehr>

They’d expect to get back two actions, because there are two of those in the data. So the cardinality of actual data matters in a different way in this case, because it changes the result set contents. In other words:

For the above simplified content, the output of the filtering functionality of FROM does not change as long as we have at least one ACTION under a composition. We could have a million actions, the filter would still return true

The story is different for its RM elements as data sources semantics. Suddenly, the fact that the constraint defined in FROM can match actual data in more than one way matters, because C/A1 and C/A2 are the results the user would expect the result set to return both.

Trouble begins when the user is interested in the cardinality of ACTIONs, but now the actions themselves. Let’s say a user wants to get some value in a composition, based on the criteria that the composition contains an ACTION with archetype node id X:

1
2
3
4
5
SELECT C
  FROM
    EHR E [ehr_id/value='...']
    CONTAINS COMPOSITION C
    CONTAINS ACTION A [archetype_node_id='X]

and the data is the same:

1
2
3
4
5
6
<ehr>
  <composition>
    <action/>
    <action/>
  </composition>
</ehr>

The user is expecting a single composition, i.e. just one row in a result set. They’re trying to say, filter the data and give me all compositions that contain an action with archetype node id X.

In most implementations of AQL, they get back two rows, and the infamous cartesian product problem of AQL emerges from its cave.

Why did the user get back two rows? Because the implementation of AQL found out that there are two actions in data, and there are therefore two ways of satisfying the constraint defined in the FROM block, and returned a subset of the matches, i.e. [[C,A1], [C,A2] -> [C,C]. This fact cannot be observed from the query results though, because the SELECT clause does not include A in its scope. The user thinks the same composition is returned twice because they’re using the query with a filtering semantics in mind for the FROM clause.

If the query was written as:

1
2
3
4
5
SELECT C, A
  FROM
    EHR E [ehr_id/value='...']
    CONTAINS COMPOSITION C
    CONTAINS ACTION A [archetype_node_id='X]

The result set would be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<result_set>
  <row>
    <cell>COMPOSITION_1</cell>
    <cell>ACTION_1</cell>
  </row>
  <row>
    <cell>COMPOSITION_1</cell>
    <cell>ACTION_2</cell>
  </row>
</result_set>

Looking at this, the user can probably see that composition is repeated because there are two different actions leading to two combinations of a composition and an action, as requested in the SELECT clause: SELECT C, A ...

The implementer’s dilemma

Why are implementers confusing the poor users? Not intentionally of course.

The problem here is, from most implementers’ perspective, the semantics of SELECT is implemented as a downstream operation on the outputs of FROM. Therefore, the implementer has to support the case in which any subset of the aliases defined in the FROM clause can be used as a column in the returned result set. In order to do that, the outputs of the FROM are usually based on the elements as data sources semantics, rather than the filtering semantics, because filtering semantics is a special case of data source semantics from a cardinality point of view. Just like selecting only C being a special case of selecting C,A.

I’m not saying this is what all implementations do, but this is a pretty natural way of thinking for an implementer, because it ensures consistency of behaviour. It gives sensible results and it is possible to explain why a particular result set is returned, as I’ve done above, as well as explaining why a user is getting unexpected results, as I’ve also done above. AQL’s design also encourages this way of thinking: you cannot use anything you did not define in the FROM clause from SELECT or WHERE, i.e. SELECT and WHERE depend on FROM’s contents, in other words the scope defined by it.

A common argument that arises in this situation is to make the implementation of AQL smarter. For example, when the user is selecting only C, since the reference model of openEHR gives the implementers the information that an ACTION will always be contained within a COMPOSITION, an implementation could figure out that it can apply a pure filter semantics to what is in the FROM clause, and return only one row per COMPOSITION instance, which would make the previously confused user happy.

I’ll take an opinionated view here and say that it is a bad idea. The reason I gave this example and made the point that I’m against it is because this approach can potentially apply to more cases during the use of AQL and it is a big, big design decision for the query language itself, even though this particular example is kept very simple.

Introducing context specific behaviour changes the unmodified/standard behaviour of query language in subtle ways, and it is extremely difficult, if not downright impossible to guarantee there won’t emerge a combination of data and queries that won’t confuse users in another context.

I’d personally choose a pair of inconvenience and a specific way to deal with it, over an implementation level, behind the scenes approach to figuring out what the user wants. In other words, query semantics should be expressed by the user, not guessed by the implementation. If there is an ambiguity in the query language, such as the FROM block serving as a filter and a data source description depending on the user’s intentions, then the language should remove that ambiguity, and guide the user to use the features that allows them to express their intentions.

There are even more reasons for implementers to prioritize consistent behaviour over case specific conveniences, required due to features of AQL I did not cover here. I won’t go in to those in this post, that’s the next one, but know that they exists.

The FROM block#

The implementer’s dilemma#

The FROM block

The implementer’s dilemma