A discussion about Archetype Query Language semantics

This is a copy/paste of a few responses I sent to a discussion in the openEHR lists. I’m copying them here because images in my responses and responses themselves are not properly archived anywhere yet.

If you want more: I wrote a PhD thesis on this stuff, so if you want a deeper discussion of the topic here it is but I suggest you read the following first.

Here is the whole exchange from openEHR mail lists, with all responses, including mine:

From: Bjørn Næss
Date: Mon, Apr 24, 2017 at 11:01 PM
Subject: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Hi

We have created a GIT repo with some issues from our experiences with AQL.

The repo was created to expose one issue with repeating items. You will find it here – and some feedback is welcome:

Folder with all files: https://github.com/bjornna/openehr-conformance/tree/master/aql/case1-permutations

README with the problem: https://github.com/bjornna/openehr-conformance/blob/master/aql/case1-permutations/index.adoc

The initial problem was this:

One simple example may be openEHR-EHR-OBSERVATION.body_weight.v1 which have an optional and repeating ‘any_event’. Below is a straigt forward AQL on this archetype. We are asking for the origin from the HISTORY attribute, and then for each repeating event we want the weight magnitude and unit.

As you see from the AQL there is an additional WHERE clausul telling that we only want weight with magnitude less than 45.

The question is: What do you expect as result from this query?

select
o/data[at0002]/origin/value as time,
o/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude as Weight,
o/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/units as Unit
from
composition c
contains
observation o[openEHR-EHR-OBSERVATION.body_weight.v1]
where
o/data[at0002]/events[at0003]/data[at0001]/items[at0004]/value/magnitude < 45
order by
o/data[at0002]/origin/value desc

Vennlig hilsen
Bjørn Næss
Produktansvarlig
DIPS ASA

Mobil +47 93 43 29 10

From: Pablo Pazos Date: Tue, Apr 25, 2017 at 5:54 AM
Subject: Re: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Trying to understand the problem, I modeled the database schema in my head.

origin is like a column from the container and magnitude/units are columns for the contained node. So it is like having two tables, one for history and one for the events with the data (might be N tables in the middle but this is a simplification), so it is a one to many relationship, getting one field from the *one* side and two from the *many* side. For me the result would repeat the origin for every magnitude/units pair, and each result in the result set will be that triplet (o, m, u), repeating the o for every triplet, and m,n be values contained in the same event/cluster/element/datavalue, since the parent paths until the datavalue are the same, the only difference is the reference to the datavalue attributes.

But that is what I think it should return, maybe the semantics are different in AQL and the results should be the permutations of data between siblings, but I don’t see much sense on doing that. Also, something like a group by might be needed, but instead of having that for aggregations like in SQL, have that to group data by container.

Sorry if this doesn’t make much sens, I might not understand the whole problem 🙂

From: Bjørn Næss
Date: Tue, Apr 25, 2017 at 9:36 PM
Subject: RE: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Thanks Pablo!

I think you have understood the problem. The real problem is how to interpret the clinical intention of the query, and then how to apply that on the openEHR RM to produce the correct output.

I have been thinking about adding different “implementation” of the problem. One “implementation” could be a ER (enitity relationship) model with tables and references. And then to apply this to SQL. Another “implementation” would be XPATH.

In this problem you need to define how to repeat:

. History.Origin by all child EVENTS

. Observation.Protocol joined by data from events.

It would be nice if you could contribute with some ER models to show how this would work in your model. Put another way; how would such a query look like and what would the output be form your system?

From: Pablo Pazos Date: Wed, Apr 26, 2017 at 2:47 AM
Subject: Re: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

On XML queries I would prefer to query and return a complete subdocument starting from the history, not just individual nodes. On that scenario querying by multiple xpath that point to individual nodes might return just a list with those nodes without any relationship. But if those nodes include some kind of parent instance id, from the list of results a program might reconstruct the hierarchy. The problem I see on this is that the database already has the hierarchy and returns a plain list and the client needs to reconstruct, why not just return the whole hierarchy and let the client process it?

I have more doubts than certainties. I know how I do things in the EHRServer, but not sure how this should work in general.

1. In the EHRServer, database is relational.
2. Queries for datavalues return whole datavalues, so there is no need of adding a projection for e.g. magnitude, units when querying a DV_QUANTITY.
3. Results can be grouped by composition, e.g. multiple instances of event will have a leaf element in the structure, the result will put all the ELEMENT.value for all the EVENTs of the same COMPOSITION instance together in the result.
4. Results can be grouped by path, all the results for e.g. systolic BP together, independently of the composition/event that contains them, but the result is annotated with the starttime of the COMPOSITION that contains the datavalue. And as I remember you mentioned on the latest demo of the EHRServer something about making that more flexible, e.g. annotating the results with the HISTORY.origin or other time in the model. I think that will be very useful on queries like the one you mention on the first email (and I have it on my TODO list 🙂

But this is not AQL (yet), is just how I designed dynamic queries that can be clinically useful. IMO AQL is too generic and some behaviors are not 100% defined yet, but is also powerful and expressive.

From: Seref Arikan
Date: Mon, May 1, 2017 at 11:53 AM
Subject: Re: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Hi Bjørn,

I’ll respond to your interpretation of aql in the git repo in a minute. But before I do that I’d like to thank you for asking the question and going into the trouble of putting this into a git repo because your question is pointing at an important issue we have at the moment: we don’t have a well defined, formal semantics for AQL. This is my personal opinion and all disagreement is welcome.

AQL is a brilliant idea coined by Chunlan, Heath, Tom and Sam [

] but until recently, it was not more than a few wiki pages, it was not even part of the standard. It is now sitting next to other specifications but its documentation is mostly focusing on the syntax. The semantics of queries is mostly implied and vendors more or less arrive at the same conclusions intuitively. However, this is not good enough.

Personally I see AQL is the hidden treasure of openEHR, it has huge potential and it is the most likely candidate to implement lots of advanced use cases in an interoperable way. For that potential to be realised, AQL needs a clear description of its underlying semantics so that edge cases are clarified and vendors support the same behaviour. Otherwise, we’ll lose the ability to build portable solutions that can be moved from point A to B even though both points are openEHR implementations.

Having gotten that bit of complaint off my chest, I can comment on your actual question: I don’t think the queries you’ve provided in the git repo have any edge cases but I disagree with what is written there based on my interpretation of AQL semantics. Since you’ve gone into trouble of expressing your question in detail, I’ll try to return the courtesy. Everything I’m talking about below is in https://github.com/serefarikan/aql-discussion

Let’s start with the meaning of your queries. The aql queries you’ve given in the repo correspond to the following:
unnamed1

In the picture above, double lines that join nodes represents CONTAINS and single lines represent a full path build on parent-child relationships. Every node in these graphs is a variable, which may or may not be included in the results. These variables are placeholders and they can be filled by actual reference model instances in your data. Here is the instance you’ve provided in Json:

unnamed2

So you’re basically asking the following question:
Given the structural constraints I’m providing in this query, and this particular instance of data, in how many ways I can populate the tabular representation which is my results?

In the git repo, your interpretation of the following query’s results it that it is a cartesian product of all variables:
select
o/data[at0001]/events[at0002]/data[at0003]/items[at0004]/items[at0005]/value/magnitude as Magn,
o/data[at0001]/events[at0002]/data[at0003]/items[at0004]/items[at0005]/value/units as Units,
o/protocol[at0007]/items[at0008]/value as Protocol
from
COMPOSITION c
contains
OBSERVATION o[openEHR-EHR-OBSERVATION.multiple_events_cluster.v0]

I think you’re missing the structural constraints you’re imposing in the select clause in this query. All the leaf nodes you’re defining in that clause are direct paths from the root “o” so the Magn and Units will always be under the same parent: you cannot have (1,b) or (2,c), these results would break the structural constraints you’ve defined in your query.

Since my whole point is that AQL has subjective interpretation, I’ve used other formalisms to demonstrate my interpretation. I wrote a simplified ontology with OWL and a simplified xml file to represent the actual data input. Here is a SPARQL query that corresponds to the aql query above, run against my toy openEHR ontology:

SELECT ?magn ?units ?Protocol
WHERE {
?comp a oe:Composition .
?comp oe:contains ?obs . ?obs a oe:Observation .
?obs oe:hasChild ?protocol . ?protocol oe:hasChild ?protoEl .
?protoEl oe:dvTextValue ?Protocol .
?obs oe:hasChild ?event . ?event a oe:Event .
?event oe:hasChild ?cluster . ?cluster a oe:Cluster .
?cluster oe:hasChild ?el . ?el a oe:Element .
?el oe:hasChild ?dvq . ?dvq a oe:DvQuantity .
?dvq oe:magnitude ?magn .
?dvq oe:units ?units
#FILTER (?magn < 3.0)
}

which gives the result:
unnamed3

The Xquery version of the same query semantics:

for $obs in composition//observation
for $measr in $obs/event/cluster/item/measurement
let $m := ($measr/magnitude, $measr/units)
for $p in $obs/protocol/item_tree/element

return {($m, $p)}

produces the same results:

1
a
X

1
a
Y

2
b
X

2
b
Y

3
c
X

3
c
Y

As you can see, there is no (1,b) or (2,c) results here. The repeated rows are due to protocol/value having two values X and Y and that means you can fill in the protocol column in the first diagram I’ve pasted above in two ways, after you’ve put in magn and units.

Based on the same interpretation, your second query would not return (1,a) (1,b).. as written in the readme in the git repo. It would instead get a nice list like this. First the SPARQL version of your AQL 2:

PREFIX owl:
PREFIX rdf:
PREFIX rdfs:
PREFIX oe:
SELECT ?magn ?units
WHERE {
?comp a oe:Composition .
?comp oe:contains ?obs . ?obs a oe:Observation .
?obs oe:contains ?event . ?event a oe:Event .
?event oe:hasChild ?cluster . ?cluster a oe:Cluster .
?cluster oe:hasChild ?el . ?el a oe:Element .
?el oe:hasChild ?dvq . ?dvq a oe:DvQuantity .
?dvq oe:magnitude ?magn .
?dvq oe:units ?units
#FILTER (?magn < 3.0)
}

Then the results:
unnamed4

I’ve written the SPARQL and xquery for the queries you’ve given in your repo and they’re under the git repository along with the toy ontology I’ve written and sample xml file etc. I’ve created. No need to copy paste them here all, I think the examples above explain what I mean.

Please note that the point here is not that SPARQL, OWL, Xquery or something else is a good representation for Aql semantics. the point is that I cannot see one described in the openEHR specifications.

Currently, there is a lot of momentum in the openEHR space and people are very excited about modelling discussions and using Snomed CT etc, that’s all fine, but for whatever reason AQL is not getting the attention it needs from a specification point of view, or so I think. What I mean is:
If you get the models wrong, everybody is using the wrong models so representation is broken but interoperability is not. If you get the behaviour wrong, since everybody is not using the same implementation, interoperability is broken.

Comments, corrections are most welcomed for all of the above.

Cheers
Seref

From: Bjørn Næss
Date: Wed, May 3, 2017 at 9:45 AM
Subject: RE: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Hi Seref

Thanks for you brilliant response on this topic. I totally agree with you: AQL is a treasure in openEHR. Without AQL eHealth will miss a lot of opportunities.

I have added some comments – mostly as new challenges below.

· Created a pull request to your repo (see below)

· Added two new examples based on openEHR-EHR-OBSERVATION.blood_pressure.v1 and openEHR-EHR-OBSERVATION.glasgow_coma_scale.v1. They are both described shortly below.

ELEMENT is atomic

From your XQUERY and SPARQL examples it looks like you have defined the queries in such a way that the openEHR ELEMENT/DATAVALUE is atomic. Put another way – the application should not divide attributes from these objects. I agree on this!

I have added a new example: #aql-11 with an AQL that queries for the DATAVALUE. This query returns the expected result. Note that the result from this query is the same as #aql-3 where the CONTAINS goes down to POINT_EVENT.

Then – how to interpret the original AQL to get the expected result? Below we have the AQL:

select
o/data[at0001]/events[at0002]/data[at0003]/items[at0004]/items[at0005]/value/magnitude as Magn,
o/data[at0001]/events[at0002]/data[at0003]/items[at0004]/items[at0005]/value/units as Units,
o/protocol[at0007]/items[at0008]/value as Protocol
from
COMPOSITION c
contains
OBSERVATION o[openEHR-EHR-OBSERVATION.multiple_events_cluster.v0]

Postulate 1: AQL providers MUST treat the paths in the SELECT clause like trees.
The trees should have paths which are equal and down to the depth of an ELEMENT. In effect this makes ELEMENT atomic (not dividable) in the resultset.

(In the AQL above I have marked the two unique paths with orange and green colors).

XQuery for the blood pressure example

The Blood Pressure example is based on repeating EVENTS and multiple observations in the Composition. The example is here: https://github.com/bjornna/openehr-conformance/blob/master/aql/case1.1-permutation_bp/index.adoc.

I have created a pull request (https://github.com/serefarikan/aql-discussion/pull/1) and issues (https://github.com/serefarikan/aql-discussion/issues/2) to your AQL discussion repo for this.

SELECT
o/data[at0001]/origin/value as Origin,
o/data[at0001]/events[at0006]/time/value as EventTime,
o/data[at0001]/events[at0006]/data[at0003]/items[at0004]/value/magnitude as Systolic,
o/data[at0001]/events[at0006]/data[at0003]/items[at0005]/value/magnitude as Diastolic,
o/protocol[at0011]/items[at0013]/value/value as Cuff
FROM Composition c
CONTAINS OBSERVATION o[openEHR-EHR-OBSERVATION.blood_pressure.v1]

I think the blood pressure example as describe will work correctly (give the expected resultset) given that the AQL server treats the paths as trees.

Glasgow Comas Scale example

The Glasgow Coma Scale is added because the Comment element is repeatable. This opens a new problem: How to handle repeating comments in the result set. Take a look at this example for my opinion on this: https://github.com/bjornna/openehr-conformance/blob/master/aql/case1.1-permutation_gcs/index.adoc

SELECT
o/data[at0001]/origin/value as Origin,
o/data[at0001]/events[at0002]/time/value as EventTime,
o/data[at0001]/events[at0002]/data[at0003]/items[at0009]/value/value as Eye,
o/data[at0001]/events[at0002]/data[at0003]/items[at0007]/value/value as Verbal,
o/data[at0001]/events[at0002]/data[at0003]/items[at0008]/value/value as Motor,
o/data[at0001]/events[at0002]/data[at0003]/items[at0026]/value/magnitude as Score,
o/data[at0001]/events[at0002]/data[at0003]/items[at0037]/value/value as Comment
FROM COMPOSITION c
CONTAINS OBSERVATION o[openEHR-EHR-OBSERVATION.glasgow_coma_scale.v1]

Screenshot from 2018-02-15 11-56-54

From: Seref Arikan
Date: Wed, May 3, 2017 at 9:55 AM
Subject: Re: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Hi Bjørn,

Just a quick reply regarding a potential misunderstanding, my simplifications in the ontology/xml/queries are there because I don’t have the time to fully implement RM or query semantics. When I take shortcuts and omit elements either in the content I create or in the queries, whether they use sparql/xquery etc, it is only because of time limitations. I’d suggest you don’t take these anything else i.e I’m not suggesting changes to granularity of RM or aql, though shortcuts for the latter would be a good topic for discussion 🙂 (but that’d be syntax, not semantics which is what I’m trying to point at)

I’m up to my eyeballs in code at the moment so I’ll have to look at the rest of your response a bit later.

All the best
Seref

From: Bjørn Næss
Date: Wed, May 3, 2017 at 10:06 AM
Subject: RE: AQL: Expected result with repeating structures
To: For openEHR implementation discussions

Seref – quick response to your quick response J

I totally understand that you did “shortcuts”. That is perfectly fine and needed to communicate the essence of some examples. But those examples will never be “total openEHR applications” – that takes too much time.

This dialogue is needed. We need a way to define the semantic rules on how to interpret AQLs. Since AQL has borrowed syntax from other technologies like SQL, XPATH, XQUERY, etc. it is really a nice contribution to add some examples using this technologies.

What would be interesting was if other vendors used my examples with the given Compositions and AQL’s to see what kind of result they get, and then agree/disagree with the expected results provided.

My postulate on the granularity was expressed because I from your response found them correct. And I wrote them out to get some feedback on them. This could be added to the specification if other find them useful.

From: Seref Arikan
Date: Sat, May 6, 2017 at 1:40 PM
Subject: Re: AQL: Expected result with repeating structures
To: For openEHR implementation discussions
Hello Bjørn,

This is your first bp query, following the same notation I used in the initial response:

unnamed5

Which would correspond to the following Xquery code:
xquery version "3.0";
declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance";
declare default element namespace "http://schemas.openehr.org/v1";
{
for $obs in composition//content[@xsi:type='OBSERVATION']
for $protocol in $obs/protocol[@archetype_node_id='at0011']
let $cuff := $protocol/items[@archetype_node_id='at0013']/value/value
let $obs_data := $obs/data[@archetype_node_id='at0001']
let $origin := $obs_data/origin/value
for $event in $obs_data/events[@archetype_node_id='at0006']
let $eventTime := $event/time/value
for $data in $event/data[@archetype_node_id='at0003']
for $systolicElement in $data/items[@archetype_node_id='at0004']
let $systolic := $systolicElement/value/magnitude
for $diastolicElement in $data/items[@archetype_node_id='at0005']
let $diastolic := $diastolicElement/value/magnitude
return
{
(element Origin {$origin/text()},
(element EventTime {$eventTime/text()}),
(element Systolic {$systolic/text()}),
(element Diastolic {$diastolic/text()}),
(element Cuff {$cuff/text()}))
}
}

Which gives the result the query is asking for. I am not sure I understand why you’re naming the first aql query for blood pressure as naive though. It looks like a reasonable query to me.

Your second query is the following:
unnamed6

Which corresponds to the following Xquery code:
xquery version "3.0";
declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance";
declare default element namespace "http://schemas.openehr.org/v1";
{
for $obs in composition//content[@xsi:type='OBSERVATION']
for $protocol in $obs/protocol[@archetype_node_id='at0011']
let $cuff := $protocol/items[@archetype_node_id='at0013']/value/value
let $obs_data := $obs/data[@archetype_node_id='at0001']
let $origin := $obs_data/origin/value
(:THIS IS WHERE AQL_BP1 AND THIS FILE DIFFERS:)
for $event in $obs//*[@xsi:type='POINT_EVENT']
let $eventTime := $event/time/value
for $data in $event/data[@archetype_node_id='at0003']
for $systolicElement in $data/items[@archetype_node_id='at0004']
let $systolic := $systolicElement/value/magnitude
for $diastolicElement in $data/items[@archetype_node_id='at0005']
let $diastolic := $diastolicElement/value/magnitude
return
{
(element Origin {$origin/text()},
(element EventTime {$eventTime/text()}),
(element Systolic {$systolic/text()}),
(element Diastolic {$diastolic/text()}),
(element Cuff {$cuff/text()}))
}
}

The results for both query are the same, which is:

2017-05-02T20:00:00+02:00
2017-05-02T20:05:00+02:00
100
90
Adult thigh

2017-05-02T20:00:00+02:00
2017-05-02T20:10:00+02:00
101
91
Adult thigh

2017-05-02T20:15:00+02:00
2017-05-02T20:20:00+02:00
102
92
Large adult

I did not have time to write the data instance in OWL based on the xml you’ve provided, so I did not write SPARQL versions.

Your third example re Glasgow Comma Scale produces the correct results and you have a good point re displaying this to a clinician: it is almost certain that a clinician would have difficulty figuring the difference between the two rows which is at the rightmost column, but this is not about AQL semantics, it is about how to format query results.

Going back to the difference between your two queries for blood pressure. The results are the same because even though the structural constraints are different, given your data, and more importantly, the design of RM, the results would be the same. Think about it, there is nothing of importance sitting between the observation and event, so whether your query includes the event via a CONTAINS constraint in the FROM clause or via a direct parent/child path in the SELECT clause does not matter

Your use of POINT_EVENT raises the kind of issue I’m pointing at though, because that is not a structural constraint, it is a type constraint. So the interesting question is:

Should AQL implementations support polymorphic results?

That is, if I use EVENT in the FROM clause for a named node as in EVENT e should this resolve to both POINT_EVENT and INTERVAL_EVENT instances? What if the data has INTERVAL_EVENTS and the query, as you’ve done, is asking for POINT_EVENTs?

How about the interpretation of paths int the SELECT clause? Do we assume a logical or here? That is, if a composition does not contains one of the rows in the SELECT clause, should the results include a row with that column set to null, or should that row be excluded all together (if we assume logical AND) ?
Can you imagine what would happen if a large scale data extraction for decision support/population query analysis were run on two different implementations that interpret the SELECT clause children differently?

Currently, almost all vendors that I know if assume a logical OR, so any data instance that satisfies at least one condition/existence in the SELECT clause is included in the results, but this is just common sense of the implementers, there is nothing in the AQL specification about this.

How about the semantics of CONTAINS constraint in the FROM clause? Is it mandatory? can we have optional CONTAINS or any data that fails all CONTAINS constraints should be excluded?

What is the right way to extend AQL with functions that can be allowed to take variables as parameters? This is more of a syntax issue but it requires an underlying semantics to be defined nonetheless.

You can I could exchange examples an implementations till we turn blue but until we have an AQL spec that clarifies the points I’m trying to make above, we’re all blind man in a room, with an elephant in between us 🙂

All the best
Seref

Ps: I’ve accepted the pull request and put the above queries there.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s