This is part 2 of a series of posts discussing the particulars and as the title says, confusing bits of AQL. Part 1 is here. I’m hoping to discuss what choices implementers of clinical data query language designers have and the implications of such choices.
AND operator takes the stage
So let’s talk a bit more about the FROM clause, and the query semantics it may express. Let’s take the previous simple AQL query from part 1 and extend it a bit. First, the simplest form I used:
SELECT ...
FROM
EHR E
CONTAINS COMPOSITION C
CONTAINS ACTION A
Let’s assume that instead of the above query, our user is interested in fetching data related to a scenario in which a clinician observes some clinical condition, then instructs something. Our user is therefore looking for data that’ll be in a COMPOSITION that contains an OBSERVATION and an INSTRUCTION:. Apologies if you’re a clinical modeller and I just butchered my way through openEHR modelling, you’ll have to live with it for the moment.
SELECT OBS AS obs_col, INS AS ins_col
FROM
EHR E [ehr_id/value = 'xyz']
CONTAINS COMPOSITION C
CONTAINS (OBSERVATION OBS AND INSTRUCTION INS)
The example above uses the logical AND operator to express the requirement (constraint) that the user is interested in COMPOSITION instances that contain both an OBSERVATION and an INSTRUCTION.
It’s all rainbows and unicorns when things go our way, so given this high level mock representation of actual data:
<ehr>
<composition>
<observation/>
<instruction/>
</composition>
</ehr>
The results from the query above will be:
obs_col | ins_col |
observation | instruction |
The above example does not require the user to think about the semantics of the AND operator in depth. The query and the result set are aligned with most users’ intuition. Allow me to confuse you a bit with a syntactically (almost) valid, but semantically unusual (useless) query.
SELECT OBS_1 AS obs_col1, OBS_2 AS obs_col2
FROM
EHR E [ehr_id/value = 'xyz']
CONTAINS COMPOSITION C
CONTAINS (OBSERVATION OBS_1 AND OBSERVATION OBS_2)
This query is selecting two OBSERVATIONs that are under a COMPOSITION. I could have written a query with a nice background story but that’ll require more columns and more complex sample data, so I’m choosing to focus on the tricky bit I want to explore. If it makes you feel good, I can tell you that what I’m oversimplifying here will emerge within some other complex query context, but the essence of the matter will be the same.
First, would this COMPOSITION fit the bill?
<ehr>
<composition>
<observation/>
<instruction/>
</composition>
</ehr>
No. Because we’re telling the query engine, via the the FROM clause that the COMPOSITION should have two observations. Or are we? Here comes the first intuitive assumption about the AND operator:
The user is most likely to assume OBS_1 and OBS_2 are aliases for distinct OBSERVATIONs, i.e. the different OBSERVATION instances contained under the same COMPOSITION.
Did you notice I kept calling AND a logical operator? As per the grammar rules of AQL, AND has two operands, so it is not too far fetched to say that it’s semantics is based on that of AND operator from Boolean algebra’s AND operator (conjunction).
This overlaps with the filtering semantics of FROM clause pretty well (see part 1): filter out all data that does not satisfy the following criteria:
composition contains observation AND the same composition contains another observation.
Let’s make things a bit better by providing an example that satisfies this criteria:
<ehr>
<composition>
<observation uid ="o1">
<observation uid = "o2"/>
</composition>
</ehr>
Good. Now the filtering semantics won’t filter this composition out. I bastardised the mock data format even more to distinguish between two observation instances. It’ll help below. Since the AND operator filters this data in, the user can get the results as:
obs_col1 | obs_col2 |
<observation uid=”o1“> | <observation uid=”o2“> |
Hold on a second. Who says this won’t be the result set? :
obs_col1 | obs_col2 |
<observation uid=”o2“> | <observation uid=”o1“> |
or this one:
obs_col1 | obs_col2 |
<observation uid=”o1“> | <observation uid=”o2“> |
<observation uid=”o2“> | <observation uid=”o1“> |
or … this one!
obs_col1 | obs_col2 |
<observation uid=”o1“> | <observation uid=”o2“> |
<observation uid=”o2“> | <observation uid=”o1“> |
<observation uid=”o1“> | <observation uid=”o1“> |
<observation uid=”o2“> | <observation uid=”o2“> |
You (and many users) would consider A to be an intuitive result set. After all, our query has two columns declared via the SELECT clause, and there are two observations, so one for each column make sense and that’s one row. All clear, nothing to talk about. Except, it is not so clear for the implementer.
Implementer’s dilemma: combinations or permutations for AND semantics?
From the implementer’s perspective, there are multiple options for the semantics of AND. The two I can think of are combinations and permutations. I’m sure someone else can come up with something else entirely. Let’s assume we’re going with combinations, which means we have two columns of data for which, actual instances of observations must exist. The combination interpretation means that the order does not matter, so we’re fine as long as we’re able to produce a set with two elements, given data and constraints that define those elements. Which is what result sets A and B both satisfy above.
The tricky thing in our query is, there is no distinction between OBSERVATIONs OBS_1 and OBS_2. Let me repeat the query for convenience:
SELECT OBS_1 AS obs_col1, OBS_2 AS obs_col2
FROM
EHR E [ehr_id/value = 'xyz']
CONTAINS COMPOSITION C
CONTAINS (OBSERVATION OBS_1 AND OBSERVATION OBS_2)
So an observation instance can be returned in either of the columns. If we adopt a combination interpretation for AND matches, then both A and B are valid result sets. Such an implementation should come with the warning that even running the same query twice may return the two observations in different order, because the underlying semantics is that of combinations.
This is not so unusual: SQL has no guarantee of ordering of results. Even though running SQL queries on a single table seems to return results in the order you inserted them, it is very, very roughly a possible result of adding to the end of a file being cheaper than inserting anywhere else. So if you want to order results, you have to explicit about it with SQL.
So the implementer can do the same, but they may be uncomfortable with this behaviour and may think that they’d rather make sure that if a match can be returned in both columns, then it should be returned for both columns, which is the result in Result set C, which the permutation interpretation for AND semantics, in which, different ordering of elements corresponds to different results, because in permutation, order matters.
For some reason, no one raised this issue with screenwriters in Hollywood, so we don’t hear characters saying “all right you son of a ^**$!^* , what’s the permutation to the safe?” Maybe someday, once the computers in movies stop bleeping every time text scrolls on the screen….
If the implementer takes things a bit further and implements permutation logic with repeated elements allowed, then that’s Result set D for you.
The point of the above examples and discussion is not to take you to your highschool days. I’m also not suggesting any of the interpretations above are better than the others (with great personal restraint…) What I’m suggesting is:
- Implementers can have different interpretations of the same language features. This’ll cause queries returning different results in different implementations(!)
- Even though an implementer’s choice for semantics may not be that of a user’s, it is better to have a well defined basis for the semantics, such as combinations or permutations, rather than cook something up under the counter. When the rules of your query processing are well defined, collaboration and discussion between all stakeholders takes place on solid ground. “Let’s all go with combinations semantics for the AND operator in the FROM clause” is better than “so what does your thing do when data is like this and query is like that…”
- We may have to find ways of making mathematically solid and powerful definitions of operations more user friendly and easily understandable. The award is worth the effort. Click here to read how such an effort created the billion dollar relational database market and Structured Query Language (SQL).
I’ll stop here for the moment. My departing question is: what about consistency of semantics across language elements? I’m hoping to write the next post on OR operator and what difficult choices await between user convenience and query language coherence and consistency.
Thanks again Seref for another important topic on AQL. As I indicated in my comment on Part 1, I never considered FROM…CONTAINS to be filtering but purely source of the data being queried. Again, I can understand why this might be thought of as filtering given the OBJECT specified in the CONTAINS would normally have an archetype_id predicate, although most of your examples excluded this (for simplicity).
I actually saw these OBJECT and predicates (e.g. OBSERVATION[openEHR-RM-OBSERVATION.blood_pressure.v1]) as one target object. In SQL terms, this OBJECT[archetype_id] would be equivalent to a table object in a JOIN statement, hence a data source, not a filter.
Now in terms of CONTAINS .. AND I see this like an INNER JOIN, although we have much more control over the combinations of these in AQL. So your simplified query with INSTRUCTION AND OBSERVATION should definitely return a single row.
However, your simplified OBSERVATION AND OBSERVATION scenario is a little contrived, as you alluded to. But based on my perspective above, we are effectively doing an INNER JOIN on the same table in SQL. What would the result be? Well as discussed in Part 1, it would certainly be result set D based on the cartesian cross product with a join on the same table. As indicated, it contrived because we certainly wouldn’t do this in reality without specifying some additional criteria on the JOIN or WHERE clause if we did happen to JOIN on the same table. Having said that, it still needs to be handled and clearly understood in the specification.
Better example is CONTAINS (OBSERVATION[a] OBSa AND OBSERVATION[b] OBSb) where a and b are different archetype_ids and in my SQL analogy, different tables but the expected result set is the same, are cartesian cross product of OBSERVATION[a] and OBSERVATION[b].
Again, thank you raising the issues and letting me put forward my original rationale, as naive as it may be. Hopefully it helps get the specification to be less ambiguous and easier to implement and use.
I look forward to your discussion on CONTAINS .. OR.
Just wanted to thank you via this comment Heath. Your account of the original thinking behind the Aql’s design is hugely helpful, especially given I was not with Ocean at the time. Your input in the previous post and this one confirms my understanding of your and Chunlan’s thinking.
Interestingly, my independent interpretation also took me to a relational view of the semantics, but that may be due to both of us having a strong background in that model of data representation and querying.
Just one thing I should clarify: my use of the word ‘filter’ may indeed cause some misunderstandings. I’m not able to find a better word for it at the moment, but I see the constraints expressed by CONTAINS clauses in the FROM clause describing a structure, a shape of data that must be satisfied for a composition, and therefore I see those constraints as a filter that can be applied to all existing data in a CDR, leaving out the compositions that do not satisfy the shape/structure criteria. This may or may not make sense, but hopefully it may clarify my view and use of the word. It is meant to be a higher level concept than JOINs (or individual CONTAINS statements), something that is represented by the combination of them, i.e. the FROM clause as a whole.