Archetype Query Language, The Confusing Bits, II

This is part 2 of a series of posts discussing the particulars and as the title says, confusing bits of AQL. Part 1 is here. I’m hoping to discuss what choices implementers of clinical data query language designers have and the implications of such choices.

AND operator takes the stage

So let’s talk a bit more about the FROM clause, and the query semantics it may express. Let’s take the previous simple AQL query from part 1 and extend it a bit. First, the simplest form I used:

1
2
3
4
5
SELECT ...
  FROM
    EHR E
    CONTAINS COMPOSITION C
    CONTAINS ACTION A

Let’s assume that instead of the above query, our user is interested in fetching data related to a scenario in which a clinician observes some clinical condition, then instructs something. Our user is therefore looking for data that’ll be in a COMPOSITION that contains an OBSERVATION and an INSTRUCTION:. Apologies if you’re a clinical modeller and I just butchered my way through openEHR modelling, you’ll have to live with it for the moment.

1
2
3
4
5
SELECT OBS AS obs_col, INS AS ins_col
  FROM
    EHR E [ehr_id/value = 'xyz']
    CONTAINS COMPOSITION C
    CONTAINS (OBSERVATION OBS AND INSTRUCTION INS)

The example above uses the logical AND operator to express the requirement (constraint) that the user is interested in COMPOSITION instances that contain both an OBSERVATION and an INSTRUCTION.

It’s all rainbows and unicorns when things go our way, so given this high level mock representation of actual data:

1
2
3
4
5
6
<ehr>
  <composition>
    <observation/>
    <instruction/>
  </composition>
</ehr>

The results from the query above will be:

obs_col	ins_col
observation	instruction

The above example does not require the user to think about the semantics of the AND operator in depth. The query and the result set are aligned with most users’ intuition. Allow me to confuse you a bit with a syntactically (almost) valid, but semantically unusual (useless) query.

1
2
3
4
5
SELECT OBS_1 AS obs_col1, OBS_2 AS obs_col2
  FROM
    EHR E [ehr_id/value = 'xyz']
    CONTAINS COMPOSITION C
    CONTAINS (OBSERVATION OBS_1 AND OBSERVATION OBS_2)

This query is selecting two OBSERVATIONs that are under a COMPOSITION. I could have written a query with a nice background story but that’ll require more columns and more complex sample data, so I’m choosing to focus on the tricky bit I want to explore. If it makes you feel good, I can tell you that what I’m oversimplifying here will emerge within some other complex query context, but the essence of the matter will be the same.

First, would this COMPOSITION fit the bill?

1
2
3
4
5
6
<ehr>
  <composition>
    <observation/>
    <instruction/>
  </composition>
</ehr>

No. Because we’re telling the query engine, via the the FROM clause that the COMPOSITION should have two observations. Or are we? Here comes the first intuitive assumption about the AND operator:

The user is most likely to assume OBS_1 and OBS_2 are aliases for distinct OBSERVATIONs, i.e. the different OBSERVATION instances contained under the same COMPOSITION.

Did you notice I kept calling AND a logical operator? As per the grammar rules of AQL, AND has two operands, so it is not too far fetched to say that it’s semantics is based on that of AND operator from Boolean algebra’s AND operator (conjunction).

This overlaps with the filtering semantics of FROM clause pretty well (see part 1): filter out all data that does not satisfy the following criteria:

composition contains observation AND the same composition contains another observation.

Let’s make things a bit better by providing an example that satisfies this criteria:

1
2
3
4
5
6
<ehr>
  <composition>
    <observation uid ="o1">
    <observation uid = "o2"/>
  </composition>
</ehr>

Good. Now the filtering semantics won’t filter this composition out. I bastardised the mock data format even more to distinguish between two observation instances. It’ll help below. Since the AND operator filters this data in, the user can get the results as:

obs_col1	obs_col2

Result set A

Hold on a second. Who says this won’t be the result set? :

obs_col1	obs_col2

Result set B

or this one:

obs_col1	obs_col2

Result set C

or … this one!

obs_col1	obs_col2

Result set D

You (and many users) would consider A to be an intuitive result set. After all, our query has two columns declared via the SELECT clause, and there are two observations, so one for each column make sense and that’s one row. All clear, nothing to talk about. Except, it is not so clear for the implementer.

Implementer’s dilemma: combinations or permutations for AND semantics?

From the implementer’s perspective, there are multiple options for the semantics of AND. The two I can think of are combinations and permutations. I’m sure someone else can come up with something else entirely. Let’s assume we’re going with combinations, which means we have two columns of data for which, actual instances of observations must exist. The combination interpretation means that the order does not matter, so we’re fine as long as we’re able to produce a set with two elements, given data and constraints that define those elements. Which is what result sets A and B both satisfy above.

The tricky thing in our query is, there is no distinction between OBSERVATIONs OBS_1 and OBS_2. Let me repeat the query for convenience:

1
2
3
4
5
SELECT OBS_1 AS obs_col1, OBS_2 AS obs_col2
  FROM
    EHR E [ehr_id/value = 'xyz']
    CONTAINS COMPOSITION C
    CONTAINS (OBSERVATION OBS_1 AND OBSERVATION OBS_2)

So an observation instance can be returned in either of the columns. If we adopt a combination interpretation for AND matches, then both A and B are valid result sets. Such an implementation should come with the warning that even running the same query twice may return the two observations in different order, because the underlying semantics is that of combinations.

This is not so unusual: SQL has no guarantee of ordering of results. Even though running SQL queries on a single table seems to return results in the order you inserted them, it is very, very roughly a possible result of adding to the end of a file being cheaper than inserting anywhere else. So if you want to order results, you have to explicit about it with SQL.

So the implementer can do the same, but they may be uncomfortable with this behaviour and may think that they’d rather make sure that if a match can be returned in both columns, then it should be returned for both columns, which is the result in Result set C, which the permutation interpretation for AND semantics, in which, different ordering of elements corresponds to different results, because in permutation, order matters.

For some reason, no one raised this issue with screenwriters in Hollywood, so we don’t hear characters saying “all right you son of a ^**$!^* , what’s the permutation to the safe?” Maybe someday, once the computers in movies stop bleeping every time text scrolls on the screen….

If the implementer takes things a bit further and implements permutation logic with repeated elements allowed, then that’s Result set D for you.

The point of the above examples and discussion is not to take you to your highschool days. I’m also not suggesting any of the interpretations above are better than the others (with great personal restraint…) What I’m suggesting is:

Implementers can have different interpretations of the same language features. This’ll cause queries returning different results in different implementations(!)
Even though an implementer’s choice for semantics may not be that of a user’s, it is better to have a well defined basis for the semantics, such as combinations or permutations, rather than cook something up under the counter. When the rules of your query processing are well defined, collaboration and discussion between all stakeholders takes place on solid ground. “Let’s all go with combinations semantics for the AND operator in the FROM clause” is better than “so what does your thing do when data is like this and query is like that…”
We may have to find ways of making mathematically solid and powerful definitions of operations more user friendly and easily understandable. The award is worth the effort. Click here to read how such an effort created the billion dollar relational database market and Structured Query Language (SQL).

I’ll stop here for the moment. My departing question is: what about consistency of semantics across language elements? I’m hoping to write the next post on OR operator and what difficult choices await between user convenience and query language coherence and consistency.

AND operator takes the stage#

Implementer’s dilemma: combinations or permutations for AND semantics?#

AND operator takes the stage

Implementer’s dilemma: combinations or permutations for AND semantics?