23 posts tagged with "facebook"

Some updates on the CQL schema upgrade system

November 7, 2022 · 17 min read

Maintainer of CG/SQL

Foreword

I was tempted to subtitle this article "How a great idea went horribly, horribly, wrong" but in the final analysis the outcome isn't actually at all horrible. But there are some good lessons here, and it's useful to capture the history while it is still fresh.

Introduction and Context

The CQL compiler can produce for you, starting from a set of table declarations and schema annotations, a schema upgrader that can upgrade your schema from any previous version to the current version, provided some simple rules are followed. Helpfully, the compiler enforces those rules with plain error messages so that you can reasonably expect your upgrader to work provided all is well with your database connection.

Broadly, the entities of the schema are on one of two plans, "create", and "recreate". These notions are all discussed in more detail in Chapter 10 of the guide.

The Create Plan

This plan applies strictly to tables, and is used for tables that have precious data that cannot reasonably be restored from say the cloud or some on-device backup. Typically the primary data is on this plan.

On this plan you are limited to these operations:

new tables can be created (including the so called baseline tables, those having no annotation at all)
columns can be added to the end of a table such that an ALTER TABLE ADD COLUMN statement could add them
columns can be deleted, making them illegal to use in queries but otherwise having no physical consequence
- in CQL "select from foo" will not include deleted columns hence "" is fully expanded
tables can be deleted, leaving a tombstone in the schema
- the tombstone provides the clue to the upgrader that the table should be dropped if it is found

The primary directives for this plan use @create annotations, hence the name.

The Recreate Plan

Triggers, Indicies, and Views are all on this plan and tables can be too if they are annotated with @recreate instead of @create. The idea with this plan is that if the entity changes at all you simply drop the old version and create the new version. This means any change is possible but it also means the upgrade is always destructive:

if the upgrader is going to do anything at all it drops all views and all triggers at the start and recreates them at the end
- this not destructive and takes a lot of weird failure modes off the table
- note steps in the upgrade logic therefore cannot rely on the existence of views or triggers
if any index or table changes at all it is dropped and recreated
- this is done by computing a 64 bit CRC of the entities schema and comparing it to the stored CRC
- if the CRC is changed the recreate happens

Probably the first thing you noticed once you create the notion of recreate for tables is that you really want to do the recreation in groups. There are constellations of schema that have related information and if one of them changes they all need to be updated. This lets you have complex foreign key relationships within this "recreate group".

You'll also notice that a recreate group can have foreign keys within itself and it can make foreign keys to things that are on the create plan but you run into trouble if you try to make foreign keys to some other recreate group. That group might vanish on you, or rather, it might try to vanish and discover that it cannot because of constraint violations. Originally recreate groups could not refer to other groups but recently this was generalized to track a directed acyclic graph of groups. This means that a core group recreating forces the recreation of any groups that refer to it. On this plan its common to end up with a snowflake type schema where the outer parts of the snowflake update often and the inner parts hardly at all.

Overall CRC

In addition to the CRCs for the recreate groups, and indices there was a one CRC for overall schema. The upgrader checks this before anything else. If the overall schema CRC matches the current schema then nothing needs to be done (the upgrader has already done its job). If it doesn't match then some steps have to be applied.

Immutable Schema Versions

Other than the cross-group dependencies things began in the form above. The recreate plan was CRC driven and the create plan was version driven. The original design simply generated the appropriate corrections at each schema version and tracked the current version. If the overall CRC had changed, whichever steps you needed were executed.

This turned out to be a disaster and it was changed within days.

The idea seems fine enough, but the first thing you run into is that two people might make a change to the schema creating say version 5. The problem is if one person adds table X and the other adds table Y they will each run their own and mark themselves as schema 5. When they merge their changes with some other developer, the version 5 upgrade will have already run and they will get nothing despite the fact that v5 includes more for both of them. This is crazytown for developers.

So, rather than simply tracking the current schema version, each schema version got its own mini-CRC. The upgrader would run the steps of each version if the CRC was absent or didn't match. With steps like CREATE TABLE IF NOT EXISTS and so forth a merge would result in you getting the other half of the changes for your version and work could accumulate at say schema v5 with no problems. Actual customers would never see this because they only saw completed schema versions.

This worked a lot better and lasted about a year.

The problem is that the system is based on these "mostly immutable" schema versions. You never restate the past you always give instructions on how to move forward. With the versions being nearly immutable, and the upgrade steps being idempotent, things seemed good. But it turns out neither of those two assumptions was really exactly true.

Mutating Schema Versions

The reality of the schema we created for our platform was that there was one large uber schema that had all the possible schema you might need for a variety of features and any give product could opt in to the features it wanted, thereby getting the necessary schema. The schema system had a good way to partition the schema using regions. The upgrader could work on a set of regions and provide the union of schema in those regions, omitting the rest.

Super. Here's where things get exciting. A schema consumer could reasonably decide at some time in the future that it wants new features and so it opts into additonal regions. That's fair enough, but the net of this is that of course new tables appear. Some of these are likely to be in the baseline schema (v0) and some might have appear later (e.g. v5, v10, v21). This is all fine, the CRCs for those versions change and the schema upgrader runs again. Those versions execute and add the correct schema. Perfect.

Actually no.

Zombie Tables

About two years into the history of CQL we started noticing that some attempts to delete tables were failing. The DROP commands claimed that there was a constraint problem -- but these should have been leaf tables. What constraint could possibly be the issue? This was the first time a major design flaw with this system was revealed. Previous bugs had been few and had all been silly logic errors or off by one checks in version numbers, that kind of thing, easily fixed. This was a puzzler. But the answer was fortunately available in the set of annotations.

Basically, imagine a table "zombie" had been created say in the baseline schema, and then later deleted; suppose it was deleted in version 20. All is well, the upgrade steps for version 20 include a table drop. However, now a team subscribes to more schema, causing the v0 schema to include a few more tables. Here's the problem, when the steps for v0 run again they notice that "zombie" is missing and helpfully create it, thinking this is the right thing to do. But this is a disaster... The "zombie" table is supposed to be deleted in v20 but that CRC is unchanged! So now a table exists that has no business existing. If "zombie" has an FK reference to some other table which we wnat to delete, then all attempts to drop that table will fail because "zombie" is there gumming up the works. Even if it's empty... which it will be in this case.

This problem was fixed by having all tables that need deleting be unconditionally deleted at the end of the upgrade and not in the steps for the version in which the delete happened. This meant that the next upgrade purged all the zombies and enabled the correct table drops to start running with no errors. The consequence of this was a 90% reduction in schema upgrade failures!

Unsubscription

Another reason for the "immutable" version history to (cough) mutate was a desire to opt out of tables. As described in this section we created an affordance to allow people to unsubscribe from some of the tables they had previously selected. This provided finer-grain control of the schema subscription and also made it possible to undo previous mistakes of over-subscription. However, it was clear from the beginning that you might want to undo an unsubscription at some time in the future. In keeping with schema directives that create a clear history the @unsub and @resub statements were added to the language with lots of rules for correctness. The upgrader did the following things:

upon finding an unsubscription at version X that version includes DDL to drop the unsubscribed table
changes to that table in an future versions were omitted
upon finding a resubscription at version Y that version included DDL to create the table as it exists at version Y
later changes to that table are once again emitted as usual

This was very nearly right except it had the same problem as the delete case above. A table created in say the baseline might come back as a zombie even though it was unsubscribed. However, now wise to this issue a small fix takes care of the problem.

always drop tables in the unsubscribed state at the end just like delete tables
no code is needed to do an unsubscribe at version x (the drop at the end will do the job)
a resubscribe at version X first drops the table and then recreates as it exists at version X

This gives us a consistent upgrade path again and importantly avoids the problem of a resubscription finding a zombie that prevents it from doing its job.

Performance Optimization 1

On July 1, 2022 we made a modest change that reduced the number of SQLite statements required to do a full upgrade. The opportuntity came from the many column existence checks we made before running ALTER TABLE ADD COLUMN. Rather than run a statement that looked like this (SELECT EXISTS(SELECT * FROM sqlite_master WHERE tbl_name = table_name AND sql GLOB column_declaration_pattern)) for each column we first selected all of the table schema out of the sqlite_master table and put it into a hash table keyed by the table name. Reading even a few hundred table names was much faster than running a single statement for each column that needed to be checked -- especially when recreating the schema from scratch. In the most relevant test case this was a 7% improvement.

Importantly, it motivated us to add hash tables into cqlrt_common.c and generalize the mechanism for object management so that the cqlrt allows creation of new objects without having to add special support for each one.

This new hash table meant that we could do a hash lookup and substring match instead of a sqlite query for each column.

Performance Optimization 2

On Oct 11, 2022 we stopped using CRCs for the version checks on the create plan entirely. This was in fact an optimization but it was motivated by a real, but rare, problem.

What was happening was something like maybe 1 in 10^5 databases was missing columns. The sequence of events that caused this was very hard to diagnose but the situation was very clear. The database was at say schema version 100. The columns had been added at say version 50. The CRCs indicated that the v50 upgrade had already run so it didn't run again. The columns would now never be added.

We had yet to come up with a set of steps that would adequately describe how this happened. I have to guess some combination of a resubscription ran because of one of those "the schema is not really immutable" changes and then "medium" version of the table say v25 was resubscribed but the columns added in say v50 never got readded because v50 thought it had already run.

This was getting to be a nightmare but there was a simple solution.

We already had created this dictionary that had all the tables and their schema from sqlite master, we were already using it to determine if we needed to add a particular column. The only reason we had version CRCs at all was to allow us to skip steps, but since we could already skip column adds super fast all we needed was to be able to skip table adds -- there is nothing else. Well the same hash table can obviously easily tell us if a table exists. Non-existent tables have no schema and hence are absent from the hash table which is loaded directly from sqlite_master.

So the new algorithm, goes something like this:

use the version numbers only for ordering
before adding a table, check if it exists in the table, this is faster htran running CREATE TABLE IF NOT EXISTS
check the columns as before
attempt each of these every time the overall schema changes, and trust that the fast checks are fast enough

On this plan we change the way @unsub and @resub are handled to something much simpler:

@unsub acts like an out of band @delete on the table or view to which it is applied
- the drop happens at the end like before
@resub resets that state so the table is considered not deleted if the last operation was @resub

To this we add one new rule:

the schema upgrader removes any reference to deleted tables entirely
- they are removed from baseline
- they are not included in any upgrade rules
- they are only dropped at the end if they still exist

This vastly simplifies unsub/resub and delete. An unsubscribed table will always get cleaned up at the end, just like deleted tables. No strange interim states happen in resub. If a table is resubcribed it just reappears in the schema and the various operations run as usual.

The only compromise to this is that we still have a single CRC for the overall baseline schema. However even that could be removed at the expense of more hash table lookups. There is a binary size win for fewer checks and since baseline by definition depends on nothing it seems like safe position to take.

This approach was about 13-15% faster in fact, the time saved examining and writing back schema CRCs more than paid for the extra hash table checks (which were ~100x faster than the db operations). And the hash table already existed! The reduction of the CRC checks and removal of vestigial upgrade logic for deleted tables also resulted in a 2.2% reduction of upgrader size for our most important case.

The most recent change and further simplications in unsub/resub logic

With all of this in place it's clear that the various rules for unsubscription and resubscription and the sort of historical playback that was used to try to create these immutable stages is moot. The only thing that matters is if we end in the unsubscribed state. Removing the unsubscribe upgrade steps from the upgrader entirely just simplifies everything. So no @resub is needed at all nor are @unsub version numbers. Presently set to land is set of changes that remove resubcription entirely, to resubscribe you simply remove the @unsub directive for your table/view.

This lets us eliminate a bunch of validations and test cases to get a simpler, clearer, and more easily verifiable upgrader. There's just much less to go wrong.

Even crazy cases like "an unsubscription happens in v5, the resubscription happens in v7, a user runs the upgrade and they might have a database that is v4, v5, v6, or v7 (having already been upgraded)". All of these had different potential flows before. Now they are all the same. All the cases will roll the table forward to v7 from whatever version they might be on with the usual rules and states particular to unsubscription or resubscription. The table is present or it isn't. It is missing columns or it isn't. Same as always.

A Versionless Future

More thinking is needed here but it's clear that now that we've arrived at this simpler place ALL the version numbers are moot. The only thing we really have to do with version numbers is run ad hoc migrations at the appropriate time, and only once. The rules for migrators would have to change such that they are responsible for finding the state of the schema, and maybe some context could be provided for this. But ad hoc data migrators are very uncommon and regular annotations are much more so.

Conclusion

The fundamental assumption about how schema changes would happen was wrong. Even so, it was close enough that upgrades were over 99.99% successful when the other parts of the system are working ok. This is probably about the best we can hope for given the state of affairs with flash drives on Android devices. The current system is actually pretty close code-wise to what we had planned -- just look at the Oct 11, 2022 diff to see what I mean. It's not that big of a difference in the end. The new system has been deployed for nearly a month now and it is immune basically all of the failure modes of the old. It will take some time before we know what its true reliability is given the low failure rate of both. But we do know the new system is significantly faster. Optimizations 1 and 2 together are over 20% for full installations.

I should note that someone who was obviously smarter than me told me that we would land on a solution like this and I didn't believe them. They were of course correct. You know who you are. Sorry I doubted you.

Introducing Parent/Child Result Sets

October 6, 2022 · 15 min read

CG/SQL Team

Maintainer of CG/SQL

Introduction and Context

There are many cases where you might want to nest one result set inside of another one. In order to do this ecomomically there was a great desire to be able to run a parent query and a child query and then link the child rows to the parent rows. One way to do this is of course to run one query for each "child" but then you end up with O(n) child queries and if there are sub-children it would be O(n*m) and so forth. What you really want to do here is something more like a join, only without the cross-product part of the join. Many systems have such features, sometimes they are called "chaptered rowsets" but in any case there is a general need for such a thing.

We did a bunch of work in the name of Parent/Child results sets but like many goals of this kind it caused us to ripen the CQL language in a variety of ways and its interesting to talk about those changes. Importantly, we wanted to be able to do work of this kind in the language while adding the fewest new notions and basically enabling the language to express a concept like a child rowset in the first place.

Here are some things that happened along the way that are interesting.

Cursor Types and Result Types

One of the first problems we run into thinking about how a CQL program might express pieces of a rowset and turn them into child results is that you need to be able to hash a row, append row data, and extract a result set from a key.

Let's think about that for just a second: in order to do anything at all with a child rowset, no matter how we got such a thing, we have to be able to describe it in a type-safe way. These objects already exist at runtime but they do not appear anywhere in the language explicitly and that was going to have to change.

To address this we added a new object type, kind of like we did with boxed statements. A result set has a type that looks like this object <proc_name set>. Here proc_name must the the name of a procedure that returns a result set and the object will represent a result set with the corresponding columns in it.

That step may seem like it's super important but actually it's kind of optional, it provides type-safety but the initial versions of the feature just used the type object which works fine provided you make no mistakes... it turns out there are even more fundamental needs that aren't optional.

Creating New Cursor Types From Existing Cursor Types

The first thing you need to be able to to is take the type of the parent query and add to it one more columns to whole the child result set or sets (note that you can have more than one child result set per parent). So for instance you might have a list of people, and one child result might be the names of the schools they attended and another is the names of the jobs they worked.

So while adding columns to existing rows might sound like a bizarre thing to do but actually it's actually fundamental to the job here. We must be able to create a new output row is that is the sames as the parent but includes columns for the the child results too. There was no good syntax for this. The cursor declaration forms were:

/* option 1 */ declare C cursor like shape_name;
/* option 2 */ declare C cursor like select 1 x, "2" y, false z;

The first option implies that you already have a shape from (e.g.) a procedure or table and you want to make an identical cursor. That doesn't work here because we're trying to modify an existing shape, not use it as is.

The second form was supposed to be able to create any kind of cursor shape by simply declaring a select statement that is an example of what you want to capture. In principle this can define almost anything. However, there's a catch -- you can't get object types to come out of a select so it's hopeless for result set types. And, maybe just as important, you can't just add a few columns to an existing type with any kind of ease, you have to list all columns.

Fortunately there was a pretty simple solution to this problem. There were already lots of cases where a typed name list happens in the language -- for example in the return type of a function you can specify something like (id integer, name text). That construction also defines a shape just like a select statement and there was already code to handle all the correctness analysis. Additionally, the LIKE construct can be used in such a list to refer to existing types. So for instance a function that returns all the columns of tables A and B could be defined like so

declare function foo() (LIKE A, LIKE B);

So we could solve all the cursor type problems by allowing a typed name list to be used to define a cursor shape. Probably the approach that should have been taken in the first place. The select option seems weird by comparison.

With the already existing support for shapes in a type list we could make the result shape for this parent/child case with ease, like so:

declare result cursor like (like parent, child_result object<child_proc set>);

So, all the parent columns plus a child result set. Or more than one child result set if needed.

Lastly there were going to be cases where we needed to make a new cursor using only some of the field of an existing cursor. The case in particular I'm thinking of is that we might have a big row from the parent and it might have only one or two columns that we need that form the key columns for the child. We didn't have a good way to do that either, but solving this turns out to be simple enough. We already had this form:

declare D cursor like C;

we just added:

declare D cursor like C(a, b, c);

Which chooses just the 3 named fields from C and makes a cursor with only those. Recently we added the form:

declare D cursor like C(-x);

To mean take all the columns of C except x

With the a shape for the key fields defined, we can use existing syntax to load the fields economically:

fetch D from C(like D);

Which says we want to load D from the fields of C, but using only the columns of D. That operation is of course going to be an exact type match by construction. So now we could describe the key columns from child rows, and the key columns from parent rows. And we could add columns to the parent type to create space to hold child result sets. All of our type problems are solved. Almost.

Cursor Arguments

It was clear that we would need to be able to do things like "hash a cursor" (any cursor) or "store this row into the appropriate partition" and this requirement meant that we had to be able to write functions that could take any cursor and dynamically do things to it based on its type information. There is no good way to write these generic helper things in CQL, but:

we don't need very many of them,
it's pretty easy to do that job in C

The main thing we need is to create a way to declare such functions and call them a with cursor and the necessary shape info.

So we added this notion of being able to call an external function with any cursor. Like so:

declare function cursor_hash(C cursor) long not null;

you can call it like so:

let hash := cursor_hash(C);

where C is any cursor.

When such a call is made the C function cursor_hash gets passed what we call a "dynamic cursor". This includes:

a pointer to the data for the cursor
the count of fields
the names of the fields
the type/offset of every field in the cursor

So you can (e.g.) generically do the hash by applying a hash to each field and then combining all of those. This kind of function works on any cursor and all the extra data about the shape that's needed to make the call is static, so really the cost of the call stays modest. Details of the dynamic cursor type are in cqlrt_common.h and there are many example functions now in the cqlrt_common.c file.

Again, creating this facility was a pretty minor matter, the compiler already has all this data and uses it to create result sets in the first place. We just allowed other functions to use that same data and made a public type for it.

The Specific Parent/Child Functions

To do the parent/child operations we needed three helper functions:

DECLARE FUNC cql_partition_create ()
   CREATE OBJECT<partitioning> NOT NULL;

DECLARE FUNC cql_partition_cursor (
  part OBJECT<partitioning> NOT NULL,
  key CURSOR,
  value CURSOR)
    BOOL NOT NULL;

DECLARE FUNC cql_extract_partition (
  part OBJECT<partitioning> NOT NULL,
  key CURSOR)
    CREATE OBJECT NOT NULL;

The first function makes a new partitioning.

The second function hashes the key columns of a cursor (specified by the key argument) and appends the values provided into a bucket for that key. By making a pass over the child rows you can easily create a partitioning with each unique key combo having a buffer of all the matching rows.

The third function is used once the partitioning is done. Given a key again, which you now presumably get from the parent rows, you get the buffer you had accumulated and then make a result set out of it and return that. Note that this function returns the vanilla object type because it could be returning any shape.

Result Set Sugar

With the type system mentioned above you could now join together any kind of complex parent and child combo you needed, but it might be a lot of code, and it's error prone. This is a good job for a little sugar. So we added some simple syntax to specify the usual partitioning.

It looks like this:

-- parent and child defined elsewhere
declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(y integer not null) (id integer not null, u text, v text);

-- join together parent and child using 'id'
create proc parent_child(x_ integer not null, y_ integer not null)
begin
  out union call parent(x_) join call child(y_) using (id);
end;

The generated code is simple enough, even though there's a good bit of it. But it's a useful exercise to look at it once. Comments added for clarity.

CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
  DECLARE __result__0 BOOL NOT NULL;

  -- we need a cursor to hold just the key of the child row
  DECLARE __key__0 CURSOR LIKE child(id);

  -- we need our partitioning object (there could be more than one per function
  -- so it gets a number, likewise everything else gets a number
  LET __partition__0 := cql_partition_create();

  -- we invoke the child and then iterate its rows
  DECLARE __child_cursor__0 CURSOR FOR CALL child(y_);
  LOOP FETCH __child_cursor__0
  BEGIN
    -- we extract just the key fields (id in this case)
    FETCH __key__0(id) FROM VALUES(__child_cursor__0.id);

    -- we add this child to the partition using its key
    SET __result__0 := cql_partition_cursor(__partition__0, __key__0, __child_cursor__0);
  END;

  -- we need a shape for our result, it is the columns of the parent plus the child rowset
  DECLARE __out_cursor__0 CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
                                       child1 OBJECT<child SET> NOT NULL);

  -- now we call the parent and iterate it
  DECLARE __parent__0 CURSOR FOR CALL parent(x_);
  LOOP FETCH __parent__0
  BEGIN
    -- we load the key values out of the parent this time, same key fields
    FETCH __key__0(id) FROM VALUES(__parent__0.id);

    -- now we create a result row using the parent columns and the child result set
    FETCH __out_cursor__0(id, a, b, child1) FROM VALUES(__parent__0.id, __parent__0.a, __parent__0.b, cql_extract_partition(__partition__0, __key__0));

    -- and then we emit that row
    OUT UNION __out_cursor__0;
  END;
END;

This code iterates the child once and the parent once and only has two database calls, one for the child and one for the parent. And this is enough to create parent/child result sets for the most common examples.

Result Set Values

While the above is probably the most common case, another case can happen where you might want to make a procedure call for each parent row to compute the child. And, more generally, there was no good way to work with result sets from procedure calls other than iterating them with a cursor. The iteration pattern is very good if the data is coming from a select statement -- we don't want to materialize all of the results if we can stream instead. However, when working with result sets the whole point is to create materialized results for use elsewhere. We now had the power to express a result set type with object<proc_name set> but no way to actually get such a set from an existing procedure. Procedures generated them, but they could only be consumed in the C layer.

Fortunately this is also an easy problem to solve. We already supported the ability to use procedures as functions in expressions if they had the right signature. We now add the ability to call a procedure that returns a result set and capture that result. Previously this was not supported and would have produced an error.

With the new features you can write:

declare child_result object<child set>;
set child_result := child(args);

or better still:

let child_result := child(args);

With this simple change we had the power to write something like this:

declare proc parent(x integer not null) (id integer not null, a integer, b integer);
declare proc child(id integer not null) (id integer not null, u text, v text);

create proc parent_child(x_ integer not null, y_ integer not null)
begin
  -- the result is like the parent with an extra column for the child
  declare result cursor like (like parent, child object<child set>);

  -- call the parent and loop over the results
  declare P cursor for call parent(x_);
  loop fetch P
  begin
     -- compute the child for each P and then emit it
     fetch result from values(from P, child(P.id));
     out union result;
  end;
end;

After the sugar is applied this compiles down to this program:

DECLARE PROC parent (x INTEGER NOT NULL) (id INTEGER NOT NULL, a INTEGER, b INTEGER);
DECLARE PROC child (id INTEGER NOT NULL) (id INTEGER NOT NULL, u TEXT, v TEXT);

CREATE PROC parent_child (x_ INTEGER NOT NULL, y_ INTEGER NOT NULL)
BEGIN
  DECLARE result CURSOR LIKE (id INTEGER NOT NULL, a INTEGER, b INTEGER,
                              child OBJECT<child SET>);

  DECLARE P CURSOR FOR CALL parent(x_);
  LOOP FETCH P
  BEGIN
    FETCH result(id, a, b, child) FROM VALUES(P.id, P.a, P.b, child(P.id));
    OUT UNION result;
  END;
END;

The LIKE and FROM forms are very powerful but they aren't new. They do make it a lot easier to express this notion of just adding one more column to the result. Note that the code for emitting the parent_child result before the transformation doesn't need to specify what the columns of the parent are or the columns of the child, only that the parent has at least the id column. Even that could have been removed.

This call could have been used instead:

fetch result from values(from P, child(from P like child arguments));

That syntax would result in using the columns of P that match the arguments of child -- just P.id in this case. But if there were 7 such columns the sugar might be easier to understand.

Additional Language Support

Last, but not least, to make this more accessible we wanted more support in the generated code. The C interface would have produced generic object results for the child result columns. This isn't wrong exactly but it would mean that a cast would be required in every use case on the native side, and it's easy to get the cast wrong. So the result type of column getters was adjusted to be a child_result_set_ref instead of just cql_object_ref.

Similar transforms were needed if column setters were being emitted (yes that's an option!) and of course the Java and Objective C output needed the same transform.

Conclusion

The prosecution of native support for parent/child result sets in CQL resulted in a bunch of very useful generalizations for declaring and managing cursors. The old special case code for blobs was actually replaced by these forms. The language overall expressiveness increased far more than just the ability to do this one kind of join. It's now possible to write general purpose debug helpers for cursors. It's possible to store and return pre-cooked result sets, creating useful caches and other such combinations. The type extensions to allow extending and narrowing existing types allow even more return flexibility while keeping everything strongly typed.

Parent/Child result sets exploit all of these things.

Introducing Backed Tables

October 5, 2022 · 18 min read

CG/SQL Team

Maintainer of CG/SQL

Introduction and Context

Most production databases include some tables that are fairly generic, they use maybe a simple key-value combination to store some simple settings or something like that. In the course of feature development this kind of thing comes up pretty often and in large client applications (like Messenger, but certainly not limited to Messenger) there are many small features that need a little bit of state. It's easy enough to model whatever state you need with a table or two but this soon results in an explosion of tiny tables. In some cases there are only a few rows of configuration data and indeed the situation can be so bad that the text of the schema for the little state table is larger than the sum of all the data you will ever store there. This is a bit tragic because SQLite has initialization cost associated with each table. So these baby tables are really not paying for themselves at all. What we'd like to do is use some kind of generic table as the backing store for many of these small tables while preserving type safety. The cost of access might be a bit higher but since data volumes are expected to be low anyway this would be a good trade-off. And we can have as many as we like. In some cases the state doesn't even need to be persisted, so we're talking about tables in an in-memory database. Here low cost of initialization is especially important. And lastly, if your product has dozens or even hundreds of small features like this, the likelihood that all of them are even used in a session is quite low and so again, having a low fixed cost for the schema is a good thing. No need to create 100 in-memory tables on the off chance that they are needed.

See also the related feature: blob storage.

How do I define one of these backed tables?

First you need a place to store the data, we define a backing table in the usual way. A simple backing table is just a key/value store and looks like this:

@ATTRIBUTE(cql:backing_table)
CREATE TABLE backing(
  k BLOB PRIMARY KEY,
  v BLOB NOT NULL
);

The backing_table attribute indicates that the table we're about to define is to be used for backing storage. At present it is signficantly restricted. It has to have exactly two columns, both of which are blobs, one is the key and one is the value. It should be either baseline schema or annotated with @create as it is expected to be precious data. If it's an in-memory table the versioning is somewhat moot but really the backing store is not supposed to change over time, that's the point. In future versions we expect to allow some number of additional physical columns which can be used by the backed tables (discussed below) but for now it's this simple pattern.

Backed table looks like this:

@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed(
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  bias REAL
);

@ATTRIBUTE(cql:backed_by=backing)
CREATE TABLE backed2(
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

The backed_by attribute indicates that the table we're about to define is not really going to be its own table. As a result, you will not be able to (e.g.) DROP the table or CREATE INDEX or CREATE TRIGGER on it, and there will be no schema upgrade for it should you request one. It may not contain constraints as there would be no way to enforce them. But as compensation for these restrictions it can be changed freely and has no physical schema cost associated with it.

How do I read this data?

To understand how this works imagine that we had a view for each backed table which simply read the blobs out of the backing store and then extracted the backed columns using some blob extraction functions. This would work, but then we'd be trading view schema for table schema so the schema savings we're trying to achieve would go up in smoke.

We might be lost here but CQL already has something very "view like" and that's the shared fragment structure. So what we do instead of views is to automatically create a shared fragment just like the view we could have made. They look like this:

@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed ()
BEGIN
  SELECT
   rowid,
   cql_blob_get(T.k, backed.id) AS id,
   cql_blob_get(T.v, backed.name) AS name,
   cql_blob_get(T.v, backed.bias) AS bias
    FROM backing AS T
    WHERE cql_blob_get_type(T.k) = 2105552408096159860L;
END;

So some things to notice right away:

First, this fragment has the right shape, but the shared fragment doesn't directly call blob extractors. Rather it uses these cql_blob_get. The point of this is to make the actual blob functions configurable. The test suites include some very simple extraction functions for blobs with just integers in them, but you can create whatever blob format you want. You could use the blob storage feature for encoding or you can encode it as you see fit. You can even have different encodings in different backed tables.

Second, there is a type code embedded in the procedure. The type code is a hash of the type name and the names and types of all the not-null fields in the backed table. The hash is arbitrary but repeatable, any system can compute the same hash and find the records they want without having to share headers. The actual hash is open source but it's just a SHA256 reduced to 64 bits with some name canonicalization. Shortly the JSON will also include the relevant hashes so you can easily consume them without even having to know the hash function.

Here's the slightly smaller shared fragment for backed2

@ATTRIBUTE(cql:shared_fragment)
CREATE PROC _backed2 ()
BEGIN
  SELECT
    rowid,
    cql_blob_get(T.k, backed2.id) AS id,
    cql_blob_get(T.v, backed2.name) AS name
    FROM backing AS T
    WHERE cql_blob_get_type(T.k) = -1844763880292276559L;
END;

As you can see it's very similar -- the type hash is different and of course it has different columns.

Why does the type hash include only the non-null fields?

The idea is that the backed table might change over time and you can add new optional fields without invalidating your existing data. If you change the name of the type or if you add new not null fields the type identity changes and any data you have in the backing table will basically be ignored because the type hash will not match.

What do `cql_blob_get` and `cql_blob_get_type` turn into?

You can configure them as you see fit. By default cql_blob_get turns into either bgetkey or bgetval depending on if you are reading from the key blob or the value blob. The directives for configuring this function are:

@blob_get_key bgetkey offset;
@blob_get_val bgetval;

You can configure the system to ask for the column by offset (this is normal for the primary key because it has a fixed number of columns for any given key type and they are all mandatory), or by hash code (this is normal for the value type because it might be missing some columns and so offset is probably not appropriate). However both are configurable so you want to do key by hashcode simply omit the "offset" part of the directive. Likewise if your values are offset addressable you can add "offset" to the value directive. Here the offset means the zero based ordinal of the column in the key or the value.

The type access functions are similarly configurable (they never need a code or an offset).

@blob_get_key_type bgetkey_type;
@blob_get_val_type bgetval_type;

What does this end up looking like?

Armed with these basic transforms we can already do a simple transform to make select statement work. Suppose CQL sees:

declare C cursor for select * from backed;

We can make this work with a simple transform:

 DECLARE C CURSOR FOR WITH
  backed (*) AS (CALL _backed())
  SELECT *
    FROM backed;

Now remember _backed was the automatically created shared fragment. Basically, if we see a select statement that mentions any backed table we simply add a call to the corresponding shared fragment in the WITH clause. This effectively creates that "view" we need. And because we're using the shared fragment form, all users of this fragment will share the text. So there's no schema and the text of the backed appears only once in the binary. More precisely we get this:

WITH
backed (rowid, id, name, bias) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),                      -- 0 is offset of backed.id in key blob
    bgetval(T.v, -6639502068221071091L),  -- note hash of backed.name
    bgetval(T.v, -3826945563932272602L)   -- note hash of backed.bias
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 2105552408096159860L)
SELECT rowid, id, name, bias
  FROM backed;

Now with this in mind we can see that it would be very beneficial to also add this:

CREATE INDEX backing_index ON backing(bgetkey_type(k));

or more cleanly:

CREATE INDEX backing_index ON backing(cql_blob_get_type(k));

Either of these result in a computed index on the row type stored in the blob. Other physical indices might be helpful too and these can potentially be shared by many backed tables, or used in partial indicies.

Of course your type function might be named something other than the default bgetkey_type.

Now consider a slightly more complex example:

A slightly more complex example:

select T1.* from backed T1 join backed2 T2 where T1.id = T2.id;

becomes:

WITH
  backed (rowid, id, name, bias) AS (CALL _backed()),
  backed2 (rowid, id, name) AS (CALL _backed2())
  SELECT T1.*
    FROM backed AS T1
    INNER JOIN backed2 AS T2
    WHERE T1.id = T2.id;

Now even though two different backed tables will be using the backing store the select "just works". All the compiler had to do was add both backed table fragments. And of course if backed was joined against itself, that would also just work.

How do I insert data like this?

Consider:

insert into backed values (1, "n001", 1.2), (2, "n002", 3.7);

This has to insert into the backing storage and convert the various values into key and value blobs. A simple transform does this job as well:

 WITH
  _vals (id, name, bias) AS (
    VALUES(1, "n001", 1.2), (2, "n002", 3.7)
  )
  INSERT INTO backing(k, v) SELECT
    cql_blob_create(backed, V.id, backed.id),
    cql_blob_create(backed,
      V.name, backed.name,
      V.bias, backed.bias)
    FROM _vals AS V;

What's going on here? Well, the issue is that the data to be inserted can be arbitrarily complicated. It might refer to all kinds of things. In this case it's just literal values but in general it could be anything. So the transform takes the original values and puts them in a _vals(...) CTE. Then we insert into the backing store by converting _vals into blobs -- one for the key and one for the value. There is only the one place we need to do this for any given insert statement no matter now many items or how complex the insertion is.

cql_blob_create similarly expands to a user configured function with optional hash codes and mandatory field types. There is default configuration that corresponds to this:

@blob_create_key bcreatekey offset;
@blob_create_val bcreateval;

The final SQL looks like this:

WITH
_vals (id, name, bias) AS (
  VALUES(1, "n001", 1.2), (2, "n002", 3.7)
)
INSERT INTO backing(k, v) SELECT
  bcreatekey(2105552408096159860, V.id, 1), -- type 1 is integer, offset implied
  bcreateval(2105552408096159860,
    -6639502068221071091, V.name, 4,  -- hash as before, type 4 is text,
    -3826945563932272602, V.bias, 3)  -- hash as before, type 3 is real,
  FROM _vals AS V

Note that both blobs have the same overall type code (2105552408096159860) as before. The key blob did not use per-field type codes, so the argument positions give the implied offset. In contrast the value blob is using hash codes (offset was not specified). This configuration is typical.

A more complex insert works just as well:

insert into backed
  select id+10, name||'x', bias+3 from backed where id < 3;

The above insert statement is a bit of a mess. It's taking some of the backed data and using it to create new backed data. But the simple transforms we have work just as before. We add the needed backed CTE and create _vals like before.

WITH
  backed (*) AS (CALL _backed()),
  _vals (id, name, bias) AS (
    SELECT id + 10, name || 'x', bias + 3
    FROM backed
    WHERE id < 3
  )
  INSERT INTO backing(k, v)
   SELECT
     cql_blob_create(backed, V.id, backed.id),
     cql_blob_create(backed, V.name, backed.name, V.bias, backed.bias)
   FROM _vals AS V;

Looking closely at the above we see a few things:

cql_blob_create will expand as before (not shown)
we added backed(*) as usual
_vals once again just has the exact unchanged insert clause
the insert into backing(k, v) part is identical, the same recipe always works

How does the delete operation work?

Now let's look at a simple delete example:

delete from backed where id = 7;

Now remember we're again looking for a pattern that will generalize when the where condition gets crazy. But fortunately this is not so hard. The following form is fully general:

WITH
  backed (*) AS (CALL _backed())
DELETE FROM backing
  WHERE rowid IN (
    SELECT rowid
    FROM backed
    WHERE id = 7
  );

All we had to do here was:

add the usual _backed CTE
move the original WHERE clause into a subordinate SELECT that gives us the rowids to delete.

With the backed table in scope, any WHERE clause works. If other backed tables are mentioned, the compiler simply adds those as usual.

Here's a more complicated delete, it's a bit crazy but illustrative:

delete from backed where
  id in (select id from backed2 where name like '%x%');

So this is using rows in backed2 to decide which rows to deleted in backed. The same simple transform works directly.

WITH
  backed2 (*) AS (CALL _backed2()),
  backed (*) AS (CALL _backed())
DELETE FROM backing
  WHERE rowid IN (
    SELECT rowid
    FROM backed
    WHERE id IN (
      SELECT id FROM backed2 WHERE name LIKE '%x%'
    )
  );

What happened here:

the WHERE clause went directly into the body of the rowid select
backed was used as before but now we also need backed2

The delete pattern does not need any additional cql helpers beyond what we've already seen.

What about updating tables?

The update statement is the most complicated of the bunch and it requires all the tricks from all the previous statements plus one more.

First, we'll need two more blob helpers that are configurable. By default they look like this:

@blob_update_key bupdatekey offset;
@blob_update_val bupdateval;

These are used to replace particular columns in a stored blob. Now let's start with a very simple update to see now it all works:

update backed set name = 'foo' where id = 5;

Fundamentally we need to do these things:

the target of the update has to end up being the backing table
we need the backed table CTE so we can do the filtering
we want to use the rowid trick to figure out which rows to update which handles our where clause
we need to modify the existing key and/or value blobs rather than create them from scratch

Let's see how this looks:

WITH
  backed (*) AS (CALL _backed())
UPDATE backing
  SET v = cql_blob_update(v, 'foo', backed.name)
    WHERE rowid IN (SELECT rowid
    FROM backed
    WHERE id = 5);

Tearing this down a bit:

we needed the normal CTE so that we can use backed rows
the WHERE clause moved into a WHERE rowid sub-select just like in the DELETE case
we changed the SET targets to be k and v very much like the INSERT case, except we used an update helper that takes the current blob and creates a new blob to store
- the helper is varargs so as we'll see it can mutate many columns in one call

This gives us a working update statement... with one hitch. It's possible to use the existing column values in the update expressions and there's no way to use our backed CTE to get them since the final update has to be all relative to the backing table.

Let's look at another example to illustrate the problem:

update backed set name = name || 'y' where bias < 5;

So this is adding the letter 'y' to some rows. Kind of goofy but similar mutations do happen and have to work. To make this work the reference to name inside of the set expression has to change. We end up with something like this:

WITH
  backed (*) AS (CALL _backed())
UPDATE backing
  SET v = cql_blob_update(v,
    cql_blob_get(v, backed.name) || 'y',
    backed.name)
  WHERE rowid IN (SELECT rowid
    FROM backed
    WHERE bias < 5);

Importantly the reference to name in the set expression was changed to cql_blob_get(v, backed.name) -- extracting the name from the value blob. After which it is appended with 'y' as usual.

The rest of the pattern is just as it was, in fact literally everything else is unchanged. But it's easy to see that the WHERE clause could be arbitrarily complex and it just works. Since the UPDATE statement has no FROM clause only the fields in the target table might need to be rewritten, so in this case name, id, and bias were possible but only name was mentioned.

After the cql_blob_get and cql_blob_update are expanded the result looks like this:

WITH
backed (rowid, id, name, bias) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),
    bgetval(T.v, -6639502068221071091L),
    bgetval(T.v, -3826945563932272602L)
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 2105552408096159860L
)
UPDATE backing
SET v =
  bupdateval(
    v,
    -6639502068221071091L, bgetval(v, -6639502068221071091L) || 'y', 4
  )
  WHERE rowid IN (SELECT rowid
  FROM backed
  WHERE bias < 5);

The blob update function for the value blob requires the original blob, the hash or offset to update, the new value, and the type of the new value. The blob update function for the key blob is the same (blob, hash/offset, value) but the type is not required since the key blob necessarily has all the fields present because they are necessarily not null. Therefore the type codes are already all present and so the type of every column is known. The value blob might be missing nullable values hence their type might not be stored/known.

To illustrate these cases we can make another small example; we'll set up yet another small table that uses the same backing store:

@attribute(cql:backed_by=backing)
create table meta(
 name text,
 state long,
 prev_state long,
 primary key(name, state)
);

This update mixes all kinds of values around...

update meta
 set state = state + 1, prev_state = state
 where name = 'foo';

And the final output will be:

WITH
meta (rowid, name, state, prev_state) AS (
  SELECT
    rowid,
    bgetkey(T.k, 0),
    bgetkey(T.k, 1),
    bgetval(T.v, -4464241499905806900)
  FROM backing AS T
  WHERE bgetkey_type(T.k) = 3397981749045545394
)
SET
  k = bupdatekey(k, bgetkey(k, 1) + 1, 1),
  v = bupdateval(v, -4464241499905806900, bgetkey(k, 1),  2)
  WHERE rowid IN (SELECT rowid
  FROM meta
  WHERE name = 'foo');

As expected the bupdatekey call gets the column offset (1) but not the type code (2). bupdateval gets a hash code and a type.

All of these transforms are live in the code as of a few days ago.

The upshot is that, if you write some simple encoding and decoding functions, you can have very flexible blob storage.

Appendix

If you want to refer to your blob functions in your own code, such as for indices you'll also need to do something like this:

declare select function bgetkey_type(b blob) long;
declare select function bgetval_type(b blob) long;
declare select function bgetkey(b blob, iarg integer) long;
declare select function bgetval(b blob, iarg integer) long;
declare select function bcreateval no check blob;
declare select function bcreatekey no check blob;
declare select function bupdateval no check blob;
declare select function bupdatekey no check blob;

bgetval and bgetkey are not readily declarable generally because their result is polymorphic so it's preferable to use cql_blob_get as above which then does the rewrite for you. But it is helpful to have a UDF declaration for each of the above, especially if you want the --rt query_plan output to work seamlessly. Typically bgetval would only be needed in the context of a create index statement.

Introducing Blob Storage

March 17, 2022 · 10 min read

CG/SQL Team

Maintainer of CG/SQL

Introduction and Context

The general idea here is that you might want to store composite data in a single column in the database. This is a common way to get more generic schema, the idea being that you can have one or more blob columns that store in tables a lot of data that doesn't have to be indexed. You could store it in other ways, like a JSON blob or some such, but we'll be using blobs as the basis for storage here -- hence the name blob "storage".

How do I define one of these blobs?

In SQL/CQL, the main way you define structures, especially those that you want to maintain, is with tables. Hence we introduce this

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
);

The blob_storage attribute indicates that the thing we're about to define here is not really going to be a materialized table. As a result, you will not be able to (e.g.) DROP the table or SELECT from it, and there will be no schema upgrade for it should you request one. However, the usual schema rules still apply which help you to create compatible versions of this structure. For instance, new columns can be added only at the end, and only if they are nullable. Here we add source to the schema in a hypothetical "version 6". Note that schema versions move forward globally in the schema, not locally in one table; this implies there are versions 1-5 elsewhere, not shown.

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
  source text @create(6)
);

Additionally, since the storage is not backed by SQL with SQL's constraint system, default values and constraints are not allowed in a table marked with cql:blob_storage; it's just data. Similarly, triggers, views, and indices may not use the "table".

Where do you keep your blob storage?

Naturally, blob storage goes in a blob field, but recall CQL has discriminated types so we could make something like this:

create table info(
  id long primary key,
  news_info blob<news_info>
);

From a SQL perspective news_info is just a blob. That means if you want to do a WHERE clause or something like that on the info, you're out of luck. Maybe you could write a user-defined function to crack the blob and so forth but really this isn't the point. If you're using this feature then, by construction, you don't need to index on this data. It's simply not suitable for use at all if you need field-at-a-time access within SQL.

How do I make one of these blobs?

The natural place that CQL stores structures is in value cursors so the most natural thing to do is to provide a variation of the SET statement that lets you load a blob from a cursor like so:

create proc make_blob(like news_info, out result blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from arguments;
  set result from cursor c;
END;

This declares a cursor, loads it from argument values, and converts it to a blob. Of course all of the usual cursor building forms can be used to power your blob creation, you just do one serialization at the end. The above is assembling a blob from arguments but you could equally make the blob from data.

create proc get_news_info(id_ long not null, out result blob<news_info>)
begin
   -- use our columns sugar syntax for getting just news_info columns from
   -- a table with potentially lots of stuff (or an error if it's missing columns)
   declare c cursor for
     select columns(like news_info) from some_source_of_info where info.id = id_;
   fetch c;
   set result from cursor c;
END;

There are many cursor fetch forms, including dummy data forms and other interesting bits of sugar. You can fetch a cursor from arguments, from other cursors, and even combinations. We want all of that to work for blobs as well without adding tons of new syntax and code generation. The obvious way to accomplish that is for cursors to be the source of blobs.

How do I unpack one of these blobs?

Again, the normal way that you work with records in CQL is by creating suitable cursors. These can be economically accessed on a field-by-field basis. What we need is a way to easily recreate a cursor from the blob so we can read the data values. This gives rise to this form:

let b := (select news_info from info where id = id_ if nothing null);
declare c cursor like b;
fetch c from b; -- note this can fail
-- now use c.who, c.what, etc.

Data loaded in a cursor is very economical to access on a field-by-field basis, and, since the deserialization of the blob happens all at once, that is also economical. Importantly, we cannot assume that the blob is well formed, it could be coming from anywhere. For secure-code reasons we must assume it is hostile. Hence the decoding validates the shape, internal lengths, and so forth.

If we had instead started with something this:

let b := (select news_info from info where id = id_ if nothing null);

Then maybe we might like to write:

if b.who == 'U2') then ... end if;

However, this sort of thing would be very uneconomical. For one thing, the blob does not have fixed-offset fields: It is carrying all the serialized data for the string fields and so forth. Each "dot" operation would be costly and, furthermore, each "dot" operation could fail if the blob is badly formed. Having to deal with a b.who that might fail seems very bad indeed.

Once you have the cursor you can make new blobs with different combinations, slice the cursor fields using the LIKE operator, return the cursor with OUT, or OUT UNION, or pass the blob fields as arguments to functions using the FROM forms. Cursors already are super flexible in terms of what you can do with their contents.

What is the representation of one of these blobs?

It's important that we allow the blobs to evolve over time, so each blob has to be self-describing. We also want to be able to throw an error if you use the wrong kind of blob when loading a cursor, so the blob has to contain the following:

the number of columns in the blob data type when it was stored
the type of each field is encoded as a single plain-text character
- the types are bool, int, long, (double) real, (string) text, blob;
- we use 'f' (flag) for bools, hence "fildsb"
- these are encoded with one letter each, upper case meaning 'not null' so the storage might be "LFss"
- the buffer begins with a null terminated string that serve for both the count and the types
Each nullable field may be present or null; 1 bit is used to store this fact. The bits are in an array of bytes that comes immediately after the type info (which implicitly tells us its size)
Boolean values are likewise encoded as bits within the same array, so the total number of bits stored is nullables plus booleans (nullable booleans use 2 bits)
If you are reading a newer version of a record from an older piece of data that is missing a column then the column is assumed to be NULL
Any columns you add after the initial version (using @create) must be nullable; this is normal for adding columns to existing schema
Integers and longs are stored in varint format after zigzag encoding
Text is stored inline in null terminated strings (embedded nulls are not allowed in CQL text)
Nested blobs are stored inline, with a length prefix encoded like any other int

What about more than one row in a blob?

Well, this is a bit more advanced but in principle this could be done as well. To make it useful, we would want to make a new cursor type that can iterate over rows in a blob. The syntax leaves room for this, something like so:

declare c cursor for blob b;
loop fetch c
begin
  -- the usual stuff
end;

This cursor would be another variation; it would keep its current index into the blob to read data out of it. Such a blob would also have to include a count of rows as part of its storage.

However, that's future looking. There is no such support at present.

Conclusion

With a fairly modest amount of work, we now support structured storage natively and have pretty rich language constructs. We carefully chose language constructs that lead to economical serialization and deserialization patterns and a record format that is versioned well, without resorting to something super loose like JSON.

As with many other features, it's possible to replace the (de)serialization with code of your choice by supplying your own runtime methods. So for instance, thrift encoding is possible; though it is more flexible than is strictly necessary for the few SQL data types, it might be convenient.

Storage types that are going to be persisted in the database or go over a wire-protocol should be managed like schema with the usual validation rules. On the other hand, formats that will be used only transiently in memory can be changed at whim from version to version. As mentioned above, the design specifically considers cases where a new client discovers and old-format blob (with fewer columns) and, the reverse, cases where an old client recieves a datagram from a new client with too many columns.

Appendix

A more complete example is included for reference.

@attribute(cql:blob_storage)
create table news_info(
  who text,
  what text,
  when_ long -- timestamp of some kind
);

-- a place where the blob appears in storage
create table some_table(
  x integer,
  y integer,
  news_blob blob<news_info>
);

-- a procedure that creates the blob from loose args
create proc make_blob(like news_info, out result blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from arguments;
  set result from cursor c;
end;

-- a procedure that cracks the blob
create proc crack_blob(data blob<news_info>)
begin
  declare c cursor like news_info;
  fetch c from data;
  out c;
end;

-- a procedure that cracks the blob into loose args if needed
-- the OUT statement was created specifically to allow you to avoid this sort mass OUT awfulness
create proc crack_blob_to_vars(
  data blob<news_info>,
  out who text,
  out what text,
  out when_ long)
begin
  declare c cursor like news_info;
  fetch c from data;
  set who := c.who;
  set what := c.what;
  set when_ := c.when_;
end;

-- this just defines a shape for the part we are keeping from the original structure
declare proc my_basic_columns() (
  x int,
  y int
);

-- this just defines a shape for the result we want
-- we're never actually defining this procedure
declare proc my_result_shape() (
  like my_basic_columns,
  like news_info
);

create proc select_and_crack(whatever_condition bool)
begin
  declare c cursor for select * from some_table where whatever_condition;
  loop fetch c
  begin
    -- crack the blob in c
    declare n cursor like news_info;
    fetch n from blob c.news_blob;

    -- assemble the result we want from the parts we have
    declare result cursor like my_result_shape;
    fetch result from values (from c like my_basic_columns, from n);

    -- emit one row
    out union result;
  end;
end;

Using the FROM construct in more places

February 21, 2022 · 3 min read

CG/SQL Team

Maintainer of CG/SQL

This new feature is a pretty simple generalization of the FROM construct as applied to expression lists. Note this isn't the same as using FROM the usual way in a select statement. An example will clear this right up.

Suppose you wanted to create a procedure that inserts a row into a table. You could write this:

create table Shape_xy (x int, y int);

create proc insert_xy(like Shape_xy)
begin
  insert into Shape_xy from arguments;
end;

Here we're using from to introduce some shape of values. It can appear in a lot of places.

Suppose now I want to insert two of those shapes. I could write this slightly more complicated procedure:

create proc insert_two_xy(xy1 like Shape_xy, xy2 like Shape_xy)
begin
   call insert_xy(from xy1);
   call insert_xy(from xy2);
end;

This also composes with cursors, so maybe you need to get two xy values from diverse locations. You can mix and match.

create proc write_xy()
begin
   declare C cursor for select T.x, T.y from somewhere T;
   fetch C;
   declare D cursor for select T.x, T.y from somewhere_else T;
   fetch D;
   if C and D then
     -- strange combos for illustration only
     call insert_two_xy(from C, from D);
     call insert_two_xy(from D, 5, 3);
     call insert_two_xy(4, 2, from C);
     call insert_two_xy(4, from C, 8);
   end if;
end;

So, as you can see, we can start from data in one or more cursors and we can turn that data, plus other expressions, into arguments, composing them as we like. This gives you the ability to call procedures and functions using shapes from a mixed set of sources. None of this is new.

However, the other places where expression lists happen -- fetch, update cursor, and insert -- only allowed you specify a single object as the input source such as insert into Shape_xy from C.

With a little work, this is trivially generalized so that all value lists can use the from construct.

Here's a complete example showing all the new forms.

create table Shape_xy (x int, y int);
create table Shape_uv (u text, v text);
create table Shape_uvxy (like Shape_xy, like Shape_uv);

create proc ShapeTrix()
begin
  declare C cursor for select Shape_xy.*, '1' u, '2' v from Shape_xy;
  fetch C;

  -- This new form is equivalent to the old form:
  --    insert into Shape_xy from C(like Shape_xy)
  -- but the values(...) form generalizes, see below.
  insert into Shape_xy values(from C like Shape_xy);

  declare D cursor for select * from Shape_uv;
  fetch D;

  declare R cursor like Shape_uvxy;

  -- This form works just like the function call case
  -- that was previously supported (it uses the same code even).
  -- This form lets you load R from any combination of sources
  -- as long as you make a suitable row.
  fetch R from values (from C like Shape_xy, from D);

  -- Same thing is supported in update cursor
  -- the x, y come from C and the u,v come from D.x, D.y.
  -- Note that C.u and C.v would not even be type compatible.
  update cursor R from values (from C like Shape_xy, from D);

  -- And in a select-values clause
  declare S cursor for
    with cte(l,m,n,o) as (values (from C like Shape_xy, from D))
     select * from cte;
  fetch S;
  insert into Shape_uvxy from S;
end;

As you can see, you can choose a subset of the from shape using like.

These combinations let you flexibily assemble rows of data for cursors, calls, and insertions, using any combination of data sources you might want, without resorting to listing every column by hand.

Introducing Expression Fragments

February 19, 2022 · 5 min read

CG/SQL Team

Maintainer of CG/SQL

Following on the heels of shared fragments, we're introducing the same kind of thing for shared fragments that are expressions rather than tables. The syntax is as follows:

-- this isn't very exciting because regular max would do the job
@attribute(cql:shared_fragment)
create proc max_func(x integer, y integer)
begin
  select case when x >= y then x else y end;
end;

The above can be used in the context of a SQL statement like so:

select max_func(T1.column1, T1.column2) the_max from foo T1;

The consequence of the above is that the body of max_func is inlined into the generated SQL. However, like the other shared fragments, this is done in such a way that the text can be shared between instances so you only pay for the cost of the text* in your program one time, no matter how many time you use it.

* You still pay for the cost of a pointer to the text.

In particular, for the above, the compiler will generate the following SQL:

select (
  select case when x >= y then x else y end
    from (select T1.column1 x, column2 y))

But each line will be its own string literal, so, more accurately, it will concatenate the following three strings:

"select (",                                      // string1
" select case when x >= y then x else y end",    // string2
" from (select T1.column1 x, column2 y))"        // string3

Importantly, string2 is fixed for any given fragment. The only thing that changes is string3, i.e., the arguments. The C compiler, and then the linker, will unify the string2 literal across all translation units so you only pay for the cost of that text one time. It also means that the text of the arguments appears exactly one time, no matter how complex they are. For these benefits, we pay the cost of the select wrapper on the arguments. This is cost is frequently negative. Consider this following:

select max_func((select max(T.m) from T), (select max(U.m) from U))

A direct expansion of the above would result in something like this:

case when (select max(T.m) from T) >= (select max(U.m) from U)
   then (select max(T.m) from T)
   else (select max(U.m) from U)
end;

The above could be accomplished with a simple pre-processor macro, but the fragments code generates the following:

select (
  select case when x >= y then x else y end
    from select (select max(T.m) from T) x, (select max(U.m) from U) y))

Expression fragments can nest, so you could write:

@attribute(cql:shared_fragment)
create proc max3_func(x integer, y integer, z integer)
begin
  select max_func(x, max_func(y, z));
end;

Again, this particular example is a waste because regular max would already do the job.

To give another example, common mappings from one kind of code to another using case/when can be written and shared this way:

-- this sort of thing happens all the time
@attribute(cql:shared_fragment)
create proc remap(x integer not null)
begin
   select case x
     when 1 then 1001
     when 2 then 1057
     when 3 then 2010
     when 4 then 2011
     else 9999
   end;
end;

In the following:

select remap(T1.c), remap(T2.d), remap(T3.e) from C, D, E;

The text for remap will appear three times in the generated SQL query but only one time in your binary.

Restrictions:

the function must consist of exactly one simple select statement
- no FROM, WHERE, HAVING, etc. -- the result is an expression
the select list must have exactly one value
- Note: the expression can be a nested SELECT which could have all the usual SELECT elements
the usual shared fragment rules apply, e.g. no out-parameters, exactly one statement, etc.

FAQ:

Q: Why does the expression fragment have a select in it?

A: Expression fragments are only interesting in SQL contexts where normal procedure and function calls are not available. The select keyword makes it clear to the author and the compiler that the expression will be evaluated by SQLite and the rules for what is allowed to go in the expression are the SQLite rules.

Q: Why no FROM clause?

A: We're trying to produce an expression, not a table-value with one column. If you want a table-value with one column, the original shared fragments solution already do exactly that. This gives you a solution for sharing code in, say, the WHERE clause or the select list.

Q: Isn't this just the same as doing, say, #define max_func(x,y) case when (x) >= (y) then x else y end;?

A: Macros can give you a ton of flexibility, but they have many problems:

if the macro has an error, you see the error in the call site with really bad diagnostic info
the compiler doesn't know that the sharing is going on so it won't be able to share text between call sites
the arguments can be evaluated many times each which could be expensive, bloaty, or wrong
there is no type-checking of arguments to the macro so you may or may not get compilation errors after expansion
you have to deal with all the usual pre-processor hazards

In general, macros can be used for meta-programming (as in C and C++), but that doesn't mean it's a good idea.

Using the LIKE form in the SELECT statement

February 3, 2022 · 4 min read

CG/SQL Team

Maintainer of CG/SQL

One of the signature features of the CQL language is the ability to use the "LIKE" form to slice out columns that conform to a shape. This notion appears in many places in the language. For instance if I have a table Foo. I can make a cursor for that shape like so:

declare C cursor like Foo;

Which says I want the columns of C to be like the columns of Foo.

If I have a cursor D that has the Foo columns but maybe more and maybe in a different order I can load C as follows:

fetch C from D(like Foo)

Which again saves me from having to list all the (potentially dozens) of Foo columns. This construct is in many places:

declare proc P(like Foo)
begin
  insert into Foo from arguments;
end;

even

declare proc P(f like Foo, b like Bar)
begin
  insert into Foo from f;
  insert into Bar from b;
end;

And other examples... This is discussed more fully in Chapter 5 of the Guide.

However, one of the few places that shapes are interesting but not supported was in the select list. And so, just a couple of days ago, we added the COLUMNS construct to the language which allows for a sugared syntax for extracting columns in bulk. It's kind of a generalization of the select T.* pattern but with CQL-style slicing and type-checking.

These forms are supported:

columns from a join table or tables

-- same as A.*
select columns(A) from ...;

-- same as A.*, B.*
select columns(A, B) from ...;

columns from a particular join table that match a shape

-- the columns of A that match the shape Foo
select columns(A like Foo) from ...;

-- get the Foo shape from A and the Far shape from B
select columns(A like Foo, B like Bar) from ...;

columns from any join table that match a shape

--- get the Foo shape from anywhere in the join
select columns(like Foo) from ...;

-- get the Foo and Bar shapes, from anywhere in the join
select columns(like Foo, like Bar) from ...;

specific columns

-- x and y columns plus the foo shape
select columns(T1.x, T2.y, like Foo) from ...;

distinct columns from the above (not distinct values!)

-- removes duplicate column names
-- e.g. there will be one copy of 'pk'
select columns(distinct A, B) from A join B using(pk);

-- if both Foo and Bar have an (e.g.) 'id' field you only get one copy
select columns(distinct like Foo, like Bar) from ...;

-- if a specific column is mentioned it is always included
-- but later clauses that are not a specific column will avoid it
-- if F or B has an x it won't appear again, just T.x
select columns(distinct T.x, F like Foo, B like Bar) from F, B ..;

Of course this is all just sugar, so it all ends up being a column list with table qualifications -- but the syntax is very powerful. For instance, for narrowing a wide table, or for fusing joins that share common keys

-- just the Foo columns
select columns(like Foo) from Superset_Of_Foo_From_Many_Joins_Even;

-- only one copy of 'pk'
select columns(distinct A,B,C) from
  A join B using (pk) join C using (pk);

And of course you can define shapes however you like and then use them to slice off column chucks of your choice. There are many ways to build up shapes from other shapes. Probably the easiest is to declare procedures that return the shape you want and never actual create them. E.g.

declare proc shape1() (x integer, y real, z text);
declare proc shape2() (like shape1, u bool, v bool);

With this combination you can easily define common column shapes and slice them out of complex queries without having to type the columns names over and over...

Note that the COLUMNS(...) form is not a general replacement for the select list. For instance, general expressions are not allowed inside of COLUMNS(...) but, where extraction of lots of columns is needed, or even re-ordering of colummns, it's a very good option indeed and it composes well with the other select features.

This was the last significant area where shapes are useful but totally absent.

Control Flow Analysis in CQL

December 30, 2021 · 12 min read

CG/SQL Team

Maintainer of CG/SQL

One of the biggest changes to CQL in 2021 was the addition of control flow analysis. Given an understanding of how execution can flow within a user's program, CQL can do things like infer when a nullable variable must contain a nonnull value and improve its type appropriately, or issue an error when a nonnull variable may be used before it has been initialized.

Improving Nullability

As of mid-2021, and with increasing sophistication throughout the remainder of the year, CQL has been able to infer that a variable of a nullable type must not be NULL within a portion of a user's program:

DECLARE PROC another_proc(t0 TEXT NOT NULL, t1 TEXT NOT NULL);

CREATE PROC some_proc(t0 TEXT, t1 TEXT)
BEGIN
  IF t0 IS NULL RETURN;
  -- `t0` must be nonnull here if we made it this far

  IF t1 IS NOT NULL THEN
    -- `t0` and `t1` are nonnull here
    CALL another_proc(t0, t1);
  ELSE
    -- `t0` is nonnull here
    CALL another_proc(t0, "default");
  END IF;
END;

The ability of the CQL compiler to infer non-nullability greatly reduces the need to use the functions ifnull_crash and ifnull_throw to coerce values to a nonnull type—functions that, if they are ever used incorrectly, usually result in programs misbehaving.

For a detailed description and many additional examples of what is possible—CQL can handle much more than what is shown above—see the user guide's section on nullability improvements.

Enforcing Initialization Before Use

In CQL, it is possible to declare a variable of a nonnull type without giving it a value. If the variable is of a non-reference type, it is assigned a default value of 0. If the variable is of a reference type (BLOB, OBJECT, or TEXT), however, it is simply set to NULL despite the nonnull type as no default value exists.

To help prevent accessing a reference variable of a nonnull type and getting back NULL, CQL recently began enforcing that such variables are initialized before use. The following code, therefore, now results in an error:

DECLARE t TEXT NOT NULL;
CALL requires_text_notnull(t); -- error!

Using the same engine for control flow analysis that is behind nullability improvements, CQL can improve a variable to be initialized:

DECLARE t TEXT NOT NULL;

IF some_condition THEN
  SET t := "some example text";
  -- `t` is initialized here
ELSE
  THROW;
END IF;
-- `t` must be initialized here if we made it this far

CALL requires_text_notnull(t); -- okay!

Thanks to CQL's ability to understand the control flow of users' programs, the above example works just fine.

CQL now also enforces that all procedures with OUT parameters of a nonnull reference type properly initialize said parameters before they return:

CREATE PROC some_proc(b BOOL NOT NULL, OUT t TEXT NOT NULL)
BEGIN
  IF b THEN
    SET t := another_proc(t);
    -- `t` is initialized here
  ELSE
    SET t := yet_another_proc(t);
    -- `t` is initialized here
  END IF;
  -- `t` must be initialized here because all possible
  -- branches initialized it, so `some_proc` is okay!
END;

As with nullability improvements, understanding the nuances of what will be considered initialized is easier if one has a sense for how control flow analysis works in the compiler.

Understanding Control Flow Analysis in CQL

To develop an intuition for how control flow analysis works in CQL, let's begin by taking a look at the following example:

DECLARE PROC p1(OUT t TEXT NOT NULL);
DECLARE PROC p2(i INTEGER NOT NULL, OUT t TEXT NOT NULL);

CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
  IF i IS NULL THEN
    IF b THEN
      CALL p1(t);
    ELSE
      SET t := "";
    END IF;
    RETURN;
  END IF;

  IF i == 0 THEN
    SET t := "";
  ELSE IF i > 0 THEN
    SET t := p2(i);
  ELSE
    THROW;
  END IF;
END;

There are a couple of things we must verify in order to ensure the code is type-safe:

With regard to the parameters of p0: Since t is an OUT parameter of type TEXT NOT NULL, p0 must always assign it a value before it returns. If it does not, a caller of p0 may end up with a variable of a NOT NULL type that actually contains NULL.
With regard to the calling of p2 in p0: Since p2 requires a first argument of type INTEGER NOT NULL, some sort of check must be performed to ensure that i is not NULL before p2(i) is executed.

If we carefully study p0, we can determine that both of the above conditions are satisfied. Making this determination, however, is not exactly trivial, and real-world code is often significantly more complicated than this—and it evolves over time. For these reasons, having a compiler that can make such determinations automatically is critical; most modern production compilers perform these sorts of checks.

The easiest way to understand how CQL does its job is to take the above example line-by-line. This is not exactly how CQL works under the hood, but it should provide an intuitive sense of how control flow analysis works in the compiler:

==> CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
    BEGIN
      ...
    END;

Right away, CQL can see that t is declared both OUT and TEXT NOT NULL and thus requires initialization before p0 returns. CQL can, therefore, add a fact about what it is analyzing to its previously null set of facts:

t requires initialization.

We can then continue:

==>   IF i IS NULL THEN
        ...
      END IF;

Here, the compiler notices that we're at an IF statement. In CQL, IF statements contain one or more branches, and the compiler considers every IF to be the start of a branch group. The same line also indicates the condition for the first branch: i IS NULL. CQL can update its set of facts:

t requires initialization.
In branch group:
In branch when i IS NULL:

It then proceeds to the next line:

      IF i IS NULL THEN
    ==> IF b THEN
          CALL p1(t);
        ELSE
          SET t := "";
        END IF;
        RETURN;
      END IF;

Another branch group and branch:

t requires initialization.
In branch group:
In branch when i IS NULL:
In branch group:
In branch when b:

Continuing:

      IF i IS NULL THEN
        IF b THEN
      ==> CALL p1(t);
        ELSE
          SET t := "";
        END IF;
        RETURN;
      END IF;

Since p1 takes an OUT argument of type TEXT NOT NULL, this call initializes t, and so CQL can update its set of facts once again:

t requires initialization.
In branch group:
In branch when i IS NULL:
In branch group:
In branch when b:
t is initialized.

Jumping ahead a couple of lines:

      IF i IS NULL THEN
        IF b THEN
          CALL p1(t);
        ELSE
      ==> SET t := "";
        END IF;
        RETURN;
      END IF;

At this point, we're in another branch. We also have yet another fact to add because t is initialized here as well due to the SET:

t requires initialization.
In branch group:
In branch when i IS NULL:
In branch group:
In branch when b:
t is initialized.
In ELSE branch:
t is initialized.

Moving ahead one more line, things get a bit more interesting:

      IF i IS NULL THEN
        IF b THEN
          CALL p1(t);
        ELSE
          SET t := "";
    ==> END IF;
        RETURN;
      END IF;

Here, we're at the end of an IF, and thus the end of a branch group. Whenever CQL reaches the end of a branch group, it merges the effects of all of its branches.

One very important thing to note here is that the current branch group has an ELSE branch, and so the set of branches covers all possible cases. That means if something is initialized in every branch within the branch group, we can consider it to be initialized after the branch group has ended: Initialization will always occur. This allows CQL to simplify its set of facts as follows as it leaves the branch group:

t requires initialization.
In branch group:
In branch when i IS NULL:
t is initialized.

Stepping forward one line again, we reach a RETURN:

      IF i IS NULL THEN
        ...
    ==> RETURN;
      END IF;

We're now at a point where we can exit the procedure. CQL will, therefore, verify that if something requires initialization, it has been initialized. Since we have both the facts "t requires initialization" and "t is initialized", all is well!

The fact that the current branch returns early is added to the set of facts:

t requires initialization.
In branch group:
In branch when i IS NULL:
t is initialized.
Returns.

Moving ahead one more line, we reach the end of another branch and branch group, and again something interesting happens:

      ...
      IF i IS NULL THEN
        ...
  ==> END IF;

Upon ending the branch group, we know that the branch group has exactly one branch, that the branch is entered only when i IS NULL, and that the branch returns. What that tells CQL is that, if execution is going to continue after the branch group, its sole branch must not have been taken, and so CQL knows the opposite of its condition for entry will be true from this point onward:

t requires initialization.
i is not null.

The next IF is rather similar to what we've seen already in its structure, so we can jump ahead several lines to the next point of interest:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
    ==> SET t := p2(i);
      ELSE
        THROW;
      END IF;

Before we analyze the above-indicated line, we have the following set of facts:

t requires initialization.
i is not null.
In branch group:
In branch when i == 0:
t is initialized.
In branch when i > 0:

In the call p2(i), we know that i was declared to have type INTEGER and that p2 requires an INTEGER NOT NULL, but we also have the fact "i is not null". For this reason, we can consider p2(i) to be a valid call. We can also add the fact that t is initialized to our current set of facts:

...
In branch when i > 0:
t is initialized.

NOTE: When it comes to code generation, it is not so simple as to say p2(i) is valid and proceed as usual. That's because p2 expects an argument of type INTEGER NOT NULL, but we merely have a value of type INTEGER that we happen to know cannot be null: INTEGER NOT NULL and INTEGER do not share the same underlying representation, and so we cannot pass the declared-nullable variable i directly to p2. To solve this problem, CQL rewrites the expression such that p2(i) becomes p2(cql_inferred_notnull(i)), where cql_inferred_notnull is an internal-only function that handles the nullable-to-nonnull representational conversion for us. This explains its presence in the following examples.

Jumping ahead again, we encounter a THROW:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
        SET t := p2(cql_inferred_notnull(i));
      ELSE
    ==> THROW;
      END IF;

The fact that the branch will throw is added to the current set of facts:

t requires initialization.
i is not null.
In branch group:
In branch when i == 0:
t is initialized.
In branch when i > 0:
t is initialized.
In ELSE branch:
Throws.

We then proceed to the end of the IF:

      IF i == 0 THEN
        SET t := "";
      ELSE IF i > 0 THEN
        SET t := p2(cql_inferred_notnull(i));
      ELSE
        THROW;
  ==> END IF;

Once again, CQL merges the effects of the branches in the branch group to finish the analysis of the IF. Since it can see that t was initialized in all branches except the one that throws, and since the branches cover all possible cases, the set of facts is simplified as follows given the knowledge that, if THROW was not encountered, t must have been initialized:

t requires initialization.
i is not null.
t is initialized.

Moving ahead one final time, we encounter the end of the procedure:

    CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
    BEGIN
      ...
==> END;

The only thing left to do at this point is to validate that anything requiring initialization has been initialized. Since we have both "t requires initialization" and "t is initialized", everything is in order.

Looking Ahead

As a recently generalized piece of functionality within the CQL compiler, control flow analysis will soon be used to enforce additional properties of users' programs. In particular, CQL will be able to ensure that cursors are always fetched before they're used and that cursors are always checked to have a row before their fields are accessed.

Hopefully you now understand the fundamentals of control flow analysis in CQL and the benefits it brings to your programs. Best wishes for 2022!

Introducing Shared Fragments

December 14, 2021 · 8 min read

CG/SQL Team

Maintainer of CG/SQL

Shared fragments are a real game-changer for CQL.

Remember, these are designed to let you write part of a query and then substitute in parameters. So it's like a parameterized view in normal SQL terms. But actually it's more powerful than that, fragments also provide features that are more like Java generics. Let's do some examples.

Suppose we have a procedure which looks something like this:

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
  WITH
  to_exclude_recursive_query (tok, rest) AS (
    SELECT
      '',
      to_exclude_ || ','
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM to_exclude_recursive_query
    WHERE rest <> ''
  ),
  to_exclude (id) AS (
    SELECT CAST(tok AS LONG)
    FROM to_exclude_recursive_query
    WHERE tok <> ''
  )
  to_include_recursive_query (tok, rest) AS (
    SELECT
      '',
      to_include_ || ','
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM to_include_recursive_query
    WHERE rest <> ''
  ),
  to_include (id) AS (
    SELECT CAST(tok AS LONG)
    FROM to_include_recursive_query
    WHERE tok <> ''
  )
  SELECT * from stuff S
  WHERE
    S.id in (select * from to_include) AND
    S.id not in (select * from to_exclude);
END;

With shared fragments you could write something like this:

@attribute(cql:shared_fragment)
CREATE PROC split_commas(str text)
BEGIN
  WITH splitter(tok, rest) AS (
    SELECT '', IFNULL(str || ',', '')
    UNION ALL
    SELECT
      substr(rest, 1, instr(rest, ',') - 1),
      substr(rest, instr(rest, ',') + 1)
    FROM splitter
    WHERE rest <> '')
  select tok from splitter where tok <> '';
END;

@attribute(cql:shared_fragment)
CREATE PROC ids_from_string(str text)
BEGIN
  WITH toks(tok) AS (CALL split_commas(str))
  SELECT CAST(tok AS LONG) AS id from toks;
END;

We now have a shared fragment called split_commas which can be anywhere like maybe in a standard include file. There are some immediate benefits:

the fragment is compiled on its own before usage so any errors are reported in the fragment
- in contrast, with macros you get errors when you try to use the macro and they are all charged to the line the macro appears on so it's hopeless figuring out what's wrong
the text of the shared fragment will be the same, so it can be re-used in all locations, this can be a big binary size savings
- in contrast, macros are pre-processed before CQL ever sees the text so it doesn't "know" it's the same code
fragments compose cleanly as we'll see; and they have typed arguments
fragments can be independently tested outside of the context in which they appear
- make a test context and explore the fragment, no worries about it breaking on edge cases later

The first fragment called split_commas does exactly what it sounds like, it takes a string argument and makes a list of the strings in it.

The second fragment uses the first to split a string and then it converts all the strings to long integers.

Now instead of the above we could write:

#include <stringsplit.sql> /* whereever you put the fragments */

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
  WITH
    to_include(id) AS (CALL ids_from_string(to_include_)),
    to_exclude(id) AS (CALL ids_from_string(to_exclude_))
  SELECT * from stuff S
  WHERE
    S.id in (select * from to_include) AND
    S.id not in (select * from to_exclude);
END;

And of course since ids_from_string is somewhere shared (stringsplit.sql) so these fragments can be used all over your code and you'll only pay for the text one time. This gives you great flexibility, very much like parameterized views. You can have any number of these fragments, they will share code, they compose like crazy and there is no schema cost!

Generics

A series of useful fragments for generating data would go a long way but there are other applications of fragments and you might want to operate on various data sources without hard coding them all. This is where the generic form of fragments comes in. Consider a case where you want to be able to filter stuff by say name and age. You could create this fragment:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    source(*) LIKE stuff
  SELECT * from source S
  WHERE
    S.name LIKE pattern_ AND
    S.age BETWEEN min_age_ and max_age_;
END;

Now imagine that we had added the shared fragment annotation to get_stuff (just like the above). We could then write the following:

CREATE PROC the_right_stuff(
  to_include_ text,
  to_exclude_ text,
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    get_stuff(*) AS (call get_stuff(to_include_, to_exclude_)),
    filter_stuff(*) AS (call filter_stuff(pattern_, min_age_, max_age_)
      using get_stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

Or with some sugar to forward arguments and assume the CTE name matches, more economically:

CREATE PROC the_right_stuff(
  to_include_ text,
  to_exclude_ text,
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    (call get_stuff(*)),
    (call filter_stuff(*) using get_stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

The arg syntax (*) simply indicates that the arg names in the caller should match to the same names in the callee. In general call foo(*) expands to call foo(from arguments like foo arguments). * is rather more economical than that.

In this example filter_stuff doesn't know where its data will be coming from, you bind its table parameter source to a compatible data source of your choice. For example, this would also be legal:

CREATE PROC almost_the_right_stuff(
  pattern_ text not null,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  WITH
    (call filter_stuff(*) using stuff as source)
  SELECT * from filter_stuff S
  ORDER BY name
  LIMIT 5;
END;

Conditionals

It's often desirable to have some options in the generated SQL without having to fork your entire query. Shared fragments address this as well with the conditional form. In this form the top level of the fragment is an IF statement and there are a number of alternatives. Here are some simple modifications to the above that illustrate some of the possibilities.

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  IF pattern_ IS NOT NULL THEN
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.name LIKE pattern_ AND
        S.age BETWEEN min_age_ and max_age_;
  ELSE
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.age BETWEEN min_age_ and max_age_;
  END IF;
END;

In the above if the input pattern is NULL then it is not considered, it won't be part of the generated SQL at all. Note that source (same name) appears in both branches and therefore must be the same type as it will be fulfilled by one actual table parameter.

Now the above could have been achieved with something like this:

pattern_ IS NULL OR S.name LIKE pattern_

But that would have no useful selectivity. But in general you might be able to avoid joins and so forth with your constraints. Consider something like this hypothetical:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
  pattern_ text,
  min_age_ integer not null,
  max_age_ integer not null)
BEGIN
  IF pattern_ IS NOT NULL THEN
    WITH
        source(*) LIKE stuff
    SELECT DISTINCT S.* from source S
    INNER JOIN keywords K
    WHERE
        K.keyword LIKE pattern_ AND
        S.age BETWEEN min_age_ and max_age_;
  ELSE
    WITH
        source(*) LIKE stuff
    SELECT * from source S
    WHERE
        S.age BETWEEN min_age_ and max_age_;
  END IF;
END;

Here we save the DISTINCT and the JOIN if there is no pattern which might be important. Of course there are probably better ways to match keywords but this is just an illustration of what's possible.

There are numerous ways this flexibility can be used, again a simple example, a real schema transform would be more complex.

@attribute(cql:shared_fragment)
CREATE PROC get_stuff(
  to_include_ text,
  to_exclude_ text,
  schema_v2 bool not null)
BEGIN
  IF schema_v2 THEN
    WITH
        to_include(id) AS (CALL ids_from_string(to_include_)),
        to_exclude(id) AS (CALL ids_from_string(to_exclude_))
    SELECT * from stuff_2 S
    WHERE
        S.id in (select * from to_include) AND
        S.id not in (select * from to_exclude);
  ELSE
    WITH
        to_include(id) AS (CALL ids_from_string(to_include_)),
        to_exclude(id) AS (CALL ids_from_string(to_exclude_))
    SELECT * from stuff S
    WHERE
        S.id in (select * from to_include) AND
        S.id not in (select * from to_exclude);
   END IF;
END;

Validation

All of this requires a bunch of checking, at least this:

the LIKE forms can only appear in a shared fragment
the CALL forms must refer to shared fragments
the CALL args must be compatible
the number and type of the provided tables in USING must be correct
the shared fragment must be a single select statement or an IF statement with an ELSE
- the statement lists of the IF/ELSE combo must all be single select statements
- all the choices in the IF block must return the same shape (this is normal for procedures)
the shared fragment can't have any out arguments
the provided fragment arguments cannot themselves use the nested SELECT construct

I think this is a total game changer for SQL authoring and should go a long way to making it easier to get your work done on SQLite. A good base set of shared fragments as part any suite of procedures seems like a good idea.

There are more details in the section on shared fragments in Chapter 14 of The Guide.

These features are in the current build as of today (12/14/2021).

Happy Holidays and stay safe.

Introducing @RC builtin variable

February 21, 2021 · One min read

CG/SQL Team

Maintainer of CG/SQL

We've long needed a way to see the most recent SQLite result code SQLite in the context of say a catch block (most other times you can assume SQLITE_OK was the last result code otherwise control flow would transfer elsewhere. Sometimes SQLITE_ROW or SQLITE_DONE might be the current result code.

Soon we'll provide a sample header that declares the most common error codes in an enum but for now you could do something like this:

-- pasted from the sqlite.c
#define SQLITE_BUSY         5   /* The database file is locked */

-- this is a contrived example
create proc get_first_foo(out can_retry bool not null)
begin

  -- can_retry is set to 0 automatically, language semantics guarantee this

  begin try
    select foo from bar limit 1;
  end try;
  begin catch
    set can_retry := (@rc == SQLITE_BUSY);
    throw; -- rethrow the original error
  end catch;
end;

Foreword​

Introduction and Context​

The Create Plan​

The Recreate Plan​

Overall CRC​

Immutable Schema Versions​

Mutating Schema Versions​

Zombie Tables​

Unsubscription​

Performance Optimization 1​

Performance Optimization 2​

The most recent change and further simplications in unsub/resub logic​

A Versionless Future​

Conclusion​

Introduction and Context​

Cursor Types and Result Types​

Creating New Cursor Types From Existing Cursor Types​

Cursor Arguments​

The Specific Parent/Child Functions​

Result Set Sugar​

Result Set Values​

Additional Language Support​

Conclusion​

Introduction and Context​

How do I define one of these backed tables?​

How do I read this data?​

Why does the type hash include only the non-null fields?​

What do cql_blob_get and cql_blob_get_type turn into?​

What does this end up looking like?​

How do I insert data like this?​

How does the delete operation work?​

What about updating tables?​

Appendix​

Introduction and Context​

How do I define one of these blobs?​

Where do you keep your blob storage?​

How do I make one of these blobs?​

How do I unpack one of these blobs?​

What is the representation of one of these blobs?​

What about more than one row in a blob?​

Conclusion​

Appendix​

Improving Nullability​

Enforcing Initialization Before Use​

Understanding Control Flow Analysis in CQL​

Looking Ahead​

Generics​

Conditionals​

Validation​

Foreword

Introduction and Context

The Create Plan

The Recreate Plan

Overall CRC

Immutable Schema Versions

Mutating Schema Versions

Zombie Tables

Unsubscription

Performance Optimization 1

Performance Optimization 2

The most recent change and further simplications in unsub/resub logic

A Versionless Future

Conclusion

Introduction and Context

Cursor Types and Result Types

Creating New Cursor Types From Existing Cursor Types

Cursor Arguments

The Specific Parent/Child Functions

Result Set Sugar

Result Set Values

Additional Language Support

Conclusion

Introduction and Context

How do I define one of these backed tables?

How do I read this data?

Why does the type hash include only the non-null fields?

What do `cql_blob_get` and `cql_blob_get_type` turn into?

What does this end up looking like?

How do I insert data like this?

How does the delete operation work?

What about updating tables?

Appendix

Introduction and Context

How do I define one of these blobs?

Where do you keep your blob storage?

How do I make one of these blobs?

How do I unpack one of these blobs?

What is the representation of one of these blobs?

What about more than one row in a blob?

Conclusion

Appendix

Improving Nullability

Enforcing Initialization Before Use

Understanding Control Flow Analysis in CQL

Looking Ahead

Generics

Conditionals

Validation