CQL has a lot of schema knowledge already and so it's well positioned to think about schema upgrades and versioning.
It seemed essential to be able to record changes to the schema over time so CQL got an understanding of versioning. This lets you do things like:
- ensure columns are only added where they should be
- generate compiler errors if you try to access columns that are deprecated
- move from one version to another tracking schema facets that have to be added
There are three basic flavors of annotation
@create(version [, migration proc])
@delete(version [, migration proc])
They have various constraints:
@deletecan only be applied to tables and columns
@recreatecan only be applied to tables (nothing else needs it anyway)
@recreatecannot mix with
@recreatecan include a group name as in
@recreate(musketeers), if a group name is specified then all the tables in that group are recreated if any of them change
Indices, Views, and Triggers are always "recreated" (just like tables can be) and so neither the
@recreate nor the
@create annotations are needed (or allowed). However when an Index, View, or Trigger is retired it must be marked with
@delete so that it isn't totally forgotten but can be deleted anywhere it might still exist. Note that when one of these items is deleted the definition is not used as it will only be dropped anyway, the simplest creation of the object with the correct name will do the job as a tombstone.
create view used_to_be_fabulous as select 1 x @delete(12); suffices to drop the
used_to_be_fabulous view in version 12 no matter how complicated it used to be. It's
CREATE VIEW will not be emitted into the upgrade procedure in any case. Similarly trivial indices and triggers of the correct name can be used for the tombstone.
In addition, if there is some data migration that needs to happen at a particular schema version that isn't associated with any particular change in schema, you can run an ad hoc migrator at any time. The syntax for that is
@schema_ad_hoc_migration(version, migration proc); Ad hoc migrations are the last to run in any given schema version, they happen after table drop migrations.
@create declares that the annotated object first appeared in the indicated version, and at that time the migration proc needs to be executed to fill in default values, denormalize values, or whatever the case may be.
@delete declares that the annotated object disappeared in the indicated version, and at that time the migration proc needs to be executed to clean up the contents, or potentially move them elsewhere.
@recreate declares that the annotated object can be dropped and recreated when it changes, there is no need to preserve its contents during an upgrade. Such objects may be changed arbitrarily from version to version
- no columns in a
@recreatetable may have
@delete(it isn't needed anyway)
- therefore tables with
@recreatenever have deprecated columns (since
@deleteisn't allowed on their columns)
- therefore tables with
NOTE: all annotations are suppressed from generated SQL. SQLite never sees them.
NOTE: looking at the annotations it is possible to compute the logical schema at any version, especially the original schema -- it's what you get if you disregard all
@delete entirely (don't delete) and then remove anything marked with
Not all migrations are possible in a sensible fashion, therefore CQL enforces certain limitations:
- the "original" schema has no annotations or just delete annotations
- new tables may be added (with
- tables may be deleted (with
- columns may be added to a table, but only at the end of the table
- added columns must be nullable or have a default value (otherwise all existing insert statements would break for sure)
- columns may not be renamed
- columns may be deleted but this is only a logical delete, SQLite has no primitive to remove columns; once deleted you may not longer refer to that column in queries
- deleted columns must be nullable or have a default value (otherwise all existing and future insert statements would break for sure, the column isn't really gone)
- views, indices, and triggers may be added (no annotation required) and removed (with
@delete) like tables
- views, indices, and triggers may be altered completely from version to version
- no normal code is allowed to refer to deleted columns, tables, etc. this includes views, indices, and triggers
- schema migration stored procs see the schema as it existed in their annotation (so an older version), they are also forbidden from using views (see below)
- recreated objects (tables marked with @recreate, views, tables, and indices) have no change restrictions
Moving from one schema version to another is done in an orderly fashion with the migration proc taking these essential steps in this order
cql_schema_facetstable is created if needed, this records the current state of the schema
the last known schema hash is read from the
cql_schema_facetstables (it's zero by default)
if the overall schema hash code matches what is stored processing stops, otherwise an upgrade ensues
all known views are dropped (hence migration procs won't see them!)
any index that needs to change is dropped (this includes items marked
@deleteor indices that are different than before)
- change is detect by hash (crc64) of the previous index definition vs. the current
all known triggers are dropped (hence they will not fire during migration!)
the current schema version is extracted from
cql_schema_facets(it's zero by default)
if the current schema version is 0 then the original version of all the tables are created
if the current schema version is <= 1 then
- any tables that need to be created at schema version 1 are created as they exist at schema version 1
- any columns that need to be created at schema version 1 are created as they exist at schema version 1
- migration procedures schema version 1 are run in this order:
- create table migration
- create column migration
- delete trigger migration (these are super rare and supported for uniformity)
- delete index migration (these are super rare and supported for uniformity)
- delete view migration (these are super rare and supported for uniformity)
- delete column migration
- delete table migration
- ad hoc migration
- each proc is run exactly one time
- any tables that need to be dropped at schema version 1 are dropped
- the schema version is marked as 1 in
- each sub-step in the above is recorded in
cql_schema_facetsas it happens so it is not repeated
- all that checking not shown for brevity
the above process is repeated for all schema versions up to the current version
all tables that are marked with
@recreateare re-created if necessary
- i.e. if the checksum of the table definition has changed for any table (or group) then
dropit and create the new version.
- i.e. if the checksum of the table definition has changed for any table (or group) then
all indices that changed and were not marked with
all views not marked with
all triggers not marked with
the current schema hash is written to the
Here's an example of a schema directly from the test cases:
This schema has a LOT of versioning... you can see tables and columns appearing in versions 2 through 6. There is a lot of error checking happening.
- things with no create annotation were present in the base schema
- only things with no delete annotation are visible to normal code
- created columns have to be at the end of their table (required by SQLite)
- they have to be in ascending schema version order (but you can add several columns in one version)
- there may or may not be a proc to run to populate data in that column when it's added or removed data when it's deleted
- proc names must be unique
- you can't delete a table or column in a version before it was created
- you can't delete a column in a table in a version before the table was created
- you can't create a column in a table in a version after the table was deleted
- there's probably more I forgot...
Sample Upgrade Script
With just those annotations you can automatically create the following upgrade script which is itself CQL (and hence has to be compiled). This code is totally readable!
I've split the script into logical pieces to explain what's going on.
Schema upgrade scripts need to see all the columns even the ones that would be logically deleted in normal mode. This is so that things like
alter table add column can refer to real columns and
drop table can refer to a table that shouldn't even be visible. Remember in CQL the declarations tell you the logical state of the universe and DLL mutations are expected to create that condition, so you should be dropping tables that are marked with
CQL stores the current state of the universe in this table.
The schema crc is computed by hashing all the schema declarations in canonical form. That's everything in this next section.
Wherein all the necessary objects are declared...
sqlite_master table is built-in but it has to be introduced to CQL so that we can query it. Like all the other loose DDL declarations here there is no code generated for this. We are simply declaring tables. To create code you have to put the DDL in a proc. Normally DDL in procs also declares the table but since we may need the original version of a table created and the final version declared we have
@schema_upgrade_script to help avoid name conflicts.
NOTE: all the tables are emitted including all the annotations. This lets us do the maximum validation when we compile this script.
These view declarations do very little. We only need the view names so we can legally drop the views. We create the views elsewhere.
Just like views, these declarations introduce the index names and nothing else.
We have only the one trigger, we declare it here.
This is where we will store everything we know about the current state of the schema. Below we define a few helper procs for reading and writing that table and reading
We will snapshot the facets table at the start of the run so that we can produce a summary of the changes at the end of the run. This table will hold that snapshot.
NOTE: the prefix "test" was specified when this file was built so all the methods and tables begin with
sqlite_master and returns true if a column matching
Here we actually create the
cql_schema_facets table with DDL inside a proc. In a non-schema-upgrade script the above would give a name conflict.
save_sql_schema_facets procedure simply makes a snapshot of the current facets table. Later we use
this snapshot to report the differences by joining these tables.
The two procs
cql_set_facet_version do just what you would expect. Note the use of
catch to return a default value if the select fails.
There are two additional helper procedures that do essentially the same thing using a schema version index. These two methods exist only to avoid unnecessary repeated string literals in the output file which cause bloat.
As you can see, these procedures are effectively specializations of
cql_set_facet_version where the facet name is computed from the integer.
Triggers require some special processing. There are so-called "legacy" triggers that crept into the system. These
tr__ and they do not have proper tombstones. In fact some are from early versions of CQL before
they were properly tracked. To fix any old databases that have these in them, we delete all triggers that start with
Note we have to use the
GLOB operator to do this, because
_ is the
The 'baseline' or 'v0' schema is unannotated (no
@recreate). The first real schema
management procedures are for creating and dropping these tables.
The next section declares the migration procedures that were in the schema. These are expected to be defined elsewhere.
The code below will refer to these migration procedures. We emit a declaration so that we can use the names in context.
USING TRANSACTION when applied to a proc declaration simply means the proc will access the database so it needs to be provided with a
sqlite3 *db parameter.
View migration is done by dropping all views and putting all views back.
dead_view was not created, but we did try to drop it if it existed.
Indices are processed similarly to views, however we do not want to drop indices that are not changing. Therefore we compute the CRC of the index definition. At the start of the script any indices that are condemned (e.g.
index_going_away) are dropped as well as any that have a new CRC. At the end of migration, changed or new indices are (re)created using
Triggers are always dropped before migration begins and are re-instated quite late in the processing as we will see below.
Main Migration Script
The main script orchestrates everything. There are inline comments for all of it. The general order of events is:
- create schema facets table if needed
- check main schema crc, if it matches we're done here, otherwise continue...
These operations are done in
- drop all views
- drop condemned indices
- fetch the current schema version
- if version 0 then install the baseline schema (see below)
- for each schema version with changes do the following:
- create any tables that need to be created in this version
- add any columns that need to be added in this version
- run migration procs in this order:
- create table
- create column
- delete trigger
- delete view
- delete index
- delete column
- delete table
- drop any tables that need to be dropped in this version
- mark schema upgraded to the current version so far, proceed to the next version
- each partial step is also marked as completed so it can be skipped if the script is run again
- create all the views
- (re)create any indices that changed and are not dead
- set the schema CRC to the current CRC
That's it... the details are below.
This is the main function for upgrades, it checks only the master schema version. This function is separate so that the normal startup path doesn't have to have the code for the full upgrade case in it. This lets linker order files do a superior job (since full upgrade is the rare case).
We had no temporary tables in this schema, but if there were some they get added to the schema after the upgrade check.
A procedure like this one is generated:
This entry point can be used any time you need the temp tables. But normally it is automatically invoked.
That logic is emitted at the end of the test procedure.
Schema Regions are designed to let you declare your schema in logical regions whose dependencies are specified. It enforces the dependencies you specify creating errors if you attempt to break the declared rules. Schema regions allow you to generate upgrade scripts for parts of your schema that can compose and be guaranteed to remain self-consistent.
In many cases schema can be factored into logical and independent islands. This is desireable for a number of reasons:
- so that the schema can go into different databases
- so that the schema can be upgraded on a different schedule
- so that "not relevant" schema can be omitted from distributions
- so that parts of your schema that have no business knowing about each other can be prevented from taking dependencies on each other
These all have very real applications:
E.g. Your Application has an on-disk and an in-memory database
This creates basically three schema regions:
- on disk: which cannot refer to the in-memory at all
- in-memory: which cannot refer to the on-disk schema at all
- cross-db: which refers to both, also in memory (optional)
Your Application Needs To Upgrade Each of the Above
There must be a separate upgrade script for both the island databases and yet a different one for the "cross-db" database
Your Customer Doesn't Want The Kitchen Sink of Schema
If you're making a library with database support, your customers likely want to be able to create databases that have only features they want; you will want logical parts within your schema that can be separated for cleanliness and distribution.
Declaring Regions and Dependencies
Schema Regions let you create logical groupings, you simply declare the regions you want and then start putting things into those regions. The regions form a directed acyclic graph -- just like C++ base classes. You create regions like this:
The above simply declares the region, it doesn't put anything into them. In this case we now have a
root region and an
extra region. The
root schema items will not be allowed to refer to anything in
Without regions, you could also ensure that the above is true by putting all the
extra items afer the
root in the input file but things can get more complicated than that in general, and the schema might also be in several files, complicating ordering as the option. Also relying on order could be problematic as it is quite easy to put things in the wrong place (e.g. add a new
root item after the
extra items). Making this a bit more complicated, we could have:
And now there are many paths to
root from the
everything region; that's ok but certainly it will be tricky to do all that with ordering.
An illustrative example, using the regions defined above:
With the structure above specified, even if a new contribution to the
root schema appears later, the rules enforce that this region cannot refer to anything other other things in
root. This can be very important if schema is being included via
#include and might get pulled into the compilation in various orders. A feature area might also have a named public region that others things can depend on (e.g. some views) and private regions (e.g. some tables, or whatever).
Schema region do not provide additional name spaces, the names of objects should be unique across all regions. i.e. regions do not hide or scope entity names, rather they create errors if inappropriate names are used.
Case 1: The second line will fail semantic validation because table
A already exists
Case 2: This fails for the same reason as case #1. Table
A already exists
Case 3: Again fails for the same reason as case #1. Table
A already exist in region
extra, you can not define another table with the same name in another region.
Really the visibility rules couldn't be anything other than the above, SQLite has no knowledge of regions at all and so any exotic name resolution would just doom SQLite statements to fail when they finally run.
"... LIKE <table>" statement
The rules above are enforced for all constructs except for where the syntactic sugar
... LIKE <table> forms, which can happen in a variety of statement. This form doesn't create a dependence on the table (but does create a dependence on its shape). When CQL generates output the
LIKE construct is replaced with the actual names of columns it refers to. But these are independent columns, so this is simply a typing-saver. The table (or view, cursor, etc.) reference will be gone.
These cases below will succeed.
Note: this exception may end up causing maintenance problems and so it might be revisited in the future.
Maintaining Schema in Pieces
When creating upgrade scripts using the
--rt schema_upgrade flags you can add region options
--include_regions a b c and
--exclude_regions d e f per the following:
- included regions must be valid region names, the base types are walked to compute all the regions that are "in"
- declarations are emitted in the upgrade for all of the "in" objects, "exclude" does not affect the declarations
- excluded regions must be valid region names and indicate parts of schema that are upgraded elsewhere, perhaps with a seperate CQL run, a different automatic upgrade, or even a manual mechanism
- upgrade code will be generated for all the included schema, but not for the excluded regions and their contents
Example: Referring to the regions above you might do something like this
The first command generates all the shared schema for regions
The second command declares all of
extra so that the
feature1 things can refer to them, however the upgrade code for these shared regions is not emitted. Only the upgrade for schema in
feature1 is emitted.
feature2 is completely absent. This will be ok because we know
feature1 cannot depend on
extra is assumed to be upgraded elsewhere (such as in the previous line).
The third command declares all of
extra so that the
feature2 things can refer to them, however the upgrade code for these shared regions is not emitted. Only the upgrade for schema in
feature2 is emitted.
feature1 is completely absent.
Note that in the above examples CQL is generating more CQL to be compiled again (a common pattern). The CQL upgrade scripts need to be compiled as usual to produce executable code. Thus the output of this form includes the schema declarations and executable DDL.
Schema Not In Any Region
For schema that is not in any region you might imagine that it is a special region
<none> that depends on everything. So basically you can put anything there. Schema that is in any region cannot ever refer to schema that is in
When upgrading, if any include regions are specified then
<none> will not be emitted at all. If you want an upgrader for just
<none> this is possible with an assortment of exclusions. You can always create arbitrary grouping regions to make this easier. A region named
any that uses all other regions would make this simple.
In general, best practice is that there is no schema in
<none>, but since most SQL code has no regions some sensible meaning has to be given to DDL before it gets region encodings.
Given the above we note that some schema regions correspond to the way that we will deploy the schema, we want those bundles to be safe to deploy but to so we need a new notion -- a deployable region. To make this possible CQL includes the following:
- You can declare a region as deployable using
- CQL computes the covering of a deployable region: its transitive closure up to but not including any deployable regions it references
- No region is allowed to depend on a region that is within the interior of a different deployable region, but you can depend on the deployable region itself
Because of the above, each deployable region is in fact a well defined root for the regions it contains. The deployable region becomes the canonical way in which a bundle of regions (and their content) is deployed and any given schema item can be in only one deployable region.
Motivation and Examples
As we saw above, regions are logical groupings of tables/views/etc such that if an entity is in some region
R then it is allowed to only refer to the things that
R declared as dependencies
D2, etc. and their transitive closure. You can make as many logical regions as you like and you can make them as razor thin as you like; they have no physical reality but they let you make as many logical groups of things as you might want.
Additionally, when we’re deploying schema you generally need to do it in several pieces. E.g. if we have tables that go in an in-memory database then defining a region that holds all the in-memory tables makes it easy to say put all those in memory tables into a particular deployment script.
Now we come to the reason for deployable regions. From CQL’s perspective all regions are simply logical groups, some grouping that is meaningful to programmers but has no physical reality. This means you’re free to reorganize tables etc. as you see fit into new or different regions when things should move. Only that’s not quite true. The fact that we deploy our schema in certain ways means while most logical moves are totally fine, if you were to move a table from say the main database region to the in-memory region you would be causing a major problem. Some installations may already have the table in the main area and there would be nothing left in the schema to tell CQL to drop the table from the main database -- the best you can hope for is the new location gets a copy of the table the old location keeps it and now there are name conflicts forever.
So, the crux of the problem is this. We want to let you move schema freely between logical regions however it makes sense to you but once you pick the region you are going to be deployed in, you can’t change that.
To do this, CQL needs to know that some of the regions are deployable regions and there have to be rules so that it all makes sense. Importantly, every region has to be contained in at most one deployable region.
Since the regions form a DAG we must create an error if any region could ever roll up to two different deployable regions. The easiest way to describe this rule is “no peeking” – the contents of a deployable region are “private” they can refer to each other in any dag shape but outside of the deployable region you can only refer to its root. So you can still compose them but each deployable region owns a well defined covering. Note you can make as many fine-grained deployable regions as you want, you don’t actually have to deploy them separately, but you get stronger rules about the sharing.
Here’s an example:
In the above:
- none of the logical regions for feature 1, 2, 3 are allowed to refer to logical regions in any other feature, any of them could refer to Core (but not directly to what is inside Core)
- within those regions you can make any set of groupings that makes sense and change them over time as you see fit
- any such regions are not allowed to move to a different Feature group (because those are deployment regions)
- the Master Deployment regions just group features in ways we’d like to deploy them, in this case there are two deployments one that includes Feature 1 & 2 and another that includes Feature 1 & 3
- the deployable region boundaries are preventing Feature 1 regions from using Feature 2 regions in an ad hoc way (i.e. you can't cheat by taking a direct dependency on something inside a different feature), but both Features can use Core
- Feature 3 doesn’t use Core but Core will still be in Master Deployment 2 due to Feature 1
Note that the deployable regions for Feature 1, 2, and 3 aren't actually deployed alone, but they are adding enforcement that makes the features cleaner
Because of how upgrades work, “Core” could have its own upgrader. Then when you create the upgrader for Master Deployment 1 and 2 you can specify “exclude Core” in which case those tables are assumed to be updated independently. You could create as many or as few independently upgrade-able things with this pattern. Because regions are not allowed to "peek" inside of a deployable region, you can reorganize your logical regions without breaking other parts of the schema.
The above constructs create a good basis for creating and composing regions, but a key missing aspect is the ability to hide internal details in the logical groups. This becomes increasingly important as your desire to modularize schema grows; you will want to have certain parts that can change without worrying about breaking others and without fear that there are foreign keys and so forth to them.
To accomplish this CQL provides the ability to compose schema regions with the optional
private keyword. In the following example there will be three regions creatively named
r1 privately and therefore
r3 is not allowed to use things in
r1 even though it consumes
r2. When creating an upgrade script for
r3 you will still need (and will get) all of
r1, but from a visibility perspective
r3 can only directly depend on
r2 is still allowed to use
r1 because your private regions are not private from yourself. So you may think it’s easy to work around this privacy by simply declaring a direct dependency on r1 wherever you need it.
That would seem to make it all moot. However, this is where deployable regions come in. Once you bundle your logical regions in a deployable region there’s no more peeking inside the the deployable region. So we could strengthen the above to:
Once this is done it becomes an error to try to make new regions that peek into
r2; you have to take all of
r2 or none of it -- and you can’t see the private parts. Of course you can do region wrapping at any level so you can have as many subgroups as you like, whatever is useful. You can even add additional deployable regions that aren’t actually deployed to get the "hardened" grouping at no cost.
So, in summary, to get true privacy first make whatever logical regions you like that are helpful. Put privacy where you need/want it. Import logical regions as much as you want in your own bundle of regions. Then wrap that bundle up in a deployable region (they nest) and then your private regions are safe from unwanted usage.