Skip to main content

20 posts tagged with "facebook"

View All Tags

· 10 min read
CG/SQL Team

Introduction and Context

The general idea here is that you might want to store composite data in a single column in the database. This is a common way to get more generic schema, the idea being that you can have one or more blob columns that store in tables a lot of data that doesn't have to be indexed. You could store it in other ways, like a JSON blob or some such, but we'll be using blobs as the basis for storage here -- hence the name blob "storage".

How do I define one of these blobs?

In SQL/CQL, the main way you define structures, especially those that you want to maintain, is with tables. Hence we introduce this

@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
);

The blob_storage attribute indicates that the thing we're about to define here is not really going to be a materialized table. As a result, you will not be able to (e.g.) DROP the table or SELECT from it, and there will be no schema upgrade for it should you request one. However, the usual schema rules still apply which help you to create compatible versions of this structure. For instance, new columns can be added only at the end, and only if they are nullable. Here we add source to the schema in a hypothetical "version 6". Note that schema versions move forward globally in the schema, not locally in one table; this implies there are versions 1-5 elsewhere, not shown.

@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
source text @create(6)
);

Additionally, since the storage is not backed by SQL with SQL's constraint system, default values and constraints are not allowed in a table marked with cql:blob_storage; it's just data. Similarly, triggers, views, and indices may not use the "table".

Where do you keep your blob storage?

Naturally, blob storage goes in a blob field, but recall CQL has discriminated types so we could make something like this:

create table info(
id long primary key,
news_info blob<news_info>
);

From a SQL perspective news_info is just a blob. That means if you want to do a WHERE clause or something like that on the info, you're out of luck. Maybe you could write a user-defined function to crack the blob and so forth but really this isn't the point. If you're using this feature then, by construction, you don't need to index on this data. It's simply not suitable for use at all if you need field-at-a-time access within SQL.

How do I make one of these blobs?

The natural place that CQL stores structures is in value cursors so the most natural thing to do is to provide a variation of the SET statement that lets you load a blob from a cursor like so:

create proc make_blob(like news_info, out result blob<news_info>)
begin
declare c cursor like news_info;
fetch c from arguments;
set result from cursor c;
END;

This declares a cursor, loads it from argument values, and converts it to a blob. Of course all of the usual cursor building forms can be used to power your blob creation, you just do one serialization at the end. The above is assembling a blob from arguments but you could equally make the blob from data.

create proc get_news_info(id_ long not null, out result blob<news_info>)
begin
-- use our columns sugar syntax for getting just news_info columns from
-- a table with potentially lots of stuff (or an error if it's missing columns)
declare c cursor for
select columns(like news_info) from some_source_of_info where info.id = id_;
fetch c;
set result from cursor c;
END;

There are many cursor fetch forms, including dummy data forms and other interesting bits of sugar. You can fetch a cursor from arguments, from other cursors, and even combinations. We want all of that to work for blobs as well without adding tons of new syntax and code generation. The obvious way to accomplish that is for cursors to be the source of blobs.

How do I unpack one of these blobs?

Again, the normal way that you work with records in CQL is by creating suitable cursors. These can be economically accessed on a field-by-field basis. What we need is a way to easily recreate a cursor from the blob so we can read the data values. This gives rise to this form:

let b := (select news_info from info where id = id_ if nothing null);
declare c cursor like b;
fetch c from b; -- note this can fail
-- now use c.who, c.what, etc.

Data loaded in a cursor is very economical to access on a field-by-field basis, and, since the deserialization of the blob happens all at once, that is also economical. Importantly, we cannot assume that the blob is well formed, it could be coming from anywhere. For secure-code reasons we must assume it is hostile. Hence the decoding validates the shape, internal lengths, and so forth.

If we had instead started with something this:

let b := (select news_info from info where id = id_ if nothing null);

Then maybe we might like to write:

if b.who == 'U2') then ... end if;

However, this sort of thing would be very uneconomical. For one thing, the blob does not have fixed-offset fields: It is carrying all the serialized data for the string fields and so forth. Each "dot" operation would be costly and, furthermore, each "dot" operation could fail if the blob is badly formed. Having to deal with a b.who that might fail seems very bad indeed.

Once you have the cursor you can make new blobs with different combinations, slice the cursor fields using the LIKE operator, return the cursor with OUT, or OUT UNION, or pass the blob fields as arguments to functions using the FROM forms. Cursors already are super flexible in terms of what you can do with their contents.

What is the representation of one of these blobs?

It's important that we allow the blobs to evolve over time, so each blob has to be self-describing. We also want to be able to throw an error if you use the wrong kind of blob when loading a cursor, so the blob has to contain the following:

  • the number of columns in the blob data type when it was stored
  • the type of each field is encoded as a single plain-text character
    • the types are bool, int, long, (double) real, (string) text, blob;
    • we use 'f' (flag) for bools, hence "fildsb"
    • these are encoded with one letter each, upper case meaning 'not null' so the storage might be "LFss"
    • the buffer begins with a null terminated string that serve for both the count and the types
  • Each nullable field may be present or null; 1 bit is used to store this fact. The bits are in an array of bytes that comes immediately after the type info (which implicitly tells us its size)
  • Boolean values are likewise encoded as bits within the same array, so the total number of bits stored is nullables plus booleans (nullable booleans use 2 bits)
  • If you are reading a newer version of a record from an older piece of data that is missing a column then the column is assumed to be NULL
  • Any columns you add after the initial version (using @create) must be nullable; this is normal for adding columns to existing schema
  • Integers and longs are stored in varint format after zigzag encoding
  • Text is stored inline in null terminated strings (embedded nulls are not allowed in CQL text)
  • Nested blobs are stored inline, with a length prefix encoded like any other int

What about more than one row in a blob?

Well, this is a bit more advanced but in principle this could be done as well. To make it useful, we would want to make a new cursor type that can iterate over rows in a blob. The syntax leaves room for this, something like so:

declare c cursor for blob b;
loop fetch c
begin
-- the usual stuff
end;

This cursor would be another variation; it would keep its current index into the blob to read data out of it. Such a blob would also have to include a count of rows as part of its storage.

However, that's future looking. There is no such support at present.

Conclusion

With a fairly modest amount of work, we now support structured storage natively and have pretty rich language constructs. We carefully chose language constructs that lead to economical serialization and deserialization patterns and a record format that is versioned well, without resorting to something super loose like JSON.

As with many other features, it's possible to replace the (de)serialization with code of your choice by supplying your own runtime methods. So for instance, thrift encoding is possible; though it is more flexible than is strictly necessary for the few SQL data types, it might be convenient.

Storage types that are going to be persisted in the database or go over a wire-protocol should be managed like schema with the usual validation rules. On the other hand, formats that will be used only transiently in memory can be changed at whim from version to version. As mentioned above, the design specifically considers cases where a new client discovers and old-format blob (with fewer columns) and, the reverse, cases where an old client recieves a datagram from a new client with too many columns.

Appendix

A more complete example is included for reference.

@attribute(cql:blob_storage)
create table news_info(
who text,
what text,
when_ long -- timestamp of some kind
);

-- a place where the blob appears in storage
create table some_table(
x integer,
y integer,
news_blob blob<news_info>
);

-- a procedure that creates the blob from loose args
create proc make_blob(like news_info, out result blob<news_info>)
begin
declare c cursor like news_info;
fetch c from arguments;
set result from cursor c;
end;

-- a procedure that cracks the blob
create proc crack_blob(data blob<news_info>)
begin
declare c cursor like news_info;
fetch c from data;
out c;
end;

-- a procedure that cracks the blob into loose args if needed
-- the OUT statement was created specifically to allow you to avoid this sort mass OUT awfulness
create proc crack_blob_to_vars(
data blob<news_info>,
out who text,
out what text,
out when_ long)
begin
declare c cursor like news_info;
fetch c from data;
set who := c.who;
set what := c.what;
set when_ := c.when_;
end;

-- this just defines a shape for the part we are keeping from the original structure
declare proc my_basic_columns() (
x int,
y int
);

-- this just defines a shape for the result we want
-- we're never actually defining this procedure
declare proc my_result_shape() (
like my_basic_columns,
like news_info
);

create proc select_and_crack(whatever_condition bool)
begin
declare c cursor for select * from some_table where whatever_condition;
loop fetch c
begin
-- crack the blob in c
declare n cursor like news_info;
fetch n from blob c.news_blob;

-- assemble the result we want from the parts we have
declare result cursor like my_result_shape;
fetch result from values (from c like my_basic_columns, from n);

-- emit one row
out union result;
end;
end;

· 3 min read
CG/SQL Team

This new feature is a pretty simple generalization of the FROM construct as applied to expression lists. Note this isn't the same as using FROM the usual way in a select statement. An example will clear this right up.

Suppose you wanted to create a procedure that inserts a row into a table. You could write this:

create table Shape_xy (x int, y int);

create proc insert_xy(like Shape_xy)
begin
insert into Shape_xy from arguments;
end;

Here we're using from to introduce some shape of values. It can appear in a lot of places.

Suppose now I want to insert two of those shapes. I could write this slightly more complicated procedure:

create proc insert_two_xy(xy1 like Shape_xy, xy2 like Shape_xy)
begin
call insert_xy(from xy1);
call insert_xy(from xy2);
end;

This also composes with cursors, so maybe you need to get two xy values from diverse locations. You can mix and match.

create proc write_xy()
begin
declare C cursor for select T.x, T.y from somewhere T;
fetch C;
declare D cursor for select T.x, T.y from somewhere_else T;
fetch D;
if C and D then
-- strange combos for illustration only
call insert_two_xy(from C, from D);
call insert_two_xy(from D, 5, 3);
call insert_two_xy(4, 2, from C);
call insert_two_xy(4, from C, 8);
end if;
end;

So, as you can see, we can start from data in one or more cursors and we can turn that data, plus other expressions, into arguments, composing them as we like. This gives you the ability to call procedures and functions using shapes from a mixed set of sources. None of this is new.

However, the other places where expression lists happen -- fetch, update cursor, and insert -- only allowed you specify a single object as the input source such as insert into Shape_xy from C.

With a little work, this is trivially generalized so that all value lists can use the from construct.

Here's a complete example showing all the new forms.

create table Shape_xy (x int, y int);
create table Shape_uv (u text, v text);
create table Shape_uvxy (like Shape_xy, like Shape_uv);

create proc ShapeTrix()
begin
declare C cursor for select Shape_xy.*, '1' u, '2' v from Shape_xy;
fetch C;

-- This new form is equivalent to the old form:
-- insert into Shape_xy from C(like Shape_xy)
-- but the values(...) form generalizes, see below.
insert into Shape_xy values(from C like Shape_xy);

declare D cursor for select * from Shape_uv;
fetch D;

declare R cursor like Shape_uvxy;

-- This form works just like the function call case
-- that was previously supported (it uses the same code even).
-- This form lets you load R from any combination of sources
-- as long as you make a suitable row.
fetch R from values (from C like Shape_xy, from D);

-- Same thing is supported in update cursor
-- the x, y come from C and the u,v come from D.x, D.y.
-- Note that C.u and C.v would not even be type compatible.
update cursor R from values (from C like Shape_xy, from D);

-- And in a select-values clause
declare S cursor for
with cte(l,m,n,o) as (values (from C like Shape_xy, from D))
select * from cte;
fetch S;
insert into Shape_uvxy from S;
end;

As you can see, you can choose a subset of the from shape using like.

These combinations let you flexibily assemble rows of data for cursors, calls, and insertions, using any combination of data sources you might want, without resorting to listing every column by hand.

· 5 min read
CG/SQL Team

Following on the heels of shared fragments, we're introducing the same kind of thing for shared fragments that are expressions rather than tables. The syntax is as follows:

-- this isn't very exciting because regular max would do the job
@attribute(cql:shared_fragment)
create proc max_func(x integer, y integer)
begin
select case when x >= y then x else y end;
end;

The above can be used in the context of a SQL statement like so:

select max_func(T1.column1, T1.column2) the_max from foo T1;

The consequence of the above is that the body of max_func is inlined into the generated SQL. However, like the other shared fragments, this is done in such a way that the text can be shared between instances so you only pay for the cost of the text* in your program one time, no matter how many time you use it.

* You still pay for the cost of a pointer to the text.

In particular, for the above, the compiler will generate the following SQL:

select (
select case when x >= y then x else y end
from (select T1.column1 x, column2 y))

But each line will be its own string literal, so, more accurately, it will concatenate the following three strings:

"select (",                                      // string1
" select case when x >= y then x else y end", // string2
" from (select T1.column1 x, column2 y))" // string3

Importantly, string2 is fixed for any given fragment. The only thing that changes is string3, i.e., the arguments. The C compiler, and then the linker, will unify the string2 literal across all translation units so you only pay for the cost of that text one time. It also means that the text of the arguments appears exactly one time, no matter how complex they are. For these benefits, we pay the cost of the select wrapper on the arguments. This is cost is frequently negative. Consider this following:

select max_func((select max(T.m) from T), (select max(U.m) from U))

A direct expansion of the above would result in something like this:

case when (select max(T.m) from T) >= (select max(U.m) from U)
then (select max(T.m) from T)
else (select max(U.m) from U)
end;

The above could be accomplished with a simple pre-processor macro, but the fragments code generates the following:

select (
select case when x >= y then x else y end
from select (select max(T.m) from T) x, (select max(U.m) from U) y))

Expression fragments can nest, so you could write:

@attribute(cql:shared_fragment)
create proc max3_func(x integer, y integer, z integer)
begin
select max_func(x, max_func(y, z));
end;

Again, this particular example is a waste because regular max would already do the job.

To give another example, common mappings from one kind of code to another using case/when can be written and shared this way:

-- this sort of thing happens all the time
@attribute(cql:shared_fragment)
create proc remap(x integer not null)
begin
select case x
when 1 then 1001
when 2 then 1057
when 3 then 2010
when 4 then 2011
else 9999
end;
end;

In the following:

select remap(T1.c), remap(T2.d), remap(T3.e) from C, D, E;

The text for remap will appear three times in the generated SQL query but only one time in your binary.

Restrictions:

  • the function must consist of exactly one simple select statement
    • no FROM, WHERE, HAVING, etc. -- the result is an expression
  • the select list must have exactly one value
    • Note: the expression can be a nested SELECT which could have all the usual SELECT elements
  • the usual shared fragment rules apply, e.g. no out-parameters, exactly one statement, etc.

FAQ:

Q: Why does the expression fragment have a select in it?

A: Expression fragments are only interesting in SQL contexts where normal procedure and function calls are not available. The select keyword makes it clear to the author and the compiler that the expression will be evaluated by SQLite and the rules for what is allowed to go in the expression are the SQLite rules.

Q: Why no FROM clause?

A: We're trying to produce an expression, not a table-value with one column. If you want a table-value with one column, the original shared fragments solution already do exactly that. This gives you a solution for sharing code in, say, the WHERE clause or the select list.

Q: Isn't this just the same as doing, say, #define max_func(x,y) case when (x) >= (y) then x else y end;?

A: Macros can give you a ton of flexibility, but they have many problems:

  • if the macro has an error, you see the error in the call site with really bad diagnostic info
  • the compiler doesn't know that the sharing is going on so it won't be able to share text between call sites
  • the arguments can be evaluated many times each which could be expensive, bloaty, or wrong
  • there is no type-checking of arguments to the macro so you may or may not get compilation errors after expansion
  • you have to deal with all the usual pre-processor hazards

In general, macros can be used for meta-programming (as in C and C++), but that doesn't mean it's a good idea.

· 4 min read
CG/SQL Team

One of the signature features of the CQL language is the ability to use the "LIKE" form to slice out columns that conform to a shape. This notion appears in many places in the language. For instance if I have a table Foo. I can make a cursor for that shape like so:

declare C cursor like Foo;

Which says I want the columns of C to be like the columns of Foo.

If I have a cursor D that has the Foo columns but maybe more and maybe in a different order I can load C as follows:

fetch C from D(like Foo)

Which again saves me from having to list all the (potentially dozens) of Foo columns. This construct is in many places:

declare proc P(like Foo)
begin
insert into Foo from arguments;
end;

even

declare proc P(f like Foo, b like Bar)
begin
insert into Foo from f;
insert into Bar from b;
end;

And other examples... This is discussed more fully in Chapter 5 of the Guide.

However, one of the few places that shapes are interesting but not supported was in the select list. And so, just a couple of days ago, we added the COLUMNS construct to the language which allows for a sugared syntax for extracting columns in bulk. It's kind of a generalization of the select T.* pattern but with CQL-style slicing and type-checking.

These forms are supported:

  • columns from a join table or tables
-- same as A.*
select columns(A) from ...;

-- same as A.*, B.*
select columns(A, B) from ...;
  • columns from a particular join table that match a shape
-- the columns of A that match the shape Foo
select columns(A like Foo) from ...;

-- get the Foo shape from A and the Far shape from B
select columns(A like Foo, B like Bar) from ...;
  • columns from any join table that match a shape
--- get the Foo shape from anywhere in the join
select columns(like Foo) from ...;

-- get the Foo and Bar shapes, from anywhere in the join
select columns(like Foo, like Bar) from ...;
  • specific columns
-- x and y columns plus the foo shape
select columns(T1.x, T2.y, like Foo) from ...;
  • distinct columns from the above (not distinct values!)
-- removes duplicate column names
-- e.g. there will be one copy of 'pk'
select columns(distinct A, B) from A join B using(pk);

-- if both Foo and Bar have an (e.g.) 'id' field you only get one copy
select columns(distinct like Foo, like Bar) from ...;

-- if a specific column is mentioned it is always included
-- but later clauses that are not a specific column will avoid it
-- if F or B has an x it won't appear again, just T.x
select columns(distinct T.x, F like Foo, B like Bar) from F, B ..;

Of course this is all just sugar, so it all ends up being a column list with table qualifications -- but the syntax is very powerful. For instance, for narrowing a wide table, or for fusing joins that share common keys

-- just the Foo columns
select columns(like Foo) from Superset_Of_Foo_From_Many_Joins_Even;

-- only one copy of 'pk'
select columns(distinct A,B,C) from
A join B using (pk) join C using (pk);

And of course you can define shapes however you like and then use them to slice off column chucks of your choice. There are many ways to build up shapes from other shapes. Probably the easiest is to declare procedures that return the shape you want and never actual create them. E.g.

declare proc shape1() (x integer, y real, z text);
declare proc shape2() (like shape1, u bool, v bool);

With this combination you can easily define common column shapes and slice them out of complex queries without having to type the columns names over and over...

Note that the COLUMNS(...) form is not a general replacement for the select list. For instance, general expressions are not allowed inside of COLUMNS(...) but, where extraction of lots of columns is needed, or even re-ordering of colummns, it's a very good option indeed and it composes well with the other select features.

This was the last significant area where shapes are useful but totally absent.

· 12 min read
CG/SQL Team

One of the biggest changes to CQL in 2021 was the addition of control flow analysis. Given an understanding of how execution can flow within a user's program, CQL can do things like infer when a nullable variable must contain a nonnull value and improve its type appropriately, or issue an error when a nonnull variable may be used before it has been initialized.

Improving Nullability

As of mid-2021, and with increasing sophistication throughout the remainder of the year, CQL has been able to infer that a variable of a nullable type must not be NULL within a portion of a user's program:

DECLARE PROC another_proc(t0 TEXT NOT NULL, t1 TEXT NOT NULL);

CREATE PROC some_proc(t0 TEXT, t1 TEXT)
BEGIN
IF t0 IS NULL RETURN;
-- `t0` must be nonnull here if we made it this far

IF t1 IS NOT NULL THEN
-- `t0` and `t1` are nonnull here
CALL another_proc(t0, t1);
ELSE
-- `t0` is nonnull here
CALL another_proc(t0, "default");
END IF;
END;

The ability of the CQL compiler to infer non-nullability greatly reduces the need to use the functions ifnull_crash and ifnull_throw to coerce values to a nonnull type—functions that, if they are ever used incorrectly, usually result in programs misbehaving.

For a detailed description and many additional examples of what is possible—CQL can handle much more than what is shown above—see the user guide's section on nullability improvements.

Enforcing Initialization Before Use

In CQL, it is possible to declare a variable of a nonnull type without giving it a value. If the variable is of a non-reference type, it is assigned a default value of 0. If the variable is of a reference type (BLOB, OBJECT, or TEXT), however, it is simply set to NULL despite the nonnull type as no default value exists.

To help prevent accessing a reference variable of a nonnull type and getting back NULL, CQL recently began enforcing that such variables are initialized before use. The following code, therefore, now results in an error:

DECLARE t TEXT NOT NULL;
CALL requires_text_notnull(t); -- error!

Using the same engine for control flow analysis that is behind nullability improvements, CQL can improve a variable to be initialized:

DECLARE t TEXT NOT NULL;

IF some_condition THEN
SET t := "some example text";
-- `t` is initialized here
ELSE
THROW;
END IF;
-- `t` must be initialized here if we made it this far

CALL requires_text_notnull(t); -- okay!

Thanks to CQL's ability to understand the control flow of users' programs, the above example works just fine.

CQL now also enforces that all procedures with OUT parameters of a nonnull reference type properly initialize said parameters before they return:

CREATE PROC some_proc(b BOOL NOT NULL, OUT t TEXT NOT NULL)
BEGIN
IF b THEN
SET t := another_proc(t);
-- `t` is initialized here
ELSE
SET t := yet_another_proc(t);
-- `t` is initialized here
END IF;
-- `t` must be initialized here because all possible
-- branches initialized it, so `some_proc` is okay!
END;

As with nullability improvements, understanding the nuances of what will be considered initialized is easier if one has a sense for how control flow analysis works in the compiler.

Understanding Control Flow Analysis in CQL

To develop an intuition for how control flow analysis works in CQL, let's begin by taking a look at the following example:

DECLARE PROC p1(OUT t TEXT NOT NULL);
DECLARE PROC p2(i INTEGER NOT NULL, OUT t TEXT NOT NULL);

CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;

IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(i);
ELSE
THROW;
END IF;
END;

There are a couple of things we must verify in order to ensure the code is type-safe:

  • With regard to the parameters of p0: Since t is an OUT parameter of type TEXT NOT NULL, p0 must always assign it a value before it returns. If it does not, a caller of p0 may end up with a variable of a NOT NULL type that actually contains NULL.

  • With regard to the calling of p2 in p0: Since p2 requires a first argument of type INTEGER NOT NULL, some sort of check must be performed to ensure that i is not NULL before p2(i) is executed.

If we carefully study p0, we can determine that both of the above conditions are satisfied. Making this determination, however, is not exactly trivial, and real-world code is often significantly more complicated than this—and it evolves over time. For these reasons, having a compiler that can make such determinations automatically is critical; most modern production compilers perform these sorts of checks.

The easiest way to understand how CQL does its job is to take the above example line-by-line. This is not exactly how CQL works under the hood, but it should provide an intuitive sense of how control flow analysis works in the compiler:

==> CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
...
END;

Right away, CQL can see that t is declared both OUT and TEXT NOT NULL and thus requires initialization before p0 returns. CQL can, therefore, add a fact about what it is analyzing to its previously null set of facts:

  • t requires initialization.

We can then continue:

==>   IF i IS NULL THEN
...
END IF;

Here, the compiler notices that we're at an IF statement. In CQL, IF statements contain one or more branches, and the compiler considers every IF to be the start of a branch group. The same line also indicates the condition for the first branch: i IS NULL. CQL can update its set of facts:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:

It then proceeds to the next line:

      IF i IS NULL THEN
==> IF b THEN
CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;

Another branch group and branch:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:

Continuing:

      IF i IS NULL THEN
IF b THEN
==> CALL p1(t);
ELSE
SET t := "";
END IF;
RETURN;
END IF;

Since p1 takes an OUT argument of type TEXT NOT NULL, this call initializes t, and so CQL can update its set of facts once again:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:
          • t is initialized.

Jumping ahead a couple of lines:

      IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
==> SET t := "";
END IF;
RETURN;
END IF;

At this point, we're in another branch. We also have yet another fact to add because t is initialized here as well due to the SET:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • In branch group:
        • In branch when b:
          • t is initialized.
        • In ELSE branch:
          • t is initialized.

Moving ahead one more line, things get a bit more interesting:

      IF i IS NULL THEN
IF b THEN
CALL p1(t);
ELSE
SET t := "";
==> END IF;
RETURN;
END IF;

Here, we're at the end of an IF, and thus the end of a branch group. Whenever CQL reaches the end of a branch group, it merges the effects of all of its branches.

One very important thing to note here is that the current branch group has an ELSE branch, and so the set of branches covers all possible cases. That means if something is initialized in every branch within the branch group, we can consider it to be initialized after the branch group has ended: Initialization will always occur. This allows CQL to simplify its set of facts as follows as it leaves the branch group:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • t is initialized.

Stepping forward one line again, we reach a RETURN:

      IF i IS NULL THEN
...
==> RETURN;
END IF;

We're now at a point where we can exit the procedure. CQL will, therefore, verify that if something requires initialization, it has been initialized. Since we have both the facts "t requires initialization" and "t is initialized", all is well!

The fact that the current branch returns early is added to the set of facts:

  • t requires initialization.
  • In branch group:
    • In branch when i IS NULL:
      • t is initialized.
      • Returns.

Moving ahead one more line, we reach the end of another branch and branch group, and again something interesting happens:

      ...
IF i IS NULL THEN
...
==> END IF;

Upon ending the branch group, we know that the branch group has exactly one branch, that the branch is entered only when i IS NULL, and that the branch returns. What that tells CQL is that, if execution is going to continue after the branch group, its sole branch must not have been taken, and so CQL knows the opposite of its condition for entry will be true from this point onward:

  • t requires initialization.
  • i is not null.

The next IF is rather similar to what we've seen already in its structure, so we can jump ahead several lines to the next point of interest:

      IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
==> SET t := p2(i);
ELSE
THROW;
END IF;

Before we analyze the above-indicated line, we have the following set of facts:

  • t requires initialization.
  • i is not null.
  • In branch group:
    • In branch when i == 0:
      • t is initialized.
    • In branch when i > 0:

In the call p2(i), we know that i was declared to have type INTEGER and that p2 requires an INTEGER NOT NULL, but we also have the fact "i is not null". For this reason, we can consider p2(i) to be a valid call. We can also add the fact that t is initialized to our current set of facts:

  • ...
    • In branch when i > 0:
      • t is initialized.

NOTE: When it comes to code generation, it is not so simple as to say p2(i) is valid and proceed as usual. That's because p2 expects an argument of type INTEGER NOT NULL, but we merely have a value of type INTEGER that we happen to know cannot be null: INTEGER NOT NULL and INTEGER do not share the same underlying representation, and so we cannot pass the declared-nullable variable i directly to p2. To solve this problem, CQL rewrites the expression such that p2(i) becomes p2(cql_inferred_notnull(i)), where cql_inferred_notnull is an internal-only function that handles the nullable-to-nonnull representational conversion for us. This explains its presence in the following examples.

Jumping ahead again, we encounter a THROW:

      IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(cql_inferred_notnull(i));
ELSE
==> THROW;
END IF;

The fact that the branch will throw is added to the current set of facts:

  • t requires initialization.
  • i is not null.
  • In branch group:
    • In branch when i == 0:
      • t is initialized.
    • In branch when i > 0:
      • t is initialized.
    • In ELSE branch:
      • Throws.

We then proceed to the end of the IF:

      IF i == 0 THEN
SET t := "";
ELSE IF i > 0 THEN
SET t := p2(cql_inferred_notnull(i));
ELSE
THROW;
==> END IF;

Once again, CQL merges the effects of the branches in the branch group to finish the analysis of the IF. Since it can see that t was initialized in all branches except the one that throws, and since the branches cover all possible cases, the set of facts is simplified as follows given the knowledge that, if THROW was not encountered, t must have been initialized:

  • t requires initialization.
  • i is not null.
  • t is initialized.

Moving ahead one final time, we encounter the end of the procedure:

    CREATE PROC p0(b BOOL, i INTEGER, OUT t TEXT NOT NULL)
BEGIN
...
==> END;

The only thing left to do at this point is to validate that anything requiring initialization has been initialized. Since we have both "t requires initialization" and "t is initialized", everything is in order.

Looking Ahead

As a recently generalized piece of functionality within the CQL compiler, control flow analysis will soon be used to enforce additional properties of users' programs. In particular, CQL will be able to ensure that cursors are always fetched before they're used and that cursors are always checked to have a row before their fields are accessed.

Hopefully you now understand the fundamentals of control flow analysis in CQL and the benefits it brings to your programs. Best wishes for 2022!

· 8 min read
CG/SQL Team

Shared fragments are a real game-changer for CQL.

Remember, these are designed to let you write part of a query and then substitute in parameters. So it's like a parameterized view in normal SQL terms. But actually it's more powerful than that, fragments also provide features that are more like Java generics. Let's do some examples.

Suppose we have a procedure which looks something like this:

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
WITH
to_exclude_recursive_query (tok, rest) AS (
SELECT
'',
to_exclude_ || ','
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM to_exclude_recursive_query
WHERE rest <> ''
),
to_exclude (id) AS (
SELECT CAST(tok AS LONG)
FROM to_exclude_recursive_query
WHERE tok <> ''
)
to_include_recursive_query (tok, rest) AS (
SELECT
'',
to_include_ || ','
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM to_include_recursive_query
WHERE rest <> ''
),
to_include (id) AS (
SELECT CAST(tok AS LONG)
FROM to_include_recursive_query
WHERE tok <> ''
)
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END;

With shared fragments you could write something like this:

@attribute(cql:shared_fragment)
CREATE PROC split_commas(str text)
BEGIN
WITH splitter(tok, rest) AS (
SELECT '', IFNULL(str || ',', '')
UNION ALL
SELECT
substr(rest, 1, instr(rest, ',') - 1),
substr(rest, instr(rest, ',') + 1)
FROM splitter
WHERE rest <> '')
select tok from splitter where tok <> '';
END;

@attribute(cql:shared_fragment)
CREATE PROC ids_from_string(str text)
BEGIN
WITH toks(tok) AS (CALL split_commas(str))
SELECT CAST(tok AS LONG) AS id from toks;
END;

We now have a shared fragment called split_commas which can be anywhere like maybe in a standard include file. There are some immediate benefits:

  • the fragment is compiled on its own before usage so any errors are reported in the fragment
    • in contrast, with macros you get errors when you try to use the macro and they are all charged to the line the macro appears on so it's hopeless figuring out what's wrong
  • the text of the shared fragment will be the same, so it can be re-used in all locations, this can be a big binary size savings
    • in contrast, macros are pre-processed before CQL ever sees the text so it doesn't "know" it's the same code
  • fragments compose cleanly as we'll see; and they have typed arguments
  • fragments can be independently tested outside of the context in which they appear
    • make a test context and explore the fragment, no worries about it breaking on edge cases later

The first fragment called split_commas does exactly what it sounds like, it takes a string argument and makes a list of the strings in it.

The second fragment uses the first to split a string and then it converts all the strings to long integers.

Now instead of the above we could write:

#include <stringsplit.sql> /* whereever you put the fragments */

CREATE PROC get_stuff(to_include_ text, to_exclude_ text)
BEGIN
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END;

And of course since ids_from_string is somewhere shared (stringsplit.sql) so these fragments can be used all over your code and you'll only pay for the text one time. This gives you great flexibility, very much like parameterized views. You can have any number of these fragments, they will share code, they compose like crazy and there is no schema cost!

Generics

A series of useful fragments for generating data would go a long way but there are other applications of fragments and you might want to operate on various data sources without hard coding them all. This is where the generic form of fragments comes in. Consider a case where you want to be able to filter stuff by say name and age. You could create this fragment:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.name LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
END;

Now imagine that we had added the shared fragment annotation to get_stuff (just like the above). We could then write the following:

CREATE PROC the_right_stuff(
to_include_ text,
to_exclude_ text,
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
get_stuff(*) AS (call get_stuff(to_include_, to_exclude_)),
filter_stuff(*) AS (call filter_stuff(pattern_, min_age_, max_age_)
using get_stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;

Or with some sugar to forward arguments and assume the CTE name matches, more economically:

CREATE PROC the_right_stuff(
to_include_ text,
to_exclude_ text,
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
(call get_stuff(*)),
(call filter_stuff(*) using get_stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;

The arg syntax (*) simply indicates that the arg names in the caller should match to the same names in the callee. In general call foo(*) expands to call foo(from arguments like foo arguments). * is rather more economical than that.

In this example filter_stuff doesn't know where its data will be coming from, you bind its table parameter source to a compatible data source of your choice. For example, this would also be legal:

CREATE PROC almost_the_right_stuff(
pattern_ text not null,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
WITH
(call filter_stuff(*) using stuff as source)
SELECT * from filter_stuff S
ORDER BY name
LIMIT 5;
END;

Conditionals

It's often desirable to have some options in the generated SQL without having to fork your entire query. Shared fragments address this as well with the conditional form. In this form the top level of the fragment is an IF statement and there are a number of alternatives. Here are some simple modifications to the above that illustrate some of the possibilities.

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
IF pattern_ IS NOT NULL THEN
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.name LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
ELSE
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.age BETWEEN min_age_ and max_age_;
END IF;
END;

In the above if the input pattern is NULL then it is not considered, it won't be part of the generated SQL at all. Note that source (same name) appears in both branches and therefore must be the same type as it will be fulfilled by one actual table parameter.

Now the above could have been achieved with something like this:

pattern_ IS NULL OR S.name LIKE pattern_

But that would have no useful selectivity. But in general you might be able to avoid joins and so forth with your constraints. Consider something like this hypothetical:

@attribute(cql:shared_fragment)
CREATE PROC filter_stuff(
pattern_ text,
min_age_ integer not null,
max_age_ integer not null)
BEGIN
IF pattern_ IS NOT NULL THEN
WITH
source(*) LIKE stuff
SELECT DISTINCT S.* from source S
INNER JOIN keywords K
WHERE
K.keyword LIKE pattern_ AND
S.age BETWEEN min_age_ and max_age_;
ELSE
WITH
source(*) LIKE stuff
SELECT * from source S
WHERE
S.age BETWEEN min_age_ and max_age_;
END IF;
END;

Here we save the DISTINCT and the JOIN if there is no pattern which might be important. Of course there are probably better ways to match keywords but this is just an illustration of what's possible.

There are numerous ways this flexibility can be used, again a simple example, a real schema transform would be more complex.

@attribute(cql:shared_fragment)
CREATE PROC get_stuff(
to_include_ text,
to_exclude_ text,
schema_v2 bool not null)
BEGIN
IF schema_v2 THEN
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff_2 S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
ELSE
WITH
to_include(id) AS (CALL ids_from_string(to_include_)),
to_exclude(id) AS (CALL ids_from_string(to_exclude_))
SELECT * from stuff S
WHERE
S.id in (select * from to_include) AND
S.id not in (select * from to_exclude);
END IF;
END;

Validation

All of this requires a bunch of checking, at least this:

  • the LIKE forms can only appear in a shared fragment
  • the CALL forms must refer to shared fragments
  • the CALL args must be compatible
  • the number and type of the provided tables in USING must be correct
  • the shared fragment must be a single select statement or an IF statement with an ELSE
    • the statement lists of the IF/ELSE combo must all be single select statements
    • all the choices in the IF block must return the same shape (this is normal for procedures)
  • the shared fragment can't have any out arguments
  • the provided fragment arguments cannot themselves use the nested SELECT construct

I think this is a total game changer for SQL authoring and should go a long way to making it easier to get your work done on SQLite. A good base set of shared fragments as part any suite of procedures seems like a good idea.

There are more details in the section on shared fragments in Chapter 14 of The Guide.

These features are in the current build as of today (12/14/2021).

Happy Holidays and stay safe.

· One min read
CG/SQL Team

We've long needed a way to see the most recent SQLite result code SQLite in the context of say a catch block (most other times you can assume SQLITE_OK was the last result code otherwise control flow would transfer elsewhere. Sometimes SQLITE_ROW or SQLITE_DONE might be the current result code.

Soon we'll provide a sample header that declares the most common error codes in an enum but for now you could do something like this:

-- pasted from the sqlite.c
#define SQLITE_BUSY 5 /* The database file is locked */

-- this is a contrived example
create proc get_first_foo(out can_retry bool not null)
begin

-- can_retry is set to 0 automatically, language semantics guarantee this

begin try
select foo from bar limit 1;
end try;
begin catch
set can_retry := (@rc == SQLITE_BUSY);
throw; -- rethrow the original error
end catch;
end;

· 2 min read
CG/SQL Team

The nested select statement is frequently misused, in particular if you get no rows back from the expression that's an error. So for instance:

set x_ := (select x from foo.x where id = y);

This will throw (with a peculiar error, SQLITE_DONE) if there is no such row.

Sometimes people try to fix this problem with a nullcheck:

set x_ := IFNULL((select x from foo.x where id = y), -1);

That doesn't help at all. It's not a null value situation, there's no row at all.

set x_ := (select IFNULL(x,-1) from foo.x where id = y), -1);

Is likewise unhelpful. To help with this situation we add two forms:

-- useful if foo.x is already known to be not null
set x_ := (select x from foo.x where id = y IF NOTHING -1);

-- useful if foo.x might be null
set x_ := (select x from foo.x where id = y IF NOTHING OR NULL -1);

Both of these deal with the case where there is no row. The second lets you have a simple default for both no row or null value. That form is equivalent to:

set x_ := (select IFNULL(x,-1) from foo.x where id = y IF NOTHING -1);

i.e. both problem cases are handled.

Of course the -1 here could be any valid expression, even a second (select...)

· One min read
CG/SQL Team

Important change in CQL semantics.

Previously if you did an early return, or fall through the end, from a procedure that is supposed to return a result set but did not in fact provide one, you would get a fake SQLITE_ERROR. Now you get an empty result set for "free".

This interpretation seems much more natural and avoids a lot of really annoying stub selects to comply with the contract.

This also works for the out statement in the same fashion.

If you want to return an error, use throw. This is a lot more natural...

examples:

-- this gives you an empty result set if x <= 0
create proc maybe_return(x integer)
begin
if x > 0 then
select * from foo where foo.y > x;
end if;
end;

-- so does this
create proc maybe_return(x integer)
begin
if x <= 0 then
return;
end if;
select * from foo where foo.y > x;
end;

-- so does this
create proc maybe_out(x integer)
begin
if x <= 0 then
declare C cursor for select etc.
out C;
end if;
end;

· 3 min read
CG/SQL Team

Further adding to the type calculus of the CQL language we introduced the ability to encode the "kind" of primitive types. This can be used in a number of ways -- like "units" for natural things and like a "type" for synthetic keys and other such. It's easier to illustrate by example.

declare job_id type long<job_id>;
declare person_id type long<person_id>;

declare j job_id;
decalre p person_id;

set p := j; -- this is an error

With the above in place, other expressions like p == j would also produce errors as these long values are no longer type compatible. This is a great way to add enforcement to your schema and procedures. Likewise you can use these annotations to add "units" to your data types. e.g.

declare meters type real<meters>;
declare grams type real<grams>;

declare m meters;
declare g grams;

Variables of type grams (e.g. g) are not compatible with variables of type meters (e.g. m) even though both are real.

Likewise, attemping to insert grams into a column that is typed to meters will give errors. Of course SQLite doesn't know about any of this so all the <> stuff is removed in the generated SQL. This is just about type enforcement at compile time.

Enumerations like:

declare enum surface integer (paper, canvas);
declare enum writer integer (pen, paper, brush);

enable this:

declare s surface;                  -- s is now of type integer<surface>
declare w writer; -- w is now of type integer<writer>
set s := surface.paper; -- ok
set s := writer.pen; -- error
set w := writer.pencil; -- ok
case when s == w then 1 else 0 end; -- error (w/s not comparable)
set w := s; -- error again

additionally in DML/DDL:

create table draw_action(
w writer,
s surface
);

insert into draw_action values(w, s); -- ok
insert into draw_action values(s, w); -- error!

So the type kinds can be quite helpful when dealing with loose variables.

The notion of specific types was added to the language nearly two years ago to support the object type because there was a great desire to prevent object<dictionary> being assigned from object<list> but this "type kind", whether it's with units (e.g. "meters", "grams") or a type name (e.g. "job_id") adds a lot of high value type checking.

The kind can be added, stripped, or changed with a cast operation and the type system allows a constant or variable with no kind (e.g. "1") to mix and match with any kind so long as the base type is compatible as usual. So you get the most value by using the specific type consistently but you won't go insane adding test cases that use constants for instance.

As of this writing the expression kinds are checked for compatibility everywhere plus or minus bugs. There are extensive tests.