Skip to main content

Part 2: Semantic Analysis

Preface​

Part 2 continues with a discussion of the essentials of the semantic analysis pass of the CQL compiler. As in the previous sections, the goal here is not to go over every single rule but rather to give a sense of how semantic analysis happens in general -- the core strategies and implementation choices -- so that when reading the code you will have an idea of how smaller pieces fit into the whole. To accomplish this, various key data structures will be explained in detail and selected examples of their use are included.

Semantic Analysis​

The overall goal of the semantic analysis pass is to verify that a correct program has been submitted to the compiler. The compiler does this by "decorating" the AST with semantic information. This information is mainly concerned with the "types" of the various things in the program. A key function of the semantic analyzer, the primary "weapon" in computing these types, if you will, is name resolution. The semantic analyzer decides what any given name means in any context and then uses that meaning, which is itself based on the AST constructs that came before, to compute types and then check those types for errors.

Broadly speaking, the errors that can be discovered are of these forms:

  • mentioned names do not exist
    • e.g. using a variable or table or column without declaring it
  • mentioned names are not unique, or are ambiguous
    • e.g. every view must have a unique name
    • e.g. table names need to be unique, or aliased when joining tables
  • operands are not compatible with each other or with the intended operation
    • e.g. you can't add a string to a real
    • e.g. you can't do the % operation on a real
    • e.g. the expression in a WHERE clause must result in a numeric
    • e.g. the first argument to printf must be a string literal
    • e.g. you can't assign a long value to an integer variable
    • e.g. you can't assign a possibly null result to a not-null variable
  • there are too many or two few operands for an operation
    • e.g. an INSERT statement must include sufficiently many columns and no extras
    • e.g. a function or procedure call must have the correct number of operands
  • an operation is happening in a context where it is not allowed
    • e.g. use of aggregate functions in the WHERE clause
    • e.g. use of unique SQLite functions outside of a SQL statement

There are several hundred possible errors, and no attempt will be made to cover them all here but we will talk about how errors are created, recorded, and reported.

Decorated AST examples​

Recalling the AST output from Part 1, this is what that same tree looks like with semantic information attached:

LET X := 1 + 3;

{let_stmt}: X: integer notnull variable
| {name X}: X: integer notnull variable
| {add}: integer notnull
| {int 1}: integer notnull
| {int 3}: integer notnull

And here's an example with some structure types:

SELECT 1 AS x, 3.2 AS y;

{select_stmt}: select: { x: integer notnull, y: real notnull }
| {select_core_list}: select: { x: integer notnull, y: real notnull }
| | {select_core}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list_con}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list}: select: { x: integer notnull, y: real notnull }
| | | {select_expr}: x: integer notnull
| | | | {int 1}: integer notnull
| | | | {opt_as_alias}
| | | | {name x}
| | | {select_expr_list}
| | | {select_expr}: y: real notnull
| | | {dbl 3.2}: real notnull
| | | {opt_as_alias}
| | | {name y}
| | {select_from_etc}: ok
| | {select_where}
| | {select_groupby}
| | {select_having}
| {select_orderby}
| {select_limit}
| {select_offset}

These can be generated by adding --sem --ast to the CQL command line along with --in your_file.sql.

Keep these shapes in mind as we discuss the various sources of type information.

The Base Data Structures​

First recall that every AST node has this field in it:

struct sem_node *_Nullable sem;

This is the pointer to the semantic information for that node. Semantic analysis happens immediately after parsing and before any of the code-generators run. Importantly, code generators never run if semantic analysis reported any errors. Before we get into the shape of the semantic node, we should start with the fundamental unit of type info sem_t which is usually stored in a variable called sem_type.

typedef uint64_t sem_t;

The low order bits of a sem_t encode the core type and indeed there is a helper function to extract the core type from a sem_t.

// Strips out all the flag bits and gives you the base/core type.
cql_noexport sem_t core_type_of(sem_t sem_type) {
return sem_type & SEM_TYPE_CORE;
}

The core bits are as follows:

#define SEM_TYPE_NULL 0         // the subtree is a null literal (not just nullable)
#define SEM_TYPE_BOOL 1 // the subtree is a bool
#define SEM_TYPE_INTEGER 2 // the subtree is an integer
#define SEM_TYPE_LONG_INTEGER 3 // the subtree is a long integer
#define SEM_TYPE_REAL 4 // the subtree is a real
#define SEM_TYPE_TEXT 5 // the subtree is a text type
#define SEM_TYPE_BLOB 6 // the subtree is a blob type
#define SEM_TYPE_OBJECT 7 // the subtree is any object type
#define SEM_TYPE_STRUCT 8 // the subtree is a table/view
#define SEM_TYPE_JOIN 9 // the subtree is a join
#define SEM_TYPE_ERROR 10 // marks the subtree as having a problem
#define SEM_TYPE_OK 11 // sentinel for ok but no type info
#define SEM_TYPE_PENDING 12 // sentinel for type calculation in flight
#define SEM_TYPE_REGION 13 // the ast is a schema region
#define SEM_TYPE_CURSOR_FORMAL 14 // this is used for the cursor parameter type uniquely
#define SEM_TYPE_CORE 0xff // bit mask for the core types

#define SEM_TYPE_MAX_UNITARY (SEM_TYPE_OBJECT+1) // the last unitary type

These break into a few categories:

  • NULL to OBJECT are the "unitary" types -- these are the types that a single simple variable can be
    • a column can be any of these except OBJECT or NULL
    • the NULL type comes only from the NULL literal which has no type
    • instances of, say, a TEXT column might have a NULL value but they are known to be TEXT
  • STRUCT indicates that the object has many fields, like a table, or a cursor
  • JOIN indicates that the object is the concatenation of many STRUCT types
    • e.g. T1 inner join T2 is a JOIN type with T1 and T2 being the parts
    • a JOIN could be flattened to STRUCT, but this is typically not done
    • the type of a SELECT statement will be a STRUCT representing the expressions that were selected
    • those expressions in turn used columns from the JOIN that was the FROM clause
  • ERROR indicates that the subtree had an error
    • the error will have been already reported
    • the error type generally cascades up the AST to the root
  • OK indicates that there is no type information but there was no problem
    • e.g. a correct IF statement will resolve to simply OK (no error)
  • PENDING is used sometimes while a type computation is in progress
    • this type doesn't appear in the AST, but has its own unique value so as to not conflict with any others
  • REGION is used to identify AST fragments that correspond to schema regions
    • see Chapter 10 of the Guide for more information on regions
  • CORE is the mask for the core parts, 0xf would do the job but for easy reading in the debugger we use 0xff
    • new core types are not added very often, adding a new one is usually a sign that you are doing something wrong

The core type can be modified by various flags. The flags, in principle, can be combined in any way but in practice many combinations make no sense. For instance, HAS_DEFAULT is for table columns and CREATE_FUNC is for function declarations. There is no one object that could require both of these.

The full list as of this writing is as follows:

#define SEM_TYPE_NOTNULL               _64(0x0100) // set if and only if null is not possible
#define SEM_TYPE_HAS_DEFAULT _64(0x0200) // set for table columns with a default
#define SEM_TYPE_AUTOINCREMENT _64(0x0400) // set for table columns with autoinc
#define SEM_TYPE_VARIABLE _64(0x0800) // set for variables and parameters
#define SEM_TYPE_IN_PARAMETER _64(0x1000) // set for in parameters (can mix with below)
#define SEM_TYPE_OUT_PARAMETER _64(0x2000) // set for out parameters (can mix with above)
#define SEM_TYPE_DML_PROC _64(0x4000) // set for stored procs that have DML/DDL
#define SEM_TYPE_HAS_SHAPE_STORAGE _64(0x8000) // set for a cursor with simplified fetch syntax
#define SEM_TYPE_CREATE_FUNC _64(0x10000) // set for a function that returns a created object +1 ref
#define SEM_TYPE_SELECT_FUNC _64(0x20000) // set for a sqlite UDF function declaration
#define SEM_TYPE_DELETED _64(0x40000) // set for columns that are not visible in the current schema version
#define SEM_TYPE_VALIDATED _64(0x80000) // set if item has already been validated against previous schema
#define SEM_TYPE_USES_OUT _64(0x100000) // set if proc has a one rowresult using the OUT statement
#define SEM_TYPE_USES_OUT_UNION _64(0x200000) // set if proc uses the OUT UNION form for multi row result
#define SEM_TYPE_PK _64(0x400000) // set if column is a primary key
#define SEM_TYPE_FK _64(0x800000) // set if column is a foreign key
#define SEM_TYPE_UK _64(0x1000000) // set if column is a unique key
#define SEM_TYPE_VALUE_CURSOR _64(0x2000000) // set only if SEM_TYPE_HAS_SHAPE_STORAGE is set and the cursor has no statement
#define SEM_TYPE_SENSITIVE _64(0x4000000) // set if the object is privacy sensitive
#define SEM_TYPE_DEPLOYABLE _64(0x8000000) // set if the object is a deployable region
#define SEM_TYPE_BOXED _64(0x10000000) // set if a cursor's lifetime is managed by a box object
#define SEM_TYPE_HAS_CHECK _64(0x20000000) // set for table column with a "check" clause
#define SEM_TYPE_HAS_COLLATE _64(0x40000000) // set for table column with a "collate" clause
#define SEM_TYPE_INFERRED_NOTNULL _64(0x80000000) // set if inferred to not be nonnull (but was originally nullable)
#define SEM_TYPE_VIRTUAL _64(0x100000000) // set if and only if this is a virtual table
#define SEM_TYPE_HIDDEN_COL _64(0x200000000) // set if and only if hidden column on a virtual table
#define SEM_TYPE_TVF _64(0x400000000) // set if and only table node is a table valued function
#define SEM_TYPE_IMPLICIT _64(0x800000000) // set if and only the variable was declare implicitly (via declare out)
#define SEM_TYPE_CALLS_OUT_UNION _64(0x1000000000) // set if proc calls an out union proc for

Note: _64(x) expands to either a trailing L or a trailing LL depending on the bitness of the compiler, whichever yields an int64_t.

Going over the meaning of all of the above is again beyond the scope of this document; some of the flags are very specialized and essentially the validation just requires a bit of storage in the tree to do its job so that storage is provided with a flag. However two flag bits are especially important and are computed almost everywhere sem_t is used. These are SEM_TYPE_NOTNULL and SEM_TYPE_SENSITIVE.

  • SEM_TYPE_NOTNULL indicates that the marked item is known to be NOT NULL, probably because it was declared as such, or directly derived from a not null item
    • Typically when two operands are combined both must be marked NOT NULL for the result to still be NOT NULL (there are exceptions like COALESCE)
    • Values that might be null cannot be assigned to targets that must not be null
  • SEM_TYPE_SENSITIVE indicates that the marked item is some kind of PII or other sensitive data.
    • Any time a sensitive operand is combined with another operand the resulting type is sensitive
    • There are very few ways to "get rid" of the sensitive bit -- it corresponds to the presence of @sensitive in the data type declaration
    • Values that are sensitive cannot be assigned to targets that are not marked sensitive

The semantic node sem_node carries all the possible semantic info we might need, and the sem_type holds the flags above and tells us how to interpret the rest of the node. There are many fields -- we'll talk about some of the most important ones here to give you a sense of how things hang together.

Note that CSTR is simply an alias for const char *. CSTR is used extensively in the codebase for brevity.

typedef struct sem_node {
sem_t sem_type; // core type plus flags
CSTR name; // for named expressions in select columns, etc.
CSTR kind; // the Foo in object<Foo>, not a variable or column name
CSTR error; // error text for test output, not used otherwise
struct sem_struct *sptr; // encoded struct if any
struct sem_join *jptr; // encoded join if any
int32_t create_version; // create version if any (really only for tables and columns)
int32_t delete_version; // delete version if any (really only for tables and columns)
bool_t recreate; // for tables only, true if marked @recreate
CSTR recreate_group_name; // for tables only, the name of the recreate group if they are in one
CSTR region; // the schema region, if applicable; null means unscoped (default)
symtab *used_symbols; // for select statements, we need to know which of the ids in the select list was used, if any
list_item *index_list; // for tables we need the list of indices that use this table (so we can recreate them together if needed)
struct eval_node *value; // for enum values we have to store the evaluated constant value of each member of the enum
} sem_node;
  • sem_type : already discussed above, this tells you how to interpret everything else
  • name : variables, columns, etc. have a canonical name -- when a name case-insensitivity resolves, the canonical name is stored here
    • typically later passes emit the canonical variable name everywhere
    • e.g. because FoO and fOO might both resolve to an object declared as foo, we always emit foo in codegen
  • kind : in CQL any type can be discriminated as in declare foo real<meters>, the kind here is meters
    • two expressions of the same core type (e.g. real) are incompatible if they have a kind and the kind does not match
    • e.g. if you have bar real<liters> then set foo := bar; this is an error even though both are real because foo above is real<meters>
  • sptr : if the item's core type is SEM_TYPE_STRUCT then this is populated (see below)
  • jptr : if the item's core type is SEM_TYPE_JOIN then this is populated (see below)

If the object is a structure type then this is simply an array of names, kinds, and semantic types. In fact the semantic types will be all be unitary, possibly modified by NOT_NULL or SENSITIVE but none of the other flags apply. A single sptr directly corresponds to the notion of a "shape" in the analyzer. Shapes come from anything that looks like a table, such as a cursor, or the result of a SELECT statement.

// for tables and views and the result of a select

typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;

If the object is a join type (such as the parts of the FROM clause) then the jptr field will be populated. This is nothing more than a named list of struct types.

// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"

typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;

With these building blocks we can represent the type of anything in the CQL language.

Initiating Semantic Analysis​

The semantic analysis pass runs much the same way as the AST emitter. In sem.c there is the essential function sem_main. It suffices to call sem_main on the root of the AST. That root node is expected to be a stmt_list node.

// This method loads up the global symbol tables in either empty state or
// with the appropriate tokens ready to go. Using our own symbol tables for
// dispatch saves us a lot of if/else string comparison verbosity.
cql_noexport void sem_main(ast_node *ast) {
// restore all globals and statics we own
sem_cleanup();
eval_init();
...
}

As you can see, sem_main begins by resetting all the global state. You can of course do this yourself after calling sem_main (when you're done with the results).

sem_main sets a variety of useful and public global variables that describe the results of the analysis. The ones in sem.h are part of the contract and you should feel free to use them in a downstream code-generator. Other items are internal and should be avoided. The internal items are typically defined statically in sem.c. The essential outputs will be described in the last section of this part.

The cleanup has this structure:

// This method frees all the global state of the semantic analyzer
cql_noexport void sem_cleanup() {
eval_cleanup();

BYTEBUF_CLEANUP(deployable_validations);
BYTEBUF_CLEANUP(recreate_annotations);
BYTEBUF_CLEANUP(schema_annotations);

SYMTAB_CLEANUP(funcs);
SYMTAB_CLEANUP(globals);
SYMTAB_CLEANUP(indices);
SYMTAB_CLEANUP(locals);
...

// these are getting zeroed so that leaksanitizer will not count those objects as reachable from a global root.

all_ad_hoc_list = NULL;
all_functions_list = NULL;
...

This basically deallocates everything and resets all the globals to NULL.

sem_main of course has to walk the AST and it does so in much the same way as we saw in gen_sql.c. There is a series of symbol tables whose key is an AST type and whose value is a function plus arguments to dispatch (effectively a lambda.) The semantic analyzer doesn't have to think about things like "should I emit parentheses?" so the signature of each type of lambda can be quite a bit simpler. We'll go over each kind with some examples.

First we have the non-SQL statements, these are basic flow control or other things that SQLite will never see directly.

  symtab *syms = non_sql_stmts;

STMT_INIT(if_stmt);
STMT_INIT(while_stmt);
STMT_INIT(switch_stmt);
STMT_INIT(leave_stmt);
...

Here STMT_INIT creates a binding between (e.g.) the AST type if_stmt and the function sem_if_stmt. This lets us dispatch any part of the AST to its handler directly.

Next we have the SQL statements. These get analyzed in the same way as the others, and with functions that have the same signature, however, if you use one of these it means that procedure that contained this statement must get a database connection in order to run. Use of the database will require the procedure's signature to change; this is recorded by the setting the SEM_TYPE_DML_PROC flag bit to be set on the procedure's AST node.

  syms = sql_stmts;

STMT_INIT(create_table_stmt);
STMT_INIT(drop_table_stmt);
STMT_INIT(create_index_stmt);
STMT_INIT(create_view_stmt);
STMT_INIT(select_stmt);
STMT_INIT(delete_stmt);
STMT_INIT(update_stmt);
STMT_INIT(insert_stmt);
...

Again STMT_INIT creates a binding between (e.g.) the AST type delete_stmt and the function sem_delete_stmt so we can dispatch to the handler.

Next we have expression types. These are set up with EXPR_INIT. Many of the operators require exactly the same kinds of verification, so in order to be able to share the code, the expression analysis functions get an extra argument for the operator in question. Typically the string of the operator is only needed to make a good quality error message with validation being otherwise identical. Here are some samples...

  EXPR_INIT(num, sem_expr_num, "NUM");
EXPR_INIT(str, sem_expr_str, "STR");
EXPR_INIT(blob, sem_expr_blob, "BLB");
EXPR_INIT(null, sem_expr_null, "NULL");
EXPR_INIT(dot, sem_expr_dot, "DOT");
EXPR_INIT(const, sem_expr_const, "CONST");
EXPR_INIT(mul, sem_binary_math, "*");
EXPR_INIT(mod, sem_binary_integer_math, "%");
EXPR_INIT(not, sem_unary_logical, "NOT");
EXPR_INIT(is_true, sem_unary_is_true_or_false, "IS TRUE");
EXPR_INIT(tilde, sem_unary_integer_math, "~");
EXPR_INIT(uminus, sem_unary_math, "-");

Looking at the very first entry as an example, we see that EXPR_INIT creates a mapping between the AST type num and the analysis function sem_expr_num and that function will get the text "NUM" as an extra argument. As it happens sem_expr_num doesn't need the extra argument, but sem_binary_math certainly needs the "*" as that function handles a large number of binary operators.

Let's quickly go over this list as these are the most important analyzers:

  • sem_expr_num : analyzes any numeric constant
  • sem_expr_str : analyzes any string literal or identifier
  • sem_expr_blob : analyzes any blob literal
  • sem_expr_null : analyzes the NULL literal (and nothing else)
  • sem_expr_dot : analyzes a compound name like T1.id
  • sem_expr_const : analyzes a const(...) expression, doing the constant evaluation
  • sem_binary_math : analyzes any normal binary math operator like '+', '-', '/' etc.
  • sem_binary_integer_math : analyzes any binary math operator where the operands must be integers like '%' or '|'
  • sem_unary_logical : analyzes any unary logical operator (the result is a bool) -- this is really only NOT
  • sem_unary_is_true_or_false : analyzes any of the IS TRUE, IS FALSE family of postfix unary operators
  • sem_unary_integer_math : analyzes any unary operator where the operand must be an integer -- this is really only ~
  • sem_unary_math : analyzes any any math unary operator, presently only negation (but in the future unary + too)

The last group of normal associations are for builtin functions, like these:

  FUNC_INIT(changes);
FUNC_INIT(printf);
FUNC_INIT(strftime);
FUNC_INIT(date);
FUNC_INIT(time);

Each of these is dispatched when a function call is found in the tree. By way of example FUNC_INIT(changes) causes the changes function to map to sem_func_changes for validation.

There are a few other similar macros for more exotic cases but the general pattern should be clear now. With these in place it's very easy to traverse arbitrary statement lists and arbitrary expressions with sub expressions and have the correct function invoked without having large switch blocks all over.

Semantic Errors​

Some of the following examples will show the handling of semantic errors more precisely but the theory is pretty simple. Each of the analyzers that has been registered is responsible for putting an appropriate sem_node into the AST it is invoked on. The caller will look to see if that sem_node is of type SEM_TYPE_ERROR using is_error(ast). If it is, the caller will mark its own AST as errant using record_error(ast) and this continues all the way up the tree. The net of this is that wherever you begin semantic analysis, you can know if there were any problems by checking for an error at the top of the tree you provided.

At the point of the initial error, the analyzer is expected to also call report_error providing a suitable message. This will be logged to stderr. In test mode it is also stored in the AST so that verification steps can confirm that errors were reported at exactly the right place.

If there are no errors, then a suitable sem_node is created for the resulting type or else, at minimum, record_ok(ast) is used to place the shared "OK" type on the node. The "OK" type indicates no type information, but no errors either. "OK" is helpful for statements that don't involve expressions like DROP TABLE Foo.

The Primitive Types​

Perhaps the simplest analysis of all happens at the leaves of the AST. By way of example, here is the code for expression nodes of type num, the numeric literals.

// Expression type for numeric primitives
static void sem_expr_num(ast_node *ast, CSTR cstr) {
Contract(is_ast_num(ast));
EXTRACT_NUM_TYPE(num_type, ast);
switch (num_type) {
case NUM_BOOL:
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL);
break;

case NUM_INT:
ast->sem = new_sem(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL);
break;

case NUM_LONG:
ast->sem = new_sem(SEM_TYPE_LONG_INTEGER | SEM_TYPE_NOTNULL);
break;

default:
// this is all that's left
Contract(num_type == NUM_REAL);
ast->sem = new_sem(SEM_TYPE_REAL | SEM_TYPE_NOTNULL);
break;
}
}

As you can see the code simply looks at the AST node, confirming first that it is a num node. Then it extracts the num_type. Then ast->sem is set to a semantic node of the matching type adding in SEM_TYPE_NOTNULL because literals are never null.

The new_sem function is used to make an empty sem_node with the sem_type filled in as specified. Nothing can go wrong creating a literal so there are no failure modes.

It doesn't get much simpler unless maybe...

// Expression type for constant NULL
static void sem_expr_null(ast_node *ast, CSTR cstr) {
Contract(is_ast_null(ast));
// null literal
ast->sem = new_sem(SEM_TYPE_NULL);
}

It's hard to get simpler than doing semantic analysis of the NULL literal. Its code should be clear with no further explanation needed.

Unary Operators​

Let's dive in to a simple case that does require some analysis -- the unary operators. There are comparatively few and there isn't much code required to handle them all.

Here's the code for the unary math operators:

// The only unary math operators are '-' and '~'
// Reference types are not allowed
static void sem_unary_math(ast_node *ast, CSTR op) {
sem_t core_type, combined_flags;
if (!sem_unary_prep(ast, &core_type, &combined_flags)) {
return;
}

if (!sem_validate_numeric(ast, core_type, op)) {
return;
}

// The result of unary math promotes to integer. Basically this converts
// bool to integer. Long integer and Real stay as they are. Text is
// already ruled out.
sem_t sem_type_result = sem_combine_types(
(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL),
(core_type | combined_flags));

ast->sem = new_sem(sem_type_result);
ast->sem->kind = ast->left->sem->kind;

// note ast->sem->name is NOT propagated because SQLite doesn't let you refer to
// the column 'x' in 'select -x' -- the column name is actually '-x' which is useless
// so we have no name once you apply unary math (unless you use 'as')
// hence ast->sem->name = ast->left->sem->name is WRONG here and it is not missing on accident
}

Unary Prep

OK already we need to pause because there is a "prep" pattern here common to most of the shared operators that we should discuss. The prep step takes care of most of the normal error handling which is the same for all the unary operators and the same pattern happens in binary operators. Let's take a look at sem_unary_prep.

// The unary operators all have a similar prep to the binary.  We need
// to visit the left side (it's always the left node even if the operator goes on the right)
// if that's ok then we need the combined_flags and core type. There is only
// the one. Returns true if everything is ok.
static bool_t sem_unary_prep(ast_node *ast, sem_t *core_type, sem_t *combined_flags) {
// op left | left op
sem_expr(ast->left);

if (is_error(ast->left)) {
*core_type = SEM_TYPE_ERROR;
*combined_flags = 0;
record_error(ast);
return false;
}

sem_node *sem = ast->left->sem;
sem_t sem_type = sem->sem_type;

*core_type = core_type_of(sem_type);
*combined_flags = not_nullable_flag(sem_type) | sensitive_flag(sem_type);

Invariant(is_unitary(*core_type));
return true;
}

Reviewing the steps:

  • first we analyze the operand, it will be in ast->left
  • if that's an error, we just return the error code from the prep steps
  • now that it's not an error, we pull the core type out of the operand
  • then we pull the not nullable and sensitive flag bits out of the operand
  • finally return a boolean indicating the presence of an error (or not) for convenience

This is useful setup for all the unary operators, and as we'll see, the binary operators have a similar prep step.

Back to Unary Processing

Looking at the overall steps we see:

  • sem_unary_prep : verifies that the operand is not an error, and gets its core type and flag bits
  • sem_validate_numeric : verifies that the operand is a numeric type
    • recall these are the math unary operators, so the operand must be numeric
  • sem_combine_types : creates the smallest type that holds two compatible types
    • by combining with "integer not null" we ensure that the resulting type is at least as big as an integer
    • if the argument is of type long or real then it will be the bigger type and the resulting type will be long or real as appropriate
    • in short, bool is promoted to int, everything else stays the same
    • sem_combine_types also combines the nullability and sensitivity appropriately
  • a new sem_node of the combined type is created
    • the type "kind" of the operand is preserved (e.g. the meters in real<meters>)
    • any column alias or variable name is not preserved, the value is now anonymous

These primitives are designed to combine well, for instance, consider sem_unary_integer_math

static void sem_unary_integer_math(ast_node *ast, CSTR op) {
sem_unary_math(ast, op);
sem_reject_real(ast, op);
}

The steps are:

  • sem_unary_math : do the sequence we just discussed
  • sem_reject_real : report/record an error if the result type is real otherwise do nothing

Note that in all cases the op string simply gets pushed down to the place where the errors happen. Let's take a quick look at one of the sources of errors in the above. Here's the numeric validator:

static bool_t sem_validate_numeric(ast_node *ast, sem_t core_type, CSTR op) {
if (is_blob(core_type)) {
report_error(ast->left, "CQL0045: blob operand not allowed in", op);
record_error(ast);
return false;
}

if (is_object(core_type)) {
report_error(ast->left, "CQL0046: object operand not allowed in", op);
record_error(ast);
return false;
}

if (is_text(core_type)) {
report_error(ast->left, "CQL0047: string operand not allowed in", op);
record_error(ast);
return false;
}

return true;
}

That function is pretty much dumb as rocks. The non-numeric types are blob, object, and text. There is a custom error for each type (it could have been shared but specific error messages seem to help users.) This code doesn't know its context, but all it needs is op to tell it what the numeric-only operator was and it can produce a nice error message. It leaves an error in the AST using record_error. Its caller can then simply return if anything goes wrong.

It's not hard to guess how sem_reject_real works:

// Some math operators like << >> & | % only make sense on integers
// This function does the extra checking to ensure they do not get real values
// as arguments. It's a post-pass after the normal math checks.
static void sem_reject_real(ast_node *ast, CSTR op) {
if (!is_error(ast)) {
sem_t core_type = core_type_of(ast->sem->sem_type);
if (core_type == SEM_TYPE_REAL) {
report_error(ast, "CQL0001: operands must be an integer type, not real", op);
record_error(ast);
}
}
}
  • if the AST node isn't already an error, and the node is of type "real", report an error
  • it assumes the type is already known to be numeric
  • the pre-check for errors is to avoid double reporting; if something has already gone wrong, the core type will be SEM_TYPE_ERROR
    • no new error recording is needed in that case, as obviously an error was already recorded

Binary Operators​

Binary Prep​

With the knowledge we have so far, this code pretty much speaks for itself, but we'll walk through it.

// All the binary ops do the same preparation -- they evaluate the left and the
// right expression, then they check those for errors. Then they need
// the types of those expressions and the combined_flags of the result. This
// does exactly that for its various callers. Returns true if all is well.
static bool_t sem_binary_prep(ast_node *ast, sem_t *core_type_left, sem_t *core_type_right, sem_t *combined_flags) {
EXTRACT_ANY_NOTNULL(left, ast->left);
EXTRACT_ANY_NOTNULL(right, ast->right);

// left op right
sem_expr(left);
sem_expr(right);

if (is_error(left) || is_error(right)) {
record_error(ast);
*core_type_left = SEM_TYPE_ERROR;
*core_type_right = SEM_TYPE_ERROR;
*combined_flags = 0;
return false;
}

*core_type_left = core_type_of(left->sem->sem_type);
*core_type_right = core_type_of(right->sem->sem_type);
*combined_flags = combine_flags(left->sem->sem_type, right->sem->sem_type);

Invariant(is_unitary(*core_type_left));
Invariant(is_unitary(*core_type_right));

return true;
}
  • sem_expr : used to recursively walk the left and right nodes
  • is_error : checks if either side had errors, and, if so, simply propagates the error
  • extract the left and right core types
  • combine nullability and sensitivity flags

And that's it! These are the standard prep steps for all binary operators. With this done, the caller has the core types of the left and right operands plus combined flags on a silver platter and one check is needed to detect if anything went wrong.

Example: Is or Is Not​

This analyzer is the simplest of all the binaries

// IS and IS NOT are special in that they return a not null boolean.
static void sem_binary_is_or_is_not(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;

if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}

if (!sem_verify_compat(ast, core_type_left, core_type_right, op)) {
return;
}

// the result of is or is not is always a bool and never null
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL | sensitive_flag(combined_flags));
}
  • sem_binary_prep : checks for errors in the left or right
  • sem_verify_compat : ensures that left and right operands are type compatible (discussed later)
  • the result is always of type bool not null

If either step goes wrong the error will naturally propagate.

Example: Binary Math​

This is the general worker for binary math operations, the most common operations like '+', '-', '*' and so forth.

// For all math operations, we combine the types and yield the type that
// holds both using the helper. If any text, that's an error.
static void sem_binary_math(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;
if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}

if (error_any_object(ast, core_type_left, core_type_right, op)) {
return;
}

if (error_any_blob_types(ast, core_type_left, core_type_right, op)) {
return;
}

if (error_any_text_types(ast, core_type_left, core_type_right, op)) {
return;
}

sem_t core_type = sem_combine_types(core_type_left, core_type_right);

CSTR kind = sem_combine_kinds(ast->right, ast->left->sem->kind);
if (is_error(ast->right)) {
record_error(ast);
return;
}

ast->sem = new_sem(core_type | combined_flags);
ast->sem->kind = kind;
}

Let's have a look at those steps:

  • sem_binary_prep : checks for errors on the left or right
  • error_any_object : reports an error if the left or right is of type object
  • error_any_blob_types : reports an error if the left or right is of type blob
  • error_any_text_types : reports an error if the left or right is of type text
  • sem_combine_type : computes the combined type, the smallest numeric type that holds both left and right
    • note the operands are now known to be numeric
    • the three type error checkers give nice tight errors about the left or right operand
  • sem_combine_kinds : tries to create a single type kind for both operands
    • if their kind is incompatible, records an error on the right
  • new_sem : creates a sem_node with the combined type, flags, and then the kind is set

At this point it might help to look at a few more of the base validators -- they are rather unremarkable.

Example Validator: error_any_object​

// If either of the types is an object, then produce an error on the ast.
static bool_t error_any_object(ast_node *ast, sem_t core_type_left, sem_t core_type_right, CSTR op) {
if (is_object(core_type_left)) {
report_error(ast->left, "CQL0002: left operand cannot be an object in", op);
record_error(ast);
return true;
}

if (is_object(core_type_right)) {
report_error(ast->right, "CQL0003: right operand cannot be an object in", op);
record_error(ast);
return true;
}

return false;
}
  • is_object : checks a sem_type against SEM_TYPE_OBJECT
    • if the left or right child is an object, an appropriate error is generated
  • there is no strong convention for returning true if ok, or true if error; it's pretty ad hoc
    • this doesn't seem to cause a lot of problems

Example Validator: sem_combine_kinds​

// Here we check that type<Foo> only combines with type<Foo> or type.
// If there is a current object type, then the next item must match
// If there is no such type, then an object type that arrives becomes the required type
// if they ever don't match, record an error
static CSTR sem_combine_kinds_general(ast_node *ast, CSTR kleft, CSTR kright) {
if (kright) {
if (kleft) {
if (strcmp(kleft, kright)) {
CSTR errmsg = dup_printf("CQL0070: expressions of different kinds can't be mixed: '%s' vs. '%s'", kright, kleft);
report_error(ast, errmsg, NULL);
record_error(ast);
}
}
return kright;
}

return kleft;
}

// helper to crack the ast nodes first and then call the normal comparisons
static CSTR sem_combine_kinds(ast_node *ast, CSTR kright) {
CSTR kleft = ast->sem->kind;
return sem_combine_kinds_general(ast, kleft, kright);
}
  • sem_combine_kinds : uses the worker sem_combine_kinds_general after extracting the kind from the left node
    • usually you already have one kind and you want to know if another kind is compatible, hence this helper
  • sem_combine_kinds_general : applies the general rules for "kind" strings:
    • NULL + NULL => NULL
    • NULL + x => x
    • x + NULL => x
    • x + x => x
    • x + y => error (if x != y)
  • this is one of the rare functions that creates a dynamic error message

Example Validator : is_numeric_compat​

This helper is frequently called several times in the course of other semantic checks. This one produces no errors, that's up to the caller. Often there is a numeric path and a non-numeric path so this helper can't create the errors as it doesn't yet know if anything bad has happened. Most of the is_something functions are the same way.

cql_noexport bool_t is_numeric_compat(sem_t sem_type) {
sem_type = core_type_of(sem_type);
return sem_type >= SEM_TYPE_NULL && sem_type <= SEM_TYPE_REAL;
}

is_numeric_compat operates by checking the core type for the numeric range. Note that NULL is compatible with numerics because expressions like NULL + 2 have meaning in SQL. The type of that expression is nullable integer and the result is NULL.

Example Validator : sem_combine_types​

// The second workhorse of semantic analysis, given two types that
// are previously known to be compatible, it returns the smallest type
// that holds both. If either is nullable, the result is nullable.
// Note: in the few cases where that isn't true, the normal algorithm for
// nullability result must be overridden (see coalesce, for instance).
static sem_t sem_combine_types(sem_t sem_type_1, sem_t sem_type_2) {
... too much code ... summary below
}

This beast is rather lengthy but unremarkable. It follows these rules:

  • text is only compatible with text
  • object is only compatible with object
  • blob is only compatible with blob
  • numerics are only compatible with other numerics and NULL
    • NULL promotes the other operand, whatever it is (might still be NULL)
    • bool promotes to integer if needed
    • integer promotes to long integer if needed
    • long integer promotes to real if needed
    • the combined type is the smallest numeric type that holds left and right according to the above rules

Some examples might be helpful:

  • 1 + 2L -> long
  • false + 3.1 -> real
  • 2L + 3.1 -> real
  • true + 2 -> integer
  • 'x' + 1 -> not compatible

Note that sem_combine_types assumes the types have already been checked for compatibility and will use Contract to enforce this. You should be using other helpers like is_numeric_compat and friends to ensure the types agree before computing the combined type. A list of values that must be compatible with each other (e.g. in needle IN (haystack)) can be checked using sem_verify_compat repeatedly.

Example Validator : sem_verify_assignment​

The sem_verify_assignment function is used any time there is something like a logical assignment going on. There are two important cases:

  • SET x := y : an actual assignment
  • call foo(x) : the expression x must be "assignable" to the formal variable for the argument of foo

This is a lot like normal binary operator compatibility with one extra rule: the source expression must not be a bigger type than the target. e.g. you cannot assign a long to an integer, nor pass a long expression to a function that has an integer parameter.

// This verifies that the types are compatible and that it's ok to assign
// the expression to the variable. In practice that means:
// * the variable type core type and kind must be compatible with the expression core type and kind
// * the variable must be nullable if the expression is nullable
// * the variable must be sensitive if the assignment is sensitive
// * the variable type must be bigger than the expression type
// Here ast is used only to give a place to put any errors.
static bool_t sem_verify_assignment(ast_node *ast, sem_t sem_type_needed, sem_t sem_type_found, CSTR var_name) {
if (!sem_verify_compat(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}

if (!sem_verify_safeassign(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}

if (is_nullable(sem_type_found) && is_not_nullable(sem_type_needed)) {
report_error(ast, "CQL0013: cannot assign/copy possibly null expression to not null target", var_name);
return false;
}

if (sensitive_flag(sem_type_found) && !sensitive_flag(sem_type_needed)) {
report_error(ast, "CQL0014: cannot assign/copy sensitive expression to non-sensitive target", var_name);
return false;
}

return true;
}
  • sem_verify_compat : checks for standard type compatibility between the left and the right
  • sem_verify_safeassign : checks that if the types are different the right operand is the smaller of the two
  • nullability checks ensure you aren't trying to assign a nullable value to a not null variable
  • sensitivity checks ensure you aren't trying to assign a sensitive value to a not sensitive variable

Simple Statement Validation​

With the expression building blocks, most of the usual kind of language statements become quite simple to check for correctness. It's probably easiest to illustrate this with an example. Let's look at validation for the WHILE statement:

// While semantic analysis is super simple.
// * the condition must be numeric
// * the statement list must be error-free
// * loop_depth is increased allowing the use of interior leave/continue
static void sem_while_stmt(ast_node *ast) {
Contract(is_ast_while_stmt(ast));
EXTRACT_ANY_NOTNULL(expr, ast->left);
EXTRACT(stmt_list, ast->right);

// WHILE [expr] BEGIN [stmt_list] END

sem_numeric_expr(expr, ast, "WHILE", SEM_EXPR_CONTEXT_NONE);

if (is_error(expr)) {
record_error(ast);
return;
}

if (stmt_list) {
loop_depth++;

sem_stmt_list(stmt_list);

loop_depth--;

if (is_error(stmt_list)) {
record_error(ast);
return;
}
}

record_ok(ast);
}
  • EXTRACT* : pulls out the tree parts we need
  • sem_numeric_expr : verifies the loop expression is numeric
  • sem_stmt_list : recursively validates the body of the loop

Note: the while expression is one of the loop constructs which means that LEAVE and CONTINUE are legal inside it. The loop_depth global tracks the fact that we are in a loop so that analysis for LEAVE and CONTINUE can report errors if we are not.

It's not hard to imagine that sem_stmt_list will basically walk the AST, pulling out statements and dispatching them using the STMT_INIT tables previously discussed. You might land right back in sem_while_stmt for a nested WHILE -- it's turtles all the way down.

If SEM_EXPR_CONTEXT_NONE is a mystery, don't worry, it's covered in the next section.

Expression Contexts​

It turns out that in the SQL language some expression types are only valid in some parts of a SQL statement (e.g. aggregate functions can't appear in a LIMIT clause) and so there is always a context for any numeric expression. When a new root expression is being evaluated, it sets the expression context per the caller's specification.

The expression contexts are as follows:

#define SEM_EXPR_CONTEXT_NONE           0x0001
#define SEM_EXPR_CONTEXT_SELECT_LIST 0x0002
#define SEM_EXPR_CONTEXT_WHERE 0x0004
#define SEM_EXPR_CONTEXT_ON 0x0008
#define SEM_EXPR_CONTEXT_HAVING 0x0010
#define SEM_EXPR_CONTEXT_ORDER_BY 0x0020
#define SEM_EXPR_CONTEXT_GROUP_BY 0x0040
#define SEM_EXPR_CONTEXT_LIMIT 0x0080
#define SEM_EXPR_CONTEXT_OFFSET 0x0100
#define SEM_EXPR_CONTEXT_TABLE_FUNC 0x0200
#define SEM_EXPR_CONTEXT_WINDOW 0x0400
#define SEM_EXPR_CONTEXT_WINDOW_FILTER 0x0800
#define SEM_EXPR_CONTEXT_CONSTRAINT 0x1000

The idea here is simple: when calling a root expression, the analyzer provides the context value that has the bit that corresponds to the current context. For instance, the expression being validated in is the WHERE clause -- the code will provide SEM_EXPR_CONTEXT_WHERE. The inner validators check this context, in particular anything that is only available in some contexts has a bit-mask of that is the union of the context bits where it can be used. The validator can check those possibilities against the current context with one bitwise "and" operation. A zero result indicates that the operation is not valid in the current context.

This bitwise "and" is performed by one of these two helper macros which makes the usage a little clearer:

#define CURRENT_EXPR_CONTEXT_IS(x)  (!!(current_expr_context & (x)))
#define CURRENT_EXPR_CONTEXT_IS_NOT(x) (!(current_expr_context & (x)))

Expression Context Example : Concat​

The concatenation operator || is challenging to successfully emulate because it does many different kinds of numeric to string conversions automatically. Rather than perennially getting this wrong, we simply do not support this operator in a context where SQLite isn't going to be doing the concatenation. So typically users use "printf" instead to get formatting done outside of a SQL context. The check for invalid use of || is very simple and it happens, of course, in sem_concat.

  if (CURRENT_EXPR_CONTEXT_IS(SEM_EXPR_CONTEXT_NONE)) {
report_error(ast, "CQL0241: CONCAT may only appear in the context of a SQL statement", NULL);
record_error(ast);
return;
}

Expression Context Example : IN​

A slightly more complex example happens processing the IN operator. This operator has two forms: the form with an expression list, which can be used anywhere, and the form with a SELECT statement. The latter form can only appear in some sections of SQL, and not at all in loose expressions. For instance, that form may not appear in the LIMIT or OFFSET sections of a SQLite statement.

We use this construct to do the validation:

    uint32_t valid = SEM_EXPR_CONTEXT_SELECT_LIST
|SEM_EXPR_CONTEXT_WHERE
|SEM_EXPR_CONTEXT_ON
|SEM_EXPR_CONTEXT_HAVING
|SEM_EXPR_CONTEXT_TABLE_FUNC;

if (CURRENT_EXPR_CONTEXT_IS_NOT(valid)) {
report_error( ast, "CQL0078: [not] in (select ...) is only allowed inside "
"of select lists, where, on, and having clauses", NULL);
record_error(ast);
return;
}

If the reader is interested in a simple learning exercise, run down the purpose of SEM_EXPR_CONTEXT_TABLE_FUNC -- it's simple, but important, and it only has one use case so it's easy to find.

Name Resolution​

We've gotten pretty far without talking about the elephant in the room: name resolution.

Like SQL, many statements in CQL have names in positions where the type of the name is completely unambiguous. For instance nobody could be confused what sort of symbol Foo is in DROP INDEX Foo.

This type, with a clear name category, is the easiest name resolutions, and there are a lot in this form. Let's do an example.

Example: Index Name Resolution​

// This is the basic checking for the drop index statement
// * the index must exist (have been declared) in some version
// * it could be deleted now, that's ok, but the name has to be valid
static void sem_drop_index_stmt(ast_node *ast) {
Contract(is_ast_drop_index_stmt(ast));
EXTRACT_ANY_NOTNULL(name_ast, ast->right);
EXTRACT_STRING(name, name_ast);

ast_node *index_ast = find_usable_index(name, name_ast, "CQL0112: index in drop statement was not declared");
if (!index_ast) {
record_error(ast);
return;
}

record_ok(ast);
}

Well, this is interesting. But what's going on with find_usable_index? What is usable? Why aren't we just looking up the index name in some name table? Let's have a look at the details of find_usable_index.

// returns the node only if it exists and is not restricted by the schema region.
static ast_node *find_usable_index(CSTR name, ast_node *err_target, CSTR msg) {
ast_node *index_ast = find_index(name);
if (!index_ast) {
report_error(err_target, msg, name);
return NULL;
}

if (!sem_validate_object_ast_in_current_region(name, index_ast, err_target, msg)) {
return NULL;
}

return index_ast;
}

We haven't discussed schema regions yet but what you need to know about them for now is this:

  • any object can be in a region.
  • a region may depend on other regions

If an object is in a region, then it may only use schema parts that are in the same region, or the region's dependencies (transitively).

The point of this is that you might have a rather large schema and you probably don't want any piece of code to use just any piece of schema. You can use regions to ensure that the code for feature "X" doesn't try to use schema designed exclusively for feature "Y". That "X" code probably has no business even knowing of the existence of "Y" schema.

So now usable simply means this:

  • find_index can find the name in the symbol table for indices
  • the found index is accessible in the current region

If we had used an example that was looking up a table name, the same region considerations would apply, however, additionally tables can be deprecated with @delete so there would be additional checks to make sure we're talking about a live table and not a table's tombstone.

In short, these simple cases just require looking up the entity and verifying that it's accessible in the current context.

Flexible Name Resolution​

The "hard case" for name resolution is where the name is occurring in an expression. Such a name can refer to all manner of things. It could be a global variable, a local variable, an argument, a table column, a field in a cursor, and others. The general name resolver goes through several phases looking for the name. Each phase can either report an affirmative success or error (in which case the search stops), or it may simply report that the name was not found but the search should continue.

We can demystify this a bit by looking at the most common way to get name resolution done.

// Resolves a (potentially qualified) identifier, writing semantic information
// into `ast` if successful, or reporting and recording an error for `ast` if
// not.
static void sem_resolve_id(ast_node *ast, CSTR name, CSTR scope) {
Contract(is_id(ast) || is_ast_dot(ast));
Contract(name);

// We have no use for `type` and simply throw it away.
sem_t *type = NULL;
sem_resolve_id_with_type(ast, name, scope, &type);
}

The name resolver works on either a vanilla name (e.g. x) or a scoped name (e.g. T1.x). The name and scope are provided. The ast parameter is used only as a place to report errors; there is no further cracking of the AST needed to resolve the name. As you can see sem_resolve_id just calls the more general function sem_resolve_id_with_type and is used in the most common case where you don't need to be able to mutate the semantic type info for the identifier. That's the 99% case.

So let's move on to the "real" resolver.

// This function is responsible for resolving both unqualified identifiers (ids)
// and qualified identifiers (dots). It performs the following two roles:
//
// - If an optional `ast` is provided, it works the same way most semantic
// analysis functions work: semantic information will be written into the
// ast, errors will be reported to the user, and errors will be recorded in
// the AST.
//
// - `*type_ptr` will be set to mutable type (`sem_t *`) in the current
// environment if the identifier successfully resolves to a type. (There are,
// unfortunately, a few exceptions in which a type will be successfully
// resolved and yet `*type_ptr` will not be set. These include when a cursor is
// in an expression position, when the expression is `rowid` (or similar), and
// when the id resolves to an enum case. The reason no mutable type is
// returned in these cases is that a new type is allocated as part of semantic
// analysis, and there exists no single, stable type in the environment to
// which a pointer could be returned. This is a limitation of this function,
// albeit one that's currently not problematic.)
//
// Resolution is attempted in the order that the `sem_try_resolve_*` functions
// appear in the `resolver` array. Each takes the same arguments: An (optional)
// AST, a mandatory name, an optional scope, and mandatory type pointer. If the
// identifier provided to one of these resolvers is resolved successfully, *or*
// if the correct resolver was found but there was an error in the program,
// `SEM_RESOLVE_STOP` is returned and resolution is complete, successful or not.
// If a resolver is tried and it determines that it is not the correct resolver
// for the identifier in question, `SEM_RESOLVE_CONTINUE` is returned and the
// next resolver is tried.
//
// This function should not be called directly. If one is interested in
// performing semantic analysis, call `sem_resolve_id` (or, if within an
// expression, `sem_resolve_id_expr`.) Alternatively, if one wants to get a
// mutable type from the environment, call `find_mutable_type`.
static void sem_resolve_id_with_type(ast_node *ast, CSTR name, CSTR scope, sem_t **type_ptr) {
Contract(name);
Contract(type_ptr);

*type_ptr = NULL;

sem_resolve (*resolver[])(ast_node *ast, CSTR, CSTR, sem_t **) = {
sem_try_resolve_arguments,
sem_try_resolve_column,
sem_try_resolve_rowid,
sem_try_resolve_cursor_as_expression,
sem_try_resolve_variable,
sem_try_resolve_enum,
sem_try_resolve_cursor_field,
sem_try_resolve_arg_bundle,
};

for (uint32_t i = 0; i < sizeof(resolver) / sizeof(void *); i++) {
if (resolver[i](ast, name, scope, type_ptr) == SEM_RESOLVE_STOP) {
return;
}
}

report_resolve_error(ast, "CQL0069: name not found", name);
record_resolve_error(ast);
}

This function is well described in its own comments. We can easily see the "mini-resolvers" which attempt to find the name in order:

  • sem_try_resolve_arguments : an argument in the argument list
  • sem_try_resolve_column : a column name (possibly scoped)
  • sem_try_resolve_rowid : the virtual rowid column (possibly scoped)
  • sem_try_resolve_cursor_as_expression : use of a cursor as a boolean -- the bool is true if the cursor has data
  • sem_try_resolve_variable : local or global variables
  • sem_try_resolve_enum : the constant value of an enum (must be scoped)
  • sem_try_resolve_cursor_field : a field in a cursor (must be scoped)
  • sem_try_resolve_arg_bundle : a field in an argument bundle (must be scoped)

These all use this enum to communicate progress:

// All `sem_try_resolve_*` functions return either `SEM_RESOLVE_CONTINUE` to
// indicate that another resolver should be tried, or `SEM_RESOLVE_STOP` to
// indicate that the correct resolver was found. Continuing implies that no
// failure has (yet) occurred, but stopping implies neither success nor failure.
typedef enum {
SEM_RESOLVE_CONTINUE = 0,
SEM_RESOLVE_STOP = 1
} sem_resolve;

Each of these mini-resolvers will have a series of rules, for example sem_try_resolve_cursor_field is going to have to do something like this:

  • if there is no scope, it can't be a cursor field, return CONTINUE
  • if the scope is not the name of a cursor, return CONTINUE
  • if the name is a field in the cursor, return STOP with success
  • else, report that the name is not a valid member of the cursor, and return STOP with an error

All the mini-resolvers are similarly structured, generically:

  • if it's not my case, return CONTINUE
  • if it is my case return STOP (maybe with an error)

Some of the mini-resolvers have quite a few steps, but any one mini-resolver is only about a screenful of code and it does one job.

Flow Analysis​

CQL implements a basic form of control flow analysis in "flow.c". The header "flow.h" exposes a small set of primitives used by "sem.c" during semantic analysis.

Flow analysis in CQL involves two important concepts: flow contexts and improvements. These are rather entangled concepts β€” one is useless without the other β€” and so the approach to describing them here will alternate between giving a bit of background on one and then the other, with a greater level of detail about the specific types of improvements being supplied later on.

A flow context is used, in essence, to create a boundary around a portion of a user's program. At the moment, there are four types of contexts.

The first type of context is called, rather boringly, a normal context. Normal contexts are used for portions of a user's code that may be entered conditionally. A good example of this is in SELECT expressions: When a WHERE clause is present, the expression list is only evaluated when the WHERE clause is true. If we look at sem_select_expr_list_con, we can get an idea of how this works in terms of flow contexts:

static void sem_select_expr_list_con(ast_node *ast) {
...
// Analyze the FROM portion (if it exists).
sem_select_from(select_from_etc);
error = is_error(select_from_etc);

// Push a flow context to contain improvements made via the WHERE clause that
// will be in effect for the SELECT expression list.
FLOW_PUSH_CONTEXT_NORMAL();

if (!error) {
...
sem_sensitive = sem_select_where_etc(select_from_etc);
...
sem_set_improvements_for_true_condition(where_expr);
...
}
...
if (!error) {
...
sem_select_expr_list(select_expr_list);
...
}
...
FLOW_POP_CONTEXT_NORMAL();
...
}

While very much simplified above, it can be seen that the steps are essentially as follows:

  1. Analyze the FROM clause.
  2. Push a new normal context.
  3. Analyze the WHERE clause.
  4. Set improvements given the WHERE clause (ultimately by calling flow_set_flag_for_type); we'll come back to this part shortly.
  5. Analyze the expression list with the improvements from the WHERE in effect.
  6. Pop the context, un-setting the improvements from the WHERE.

This, of course, only begins to make sense once one understands what we mean by improvements.

CQL, at the moment, supports two forms of improvements: nullability improvements and initialization improvements. Both of these will be discussed in more detail later, but the basic idea is that an improvement upgrades the type of some value within a particular flow context. For example, in the expression SELECT x + x FROM t WHERE x IS NOT NULL, we can reason that x + x can safely be given a nonnull type because of the WHERE clause. This is exactly what we do in sem_select_expr_list_con: We make a context to hold the improvements that may come from the WHERE, analyze the WHERE, set the appropriate improvements given the WHERE, analyze the expression list, and then pop the context to unset the improvements (as they must not affect any enclosing expressions).

In addition to normal contexts, there are also branch contexts and branch group contexts. These two context types are designed to work together for handling IF, CASE, IIF, SWITCH, et cetera.

Like normal contexts, branch contexts assume that they are entered when some condition is true. The difference is that branch contexts lie within a branch group context, and branch groups know that at most one branch of a given set of branches will be entered. A great example of this can be found in sem_if_stmt:

static void sem_if_stmt(ast_node *ast) {
...
// Each branch gets its own flow context in `sem_cond_action` where its
// condition is known to be true. We also create one more context for the
// entire set of branches. In addition to grouping the branches together, this
// outer context holds all of the negative improvements that result from the
// knowledge that, if a given branch's statements are being evaluated, all
// previous branches' conditions must have been false.
FLOW_PUSH_CONTEXT_BRANCH_GROUP();

// IF [cond_action]
sem_cond_action(cond_action);
...
if (elseif) {
sem_elseif_list(elseif);
...
}
...
if (elsenode) {
// ELSE [stmt_list]
flow_set_context_branch_group_covers_all_cases(true);
EXTRACT(stmt_list, elsenode->left);
if (stmt_list) {
FLOW_PUSH_CONTEXT_BRANCH();
sem_stmt_list_in_current_flow_context(stmt_list);
FLOW_POP_CONTEXT_BRANCH();
...
} else {
flow_context_branch_group_add_empty_branch();
}
record_ok(elsenode);
}
...
FLOW_POP_CONTEXT_BRANCH_GROUP();
...
}

It's instructive to look at sem_cond_action as well:

static void sem_cond_action(ast_node *ast) {
...
// [expr] THEN stmt_list
sem_expr(expr);
...
if (stmt_list) {
FLOW_PUSH_CONTEXT_BRANCH();
// Add improvements for `stmt_list` where `expr` must be true.
sem_set_improvements_for_true_condition(expr);
sem_stmt_list_in_current_flow_context(stmt_list);
FLOW_POP_CONTEXT_BRANCH();
...
} else {
flow_context_branch_group_add_empty_branch();
}

// If a later branch will be taken, `expr` must be false. Add its negative
// improvements to the context created in `sem_if_stmt` so that all later
// branches will be improved by the OR-linked spine of IS NULL checks in
// `expr`.
sem_set_improvements_for_false_condition(expr);
...
}

Putting all of this together, we can see that the basic steps for analyzing an IF statement are as follows:

  1. Push a new branch group context to hold all of the branch contexts that are to come.
  2. Analyze the condition in the IF condition THEN portion of the statement.
  3. Push a new branch context to hold the nullability improvements from the condition (e.g., in IF x IS NOT NULL THEN, we can improve x to have a nonnull type in the statement list after the THEN).
  4. Set the improvements.
  5. Anaylze the statement list after the THEN.
  6. Pop the branch context.
  7. Set the negative improvements resulting from the knowledge that conditionmust have been false if the previous branch wasn't entered (e.g., in IF y IS NULL THEN, we know that y must be nonnull from just after the end of the branch until the end of the current branch group).
  8. Repeat for the ELSE IF and ELSE branches (if any).
  9. Pop the branch group context.

What makes all of this work are the following:

  • When a branch context is popped, it resets all improvements such that they become exactly what they were before the branch was analyzed. This is done to reflect the fact that, because at most one branch will be entered, neither adding improvements (via flow_set_flag_for_type) nor removing existing improvements (via flow_unset_flag_for_type) in a branch should affect any of the other branches in the group.

  • When a branch group context is popped, it merges the effects of all of its branches. This is a key step that allows CQL to retain an improvement after a branch group is popped whenever the same improvement is made within every one of its branches and when the branches given cover all possible cases (which is indicated by the call to flow_set_context_branch_group_covers_all_cases in the code above).

The final type of context is called a jump context. Jump contexts are a maximally pessimistic form of context that assume every improvement that might be unset within them will be unset and that every improvement that might be set within them will not be set. Jump contexts are used to make semantic analysis safe in the possible presence of control flow statements like CONTINUE, LEAVE, and THROW, and so jump contexts are used for the analysis of statements like LOOP, WHILE, and TRY. Take the following line-numbered code as an example:

001  DECLARE x TEXT;
002 SET x := "foo";
003 WHILE some_condition
004 BEGIN
005 IF another_condition THEN
006 SET x := NULL;
007 IF yet_another_condition THEN
008 LEAVE;
009 END IF;
010 SET x := "bar";
011 ELSE
012 -- do nothing
013 END IF;
014 END;
015 CALL requires_text_notnull(x);

Here, even though the outer IF makes no change overall to the nullability improvement to x from line 2 -- it unsets it on line 6 and then re-sets it on line 10 and the ELSE does nothingβ€”there is no guarantee that line 10 will ever be evaluated because we may jump straight from line 8 to line 15. As a result, it is necessary that x be un-improved after the WHILE loop; a normal context would not accomplish this, but a jump context does. See the comments within _flow_push_context_branch for additional discussion.

While jump contexts are necessary for the safety of improvements in the presence of loops, they are not sufficient: It's actually necessary to analyze loops twice. This is because execution of a loop might repeat, and so a statement that results in the unsetting of an improvement later in a loop must affect improvements earlier in that loop. For example:

DECLARE x INT;
SET x := 1;
WHILE some_condition
BEGIN
-- okay on the first analysis pass, but not the second
CALL requires_int_notnull(x);
-- must negatively affect the call on the line above
SET x := NULL;
END;

Semantic analysis keeps track of whether or not it is currently reanalyzing the statement list of a loop via the current_loop_analysis_state variable:

// The analysis of loops like LOOP and WHILE is done in two passes. First, we
// analyze the loop to conservatively figure out every improvement that the loop
// could possibly unset. After that, we reanalyze it with said improvements
// unset to ensure that everything is safe. See `sem_stmt_list_within_loop` for
// more information on why this is necessary.
typedef enum {
LOOP_ANALYSIS_STATE_NONE,
LOOP_ANALYSIS_STATE_ANALYZE,
LOOP_ANALYSIS_STATE_REANALYZE
} loop_analysis_state;

...

// Keeps tracks of the current loop analysis state. If this is equal to
// `LOOP_ANALYSIS_STATE_ANALYZE`, we are analyzing with a non-final set of
// improvements. This is useful for two reasons:
//
// 1. Procedures that perform rewrites based on improvements (e.g.,
// `sem_resolve_id_expr`) can use this to verify whether a rewrite is safe to
// perform (`LOOP_ANALYSIS_STATE_NONE` or `LOOP_ANALYSIS_STATE_REANALYZE`) or
// whether they should wait because they do not yet have definitive
// information (`LOOP_ANALYSIS_STATE_ANALYZE`).
//
// 2. Analyses that would otherwise fail if called during reanalysis (e.g.,
// `sem_verify_legal_variable_name`) can use this to check whether the
// current state is `LOOP_ANALYSIS_STATE_REANALYZE` and adjust their
// behaviors accordingly.
static loop_analysis_state current_loop_analysis_state = LOOP_ANALYSIS_STATE_NONE;

As indicated in the first comment above, the comments within sem_stmt_list_within_loop go into further detail.

At this point, we've only scratched the surface of control flow analysis in CQL. Fortunately, the files "flow.h" and "flow.c" are heavily commented and can be studied to deepen one's understanding.

Nullability Improvements​

Via a form of occurrence typing (also known as flow typing), CQL has the ability to determine that, due to a prior conditional check, a nullable variable or cursor field cannot be null within a particular context, and CQL will improve its type in that context.

Unlike most forms of semantic analysis performed by CQL, the analysis for nullability improvements, as is the case for all types of improvements, makes heavy use of the find_mutable_type function:

// Returns the *mutable* type (`sem_t *`) for a given (potentially qualified)
// identifier if one exists in the environment. See the documentation for
// `sem_resolve_id_with_type` for limitations.
static sem_t *find_mutable_type(CSTR name, CSTR scope);

This function allows us to look up the type of the original binding referred to by a particular name/scope pair. In essence, it provides access to the current type environment for whichever part of the program we are analyzing. It also allows us to mutate that environment by virtue of the fact that it returns a pointer to the type of the binding, not merely the type itself.

By using find_mutable_type to get a type pointer and toggling the SEM_TYPE_INFERRED_NOTNULL flag via flow_set_flag_for_type and flow_unset_flag_for_type, the procedures sem_set_notnull_improved and sem_unset_notnull_improved are able to record that a nullable identifier or cursor field is either temporarily nonnull or no longer nonnull respectively:

// Enables a nonnull improvement, if possible.
static void sem_set_notnull_improved(CSTR name, CSTR scope);

// This needs to be called for everything that is no longer safe to consider NOT
// NULL due to a mutation. It is fine to call this for something not currently
// subject to improvement, but it must only be called with a name/scope pair
// referring to something that has a mutable type (e.g., it must not be an unbound
// variable, a cursor used as an expression, an enum case, et cetera).
static void sem_unset_notnull_improved(CSTR name, CSTR scope);

Similarly, sem_is_notnull_improved uses find_mutable_type to check whether or not something is currently improved:

// Returns true if currently improved to be nonnull, else false.
static bool_t sem_is_notnull_improved(CSTR name, CSTR scope);

Why does nullability inference use this approach? The reason is that the alternative would be maintaining some sort of set of currently improved identifiers and cursor fields and checking it whenever resolving an identifier or cursor field. The problem would be that merely knowing that some identifier "x" is improved would not be sufficient, however: We'd have to know which "x". Is it the local variable "x"? Is it the column "x" of the table from which we're currently selecting? In essence, correctly maintaining an independent set of all currently active improvements would involve re-implementing all of the scoping rules of the language. By using find_mutable_type, we can simply piggyback on the existing name resolution logic and avoid all of these issues.

A nullability improvement is always created within a particular flow context. When an improvement is added via sem_set_notnull_improved, a record of that improvement is recorded in the current context. When that context ends, that same record is used to remove the improvement. It is also the case that sem_unset_notnull_improved may be used to remove an improvement before a context has ended due to a SET, FETCH, or call to a procedure or function with an OUT argument resulting in the improvement no longer being safe.

Improvements can be introduced into the current context via sem_set_notnull_improved directly (when a variable is SET to a value of a nonnull type), but more commonly they are introduced via one of the following two functions:

// Given a conditional expression `ast` possibly containing AND-linked
// subexpressions, set all of the applicable nullability and has-row
// improvements within the current flow context. Generally speaking, calls to
// this function should be bounded by a new flow context corresponding to the
// portion of the program for which the condition `ast` must be be true.
static void sem_set_improvements_for_true_condition(ast_node *expr);

// Improvements for known-false conditions are dual to improvements for
// known-true conditions.
//
// For nullability, known-false conditions improve ids and dots verified to be
// NULL via `IS NULL` along the outermost spine of `OR` expressions, whereas
// known-true conditions improve ids and dots verified to be nonnull via `IS NOT
// NULL` along the outermost spine of `AND` expressions. For example, the
// following two statements introduce the same improvements:
//
// IF a IS NOT NULL AND b IS NOT NULL THEN
// -- `a` and `b` are improved here because we know the condition is true
// END IF;
//
// IF a IS NULL OR b IS NULL RETURN;
// -- `a` and `b` are improved here because we know the condition is false
// -- since we must not have returned if we got this far
//
// ...
static void sem_set_improvements_for_false_condition(ast_node *ast);

These functions introduce improvements by gathering up all of the IS NOT NULL checks (in the true case) or IS NULL checks (in the false case) and introducing improvements appropriately. The true version is used when we enter a context that will only be evaluated at runtime when some particular condition is true; the false version, conversely, is used when we enter a context that will only be evaluated at runtime when some particular condition is false:

IF some_condition THEN
-- "true" improvements from `some_condition` are in
-- effect here
ELSE IF another_condition THEN
-- "false" improvements from `some_condition` and true
-- improvements from `another_condition` are in effect
-- here
ELSE
-- "false" improvements from both `some_condition` and
-- `another_condition` are in effect here
END IF;

Global variables in CQL require special treatment when it comes to nullability improvements. This is because any procedure call could potentially mutate any number of global variables, and so all currently improved globals must be un-improved at every such call. The following list keeps track of which global variables are currently improved:

typedef struct global_notnull_improvement_item {
sem_t *type;
struct global_notnull_improvement_item *next;
} global_notnull_improvement_item;

...

// Keeps track of all global variables that may currently be improved to be NOT
// NULL. We need this because we must un-improve all such variables after every
// procedure call (because we don't do interprocedural analysis and cannot know
// which globals may have been set to NULL).
static global_notnull_improvement_item *global_notnull_improvements;

The fact that we don't do interprocedural analysis (as the comment above indicates) is not a deficiency. Programmers should be able to reason locally about nullability improvements, and an analysis that depended upon the details of how other procedures were implemented would make that impossible.

So far, we have talked a lot about how improvements are set and unset, but we haven't talked about how the improvement actually happens in terms of code generation. Since CQL represents values of nullable and nonnull types differently (at least in the case of non-reference types), we cannot simply treat a value of a nullable type as though it were of a nonnull type: We need to actually change its representation.

The way this works is that, whenever we resolve a name/scope pair via sem_resolve_id_expr, we check whether the pair is currently improved via sem_is_notnull_improved. If it is, we call rewrite_nullable_to_notnull to wrap the id or dot we're resolving with a call to the function cql_inferred_notnull (for which we generate code in cg_func_cql_inferred_notnull):

// Wraps an id or dot in a call to cql_inferred_notnull.
cql_noexport void rewrite_nullable_to_unsafe_notnull(ast_node *_Nonnull ast);

// The `cql_inferred_notnull` function is not used by the programmer directly,
// but rather inserted via a rewrite during semantic analysis to coerce a value
// of a nullable type to be nonnull. The reason for this approach, as opposed to
// just changing the type directly, is that there are also representational
// differences between values of nullable and nonnull types; some conversion is
// required.
static void cg_func_cql_inferred_notnull(ast_node *call_ast, charbuf *is_null, charbuf *value);

As the comment for cg_func_cql_inferred_notnull indicates, programmers do not use cql_inferred_notnull directly: It is only inserted as a product of the above-mentioned rewrite. In fact, we explicitly disallow its use by programmers in the parser:

// We insert calls to `cql_inferred_notnull` as part of a rewrite so we expect
// to see it during semantic analysis, but it cannot be allowed to appear in a
// program. It would be unsafe if it could: It coerces a value from a nullable
// type to a nonnull type without any runtime check.
#define YY_ERROR_ON_CQL_INFERRED_NOTNULL(x) \
EXTRACT_STRING(proc_name, x); \
if (!strcmp(proc_name, "cql_inferred_notnull")) { \
yyerror("Call to internal function is not allowed 'cql_inferred_notnull'"); \
}

One subtle aspect of the rewrite is that the rewrite itself performs analysis to validate the product of the rewrite (as do other many other rewrites). To avoid going into a loop of rewriting, analyzing the result (which ultimately happens in sem_special_func_cql_inferred_notnull), rewriting again because the result contains a name that is improved, et cetera, we keep track of whether or not we're currently analyzing a subexpression under a call to cql_inferred_notnull and avoid re-rewriting appropriately:

// This is true if we are analyzing a call to `cql_inferred_notnull`. This can
// happen for three reasons:
//
// * We just did a rewrite that produced a `cql_inferred_notnull` call and now
// we're computing its type.
// * We're analyzing an expression that was already analyzed (e.g., in a CTE).
// * We're analyzing the output of a previous CQL run within which calls to
// `cql_inferrred_notnull` may occur.
//
// Regardless of the cause, if `is_analyzing_notnull_rewrite` is true, we do not
// want to rewrite again.
static bool_t is_analyzing_notnull_rewrite;

static void sem_special_func_cql_inferred_notnull(ast_node *ast, uint32_t arg_count, bool_t *is_aggregate) {
...
// Since we're checking a call to `cql_inferred_notnull`, its arguments have
// already been rewritten and we don't want to do it again. Setting
// `is_analyzing_notnull_rewrite` prevents that.
is_analyzing_notnull_rewrite = true;
sem_arg_list(arg_list, IS_NOT_COUNT);
is_analyzing_notnull_rewrite = false;
...
}

// Like `sem_resolve_id`, but specific to expression contexts (where nullability
// improvements are applicable).
static void sem_resolve_id_expr(ast_node *ast, CSTR name, CSTR scope) {
...
if (is_analyzing_notnull_rewrite) {
// If we're analyzing the product of a rewrite and we're already inside of a
// call to `cql_inferred_notnull`, do not expand again.
// forever.
return;
}
...
}

At this point, you should have a decent understanding of how nullability improvements function, both in terms of semantic analysis and in terms of code generation. The implementation is heavily commented, so reading the code and searching for calls to the core functions listed below should be sufficient to fill in any gaps:

bool_t sem_is_notnull_improved(CSTR name, CSTR scope);
void sem_set_notnull_improved(CSTR name, CSTR scope);
void sem_unset_notnull_improved(CSTR name, CSTR scope);
void sem_unset_global_notnull_improvements();
void sem_set_improvements_for_true_condition(ast_node *ast);
void sem_set_improvements_for_false_condition(ast_node *ast);
void sem_special_func_cql_inferred_notnull(ast_node *ast, uint32_t arg_count, bool_t *is_aggregate)
void rewrite_nullable_to_unsafe_notnull(ast_node *_Nonnull ast);

Initialization Improvements​

Compared to nullability improvements, initialization improvements are relatively simple.

The idea behind initialization improvements is that, if one declares a variable of a reference type (BLOB, OBJECT, or TEXT) that is also NOT NULL, it is not safe to use the variable until it has been given a value. For example:

DECLARE x TEXT NOT NULL;

IF some_condition THEN
SET x := some_text_notnull_value;
-- `x` is safe to use here
ELSE
-- `x` is NOT safe to use here (it might be uninitialized)
END IF;

-- `x` is NOT safe to use here either (it might be uninitialized)

As with nullability improvements, initialization improvements rely heavily on flow contexts. The function sem_set_initialization_improved, similarly to sem_set_notnull_improved for nullability, is used to enable an initialization improvement. (There is nothing analogous to sem_unset_notnull_improved for initialization because nothing can ever be uninitialized once it has been given a value.)

Unlike nullability improvements, initialization improvements use two flags: SEM_TYPE_INIT_REQUIRED and SEM_TYPE_INIT_COMPLETE. Rather than assuming everything is uninitalized by default and requiring the presence of some SEM_TYPE_INITIALIZED flag before anything can be used, we explicitly tag things that are not initialized but need to be with SEM_TYPE_INIT_REQUIRED and later tag them with SEM_TYPE_INIT_COMPLETE once they've been initialized. Doing it in this way has two benefits:

  1. It reduces the amount of noise in the AST output significantly: Code like LET x := 10; can remain {let_stmt}: x: integer notnull variable in the AST without the need of the extra noise of some initialized flag.

  2. More importantly, it means we only have to deal with initialization in a tiny portion of "sem.c". For example, we must handle it in sem_declare_vars_type to add the SEM_TYPE_INIT_REQUIRED flag and in sem_assign to add SEM_TYPE_INIT_COMPLETE, but sem_let_stmt can remain blissfully ignorant of initialization altogether.

There are only three places in which a variable may be initialized: sem_assign (as mentioned), sem_fetch_stmt (for the FETCH...INTO form), and sem_arg_for_out_param (as passing a variable to a procedure requiring an OUT argument of a NOT NULL type can initialize it).

Regarding sem_arg_for_out_param, we can only set initialization improvements when a variable is passed as an OUT argument because we require that all procedures initialize all of their OUT parameters of a nonnull reference type. This is handled in two places:

  1. In sem_param, we set the SEM_TYPE_INIT_REQUIRED flag when param_should_require_initialization is true.

  2. In sem_validate_current_proc_params_are_initialized, which is called both after analyzing the statement list of a procedure and for each return statement within the procedure, we ensure that SEM_TYPE_INIT_COMPLETE is present on all parameters that have SEM_TYPE_INIT_REQUIRED.

There is only one wrinkle in all of this: the cql:try_is_proc_body attribute. If cql:try_is_proc_body is present on a TRY statement, we call sem_validate_current_proc_params_are_initialized at the end of the TRY and not at the end of the procedure. The rationale for this is explained thoroughly in the comments for sem_find_ast_misc_attr_trycatch_is_proc_body_callback.

That's all there is to it: "flow.c" does most of the hard work for us.

Structure types and the notion of Shapes​

Earlier we discussed SEM_TYPE_STRUCT briefly. Recall the basic notion of the structure type:

// for tables and views and the result of a select

typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;

The structure is nothing more than an array of names, types and kinds with a count. But it creates the notion of what's usually called a "shape" in the codebase. Shapes can be used in a variety of ways as is described in Chapter 5 of the CQL Guide. But before we get into shapes, let's look at an example of how a structure type is created.

The code that follows is the back end of sem_create_table_stmt. At this point the bulk of the analysis is done and the columns all have their types. We're about to build the struct type for the table. Let's see how that goes.

  // now create a struct type with the correct number of columns
// the types have already been computed so all we have to do is
// check for duplicates
sem_struct *sptr = new_sem_struct(name, cols);

symtab *columns = symtab_new();

int32_t col = 0;
for (ast_node *item = col_key_list; item; item = item->right) {
Contract(is_ast_col_key_list(item));
EXTRACT_ANY_NOTNULL(def, item->left);

if (is_ast_col_def(def)) {
Invariant(def->sem->name);
Invariant(col <= cols); // it's possible that the rest are deleted and we're at the end.

// columns must be unique, including deleted columns
if (!symtab_add(columns, def->sem->name, NULL)) {
EXTRACT_NOTNULL(col_def_type_attrs, def->left);
EXTRACT_NOTNULL(col_def_name_type, col_def_type_attrs->left);
EXTRACT_ANY_NOTNULL(col_def_ast, col_def_name_type->left);

report_error(col_def_ast, "CQL0142: duplicate column name", def->sem->name);
record_error(ast);
symtab_delete(columns);
goto cleanup;;
}

if (is_deleted(def->sem->sem_type)) {
continue;
}

Invariant(col < cols);

sptr->names[col] = def->sem->name;
sptr->semtypes[col] = def->sem->sem_type;
sptr->kinds[col] = def->sem->kind;
col++;
}
}

symtab_delete(columns);

Invariant(col == cols);

ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;
  • new_sem_struct : makes a struct to hold the result, we already have the count of columns and the table name
  • symtab_new : is going to give us a scratch symbol table so we can check for duplicate column names
  • we walk all the items in the table and use is_ast_col_def(def) to find the column definitions
  • Invariant(def->sem->name) : claims that we must have already computed the semantic info for the column and it has its name populated
    • this was done earlier
  • symtab_add(columns, def->sem->name, NULL) : adds a nil entry under the column name -- if this fails we have a duplicate column,
    • in which case we report errors and stop
  • is_deleted : tells us if the column was marked with @delete in which case it no longer counts as part of the table
  • if all this is good we set the names, kinds, and semtypes from the column definition's semantic info
  • symtab_delete : cleans up the temporary symbol table
  • new_sem : creates a sem_node of type SEM_TYPE_STRUCT which is filled in
    • sem_join_from_sem_struct will be discussed shortly, but it creates a jptr with one table in it

Structure types often come from the shape of a table, but other things can create a structure type. For instance, the columns of a view, or any select statement, are also described by a structure type and are therefore valid "shapes". The return type of a procedure usually comes from a SELECT statement, so the procedure too can be the source of a shape. The arguments of a procedure form a shape. The fields of a cursor form a shape. You can even have a named subset of the arguments of a procedure and use them like a shape. All of these things are described by structure types.

Shapes and the LIKE construct​

There are many cases where you want to be able to capture or re-use something with a known shape and you don't want to have to fully re-declare the thing. CQL uses the LIKE construct to do these sorts of things. This is more fully explained in Chapter 5 of the Guide, but for now let's look at two different cases that are of interest.

First, a cursor:

DECLARE C CURSOR LIKE Foo;  -- Foo something with a shape

So, in the above, Foo could be a table, a view, a procedure with a result, another cursor, and so forth.

How might we do this? This is the business of sem_declare_cursor_like_name

// Here we're going to make a new value cursor using the indicated name for the shape.
// The name has to be "likeable" meaning it refers to some named thing with a shape
// such as a table, a view, another cursor, or a procedure that returns a result set.
// These are the so called "value cursors" in that they have no underlying statement
// that they move through. You can just load them up with a row and pass them around.
static void sem_declare_cursor_like_name(ast_node *ast) {
Contract(is_ast_declare_cursor_like_name(ast));
EXTRACT_ANY_NOTNULL(new_cursor_ast, ast->left);
EXTRACT_STRING(new_cursor_name, new_cursor_ast);
EXTRACT_ANY_NOTNULL(shape_def, ast->right);

// no duplicates allowed
if (!sem_verify_legal_variable_name(ast, new_cursor_name)) {
record_error(new_cursor_ast);
record_error(ast);
return;
}

// must be a valid shape
ast_node *found_shape = sem_find_shape_def(shape_def, LIKEABLE_FOR_VALUES);
if (!found_shape) {
record_error(ast);
return;
}

// good to go, make our cursor, with storage.
shape_def->sem = found_shape->sem;
new_cursor_ast->sem = new_sem(SEM_TYPE_STRUCT | SEM_TYPE_VARIABLE | SEM_TYPE_VALUE_CURSOR | SEM_TYPE_HAS_SHAPE_STORAGE);
new_cursor_ast->sem->sptr = found_shape->sem->sptr;
new_cursor_ast->sem->name = new_cursor_name;
ast->sem = new_cursor_ast->sem;

symtab_add(current_variables, new_cursor_name, new_cursor_ast);
}
  • EXTRACT : gets the pieces we need from the AST
  • sem_verify_legal_variable_name : makes sure the cursor name is unique and doesn't hide a table name
  • sem_find_shape_def : searches for something with a suitable name that has a shape
  • we populate the name node with the semantic type that we found
  • new_sem : makes a new sem_node for the cursor variable with SEM_TYPE_STRUCT
    • set the sptr field using the discovered shape

Note: name_ast->sem isn't actually used for anything but it is helpful for debugging. If the AST is printed it shows the original unmodified semantic type which can be helpful.

Briefly sem_find_shape_def does these steps:

  • if the right of the LIKE refers to procedure arguments (e.g. C LIKE Foo ARGUMENTS), get the args of the named procedure and use them as a shape
  • if the right is a local or global, and its a cursor, use the shape of that cursor for the new cursor
  • if the right is the name of an argument bundle, use the shape of the bundle
    • e.g. in CREATE PROC Foo(p1 like Person, p2 like Person) p1 and p2 are the names of argument bundles shaped like Person
  • if the right is the name of a table or view, use that shape
  • if the right is the name of a procedure with a structure result, use that shape
  • if it's none of these, produce an error

This is the primary source of shape reuse. Let's look at how we might use that.

Suppose we want to write a procedure that inserts a row into the table Foo, we could certainly list the columns of Foo as arguments like this:

CREATE PROC InsertIntoFoo(id integer, t text, r real, b blob)
BEGIN
INSERT INTO Foo(id, t, r, b) VALUES(id, t, r, b);
END;

But that approach is going to get a lot less exciting when there are lots of columns and it will be increasingly a maintenance headache.

Compare that with the following:

CREATE PROC InsertIntoFoo(row LIKE Foo)
BEGIN
INSERT INTO Foo FROM row;
END;

Those two versions of InsertIntoFoo compile into the same code. The semantic analyzer expands the (row LIKE Foo) into (row_id integer, row_t text, row_r real, row_b blob) and then replaces FROM row with (row_id, row_t, row_r, row_b). In both case it simply looked up the shape using sem_find_shape_def and then altered the AST to the canonical pattern. This kind of "shape sugar" is all over CQL and greatly increases maintainability while eliminating common errors. The most common operation is simply to expland a "shape" into a list of arguments or columns (maybe with or without type names). SQLite doesn't know any of this shape magic so by the time SQLite sees the code it has to look "normal" -- the shapes are all resolved.

Join Types​

The last of the type building data structure is the join type. Recall that we have this shape:

// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"

typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;

This is an array of named structure types, which is exactly what you get when you do something like this:

select * from T1 INNER JOIN T2;

The result has all of the columns of T1 and all of the columns of T2. They can be referred to with scoped names like T1.x which means "find the sptr corresponding to the name T1 then within that structure find the column named x". In general, when we join, we take a jptr on the left and concatenate it with a jptr on the right. For all this to work we have to start somewhere, usually single tables.

As we saw when we make a table we use sem_join_from_sem_struct to make its initial jptr. Let's have a look at that now:

// Create a base join type from a single struct.
static sem_join *sem_join_from_sem_struct(sem_struct *sptr) {
sem_join *jptr = new_sem_join(1);
jptr->names[0] = sptr->struct_name;
jptr->tables[0] = new_sem_struct_strip_table_flags(sptr);

return jptr;
}

It doesn't get much simpler than the above, here are the steps briefly:

  • new_sem_join : gives us an empty sem_join with room for 1 table
  • we use the struct name for the name and the table's sptr for the shape
  • new_sem_struct_strip_table_flags : copies the table's sptr keeping only the essential flags
    • SEM_TYPE_HIDDEN_COL
    • SEM_FLAG_NOTNULL
    • SEM_FLAG_SENSITIVE

The other flags (e.g. SEM_TYPE_PK) have no value in doing type checking and were only needed to help validate the table itself. Those extra flags would be harmless but they would also contaminate all of the debug output, so they are stripped. As a result the type of columns as they appear in say SELECT statements is simpler than how they appear in a CREATE TABLE statement.

When we need to create a new join type we simply (*) make a new sem_join that is the concatenation of the left and right sides of the join operation.

  • some join types change the nullability of columns like LEFT JOIN, so we have to handle that too
  • the names of the tables in the new concatenated joinscope have to be unambiguous so there is also error checking to do
  • but basically it's just a concatenation

Importantly, we call the thing a "joinscope" because it creates a namespace. When we are evaluating names inside of the FROM clause or even later in, say, a WHERE clause, the joinscope that we have created so far controls the table.column combinations that you can use in expressions. This changes again when there is a subquery, so the joinscopes can be pushed and popped as needed.

By way of example, you'll see these two patterns in the code:

  PUSH_JOIN(from_scope, select_from_etc->sem->jptr);
error = sem_select_orderby(select_orderby);
POP_JOIN();
  • PUSH_JOIN : use the jptr from the FROM clause to put things back in scope for the ORDER BY clause
  PUSH_JOIN_BLOCK();
sem_numeric_expr(ast->left, ast, "LIMIT", SEM_EXPR_CONTEXT_LIMIT);
POP_JOIN();
  • PUSH_JOIN_BLOCK : causes the name search to stop -- nothing deeper in the stack is searched
  • in this case we do not allow LIMIT expressions to see any joinscopes, as they may not use any columns
    • even if the LIMIT clause is appearing in a subquery it can't refer to columns in the parent query

Schema Regions​

We touched briefly on schema regions earlier in this section. The purpose and language for regions is described more fully in Chapter 10 of the Guide. In this section we'll deal with how they are implemented and what you should expect to see in the code.

When a region declaration is found this method is used:

// A schema region is an partitioning of the schema such that it
// only uses objects in the same partition or one of its declared
// dependencies. One schema region may be upgraded independently
// from any others (assuming they happen such that dependents are done first).
// Here we validate:
// * the region name is unique
// * the dependencies (if any) are unique and exist
static void sem_declare_schema_region_stmt(ast_node *ast) { ... }

The general rules are described in the comment, but effectively it accumulates the list of the declared region's dependencies. Sometimes these are called the antecedent regions. Since a region can only depend on regions that have already been declared, it's not possible to make any cycles. Regions are declared before you put anything into them.

Pieces of schema or procedures (or anything really) can go into a region by putting that code inside a begin/end pair for the named region. Like so:

@begin_schema_region your_region;
-- your stuff
@end_schema_region;

Now whatever happens to be in "your stuff" is:

  • limited to seeing only the things that your_region is allowed to see, and
  • contributes its contents to your_region thereby limiting how others will be able to use "your stuff"

To see how this happens, let's have a look at sem_begin_schema_region_stmt.

// Entering a schema region makes all the objects that follow part of that
// region. It also means that all the contained objects must refer to
// only pieces of schema that are in the same region or a dependent region.
// Here we validate that region we are entering is in fact a valid region
// and that there isn't already a schema region.
static void sem_begin_schema_region_stmt(ast_node * ast) {
Contract(is_ast_begin_schema_region_stmt(ast));
EXTRACT_STRING(name, ast->left);

// @BEGIN_SCHEMA_REGION name

if (!verify_schema_region_out_of_proc(ast)) {
record_error(ast);
return;
}

if (current_region) {
report_error(ast, "CQL0246: schema regions do not nest; end the current region before starting a new one", NULL);
record_error(ast);
return;
}

ast_node *region = find_region(name);
if (!region) {
report_error(ast->left, "CQL0244: unknown schema region", name);
record_error(ast);
return;
}

// Get the canonical name of the region (case adjusted)
Contract(is_region(region));
EXTRACT_STRING(region_name, region->left);

// we already know we are not in a region
Invariant(!current_region_image);
current_region_image = symtab_new();
sem_accumulate_public_region_image(current_region_image, region_name);

// this is the one and only text pointer value for this region
current_region = region_name;
record_ok(ast);
}

We see these basic steps:

  • EXTRACT : gets the region name
  • verify_schema_region_out_of_proc : makes sure we are out of any procedure (we have to be at the top level)
    • errors if in a procedure
  • current_region : is tested to make sure we are not already in a region (no nesting)
    • errors if already in a region
  • find_region : is used to find the region AST by name
    • errors if the region name isn't valid
  • EXTRACT : is used again to get the canonical name of the region
    • you could write @begin_schema_region YoUr_ReGION; but we want the canonical name your_region, as it was declared
  • symtab_new : creates a new symbol table current_region_image
  • sem_accumulate_public_region_image : populates current_region_image by recursively walking this region adding the names of all the regions we find along the way
    • note the regions form a DAG so we might find the same name twice; we can stop if we find a region that is already in the image symbol table
  • current_region : set it to the now new current region

Now we're all set up.

  • We can use current_region to set the region in the sem_node of anything we encounter
  • We can use current_region_image to quickly see if we are allowed to use any given region
    • if it's in the symbol table we can use it

Recall that at the end of sem_create_table_stmt we do this:

  ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;

That should make a lot more sense now.

When doing the symmetric check in sem_validate_object_ast_in_current_region we see this pattern:

// Validate whether or not an object is usable with a schema region. The object
// can only be a table, view, trigger or index.
static bool_t sem_validate_object_ast_in_current_region(
CSTR name,
ast_node *table_ast,
ast_node *err_target,
CSTR msg)
{
// We're in a non-region therefore no validation needed because non-region stmt
// can reference schema in any region.
if (!current_region) {
return true;
}

if (table_ast->sem && table_ast->sem->region) {
// if we have a current region then the image is always computed!
Invariant(current_region_image);
if (!symtab_find(current_region_image, table_ast->sem->region)) {
// The target region is not accessible from this region
...
return false;
}
} else {
// while in schema region '%s', accessing an object that isn't in a region is invalid
...
return false;
}

return true;
}

I've elided some of the code here, but only the part that generates error messages. The essential logic is:

  • if we are not in a region we can access anything
  • if we're in a region then...
    • the thing we're trying to access must also be in a region, and
    • that region must be in current_region_image
    • otherwise, we can't access it

This is enough to do all the region validation we need.

Results of Semantic Analysis​

Semantic Analysis leaves a lot of global state ready for the remaining stages to harvest. If the state is defined in sem.h then it's ok to harvest. Here we'll highlight some of the most important things you can use in later passes. These are heavily used in the code generators.

cql_data_decl( struct list_item *all_tables_list );
cql_data_decl( struct list_item *all_functions_list );
cql_data_decl( struct list_item *all_views_list );
cql_data_decl( struct list_item *all_indices_list );
cql_data_decl( struct list_item *all_triggers_list );
cql_data_decl( struct list_item *all_regions_list );
cql_data_decl( struct list_item *all_ad_hoc_list );
cql_data_decl( struct list_item *all_select_functions_list );
cql_data_decl( struct list_item *all_enums_list );

These linked lists are authoritiative; they let you easily enumerate all the objects of the specified type. For instance, if you wanted to do some validation of all indices, you could simply walk all_indices_list.

cql_noexport ast_node *find_proc(CSTR name);
cql_noexport ast_node *find_region(CSTR name);
cql_noexport ast_node *find_func(CSTR name);
cql_noexport ast_node *find_table_or_view_even_deleted(CSTR name);
cql_noexport ast_node *find_enum(CSTR name);

These functions give you access to the core name tables (which are still valid!) so that you can look up procedures, functions, tables, etc. by name.

Finally, information about all the schema annotations is invaluable for building schema upgraders. These two buffers hold dense arrays of annotation records as shown below.

cql_data_decl( bytebuf *schema_annotations );
cql_data_decl( bytebuf *recreate_annotations );

typedef struct recreate_annotation {
CSTR target_name; // the name of the target
CSTR group_name; // group name or "" if no group (not null, safe to sort)
ast_node *target_ast; // top level target (table, view, or index)
ast_node *annotation_ast; // the actual annotation
int32_t ordinal; // when sorting we want to use the original order (reversed actually) within a group
} recreate_annotation;

typedef struct schema_annotation {
int32_t version; // the version number (always > 0)
ast_node *target_ast; // top level target (table, view, or index)
CSTR target_name; // the name of the target
uint32_t annotation_type; // one of the codes below for the type of annotation
ast_node *annotation_ast; // the actual annotation
int32_t column_ordinal; // -1 if not a column
ast_node *column_ast; // a particular column if column annotation
} schema_annotation;

// Note: schema annotations are processed in the indicated order: the numbers matter!
#define SCHEMA_ANNOTATION_INVALID 0
#define SCHEMA_ANNOTATION_FIRST 1
#define SCHEMA_ANNOTATION_UNSUB 1
#define SCHEMA_ANNOTATION_CREATE_TABLE 2
#define SCHEMA_ANNOTATION_CREATE_COLUMN 3
#define SCHEMA_ANNOTATION_DELETE_TRIGGER 4
#define SCHEMA_ANNOTATION_DELETE_VIEW 5
#define SCHEMA_ANNOTATION_DELETE_INDEX 6
#define SCHEMA_ANNOTATION_DELETE_COLUMN 7
#define SCHEMA_ANNOTATION_DELETE_TABLE 8
#define SCHEMA_ANNOTATION_AD_HOC 9
#define SCHEMA_ANNOTATION_RESUB 10
#define SCHEMA_ANNOTATION_LAST 10

And of course, each "back end" is provided with the root of the AST so that it can also search and/or walk the AST in its own manner. We will see examples of this in later sections.

Recap​

At present, there are nearly 20000 lines in sem.c and it would no doubt take more than 20000 lines of text to explain what they all do, and that would be more imprecise than the source code, and probably less readable. sem.c includes over 4000 lines of comments, and probably should have more. While there is a lot of code there, it's very readable and I encourage you to do just that -- read it -- to get your answers.

The point of this part of the Internals Guide isn't to fully explain all 400+ error checks in about as many semantic error checking functions, it is to showcase the key concepts shared by all of them. Things like:

  • errors are reported largely in the AST and percolate up
  • expressions and statements have general purpose dispatch logic for continuing a statement walk
  • EXTRACT macros are used to keep the tree walk on track and correct in the face of changes
  • regions are used for visibility
  • versioning contributes to visibility
  • nullability and sensitivity are tracked throughout using type bits
  • type "kind" is managed by a simple string in the sem_node payload
  • the three main payloads are
    • sem_node for basic info, and
    • sem_struct or sem_join for the non-unitary types

This isn't everything but it should leave you well armed to begin your own exploration of sem.c.

Note: details on unsub/resub are forthcoming. This code is under development.