Part 2: Semantic Analysis

Preface

Part 2 continues with a discussion of the essentials of the semantic analysis pass of the CQL compiler. As in the previous sections, the goal here is not to go over every single rule but rather to give a sense of how semantic analysis happens in general -- the core strategies and implementation choices -- so that when reading the code you will have an idea how smaller pieces fit into the whole. To accomplish this, various key data structures will be explained in detail as well as selected examples of their use.

Semantic Analysis

The overall goal of the semantic analysis pass is to verify that a correct program has been submitted to the compiler. The compiler does this by "decorating" the AST with semantic information. This information is mainly concerned with the "types" of the various things in the program. A key function of the semantic analyzer, the primary "weapon" in computing these types if you will, is name resolution. The semantic analyzer decides what any given name means in any context and then uses that meaning, which is itself based on the AST constructs that came before, to compute types and then check those types for errors.

Broadly speaking the errors that can be discovered are of these forms:

  • mentioned names do not exist
    • e.g. using a variable or table or column without declaring it
  • mentioned names are not unique, or are ambiguous
    • e.g. every view must have a unique name
    • e.g. table names need to be unique, or aliased when joining tables
  • operands are not compatible with each other or with the intended operation
    • e.g. you can't add a string to a real
    • e.g. you can't do the % operation on a real
    • e.g. the expression in a WHERE clause must result in a numeric
    • e.g. the first argument to printf must be a string literal.
    • e.g. you can't assign a long value to an integer variable
    • e.g. you can't assign a possibly null result to a not-null variable
  • there are too many or two few operands for an operation
    • e.g. an INSERT statement must include sufficiently many columns and no extras
    • e.g. a function or procedure call must have the correct number of operands
  • an operation is happening in a context where it is not allowed
    • e.g. use of aggregate functions in the WHERE clause
    • e.g. use of unique SQLite functions outside of a SQL statement

There are several hundred possible errors, no attempt will be made to cover them all here but we will talk about how errors are created, recorded, and reported.

Decorated AST examples

Recalling the AST output from Part 1, this is what that same tree looks like with semantic information attached.

LET X := 1 + 3;
{let_stmt}: X: integer notnull variable
| {name X}: X: integer notnull variable
| {add}: integer notnull
| {int 1}: integer notnull
| {int 3}: integer notnull

And here's an example with some structure types

SELECT 1 AS x, 3.2 AS y;
{select_stmt}: select: { x: integer notnull, y: real notnull }
| {select_core_list}: select: { x: integer notnull, y: real notnull }
| | {select_core}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list_con}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list}: select: { x: integer notnull, y: real notnull }
| | | {select_expr}: x: integer notnull
| | | | {int 1}: integer notnull
| | | | {opt_as_alias}
| | | | {name x}
| | | {select_expr_list}
| | | {select_expr}: y: real notnull
| | | {dbl 3.2}: real notnull
| | | {opt_as_alias}
| | | {name y}
| | {select_from_etc}: ok
| | {select_where}
| | {select_groupby}
| | {select_having}
| {select_orderby}
| {select_limit}
| {select_offset}

These can be generated by adding --sem --print to the CQL command line along with --in your_file.sql.

Keep these shapes in mind as we discuss the various sources of type information.

The Base Data Structures

First recall that every AST node has this field in it:

struct sem_node *_Nullable sem;

This is the pointer to the semantic information for that node. Semantic analysis happens immediately after parsing and before any of the code-generators run. Importantly, code generators never run if semantic analysis reported any errors. Before we get into the shape of the semantic node, we should start with the fundamental unit of type info sem_t which is usually stored in a variable called sem_type.

typedef uint64_t sem_t;

The low order bits of a sem_t encode the core type and indeed there is an helper function to extract the core type from a sem_t.

// Strips out all the flag bits and gives you the base/core type.
cql_noexport sem_t core_type_of(sem_t sem_type) {
return sem_type & SEM_TYPE_CORE;
}

The core bits are as follows:

#define SEM_TYPE_NULL 0 // the subtree is a null literal (not just nullable)
#define SEM_TYPE_BOOL 1 // the subtree is a bool
#define SEM_TYPE_INTEGER 2 // the subtree is an integer
#define SEM_TYPE_LONG_INTEGER 3 // the subtree is a long integer
#define SEM_TYPE_REAL 4 // the subtree is a real
#define SEM_TYPE_TEXT 5 // the subtree is a text type
#define SEM_TYPE_BLOB 6 // the subtree is a blob type
#define SEM_TYPE_OBJECT 7 // the subtree is any object type
#define SEM_TYPE_STRUCT 8 // the subtree is a table/view
#define SEM_TYPE_JOIN 9 // the subtree is a join
#define SEM_TYPE_ERROR 10 // marks the subtree as having a problem
#define SEM_TYPE_OK 11 // sentinel for ok but no type info
#define SEM_TYPE_PENDING 12 // sentinel for type calculation in flight
#define SEM_TYPE_REGION 13 // the ast is a schema region
#define SEM_TYPE_CORE 0xff // bit mask for the core types
#define SEM_TYPE_MAX_UNITARY (SEM_TYPE_OBJECT+1) // the last unitary type

These break into a few categories:

  • NULL to OBJECT are the "unitary" types, these are the types that a single simple variable can be
    • a column can be any of these except OBJECT or NULL
    • the NULL type comes only from the NULL literal which has no type
    • instances of say a TEXT column might have a NULL value but they are known to be TEXT
  • STRUCT indicates that the object has many fields, like a table, or a cursor
  • JOIN indicates that the object is the concatenation of many STRUCT types
    • e.g. T1 inner join T2 is a JOIN type with T1 and T2 being the parts
    • a JOIN could be flattened to STRUCT, but this is typically not done
    • the type of a SELECT statement will be a STRUCT representing the expressions that were selected
    • those expressions in turn used columns from the JOIN that was the FROM clause
  • ERROR indicates that the subtree had an error
    • the error will have been already reported
    • the error type generally cascades up the AST to the root
  • OK indicates that there is no type information but there was no problem
    • e.g. a correct IF statement will resolve to simply OK (no error)
  • PENDING is used sometimes while a type computation is in progress
    • this type doesn't appear in the AST, but has its own unique value so as to not conflict with any others
  • REGION is used to identify AST fragments that correspond to schema regions
    • see Chapter 10 of the Guide for more information on regions
  • CORE is the mask for the core parts, 0xf would do the job but for easy reading in the debugger we use 0xff
    • new core types are not added very often, adding a new one is usually a sign that you are doing something wrong

The core type can be modified by various flags. The flags in principle can be combined in any way but in practice many combinations make no sense. For instance, HAS_DEFAULT is for table columns and CREATE_FUNC is for function declarations. There is no one object that could require both of these.

The full list as of this writing is as follows:

#define SEM_TYPE_NOTNULL _64(0x0100) // set if and only if null is not possible
#define SEM_TYPE_HAS_DEFAULT _64(0x0200) // set for table columns with a default
#define SEM_TYPE_AUTOINCREMENT _64(0x0400) // set for table columns with autoinc
#define SEM_TYPE_VARIABLE _64(0x0800) // set for variables and parameters
#define SEM_TYPE_IN_PARAMETER _64(0x1000) // set for in parameters (can mix with below)
#define SEM_TYPE_OUT_PARAMETER _64(0x2000) // set for out parameters (can mix with above)
#define SEM_TYPE_DML_PROC _64(0x4000) // set for stored procs that have DML/DDL
#define SEM_TYPE_HAS_SHAPE_STORAGE _64(0x8000) // set for a cursor with simplified fetch syntax
#define SEM_TYPE_CREATE_FUNC _64(0x10000) // set for a function that returns a created object +1 ref
#define SEM_TYPE_SELECT_FUNC _64(0x20000) // set for a sqlite UDF function declaration
#define SEM_TYPE_DELETED _64(0x40000) // set for columns that are not visible in the current schema version
#define SEM_TYPE_VALIDATED _64(0x80000) // set if item has already been validated against previous schema
#define SEM_TYPE_USES_OUT _64(0x100000) // set if proc has a one rowresult using the OUT statement
#define SEM_TYPE_USES_OUT_UNION _64(0x200000) // set if proc uses the OUT UNION form for multi row result
#define SEM_TYPE_PK _64(0x400000) // set if column is a primary key
#define SEM_TYPE_FK _64(0x800000) // set if column is a foreign key
#define SEM_TYPE_UK _64(0x1000000) // set if column is a unique key
#define SEM_TYPE_VALUE_CURSOR _64(0x2000000) // set only if SEM_TYPE_HAS_SHAPE_STORAGE is set and the cursor has no statement
#define SEM_TYPE_SENSITIVE _64(0x4000000) // set if the object is privacy sensitive
#define SEM_TYPE_DEPLOYABLE _64(0x8000000) // set if the object is a deployable region
#define SEM_TYPE_BOXED _64(0x10000000) // set if a cursor's lifetime is managed by a box object
#define SEM_TYPE_HAS_CHECK _64(0x20000000) // set for table column with a "check" clause
#define SEM_TYPE_HAS_COLLATE _64(0x40000000) // set for table column with a "collate" clause
#define SEM_TYPE_INFERRED_NOTNULL _64(0x80000000) // set if inferred to not be nonnull (but was originally nullable)
#define SEM_TYPE_VIRTUAL _64(0x100000000) // set if and only if this is a virtual table
#define SEM_TYPE_HIDDEN_COL _64(0x200000000) // set if and only if hidden column on a virtual table
#define SEM_TYPE_TVF _64(0x400000000) // set if and only table node is a table valued function
#define SEM_TYPE_IMPLICIT _64(0x800000000) // set if and only the variable was declare implicitly (via declare out)
#define SEM_TYPE_CALLS_OUT_UNION _64(0x1000000000) // set if proc calls an out union proc for

Note: _64(x) expands to either a trailing L or a trailing LL depending on the bitness of the compiler, whichever yields an int64_t.

Going over the meaning of all of the above is again beyond the scope of this document; some of the flags are very specialized and essentially the validation just requires a bit of storage in the tree to do its job so that storage is provided with a flag. However two flag bits are especially important and are computed almost everywhere sem_t is used. These are SEM_TYPE_NOTNULL and SEM_TYPE_SENSITIVE.

  • SEM_TYPE_NOTNULL indicates that the marked item is known to be NOT NULL, probably because it was declared as such, or directly derived from a not null item
    • Typically when two operands are combined both must be marked NOT NULL for the result to still be NOT NULL (there are exceptions like COALESCE)
    • Values that might be null cannot be assigned to targets that must not be null
  • SEM_TYPE_SENSITIVE indicates that the marked item is some kind of PII or other sensitive data.
    • Any time a sensitive operand is combined with another operand the resulting type is sensitive
    • There are very few ways to "get rid" of the sensitive bit -- it corresponds to the presence of @sensitive in the data type declaration.
    • Values that are sensitive cannot be assigned to targets that are not marked sensitive

The semantic node sem_node carries all the possible semantic info we might need, the sem_type holds the flags above and tells us how to interpret the rest of the node. There are many fields, we'll talk about some of the most important ones here to give you a sense of how things hang together.

Note that CSTR is simply an alias for const char *. CSTR is used extensively in the codebase for brevity.

typedef struct sem_node {
sem_t sem_type; // core type plus flags
CSTR name; // for named expressions in select columns etc.
CSTR kind; // the Foo in object<Foo>, not a variable or column name
CSTR error; // error text for test output, not used otherwise
struct sem_struct *sptr; // encoded struct if any
struct sem_join *jptr; // encoded join if any
int32_t create_version; // create version if any (really only for tables and columns)
int32_t delete_version; // create version if any (really only for tables and columns)
bool_t recreate; // for tables only, true if marked @recreate
CSTR recreate_group_name; // for tables only, the name of the recreate group if they are in one
CSTR region; // the schema region, if applicable, null means unscoped (default)
symtab *used_symbols; // for select statements, we need to know which of the ids in the select list was used if any
list_item *index_list; // for tables we need the list of indices that use this table (so we can recreate them together if needed)
struct eval_node *value; // for enum values we have to store the evaluated constant value of each member of the enum
} sem_node;
  • sem_type : already discussed above, this tells you how to interpret everything else
  • name : variables, columns, etc. have a canonical name, when a name case-insensitivity resolves, the canonical name is stored here
    • typically later passes emit the canonical variable name everywhere
    • e.g. FoO and fOO might both resolve to an object declared as foo, we always emit foo in codegen
  • kind : in CQL any type can be discriminated as in declare foo real<meters>, the kind here is meters
    • two expressions of the same core type (e.g. real) are incompatible if they have a kind and the kind does not match
    • e.g. if you have bar real<liters> then set foo := bar; is an error even though both are real because foo above is real<meters>
  • sptr : if the item's core type is SEM_TYPE_STRUCT then this is populated, see below
  • jptr : if the item's core type is SEM_TYPE_JOIN then this is populated, see below

If the object is a structure type then this is simply an array of names, kinds, and semantic types. In fact the semantic types will be all be unitary, possibly modified by NOT_NULL or SENSITIVE but none of the other flags apply. A single sptr directly corresponds to the notion of a "shape" in the analyzer. Shapes come from anything that looks like a table, such as a cursor, or the result of a SELECT statement.

// for tables and views and the result of a select
typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;

If the object is a join type (such as the parts of the FROM clause) then the jptr field will be populated. This is nothing more than a named list of struct types.

// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"
typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;

With these building blocks we can represent the type of anything in the CQL language.

Initiating Semantic Analysis

The semantic analysis pass runs much the same way as the AST emitter. In sem.c there is the essential function sem_main. It suffices to call sem_main on the root of the AST. That root node is expected to be a stmt_list node.

// This method loads up the global symbol tables in either empty state or
// with the appropriate tokens ready to go. Using our own symbol tables for
// dispatch saves us a lot of if/else string comparison verbosity.
cql_noexport void sem_main(ast_node *ast) {
// restore all globals and statics we own
sem_cleanup();
eval_init();
...
}

As you can see, sem_main begins by resetting all the global state. You can of course do this yourself after calling sem_main (when you're done with the results).

sem_main sets a variety of useful and public global variables that describe the results of the analysis. The ones in sem.h are part of the contract and you should feel free to use them in a downstream code-generator. Other items are internal and should be avoided. The internal items are typically defined statically in sem.c. The essential outputs will be described in the last section of this part.

The cleanup has this structure:

// This method frees all the global state of the semantic analyzer
cql_noexport void sem_cleanup() {
eval_cleanup();
BYTEBUF_CLEANUP(deployable_validations);
BYTEBUF_CLEANUP(recreate_annotations);
BYTEBUF_CLEANUP(schema_annotations);
SYMTAB_CLEANUP(funcs);
SYMTAB_CLEANUP(globals);
SYMTAB_CLEANUP(indices);
SYMTAB_CLEANUP(locals);
...
// these are getting zeroed so that leaksanitizer will not count those objects as reachable from a global root.
all_ad_hoc_list = NULL;
all_functions_list = NULL;
...

This basically deallocates everything and resets all the globals to NULL.

sem_main of course has to walk the AST and it does so in much the same way as we saw in gen_sql.c. There is a series of symbol tables whose key is an AST type and whose value is a function plus arguments to dispatch (effectively a lambda). The semantic analyzer doesn't have to think about things like "should I emit parentheses" so the signature of each type of lambda can be quite a bit simpler. We'll go over each kind with some examples.

First we have the non-sql statements, these are basic flow control or other things that SQLite will never see directly.

symtab *syms = non_sql_stmts;
STMT_INIT(if_stmt);
STMT_INIT(while_stmt);
STMT_INIT(switch_stmt);
STMT_INIT(leave_stmt);
...

Here STMT_INIT creates a binding between (e.g.) the AST type if_stmt and the function sem_if_stmt. This lets us dispatch any part of the AST to its handler directly.

Next we have the SQL statements. These get analyzed in the same way as the others, and with functions that have the same signature, however, if you use one of these it means that procedure that contained this statement must get a database connection in order to run. Use of the database will require the procedure's signature to change; this is recorded by the setting the SEM_TYPE_DML_PROC flag bit to be set on the procedures AST node.

syms = sql_stmts;
STMT_INIT(create_table_stmt);
STMT_INIT(drop_table_stmt);
STMT_INIT(create_index_stmt);
STMT_INIT(create_view_stmt);
STMT_INIT(select_stmt);
STMT_INIT(delete_stmt);
STMT_INIT(update_stmt);
STMT_INIT(insert_stmt);
...

Again STMT_INIT creates a binding between (e.g.) the AST type delete_stmt and the function sem_delete_stmt so we can dispatch to the handler.

Next we have expression types, these are set up with EXPR_INIT. Many of the operators require exactly the same kinds of verification, so in order to be able to share the code, the expression analysis functions get an extra argument for the operator in question. Typically the string of the operator is only needed to make a good quality error message with validation being otherwise identical. Here are some samples...

EXPR_INIT(num, sem_expr_num, "NUM");
EXPR_INIT(str, sem_expr_str, "STR");
EXPR_INIT(blob, sem_expr_blob, "BLB");
EXPR_INIT(null, sem_expr_null, "NULL");
EXPR_INIT(dot, sem_expr_dot, "DOT");
EXPR_INIT(const, sem_expr_const, "CONST");
EXPR_INIT(mul, sem_binary_math, "*");
EXPR_INIT(mod, sem_binary_integer_math, "%");
EXPR_INIT(not, sem_unary_logical, "NOT");
EXPR_INIT(is_true, sem_unary_is_true_or_false, "IS TRUE");
EXPR_INIT(tilde, sem_unary_integer_math, "~");
EXPR_INIT(uminus, sem_unary_math, "-");

Looking at the very first entry as an example, we see that EXPR_INIT creates a mapping between the AST type num and the analysis function sem_expr_num and that function will get the text "NUM" as an extra argument. As it happens sem_expr_num doesn't need the extra argument, but sem_binary_math certainly needs the "*" as that function handles a large number of binary operators.

Let's quickly go over this list as these are the most important analyzers:

  • sem_expr_num : analyzes any numeric constant
  • sem_expr_str : analyzes any string literal or identifier
  • sem_expr_blob : analyzes any blob literal
  • sem_expr_null : analyzes the NULL literal (and nothing else)
  • sem_expr_dot : analyzes a compound name like T1.id
  • sem_expr_const : analyzes a const(...) expression, doing the constant evaluation
  • sem_binary_math : analyzes any normal binary math operator like '+', '-', '/' etc.
  • sem_binary_integer_math : analyzes any binary math operator where the operands must be integers like '%' or '|'
  • sem_unary_logical : analyzes any unary logical operator (the result is a bool), this is really only NOT
  • sem_unary_is_true_or_false : analyzes any of the IS TRUE, IS FALSE, family of postfix unary operators
  • sem_unary_integer_math : analyzes any unary operator where the operand must be an integer, this is really only ~
  • sem_unary_math : analyzes any any math unary operator, presently only negation (but in the future unary + too)

The last group of normal associations are for builtin functions, like these:

FUNC_INIT(changes);
FUNC_INIT(printf);
FUNC_INIT(strftime);
FUNC_INIT(date);
FUNC_INIT(time);

Each of these is dispatched when a function call is found in the tree. By way of example FUNC_INIT(changes) causes the changes function to map to sem_func_changes for validation.

There are a few other similar macros for more exotic cases but the general pattern should be clear now. With these in place it's very easy to traverse arbitrary statement lists and arbitrary expressions with sub expressions and have the correct function invoked without having large switch blocks all over.

Semantic Errors

Some of the following examples will show the handling of semantic errors more precisely but the theory is pretty simple. Each of the analyzers that has been registered is responsible for putting an appropriate sem_node into the AST it is invoked on. The caller will look to see if that sem_node is of type SEM_TYPE_ERROR using is_error(ast). If it is, the caller will mark its own AST as errant using record_error(ast) and this continues all the way up the tree. The net of this is that wherever you begin semantic analysis, you can know if there were any problems by checking for an error at the top of the tree you provided.

At the point of the initial error, the analyzer is expected to also call report_error providing a suitable message. This will be logged to stderr. In test mode it is also stored in the AST so that verification steps can confirm that errors were reported at exactly the right place.

If there are no errors, then a suitable sem_node is created for the resulting type or else, at minimum, record_ok(ast) is used to place the shared "OK" type on the node. The "OK" type indicates no type information, but no errors either. "OK" is helpful for statements that don't involve expressions like DROP TABLE Foo.

The Primitive Types

Perhaps the simplest analysis of all happens at the leaves of the AST. By way of example, here is the code for expression nodes of type num, the numeric literals.

// Expression type for numeric primitives
static void sem_expr_num(ast_node *ast, CSTR cstr) {
Contract(is_ast_num(ast));
EXTRACT_NUM_TYPE(num_type, ast);
switch (num_type) {
case NUM_BOOL:
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL);
break;
case NUM_INT:
ast->sem = new_sem(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL);
break;
case NUM_LONG:
ast->sem = new_sem(SEM_TYPE_LONG_INTEGER | SEM_TYPE_NOTNULL);
break;
default:
// this is all that's left
Contract(num_type == NUM_REAL);
ast->sem = new_sem(SEM_TYPE_REAL | SEM_TYPE_NOTNULL);
break;
}
}

As you can see the code simply looks at the AST node, confirming first that it is a num node. Then it extracts the num_type. Then ast->sem is set to a semantic node of the matching type adding in SEM_TYPE_NOTNULL because literals are never null.

The new_sem function is used to make an empty sem_node with the sem_type filled in as specified. Nothing can go wrong creating a literal so there are no failure modes.

It doesn't get much simpler unless maybe...

// Expression type for constant NULL
static void sem_expr_null(ast_node *ast, CSTR cstr) {
Contract(is_ast_null(ast));
// null literal
ast->sem = new_sem(SEM_TYPE_NULL);
}

It's hard to get simpler than doing semantic analysis of the NULL literal. Its code should be clear with no further explanation needed.

Unary Operators

Let's dive in to a simple case that does require some analysis -- the unary operators. There are comparatively few and there isn't much code required to handle them all.

Here's the code for the unary math operators:

// The only unary math operators are '-' and '~'
// Reference types are not allowed
static void sem_unary_math(ast_node *ast, CSTR op) {
sem_t core_type, combined_flags;
if (!sem_unary_prep(ast, &core_type, &combined_flags)) {
return;
}
if (!sem_validate_numeric(ast, core_type, op)) {
return;
}
// The result of unary math promotes to integer. Basically this converts
// bool to integer. Long integer and Real stay as they are. Text is
// already ruled out.
sem_t sem_type_result = sem_combine_types(
(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL),
(core_type | combined_flags));
ast->sem = new_sem(sem_type_result);
ast->sem->kind = ast->left->sem->kind;
// note ast->sem->name is NOT propagated because SQLite doesn't let you refer to
// the column 'x' in 'select -x' -- the column name is actually '-x' which is useless
// so we have no name once you apply unary math (unless you use 'as')
// hence ast->sem->name = ast->left->sem->name is WRONG here and it is not missing on accident
}

Unary Prep

OK already we need to pause, there is a "prep" pattern here common to most of the shared operators that we should discuss. The prep step takes care of most of the normal error handling which is the same for all the unary operators and same pattern happens in binary operators. Let's take a look at sem_unary_prep.

// The unary operators all have a similar prep to the binary. We need
// to visit the left side (it's always the left node even if the operator goes on the right)
// if that's ok then we need the combined_flags and core type. There is only
// the one. Returns true if everything is ok.
static bool_t sem_unary_prep(ast_node *ast, sem_t *core_type, sem_t *combined_flags) {
// op left | left op
sem_expr(ast->left);
if (is_error(ast->left)) {
*core_type = SEM_TYPE_ERROR;
*combined_flags = 0;
record_error(ast);
return false;
}
sem_node *sem = ast->left->sem;
sem_t sem_type = sem->sem_type;
*core_type = core_type_of(sem_type);
*combined_flags = not_nullable_flag(sem_type) | sensitive_flag(sem_type);
Invariant(is_unitary(*core_type));
return true;
}

Reviewing the steps:

  • first we analyze the operand, it will be in ast->left
  • if that's an error, we just return the error code from the prep steps
  • now that it's not an error, we pull the core type out of the operand
  • then we pull the not nullable and sensitive flag bits out of the operand
  • finally return a boolean indicating the presence of an error (or not) for convenience

This is useful setup for all the unary operators, and as we'll see, the binary operators have a similar prep step.

Back to Unary Processing

Looking at the overall steps we see:

  • sem_unary_prep : verifies that the operand is not an error, and gets its core type and flag bits
  • sem_validate_numeric : verifies that the operand is a numeric type
    • recall these are the math unary operators, so the operand must be numeric
  • sem_combine_types : creates the smallest type that holds two compatible types
    • by combining with "integer not null" we ensure that the resulting type is at least as big as an integer
    • if the argument is of type long or real then it will be the bigger type and the resulting type will be long or real as appropriate
    • in short, bool is promoted to int, everything else stays the same
    • sem_combine_types also combines the nullability and sensitivity appropriately
  • a new sem_node of the combined type is created
    • the type "kind" of the operand is preserved (e.g. the meters in real<meters>)
    • any column alias or variable name is not preserved, the value is now anonymous

These primitives are designed to combine well, for instance, consider sem_unary_integer_math

static void sem_unary_integer_math(ast_node *ast, CSTR op) {
sem_unary_math(ast, op);
sem_reject_real(ast, op);
}

The steps are:

  • sem_unary_math : do the sequence we just discussed
  • sem_reject_real : report/record an error if the result type is real otherwise do nothing

Note that in all cases the op string simply gets pushed down to the place where the errors happen. Let's take a quick look at one of the sources of errors in the above. Here's the numeric validator:

static bool_t sem_validate_numeric(ast_node *ast, sem_t core_type, CSTR op) {
if (is_blob(core_type)) {
report_error(ast->left, "CQL0045: blob operand not allowed in", op);
record_error(ast);
return false;
}
if (is_object(core_type)) {
report_error(ast->left, "CQL0046: object operand not allowed in", op);
record_error(ast);
return false;
}
if (is_text(core_type)) {
report_error(ast->left, "CQL0047: string operand not allowed in", op);
record_error(ast);
return false;
}
return true;
}

That function is pretty much dumb as rocks. The non-numeric types are blob, object, and text. There is a custom error for each type (it could have been shared but specific error messages seem to help users). This code doesn't know its context, but all it needs is op to tell it what the numeric-only operator was and it can produce a nice error message. It leaves an error in the AST using record_error, its caller can then simply return if anything goes wrong.

It's not hard to guess how sem_reject_real works:

// Some math operators like << >> & | % only make sense on integers
// This function does the extra checking to ensure they do not get real values
// as arguments. It's a post-pass after the normal math checks.
static void sem_reject_real(ast_node *ast, CSTR op) {
if (!is_error(ast)) {
sem_t core_type = core_type_of(ast->sem->sem_type);
if (core_type == SEM_TYPE_REAL) {
report_error(ast, "CQL0001: operands must be an integer type, not real", op);
record_error(ast);
}
}
}
  • if the AST node isn't already an error, and the node is of type "real", report an error
  • it assumes the type is already known to be numeric
  • the pre-check for errors is to avoid double reporting; if something has already gone wrong, the core type will be SEM_TYPE_ERROR
    • no new error recording is needed in that case, obviously an error was already recorded

Binary Operators

Binary Prep

With the knowledge we have so far, this code pretty much speaks for itself, but we'll walk through it.

// All the binary ops do the same preparation, they evaluate the left and the
// right expression, then they check those for errors. Then they need
// the types of those expressions and the combined_flags of the result. This
// does exactly that for its various callers. Returns true if all is well.
static bool_t sem_binary_prep(ast_node *ast, sem_t *core_type_left, sem_t *core_type_right, sem_t *combined_flags) {
EXTRACT_ANY_NOTNULL(left, ast->left);
EXTRACT_ANY_NOTNULL(right, ast->right);
// left op right
sem_expr(left);
sem_expr(right);
if (is_error(left) || is_error(right)) {
record_error(ast);
*core_type_left = SEM_TYPE_ERROR;
*core_type_right = SEM_TYPE_ERROR;
*combined_flags = 0;
return false;
}
*core_type_left = core_type_of(left->sem->sem_type);
*core_type_right = core_type_of(right->sem->sem_type);
*combined_flags = combine_flags(left->sem->sem_type, right->sem->sem_type);
Invariant(is_unitary(*core_type_left));
Invariant(is_unitary(*core_type_right));
return true;
}
  • sem_expr : used to recursively walk the left and right nodes
  • is_error : checks if either side had errors, if so, simply propagate the error
  • extract the left and right core types
  • combine nullability and sensitivity flags

And that's it! These are the standard prep steps for all binary operators. With this done, the caller has the core types of the left and right operands plus combined flags on a silver platter and one check is needed to detect if anything went wrong.

Example: Is or Is Not

This analyzer is the simplest of all the binaries

// IS and IS NOT are special in that they return a not null boolean.
static void sem_binary_is_or_is_not(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;
if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}
if (!sem_verify_compat(ast, core_type_left, core_type_right, op)) {
return;
}
// the result of is or is not is always a bool and never null
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL | sensitive_flag(combined_flags));
}
  • sem_binary_prep : checks for errors in the left or right
  • sem_verify_compat : ensures that left and right operands are type compatible (discussed later)
  • the result is always of type bool not null

If either step goes wrong the error will naturally propagate.

Example: Binary Math

This is the general worker for binary math operations, the most common operations like '+', '-', '*' and so forth.

// For all math operations, we combine the types and yield the type that
// holds both using the helper. If any text, that's an error.
static void sem_binary_math(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;
if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}
if (error_any_object(ast, core_type_left, core_type_right, op)) {
return;
}
if (error_any_blob_types(ast, core_type_left, core_type_right, op)) {
return;
}
if (error_any_text_types(ast, core_type_left, core_type_right, op)) {
return;
}
sem_t core_type = sem_combine_types(core_type_left, core_type_right);
CSTR kind = sem_combine_kinds(ast->right, ast->left->sem->kind);
if (is_error(ast->right)) {
record_error(ast);
return;
}
ast->sem = new_sem(core_type | combined_flags);
ast->sem->kind = kind;
}

Let's have a look at those steps:

  • sem_binary_prep : checks for errors on the left or right
  • error_any_object : reports an error if the left or right is of type object
  • error_any_blob_types : reports an error if the left or right is of type blob
  • error_any_text_types : reports an error if the left or right is of type text
  • sem_combine_type : computes the combined type, the smallest numeric type that holds both left and right
    • note the operands are now known to be numeric
    • the three type error checkers give nice tight errors about the left or right operand
  • sem_combine_kinds : tries to create a single type kind for both operands
    • if their kind is incompatible, records an error on the right
  • new_sem : creates a sem_node with the combined type, flags, and then the kind is set.

At this point it might help to look a few more of the base validators, they are very unremarkable.

Example Validator: error_any_object

// If either of the types is an object then produce an error on the ast.
static bool_t error_any_object(ast_node *ast, sem_t core_type_left, sem_t core_type_right, CSTR op) {
if (is_object(core_type_left)) {
report_error(ast->left, "CQL0002: left operand cannot be an object in", op);
record_error(ast);
return true;
}
if (is_object(core_type_right)) {
report_error(ast->right, "CQL0003: right operand cannot be an object in", op);
record_error(ast);
return true;
}
return false;
}
  • is_object : checks a sem_type against SEM_TYPE_OBJECT
    • if the left or right child is an object an appropriate error is generated
  • there is no strong convention about returning true if ok, or true if error, it's pretty ad hoc
    • this doesn't seem to cause a lot of problems

Example Validator: sem_combine_kinds

// Here we check that type<Foo> only combines with type<Foo> or type.
// If there is a current object type, then the next item must match
// If there is no such type, then an object type that arrives becomes the required type
// if they ever don't match record an error
static CSTR sem_combine_kinds_general(ast_node *ast, CSTR kleft, CSTR kright) {
if (kright) {
if (kleft) {
if (strcmp(kleft, kright)) {
CSTR errmsg = dup_printf("CQL0070: expressions of different kinds can't be mixed: '%s' vs. '%s'", kright, kleft);
report_error(ast, errmsg, NULL);
record_error(ast);
}
}
return kright;
}
return kleft;
}
// helper to crack the ast nodes first and then call the normal comparisons
static CSTR sem_combine_kinds(ast_node *ast, CSTR kright) {
CSTR kleft = ast->sem->kind;
return sem_combine_kinds_general(ast, kleft, kright);
}
  • sem_combine_kinds : uses the worker sem_combine_kinds_general after extracting the kind from the left node
    • usually you already have one kind and you want to know if another kind is compatible hence this helper
  • sem_combine_kinds_general : applies the general rules for "kind" strings:
    • NULL + NULL => NULL
    • NULL + x => x
    • x + NULL => x
    • x + x => x
    • x + y => error (if x != y)
  • this is one of the rare functions that creates a dynamic error message

Example Validator : is_numeric_compat

This helper is frequently called several times in the course of other semantic checks. This one produces no errors, that's up to the caller. Often there is a numeric path and a non-numeric path so this helper can't create the errors as it doesn't yet know if anything bad has happened. Most of the is_something functions are the same way.

cql_noexport bool_t is_numeric_compat(sem_t sem_type) {
sem_type = core_type_of(sem_type);
return sem_type >= SEM_TYPE_NULL && sem_type <= SEM_TYPE_REAL;
}

is_numeric_compat operates by checking the core type for the numeric range. Note that NULL is compatible with numerics because expressions like NULL + 2 have meaning in SQL. The type of that expression is nullable integer and the result is NULL.

Example Validator : sem_combine_types

// The second workhorse of semantic analysis, given two types that
// are previously known to be compatible, it returns the smallest type
// that holds both. If either is nullable the result is nullable.
// Note: in the few cases where that isn't true the normal algorithm for
// nullablity result must be overridden (see coalesce for instance).
static sem_t sem_combine_types(sem_t sem_type_1, sem_t sem_type_2) {
... too much code ... summary below
}

This beast is rather lengthy but unremarkable. It follows these rules:

  • text is only compatible with text
  • object is only compatible with object
  • blob is only compatible with blob
  • numerics are only compatible with other numerics and NULL
    • NULL promotes the other operand, whatever it is (might still be NULL)
    • bool promotes to integer if needed
    • integer promotes to long integer if needed
    • long integer promotes to real if needed
    • the combined type is the smallest numeric type that holds left and right according to the above

Some examples might be helpful:

  • 1 + 2L -> long
  • false + 3.1 -> real
  • 2L + 3.1 -> real
  • true + 2 -> integer
  • 'x' + 1 -> not compatible

Note that sem_combine_types assumes the types have already been checked for compatibility and will use Contract to enforce this. You should be using other helpers like is_numeric_compat and friends to ensure the types agree before computing the combined type. A list of values that must be compatible with each other (e.g. in needle IN (haystack)) can be checked using sem_verify_compat repeatedly.

Example Validator : sem_verify_assignment

The sem_verify_assignment function is used any time there is something like a logical assignment going on. There are two important cases:

  • SET x := y : an actual assignment
  • call foo(x) : the expression x must be "assignable" to the formal variable for the argument of foo

This is a lot like normal binary operator compatibility with one extra rule: the source expression must not be a bigger type than the target. e.g. you cannot assign a long to an integer, nor pass a long expression to a function that has an integer parameter.

// This verifies that the types are compatible and that it's ok to assign
// the expression to the variable. In practice that means:
// * the variable type core type and kind must be compatible with the expression core type and kind
// * the variable must be nullable if the expression is nullable
// * the variable must be sensitive if the assignment is sensitive
// * the variable type must be bigger than the expression type
// Here ast is used only to give a place to put any errors.
static bool_t sem_verify_assignment(ast_node *ast, sem_t sem_type_needed, sem_t sem_type_found, CSTR var_name) {
if (!sem_verify_compat(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}
if (!sem_verify_safeassign(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}
if (is_nullable(sem_type_found) && is_not_nullable(sem_type_needed)) {
report_error(ast, "CQL0013: cannot assign/copy possibly null expression to not null target", var_name);
return false;
}
if (sensitive_flag(sem_type_found) && !sensitive_flag(sem_type_needed)) {
report_error(ast, "CQL0014: cannot assign/copy sensitive expression to non-sensitive target", var_name);
return false;
}
return true;
}
  • sem_verify_compat : checks for standard type compatibility between the left and the right
  • sem_verify_safeassign : checks that if the types are different the right operand is the smaller
  • nullability checks ensure you aren't trying to assign a nullable value to a not null variable
  • sensitivity checks ensure you aren't trying to assign a sensitive value to a not sensitive variable

Simple Statement Validation

With the expression building blocks, most of the usual kind of language statements become quite simple to check for correctness. It's probably easiest to illustrate this with an example, let's look at validation for the WHILE statement.

// While semantic analysis is super simple.
// * the condition must be numeric
// * the statement list must be error-free
// * loop_depth is increased allowing the use of interior leave/continue
static void sem_while_stmt(ast_node *ast) {
Contract(is_ast_while_stmt(ast));
EXTRACT_ANY_NOTNULL(expr, ast->left);
EXTRACT(stmt_list, ast->right);
// WHILE [expr] BEGIN [stmt_list] END
sem_numeric_expr(expr, ast, "WHILE", SEM_EXPR_CONTEXT_NONE);
if (is_error(expr)) {
record_error(ast);
return;
}
if (stmt_list) {
loop_depth++;
sem_stmt_list(stmt_list);
loop_depth--;
if (is_error(stmt_list)) {
record_error(ast);
return;
}
}
record_ok(ast);
}
  • EXTRACT* : pulls out the tree parts we need
  • sem_numeric_expr : verifies the loop expression is numeric
  • sem_stmt_list : recursively validates the body of the loop

Note: the while expression is one of the loop constructs which means LEAVE and CONTINUE are legal inside it. The loop_depth global tracks the fact that we are in a loop so that analysis for LEAVE and CONTINUE can report errors if we are not.

It's not hard to imagine that sem_stmt_list will basically walk the AST, pulling out statements and dispatching them using the STMT_INIT tables previously discussed. You might land right back in sem_while_stmt for a nested WHILE -- it's turtles all the way down.

If SEM_EXPR_CONTEXT_NONE is a mystery, don't worry it's covered in the next section.

Expression Contexts

It turns out that in the SQL language some expression types are only valid in some parts of a SQL statement (e.g. aggregate functions can't appear in a LIMIT clause) and so there is always a context for any numeric expression. When a new root expression is being evaluated, it sets the expression context per the caller's specification.

The expression contexts are as follows:

#define SEM_EXPR_CONTEXT_NONE 0x0001
#define SEM_EXPR_CONTEXT_SELECT_LIST 0x0002
#define SEM_EXPR_CONTEXT_WHERE 0x0004
#define SEM_EXPR_CONTEXT_ON 0x0008
#define SEM_EXPR_CONTEXT_HAVING 0x0010
#define SEM_EXPR_CONTEXT_ORDER_BY 0x0020
#define SEM_EXPR_CONTEXT_GROUP_BY 0x0040
#define SEM_EXPR_CONTEXT_LIMIT 0x0080
#define SEM_EXPR_CONTEXT_OFFSET 0x0100
#define SEM_EXPR_CONTEXT_TABLE_FUNC 0x0200
#define SEM_EXPR_CONTEXT_WINDOW 0x0400
#define SEM_EXPR_CONTEXT_WINDOW_FILTER 0x0800
#define SEM_EXPR_CONTEXT_CONSTRAINT 0x1000

The idea here is simple, when calling a root expression, the analyzer provides the context value that has the bit that corresponds to the current context. For instance, the expression being validated in is the WHERE clause, the code will provide SEM_EXPR_CONTEXT_WHERE. The inner validators check this context, in particular anything that is only available in some contexts has a bit-mask of that is the union of the context bits where it can be used. The validator can check those possibilities against the current context with one bitwise "and" operation. A zero result indicates that the operation is not valid in the current context.

This bitwise "and" is performed by one of these two helper macros which makes the usage a little clearer:

#define CURRENT_EXPR_CONTEXT_IS(x) (!!(current_expr_context & (x)))
#define CURRENT_EXPR_CONTEXT_IS_NOT(x) (!(current_expr_context & (x)))

Expression Context Example : Concat

The concatenation operator || is challenging to successfully emulate because it does many different kinds of numeric to string conversions automatically. Rather than perennially getting this wrong, we simply do not support this operator in a context where SQLite isn't going to be doing the concatenation. So typically users use "printf" instead to get formatting done outside of a SQL context. The check for invalid use of || is very simple and it happens of course in sem_concat.

if (CURRENT_EXPR_CONTEXT_IS(SEM_EXPR_CONTEXT_NONE)) {
report_error(ast, "CQL0241: CONCAT may only appear in the context of a SQL statement", NULL);
record_error(ast);
return;
}

Expression Context Example : IN

A slightly more complex example happens processing the IN operator. This operator has two forms, the form with an expression list, which can be used anywhere, and the form with a SELECT statement. The latter form can only appear in some sections of SQL, and not at all in loose expressions. For instance, that form may not appear in the LIMIT or OFFSET sections of a SQLite statement.

We use this construct to do the validation:

uint32_t valid = SEM_EXPR_CONTEXT_SELECT_LIST
|SEM_EXPR_CONTEXT_WHERE
|SEM_EXPR_CONTEXT_ON
|SEM_EXPR_CONTEXT_HAVING
|SEM_EXPR_CONTEXT_TABLE_FUNC;
if (CURRENT_EXPR_CONTEXT_IS_NOT(valid)) {
report_error( ast, "CQL0078: [not] in (select ...) is only allowed inside "
"of select lists, where, on, and having clauses", NULL);
record_error(ast);
return;
}

If the reader is interested in a simple learning exercise, run down the purpose of SEM_EXPR_CONTEXT_TABLE_FUNC -- it's simple, but important, and it only has one use case so it's easy to find.

Name Resolution

We've gotten pretty far without talking about the elephant in the room: name resolution.

Like SQL, many statements in CQL have names in positions where the type of the name is completely unambiguous. For instance nobody could be confused what sort of symbol Foo is in DROP INDEX Foo.

This type, with a clear name category, are the easiest name resolutions, and there are a lot in this form. Let's do an example.

Example: Index Name Resolution

// This is the basic checking for the drop index statement
// * the index must exist (have been declared) in some version
// * it could be deleted now, that's ok, but the name has to be valid
static void sem_drop_index_stmt(ast_node *ast) {
Contract(is_ast_drop_index_stmt(ast));
EXTRACT_ANY_NOTNULL(name_ast, ast->right);
EXTRACT_STRING(name, name_ast);
ast_node *index_ast = find_usable_index(name, name_ast, "CQL0112: index in drop statement was not declared");
if (!index_ast) {
record_error(ast);
return;
}
record_ok(ast);
}

Well, this is interesting. But what's going on with find_usable_index? What is usable? Why aren't we just looking up the index name in some name table? Let's have a look at the details of find_usable_index.

// returns the node only if it exists and is not restricted by the schema region.
static ast_node *find_usable_index(CSTR name, ast_node *err_target, CSTR msg) {
ast_node *index_ast = find_index(name);
if (!index_ast) {
report_error(err_target, msg, name);
return NULL;
}
if (!sem_validate_object_ast_in_current_region(name, index_ast, err_target, msg)) {
return NULL;
}
return index_ast;
}

We haven't discussed schema regions yet but what you need to know about them for now is this:

  • any object can be in a region.
  • a region may depend on other regions

If an object is in a region, then it may only use schema parts that are in the same region, or the region's dependencies (transitively).

The point of this is that you might have a rather large schema and you probably don't want any piece of code to use any piece of schema. You can use regions to ensure that the code for feature "X" doesn't try to use schema designed exclusively for feature "Y". That "X" code probably has no business even knowing of the existence of "Y" schema.

So now usable simply means this:

  • find_index can find the name in the symbol table for indices
  • the found index is accessible in the current region

If we had used an example that was looking up a table name, the same region considerations would apply, however, additionally tables can be deprecated with @delete so there would be additional checks to make sure we're talking about a live table and not a table's tombstone.

In short, these simple cases just require looking up the entity and verifying that it's accessible in the current context.

Flexible Name Resolution

The "hard case" for name resolution is where the name is occurring in an expression. Such a name can refer to all manner of things. It could be a global variable, a local variable, an argument, a table column, a field in a cursor, and others. The general name resolver goes through several phases looking for the name. Each phase can either report an affirmative success or error (in which case the search stops), or it may simply report that the name was not found but the search should continue.

We can demystify this a bit by looking at the most common way to get name resolution done.

// Resolves a (potentially qualified) identifier, writing semantic information
// into `ast` if successful, or reporting and recording an error for `ast` if
// not.
static void sem_resolve_id(ast_node *ast, CSTR name, CSTR scope) {
Contract(is_id(ast) || is_ast_dot(ast));
Contract(name);
// We have no use for `type` and simply throw it away.
sem_t *type = NULL;
sem_resolve_id_with_type(ast, name, scope, &type);
}

The name resolver works on either a vanilla name (e.g. x) or a scoped name (e.g. T1.x). The name and scope are provided. The ast parameter is used only as a place to report errors, there is no further cracking of the AST needed to resolve the name. As you can see sem_resolve_id just calls the more general function sem_resolve_id_with_type and is used in the most common case where you don't need to be able to mutate the sematic type info for the identifier. That's the 99% case.

So let's move on to the "real" resolver.

// This function is responsible for resolving both unqualified identifiers (ids)
// and qualified identifiers (dots). It performs the following two roles:
//
// - If an optional `ast` is provided, it works the same way most semantic
// analysis functions work: semantic information will be written into into the
// ast, errors will be reported to the user, and errors will be recorded in
// the AST.
//
// - `*typr_ptr` will be set to mutable type (`sem_t *`) in the current
// environment if the identifier successfully resolves to a type. (There are,
// unfortunately, a few exceptions in which a type will be successfully
// resolved and yet `*typr_ptr` will not be set. These include when a cursor
// in an expression position, when the expression is `rowid` (or similar), and
// when the id resolves to an enum case. The reason no mutable type is
// returned in these cases is that a new type is allocated as part of semantic
// analysis, and there exists no single, stable type in the environment to
// which a pointer could be returned. This is a limitation of this function,
// albeit one that's currently not problematic.)
//
// Resolution is attempted in the order that the `sem_try_resolve_*` functions
// appear in the `resolver` array. Each takes the same arguments: An (optional)
// AST, a mandatory name, an optional scope, and mandatory type pointer. If the
// identifier provided to one of these resolvers is resolved successfully, *or*
// if the correct resolver was found but there was an error in the program,
// `SEM_RESOLVE_STOP` is returned and resolution is complete, successful or not.
// If a resolver is tried and it determines that it is not the correct resolver
// for the identifier in question, `SEM_RESOLVE_CONTINUE` is returned and the
// next resolver is tried.
//
// This function should not be called directly. If one is interested in
// performing semantic analysis, call `sem_resolve_id` (or, if within an
// expression, `sem_resolve_id_expr`). Alternatively, if one wants to get a
// mutable type from the environment, call `find_mutable_type`.
static void sem_resolve_id_with_type(ast_node *ast, CSTR name, CSTR scope, sem_t **type_ptr) {
Contract(name);
Contract(type_ptr);
*type_ptr = NULL;
sem_resolve (*resolver[])(ast_node *ast, CSTR, CSTR, sem_t **) = {
sem_try_resolve_arguments,
sem_try_resolve_column,
sem_try_resolve_rowid,
sem_try_resolve_cursor_as_expression,
sem_try_resolve_variable,
sem_try_resolve_enum,
sem_try_resolve_cursor_field,
sem_try_resolve_arg_bundle,
};
for (uint32_t i = 0; i < sizeof(resolver) / sizeof(void *); i++) {
if (resolver[i](ast, name, scope, type_ptr) == SEM_RESOLVE_STOP) {
return;
}
}
report_resolve_error(ast, "CQL0069: name not found", name);
record_resolve_error(ast);
}

This function is well described in its own comments. We can easily see the "mini-resolvers" which attempt to find the name in order:

  • sem_try_resolve_arguments : an argument in the argument list
  • sem_try_resolve_column : a column name (possibly scoped)
  • sem_try_resolve_rowid : the virtual rowid column (possibly scoped)
  • sem_try_resolve_cursor_as_expression : use of a cursor as a boolean, the bool is true if the cursor has data
  • sem_try_resolve_variable : local or global variables
  • sem_try_resolve_enum : the constant value of an enum (must be scoped)
  • sem_try_resolve_cursor_field : a field in a cursor (must be scoped)
  • sem_try_resolve_arg_bundle : a field in an argument bundle (must be scoped)

These all use this enum to communicate progress:

// All `sem_try_resolve_*` functions return either `SEM_RESOLVE_CONTINUE` to
// indicate that another resolver should be tried, or `SEM_RESOLVE_STOP` to
// indicate that the correct resolver was found. Continuing implies that no
// failure has (yet) occurred, but stopping implies neither success nor failure.
typedef enum {
SEM_RESOLVE_CONTINUE = 0,
SEM_RESOLVE_STOP = 1
} sem_resolve;

Each of these mini-resolvers will have a series of rules, for example sem_try_resolve_cursor_field is going to have to do something like this:

  • if there is no scope, it can't be a cursor field, return CONTINUE
  • if the scope is not the name of a cursor, return CONTINUE
  • if the name is a field in the cursor, return STOP with success
  • else, report that the name is not a valid member of the cursor, and return STOP with an error

All the mini-resolvers are similarly structured, generically:

  • if it's not my case, return CONTINUE
  • if it is my case return STOP (maybe with an error)

Some of the mini-resolvers have quite a few steps, but any one mini-resolver is only about a screenful of code and it does one job.

Structure types and the notion of Shapes

Earlier we discussed SEM_TYPE_STRUCT briefly. Recall the basic notion of the structure type:

// for tables and views and the result of a select
typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;

The structure is nothing more than an array of names, types and kinds with a count. But it creates the notion of what's usually called a "shape" in the codebase. Shapes can be used in a variety of ways as is described in Chapter 5 of the CQL Guide. But before we get into shapes, let's look at an example of how a structure type is created.

The code that follows is the back end of sem_create_table_stmt. At this point the bulk of the analysis is done and the columns all have their types. We're about to build the struct type for the table. Let's see how that goes.

// now create a struct type with the correct number of columns
// the types have already been computed so all we have to do is
// check for duplicates
sem_struct *sptr = new_sem_struct(name, cols);
symtab *columns = symtab_new();
int32_t col = 0;
for (ast_node *item = col_key_list; item; item = item->right) {
Contract(is_ast_col_key_list(item));
EXTRACT_ANY_NOTNULL(def, item->left);
if (is_ast_col_def(def)) {
Invariant(def->sem->name);
Invariant(col <= cols); // it's possible that the rest are deleted and we're at the end.
// columns must be unique, including deleted columns
if (!symtab_add(columns, def->sem->name, NULL)) {
EXTRACT_NOTNULL(col_def_type_attrs, def->left);
EXTRACT_NOTNULL(col_def_name_type, col_def_type_attrs->left);
EXTRACT_ANY_NOTNULL(col_def_ast, col_def_name_type->left);
report_error(col_def_ast, "CQL0142: duplicate column name", def->sem->name);
record_error(ast);
symtab_delete(columns);
goto cleanup;;
}
if (is_deleted(def->sem->sem_type)) {
continue;
}
Invariant(col < cols);
sptr->names[col] = def->sem->name;
sptr->semtypes[col] = def->sem->sem_type;
sptr->kinds[col] = def->sem->kind;
col++;
}
}
symtab_delete(columns);
Invariant(col == cols);
ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;
  • new_sem_struct : makes a struct to hold the result, we already have the count of columns and the table name
  • symtab_new : is going to gives us a scratch symbol table so we can check for duplicate column names
  • we walk all the items in the table and use is_ast_col_def(def) to find the column definitions
  • Invariant(def->sem->name) : claims that we must have already computed the semantic info for the column and it has its name populated
    • this was done earlier
  • symtab_add(columns, def->sem->name, NULL) : adds a nil entry under the column name, if this fails we have a duplicate column
    • in which case we report errors and stop
  • is_deleted : tells us if the column was marked with @delete in which case it no longer counts as part of the table
  • if all this is good we set the names, kinds, and semtypes from the column definition's semnatic info
  • symtab_delete : cleans up the temporary symbol table
  • new_sem : creates a sem_node of type SEM_TYPE_STRUCT which is filled in
    • sem_join_from_sem_struct will be discussed shortly, but it creates a jptr with one table in it

Structure types often come from the shape of a table, but other things can create a structure type. For instance, the columns of a view, or any select statement, are also described by a structure type and are therefore valid "shapes". The return type of a procedure usually comes from a SELECT statement so the procedure too can be the source of a shape. The arguments of a procedure form a shape. The fields of a cursor form a shape. You can even have a named subset of the arguments of a procedure and use them like a shape. All of these things are described by structure types.

Shapes and the LIKE construct

There are many cases where you want to be able to capture or re-use something with a known shape and you don't want to have to fully re-declare the thing. CQL uses the LIKE construct to do these sorts of things. This is more fully explained in Chapter 5 of the Guide, but for now let's look at two different cases that are of interest.

First, a cursor:

DECLARE C CURSOR LIKE Foo; -- Foo something with a shape

So, in the above, Foo could be a table, a view, a procedure with a result, another cursor, and so forth.

How might we do this? This is the business of sem_declare_cursor_like_name

// Here we're going to make a new value cursor using the indicated name for the shape.
// The name has to be "likeable" meaning it refers to some named thing with a shape
// such as a table, a view, another cursor, or a procedure that returns a result set.
// These are the so called "value cursors" in that they have no underlying statement
// that they move through. You can just load them up with a row and pass them around.
static void sem_declare_cursor_like_name(ast_node *ast) {
Contract(is_ast_declare_cursor_like_name(ast));
EXTRACT_ANY_NOTNULL(new_cursor_ast, ast->left);
EXTRACT_STRING(new_cursor_name, new_cursor_ast);
EXTRACT_ANY_NOTNULL(like_ast, ast->right);
EXTRACT_ANY_NOTNULL(name_ast, like_ast->left);
EXTRACT_STRING(like_name, name_ast);
// no duplicates allowed
if (!sem_verify_legal_variable_name(ast, new_cursor_name)) {
record_error(new_cursor_ast);
record_error(ast);
return;
}
// must be a valid shape
ast_node *found_shape = sem_find_likeable_ast(like_ast, LIKEABLE_FOR_VALUES);
if (!found_shape) {
record_error(ast);
return;
}
// good to go, make our cursor, with storage.
name_ast->sem = like_ast->sem = found_shape->sem;
new_cursor_ast->sem = new_sem(SEM_TYPE_STRUCT | SEM_TYPE_VARIABLE | SEM_TYPE_VALUE_CURSOR | SEM_TYPE_HAS_SHAPE_STORAGE);
new_cursor_ast->sem->sptr = found_shape->sem->sptr;
new_cursor_ast->sem->name = new_cursor_name;
ast->sem = new_cursor_ast->sem;
symtab_add(current_variables, new_cursor_name, new_cursor_ast);
}
  • EXTRACT : gets the pieces we need from the AST
  • sem_verify_legal_variable_name : makes sure the cursor name is unique and doesn't hide a table name
  • sem_find_likeable_ast : searches for something with a suitable name that has a shape
  • we populate the name node with the semantic type that we found
  • new_sem : makes a new sem_node for the cursor variable with SEM_TYPE_STRUCT
    • set the sptr field using the discovered shape

Note: name_ast->sem isn't actually used for anything but it is helpful for debugging. If the AST is printed it shows the original unmodified semantic type which can be helpful.

Briefly sem_find_likeable_ast does these steps:

  • if the right of the LIKE refers to procedure arguments (e.g. C LIKE Foo ARGUMENTS), get the args of the named procedure and use them as a shape
  • if the right is a local or global, and its a cursor, use the shape of that cursor for the new cursor
  • if the right is the name of an argument bundle, use the shape of the bundle
    • e.g. in CREATE PROC Foo(p1 like Person, p2 like Person) p1 and p2 are the names of argument bundles shaped like Person
  • if the right is the name of a table or view, use that shape
  • if the right is the name of a procedure with a structure result, use that shape
  • if it's none of these, produce an error

This is the primary source of shape reuse. Let's look at how we might use that.

Suppose we want to write a procedure that inserts a row into the table Foo, we could certainly list the columns of Foo as arguments like this:

CREATE PROC InsertIntoFoo(id integer, t text, r real, b blob)
BEGIN
INSERT INTO Foo(id, t, r, b) VALUES(id, t, r, b);
END;

But that approach is going to get a lot less exciting when there are lots of columns and it will be increasingly a maintenance headache.

Compare that with the following:

CREATE PROC InsertIntoFoo(row LIKE Foo)
BEGIN
INSERT INTO Foo FROM row;
END;

Those two versions of InsertIntoFoo compile into the same code. The semantic analyzer expands the (row LIKE Foo) into (row_id integer, row_t text, row_r real, row_b blob) and then replaces FROM row with (row_id, row_t, row_r, row_b). In both case it simply looked up the shape using sem_find_likeable_ast and then altered the AST to the canonical pattern. This kind of "shape sugar" is all over CQL and greatly increases maintainability while eliminating common errors. The most common operation is simply to expland a "shape" into a list of arguments or columns (maybe with or without type names). SQLite doesn't know any of this shape magic so by the time SQLite sees the code it has to look "normal" -- the shapes are all resolved.

Join Types

The last of the type building data structure is the join type. Recall that we have this shape:

// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"
typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;

This is an array of named structure types, which is exactly what you get when you do something like this:

select * from T1 INNER JOIN T2;

The result has all of the columns of T1 and all of the columns of T2. They can be referred to with scoped names like T1.x which means "find the sptr corresponding to the name T1 then within that structure find the column named x". In general, when we join, we take a jptr on the left and concatenate it with a jptr on the right. For all this to work we have to start somewhere, usually single tables.

As we saw when we make a table we use sem_join_from_sem_struct to make its initial jptr. Let's have a look at that now:

// Create a base join type from a single struct.
static sem_join *sem_join_from_sem_struct(sem_struct *sptr) {
sem_join *jptr = new_sem_join(1);
jptr->names[0] = sptr->struct_name;
jptr->tables[0] = new_sem_struct_strip_table_flags(sptr);
return jptr;
}

It doesn't get much simpler than the above, here are the steps briefly:

  • new_sem_join : gives us an empty sem_join with room for 1 table
  • we use the struct name for the name and the table's sptr for the shape
  • new_sem_struct_strip_table_flags : copies the table's sptr keeping only the essential flags
    • SEM_TYPE_HIDDEN_COL
    • SEM_FLAG_NOTNULL
    • SEM_FLAG_SENSITIVE

The other flags (e.g. SEM_TYPE_PK) have no value in doing type checking and were only needed to help validate the table itself. Those extra flags would be harmless but they would also contaminate all of the debug output, so they are stripped. As a result the type of columns as they appear in say SELECT statements is simpler than how they appear in a CREATE TABLE statement.

When we need to create a new join type we simply (*) make a new sem_join that is the concatenation of the left and right sides of the join operation.

  • some join types change the nullability of columns like LEFT JOIN, so we have to handle that too
  • the names of the tables in the new concatenated joinscope have to be unambiguous so there is also error checking to do
  • but basically it's just a concat...

Importantly, we call the thing a "joinscope" because it creates a namespace. When we are evaluating names inside of the FROM clause or even later in say a WHERE clause, the joinscope that we have created so far controls the table.column combinations you can use in expressions. This changes again when there is a subquery, so the joinscopes can be pushed and popped as needed.

By way of example, you'll see these two patterns in the code:

PUSH_JOIN(from_scope, select_from_etc->sem->jptr);
error = sem_select_orderby(select_orderby);
POP_JOIN();
  • PUSH_JOIN : use the jptr from the FROM clause to put things back in scope for the ORDER BY clause
PUSH_JOIN_BLOCK();
sem_numeric_expr(ast->left, ast, "LIMIT", SEM_EXPR_CONTEXT_LIMIT);
POP_JOIN();
  • PUSH_JOIN_BLOCK : causes the name search to stop, nothing deeper in the stack is searched
  • in this case we do not allow LIMIT expressions to see any joinscopes, they may not use any columns
    • even if the LIMIT clause is appearing in a subquery it can't refer to columns in the parent query.

Schema Regions

We touched briefly on schema regions earlier in this section. The purpose and language for regions is described more fully in Chapter 10 of the Guide. In this section we'll deal with how they are implemented and what you should expect to see in the code.

When a region declaration is found this method is used:

// A schema region is an partitioning of the schema such that it
// only uses objects in the same partition or one of its declared
// dependencies. One schema region may be upgraded independently
// from any others (assuming they happen such that dependents are done first).
// Here we validate:
// * the region name is unique
// * the dependencies (if any) are unique and exist
static void sem_declare_schema_region_stmt(ast_node *ast) { ... }

The general rules are described in the comment, but effectively it accumulates the list of the declared region's dependencies. Sometimes these are called the antecedent regions. Since a region can only depend on regions that have already been declared, it's not possible to make any cycles. Regions are declared before you put anything into them.

Pieces of schema or procedures (or anything really) can go into a region by putting that code inside a begin/end pair for the named region. Like so:

@begin_schema_region your_region;
-- your stuff
@end_schema_region;

Now whatever happens to be in "your stuff" is:

  • limited to seeing only the things that your_region is allowed to see, and
  • contributes its contents to your_region thereby limiting how others will be able to use "your stuff"

To see how this happens, let's have a look at sem_begin_schema_region_stmt.

// Entering a schema region makes all the objects that follow part of that
// region. It also means that all the contained objects must refer to
// only pieces of schema that are in the same region or a dependent region.
// Here we validate that region we are entering is in fact a valid region
// and that there isn't already a schema region.
static void sem_begin_schema_region_stmt(ast_node * ast) {
Contract(is_ast_begin_schema_region_stmt(ast));
EXTRACT_STRING(name, ast->left);
// @BEGIN_SCHEMA_REGION name
if (!verify_schema_region_out_of_proc(ast)) {
record_error(ast);
return;
}
if (current_region) {
report_error(ast, "CQL0246: schema regions do not nest; end the current region before starting a new one", NULL);
record_error(ast);
return;
}
ast_node *region = find_region(name);
if (!region) {
report_error(ast->left, "CQL0244: unknown schema region", name);
record_error(ast);
return;
}
// Get the canonical name of the region (case adjusted)
Contract(is_region(region));
EXTRACT_STRING(region_name, region->left);
// we already know we are not in a region
Invariant(!current_region_image);
current_region_image = symtab_new();
sem_accumulate_public_region_image(current_region_image, region_name);
// this is the one and only text pointer value for this region
current_region = region_name;
record_ok(ast);
}

We see these basic steps:

  • EXTRACT : gets the region name
  • verify_schema_region_out_of_proc : makes sure we are out of any procedure (we have to be at the top level)
    • errors if in a procedure
  • current_region : is tested to make sure we are not already in a region (no nesting)
    • errors if already in a region
  • find_region : is used to find the region AST by name
    • errors if the region name isn't valid
  • EXTRACT : is used again to get the canonical name of the region
    • you could write @begin_schema_region YoUr_ReGION; but we want the canonical name your_region, as it was declared
  • symtab_new : creates a new symbol table current_region_image
  • sem_accumulate_public_region_image : populates current_region_image by recursively walking this region adding the names of all the regions we find along the way
    • note the regions form a DAG so we might find the same name twice, we can stop if we find a region that is already in the image symbol table
  • current_region : set it to the now new current region

Now we're all set up.

  • We can use current_region to set the region in the sem_node of anything we encounter
  • We can use current_region_image to quickly see if we are allowed to use any given region
    • if it's in the symbol table we can use it

Recall that at the end of sem_create_table_stmt we do this:

ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;

That should make a lot more sense now.

When doing the symmetric check in sem_validate_object_ast_in_current_region we see this pattern:

// Validate whether or not an object is usable with a schema region. The object
// can only be a table, view, trigger or index.
static bool_t sem_validate_object_ast_in_current_region(
CSTR name,
ast_node *table_ast,
ast_node *err_target,
CSTR msg)
{
// We're in a non-region therefore no validation needed because non-region stmt
// can reference schema in any region.
if (!current_region) {
return true;
}
if (table_ast->sem && table_ast->sem->region) {
// if we have a current region then the image is always computed!
Invariant(current_region_image);
if (!symtab_find(current_region_image, table_ast->sem->region)) {
// The target region is not accessible from this region
...
return false;
}
} else {
// while in schema region '%s', accessing an object that isn't in a region is invalid
...
return false;
}
return true;
}

I've elided some of the code here, but only the part that generates error messages. The essential logic is:

  • if we are not in a region we can access anything
  • if we're in a region then...
    • the thing we're trying to access must also be in a region, and
    • that region must be in current_region_image
    • otherwise, we can't access it

This is enough to do all the region validation we need.

Results of Semantic Analysis

Semantic Analysis leaves a lot of global state ready for the remaining stages to harvest. If the state is defined in sem.h then it's ok to harvest. Here we'll highlight some of the most important things you can use in later passes. These are heavily used in the code generators.

cql_data_decl( struct list_item *all_tables_list );
cql_data_decl( struct list_item *all_functions_list );
cql_data_decl( struct list_item *all_views_list );
cql_data_decl( struct list_item *all_indices_list );
cql_data_decl( struct list_item *all_triggers_list );
cql_data_decl( struct list_item *all_regions_list );
cql_data_decl( struct list_item *all_ad_hoc_list );
cql_data_decl( struct list_item *all_select_functions_list );
cql_data_decl( struct list_item *all_enums_list );

These linked lists are authoritiative, they let you easily enumerate all the objects of the specified type. For instance, if you wanted to do some validation of all indices you could simply walk all_indices_list.

cql_noexport ast_node *find_proc(CSTR name);
cql_noexport ast_node *find_region(CSTR name);
cql_noexport ast_node *find_func(CSTR name);
cql_noexport ast_node *find_table_or_view_even_deleted(CSTR name);
cql_noexport ast_node *find_enum(CSTR name);

These functions give you access to the core name tables (which are still valid!) so that you can look up procedures, functions, tables, etc. by name.

Finally, information about all the schema annotations is invaluable for building schema upgraders. These two buffers hold dense arrays of annotation records as shown below.

cql_data_decl( bytebuf *schema_annotations );
cql_data_decl( bytebuf *recreate_annotations );
typedef struct recreate_annotation {
CSTR target_name; // the name of the target
CSTR group_name; // group name or "" if no group (not null, safe to sort)
ast_node *target_ast; // top level target (table, view, or index)
ast_node *annotation_ast; // the actual annotation
int32_t ordinal; // when sorting we want to use the original order (reversed actually) within a group
} recreate_annotation;
typedef struct schema_annotation {
int32_t version; // the version number (always > 0)
ast_node *target_ast; // top level target (table, view, or index)
CSTR target_name; // the name of the target
uint32_t annotation_type; // one of the codes below for the type of annotation
ast_node *annotation_ast; // the actual annotation
int32_t column_ordinal; // -1 if not a column
ast_node *column_ast; // a particular column if column annotation
} schema_annotation;
// Note: schema annotations are processed in the indicated order: the numbers matter
#define SCHEMA_ANNOTATION_INVALID 0
#define SCHEMA_ANNOTATION_FIRST 1
#define SCHEMA_ANNOTATION_CREATE_TABLE 1
#define SCHEMA_ANNOTATION_CREATE_COLUMN 2
#define SCHEMA_ANNOTATION_DELETE_TRIGGER 3
#define SCHEMA_ANNOTATION_DELETE_VIEW 4
#define SCHEMA_ANNOTATION_DELETE_INDEX 5
#define SCHEMA_ANNOTATION_DELETE_COLUMN 6
#define SCHEMA_ANNOTATION_DELETE_TABLE 7
#define SCHEMA_ANNOTATION_AD_HOC 8
#define SCHEMA_ANNOTATION_LAST 8

And of course, each "back end" is provided with the root of the AST so that it can also search and/or walk the AST in its own manner. We will see examples of this in later sections.

Recap

At present, there are nearly 20000 lines in sem.c and it would no doubt take more than 20000 lines of text to explain what they all do, and that would be more imprecise than the source code, and probably less readable. sem.c includes over 4000 lines of comments, and probably should have more. While there is a lot of code there it's very readable and I encourage you to do so to get your answers.

The point of this part of the Internals Guide isn't to fully explain all 400+ error checks in about as many semantic error checking functions, it's to showcase the key concepts shared by all of them. Things like:

  • errors are reported largely in the AST and percolate up
  • expressions and statements have general purpose dispatch logic for continuing a statement walk
  • EXTRACT macros are used to keep the tree walk on track and correct in the face of changes
  • regions are used for visibility
  • versioning contributes to visibility
  • nullability and sensitivity are tracked throughout using type bits
  • type "kind" is managed by a simple string in the sem_node payload
  • the three main payloads are
    • sem_node for basic info, and
    • sem_struct or sem_join for the non-unitary types

This isn't everything but it should leave you well armed to begin your own exploration of sem.c.

Last updated on by Rico Mariani