Part 2: Semantic Analysis
Prefaceβ
Part 2 continues with a discussion of the essentials of the semantic analysis pass of the CQL compiler. As in the previous sections, the goal here is not to go over every single rule but rather to give a sense of how semantic analysis happens in general -- the core strategies and implementation choices -- so that when reading the code you will have an idea of how smaller pieces fit into the whole. To accomplish this, various key data structures will be explained in detail and selected examples of their use are included.
Semantic Analysisβ
The overall goal of the semantic analysis pass is to verify that a correct program has been submitted to the compiler. The compiler does this by "decorating" the AST with semantic information. This information is mainly concerned with the "types" of the various things in the program. A key function of the semantic analyzer, the primary "weapon" in computing these types, if you will, is name resolution. The semantic analyzer decides what any given name means in any context and then uses that meaning, which is itself based on the AST constructs that came before, to compute types and then check those types for errors.
Broadly speaking, the errors that can be discovered are of these forms:
- mentioned names do not exist
- e.g. using a variable or table or column without declaring it
- mentioned names are not unique, or are ambiguous
- e.g. every view must have a unique name
- e.g. table names need to be unique, or aliased when joining tables
- operands are not compatible with each other or with the intended operation
- e.g. you can't add a string to a real
- e.g. you can't do the
%
operation on a real - e.g. the expression in a
WHERE
clause must result in a numeric - e.g. the first argument to
printf
must be a string literal - e.g. you can't assign a long value to an integer variable
- e.g. you can't assign a possibly null result to a not-null variable
- there are too many or two few operands for an operation
- e.g. an
INSERT
statement must include sufficiently many columns and no extras - e.g. a function or procedure call must have the correct number of operands
- e.g. an
- an operation is happening in a context where it is not allowed
- e.g. use of aggregate functions in the
WHERE
clause - e.g. use of unique SQLite functions outside of a SQL statement
- e.g. use of aggregate functions in the
There are several hundred possible errors, and no attempt will be made to cover them all here but we will talk about how errors are created, recorded, and reported.
Decorated AST examplesβ
Recalling the AST output from Part 1, this is what that same tree looks like with semantic information attached:
LET X := 1 + 3;
{let_stmt}: X: integer notnull variable
| {name X}: X: integer notnull variable
| {add}: integer notnull
| {int 1}: integer notnull
| {int 3}: integer notnull
And here's an example with some structure types:
SELECT 1 AS x, 3.2 AS y;
{select_stmt}: select: { x: integer notnull, y: real notnull }
| {select_core_list}: select: { x: integer notnull, y: real notnull }
| | {select_core}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list_con}: select: { x: integer notnull, y: real notnull }
| | {select_expr_list}: select: { x: integer notnull, y: real notnull }
| | | {select_expr}: x: integer notnull
| | | | {int 1}: integer notnull
| | | | {opt_as_alias}
| | | | {name x}
| | | {select_expr_list}
| | | {select_expr}: y: real notnull
| | | {dbl 3.2}: real notnull
| | | {opt_as_alias}
| | | {name y}
| | {select_from_etc}: ok
| | {select_where}
| | {select_groupby}
| | {select_having}
| {select_orderby}
| {select_limit}
| {select_offset}
These can be generated by adding --sem --ast
to the CQL command line along with --in your_file.sql
.
Keep these shapes in mind as we discuss the various sources of type information.
The Base Data Structuresβ
First recall that every AST node has this field in it:
struct sem_node *_Nullable sem;
This is the pointer to the semantic information for that node. Semantic analysis happens immediately
after parsing and before any of the code-generators run. Importantly, code generators never run
if semantic analysis reported any errors. Before we get into the shape of the semantic node, we
should start with the fundamental unit of type info sem_t
which is usually stored in a variable
called sem_type
.
typedef uint64_t sem_t;
The low order bits of a sem_t
encode the core type and indeed there is a helper function
to extract the core type from a sem_t
.
// Strips out all the flag bits and gives you the base/core type.
cql_noexport sem_t core_type_of(sem_t sem_type) {
return sem_type & SEM_TYPE_CORE;
}
The core bits are as follows:
#define SEM_TYPE_NULL 0 // the subtree is a null literal (not just nullable)
#define SEM_TYPE_BOOL 1 // the subtree is a bool
#define SEM_TYPE_INTEGER 2 // the subtree is an integer
#define SEM_TYPE_LONG_INTEGER 3 // the subtree is a long integer
#define SEM_TYPE_REAL 4 // the subtree is a real
#define SEM_TYPE_TEXT 5 // the subtree is a text type
#define SEM_TYPE_BLOB 6 // the subtree is a blob type
#define SEM_TYPE_OBJECT 7 // the subtree is any object type
#define SEM_TYPE_STRUCT 8 // the subtree is a table/view
#define SEM_TYPE_JOIN 9 // the subtree is a join
#define SEM_TYPE_ERROR 10 // marks the subtree as having a problem
#define SEM_TYPE_OK 11 // sentinel for ok but no type info
#define SEM_TYPE_PENDING 12 // sentinel for type calculation in flight
#define SEM_TYPE_REGION 13 // the ast is a schema region
#define SEM_TYPE_CURSOR_FORMAL 14 // this is used for the cursor parameter type uniquely
#define SEM_TYPE_CORE 0xff // bit mask for the core types
#define SEM_TYPE_MAX_UNITARY (SEM_TYPE_OBJECT+1) // the last unitary type
These break into a few categories:
NULL
toOBJECT
are the "unitary" types -- these are the types that a single simple variable can be- a column can be any of these except
OBJECT
orNULL
- the
NULL
type comes only from theNULL
literal which has no type - instances of, say, a
TEXT
column might have aNULL
value but they are known to beTEXT
- a column can be any of these except
STRUCT
indicates that the object has many fields, like a table, or a cursorJOIN
indicates that the object is the concatenation of manySTRUCT
types- e.g.
T1 inner join T2
is aJOIN
type withT1
andT2
being the parts - a
JOIN
could be flattened toSTRUCT
, but this is typically not done - the type of a
SELECT
statement will be aSTRUCT
representing the expressions that were selected - those expressions in turn used columns from the
JOIN
that was theFROM
clause
- e.g.
ERROR
indicates that the subtree had an error- the error will have been already reported
- the error type generally cascades up the AST to the root
OK
indicates that there is no type information but there was no problem- e.g. a correct
IF
statement will resolve to simplyOK
(no error)
- e.g. a correct
PENDING
is used sometimes while a type computation is in progress- this type doesn't appear in the AST, but has its own unique value so as to not conflict with any others
REGION
is used to identify AST fragments that correspond to schema regions- see Chapter 10 of the Guide for more information on regions
CORE
is the mask for the core parts,0xf
would do the job but for easy reading in the debugger we use0xff
- new core types are not added very often, adding a new one is usually a sign that you are doing something wrong
The core type can be modified by various flags. The flags, in principle, can be combined in any way but in practice many combinations make no sense.
For instance, HAS_DEFAULT
is for table columns and CREATE_FUNC
is for function declarations. There is no one object that could require both of these.
The full list as of this writing is as follows:
#define SEM_TYPE_NOTNULL _64(0x0100) // set if and only if null is not possible
#define SEM_TYPE_HAS_DEFAULT _64(0x0200) // set for table columns with a default
#define SEM_TYPE_AUTOINCREMENT _64(0x0400) // set for table columns with autoinc
#define SEM_TYPE_VARIABLE _64(0x0800) // set for variables and parameters
#define SEM_TYPE_IN_PARAMETER _64(0x1000) // set for in parameters (can mix with below)
#define SEM_TYPE_OUT_PARAMETER _64(0x2000) // set for out parameters (can mix with above)
#define SEM_TYPE_DML_PROC _64(0x4000) // set for stored procs that have DML/DDL
#define SEM_TYPE_HAS_SHAPE_STORAGE _64(0x8000) // set for a cursor with simplified fetch syntax
#define SEM_TYPE_CREATE_FUNC _64(0x10000) // set for a function that returns a created object +1 ref
#define SEM_TYPE_SELECT_FUNC _64(0x20000) // set for a sqlite UDF function declaration
#define SEM_TYPE_DELETED _64(0x40000) // set for columns that are not visible in the current schema version
#define SEM_TYPE_VALIDATED _64(0x80000) // set if item has already been validated against previous schema
#define SEM_TYPE_USES_OUT _64(0x100000) // set if proc has a one rowresult using the OUT statement
#define SEM_TYPE_USES_OUT_UNION _64(0x200000) // set if proc uses the OUT UNION form for multi row result
#define SEM_TYPE_PK _64(0x400000) // set if column is a primary key
#define SEM_TYPE_FK _64(0x800000) // set if column is a foreign key
#define SEM_TYPE_UK _64(0x1000000) // set if column is a unique key
#define SEM_TYPE_VALUE_CURSOR _64(0x2000000) // set only if SEM_TYPE_HAS_SHAPE_STORAGE is set and the cursor has no statement
#define SEM_TYPE_SENSITIVE _64(0x4000000) // set if the object is privacy sensitive
#define SEM_TYPE_DEPLOYABLE _64(0x8000000) // set if the object is a deployable region
#define SEM_TYPE_BOXED _64(0x10000000) // set if a cursor's lifetime is managed by a box object
#define SEM_TYPE_HAS_CHECK _64(0x20000000) // set for table column with a "check" clause
#define SEM_TYPE_HAS_COLLATE _64(0x40000000) // set for table column with a "collate" clause
#define SEM_TYPE_INFERRED_NOTNULL _64(0x80000000) // set if inferred to not be nonnull (but was originally nullable)
#define SEM_TYPE_VIRTUAL _64(0x100000000) // set if and only if this is a virtual table
#define SEM_TYPE_HIDDEN_COL _64(0x200000000) // set if and only if hidden column on a virtual table
#define SEM_TYPE_TVF _64(0x400000000) // set if and only table node is a table valued function
#define SEM_TYPE_IMPLICIT _64(0x800000000) // set if and only the variable was declare implicitly (via declare out)
#define SEM_TYPE_CALLS_OUT_UNION _64(0x1000000000) // set if proc calls an out union proc for
Note: _64(x)
expands to either a trailing L
or a trailing LL
depending on the bitness of the compiler, whichever yields an int64_t
.
Going over the meaning of all of the above is again beyond the scope of this document; some of the flags are very specialized and essentially the validation
just requires a bit of storage in the tree to do its job so that storage is provided with a flag. However two flag bits are especially important and
are computed almost everywhere sem_t
is used. These are SEM_TYPE_NOTNULL
and SEM_TYPE_SENSITIVE
.
SEM_TYPE_NOTNULL
indicates that the marked item is known to beNOT NULL
, probably because it was declared as such, or directly derived from a not null item- Typically when two operands are combined both must be marked
NOT NULL
for the result to still beNOT NULL
(there are exceptions likeCOALESCE
) - Values that might be null cannot be assigned to targets that must not be null
- Typically when two operands are combined both must be marked
SEM_TYPE_SENSITIVE
indicates that the marked item is some kind of PII or other sensitive data.- Any time a sensitive operand is combined with another operand the resulting type is sensitive
- There are very few ways to "get rid" of the sensitive bit -- it corresponds to the presence of
@sensitive
in the data type declaration - Values that are sensitive cannot be assigned to targets that are not marked sensitive
The semantic node sem_node
carries all the possible semantic info we might need, and the sem_type
holds the flags above and tells us how to interpret the rest of the node.
There are many fields -- we'll talk about some of the most important ones here to give you a sense of how things hang together.
Note that CSTR
is simply an alias for const char *
. CSTR
is used extensively in the codebase for brevity.
typedef struct sem_node {
sem_t sem_type; // core type plus flags
CSTR name; // for named expressions in select columns, etc.
CSTR kind; // the Foo in object<Foo>, not a variable or column name
CSTR error; // error text for test output, not used otherwise
struct sem_struct *sptr; // encoded struct if any
struct sem_join *jptr; // encoded join if any
int32_t create_version; // create version if any (really only for tables and columns)
int32_t delete_version; // delete version if any (really only for tables and columns)
bool_t recreate; // for tables only, true if marked @recreate
CSTR recreate_group_name; // for tables only, the name of the recreate group if they are in one
CSTR region; // the schema region, if applicable; null means unscoped (default)
symtab *used_symbols; // for select statements, we need to know which of the ids in the select list was used, if any
list_item *index_list; // for tables we need the list of indices that use this table (so we can recreate them together if needed)
struct eval_node *value; // for enum values we have to store the evaluated constant value of each member of the enum
} sem_node;
sem_type
: already discussed above, this tells you how to interpret everything elsename
: variables, columns, etc. have a canonical name -- when a name case-insensitivity resolves, the canonical name is stored here- typically later passes emit the canonical variable name everywhere
- e.g. because
FoO
andfOO
might both resolve to an object declared asfoo
, we always emitfoo
in codegen
kind
: in CQL any type can be discriminated as indeclare foo real<meters>
, the kind here ismeters
- two expressions of the same core type (e.g.
real
) are incompatible if they have akind
and thekind
does not match - e.g. if you have
bar real<liters>
thenset foo := bar;
this is an error even though both arereal
becausefoo
above isreal<meters>
- two expressions of the same core type (e.g.
sptr
: if the item's core type isSEM_TYPE_STRUCT
then this is populated (see below)jptr
: if the item's core type isSEM_TYPE_JOIN
then this is populated (see below)
If the object is a structure type then this is simply an array of names, kinds, and semantic types. In fact the semantic types will be all be unitary, possibly modified by NOT_NULL
or SENSITIVE
but none of the other flags apply. A single sptr
directly corresponds to the notion of a "shape" in the analyzer. Shapes come from anything
that looks like a table, such as a cursor, or the result of a SELECT
statement.
// for tables and views and the result of a select
typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;
If the object is a join type (such as the parts of the FROM
clause) then the jptr
field will be populated. This is nothing more than a named list of struct types.
// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"
typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;
With these building blocks we can represent the type of anything in the CQL language.
Initiating Semantic Analysisβ
The semantic analysis pass runs much the same way as the AST emitter. In sem.c
there is the essential function sem_main
. It suffices
to call sem_main
on the root of the AST. That root node is expected to be a stmt_list
node.
// This method loads up the global symbol tables in either empty state or
// with the appropriate tokens ready to go. Using our own symbol tables for
// dispatch saves us a lot of if/else string comparison verbosity.
cql_noexport void sem_main(ast_node *ast) {
// restore all globals and statics we own
sem_cleanup();
eval_init();
...
}
As you can see, sem_main
begins by resetting all the global state. You can of course do this yourself after calling sem_main
(when you're done with the results).
sem_main
sets a variety of useful and public global variables that describe the results of the analysis. The ones in sem.h
are part of the contract and
you should feel free to use them in a downstream code-generator. Other items are internal and should be avoided.
The internal items are typically defined statically in sem.c
. The essential outputs will be described in the last section of this part.
The cleanup has this structure:
// This method frees all the global state of the semantic analyzer
cql_noexport void sem_cleanup() {
eval_cleanup();
BYTEBUF_CLEANUP(deployable_validations);
BYTEBUF_CLEANUP(recreate_annotations);
BYTEBUF_CLEANUP(schema_annotations);
SYMTAB_CLEANUP(funcs);
SYMTAB_CLEANUP(globals);
SYMTAB_CLEANUP(indices);
SYMTAB_CLEANUP(locals);
...
// these are getting zeroed so that leaksanitizer will not count those objects as reachable from a global root.
all_ad_hoc_list = NULL;
all_functions_list = NULL;
...
This basically deallocates everything and resets all the globals to NULL
.
sem_main
of course has to walk the AST and it does so in much the same way as we saw in gen_sql.c
. There is a series of symbol tables
whose key is an AST type and whose value is a function plus arguments to dispatch (effectively a lambda.) The semantic analyzer doesn't
have to think about things like "should I emit parentheses?" so the signature of each type of lambda can be quite a bit simpler. We'll
go over each kind with some examples.
First we have the non-SQL statements, these are basic flow control or other things that SQLite will never see directly.
symtab *syms = non_sql_stmts;
STMT_INIT(if_stmt);
STMT_INIT(while_stmt);
STMT_INIT(switch_stmt);
STMT_INIT(leave_stmt);
...
Here STMT_INIT
creates a binding between (e.g.) the AST type if_stmt
and the function sem_if_stmt
. This lets us dispatch any part of the AST
to its handler directly.
Next we have the SQL statements. These get analyzed in the same way as the others, and with functions that have the same signature, however,
if you use one of these it means that procedure that contained this statement must get a database connection in order to run. Use of the database
will require the procedure's signature to change; this is recorded by the setting the SEM_TYPE_DML_PROC
flag bit to be set on the procedure's AST node.
syms = sql_stmts;
STMT_INIT(create_table_stmt);
STMT_INIT(drop_table_stmt);
STMT_INIT(create_index_stmt);
STMT_INIT(create_view_stmt);
STMT_INIT(select_stmt);
STMT_INIT(delete_stmt);
STMT_INIT(update_stmt);
STMT_INIT(insert_stmt);
...
Again STMT_INIT
creates a binding between (e.g.) the AST type delete_stmt
and the function sem_delete_stmt
so we can dispatch to the handler.
Next we have expression types. These are set up with EXPR_INIT
. Many of the operators require exactly the same kinds of verification, so in order to be
able to share the code, the expression analysis functions get an extra argument for the operator in question. Typically the string of the operator
is only needed to make a good quality error message with validation being otherwise identical. Here are some samples...
EXPR_INIT(num, sem_expr_num, "NUM");
EXPR_INIT(str, sem_expr_str, "STR");
EXPR_INIT(blob, sem_expr_blob, "BLB");
EXPR_INIT(null, sem_expr_null, "NULL");
EXPR_INIT(dot, sem_expr_dot, "DOT");
EXPR_INIT(const, sem_expr_const, "CONST");
EXPR_INIT(mul, sem_binary_math, "*");
EXPR_INIT(mod, sem_binary_integer_math, "%");
EXPR_INIT(not, sem_unary_logical, "NOT");
EXPR_INIT(is_true, sem_unary_is_true_or_false, "IS TRUE");
EXPR_INIT(tilde, sem_unary_integer_math, "~");
EXPR_INIT(uminus, sem_unary_math, "-");
Looking at the very first entry as an example, we see that EXPR_INIT
creates a mapping between the AST type num
and the analysis function sem_expr_num
and that function will get the text "NUM"
as an extra argument.
As it happens sem_expr_num
doesn't need the extra argument, but sem_binary_math
certainly needs the "*"
as that function handles a large number of binary operators.
Let's quickly go over this list as these are the most important analyzers:
sem_expr_num
: analyzes any numeric constantsem_expr_str
: analyzes any string literal or identifiersem_expr_blob
: analyzes any blob literalsem_expr_null
: analyzes the NULL literal (and nothing else)sem_expr_dot
: analyzes a compound name likeT1.id
sem_expr_const
: analyzes aconst(...)
expression, doing the constant evaluationsem_binary_math
: analyzes any normal binary math operator like '+', '-', '/' etc.sem_binary_integer_math
: analyzes any binary math operator where the operands must be integers like '%' or '|'sem_unary_logical
: analyzes any unary logical operator (the result is a bool) -- this is really onlyNOT
sem_unary_is_true_or_false
: analyzes any of theIS TRUE
,IS FALSE
family of postfix unary operatorssem_unary_integer_math
: analyzes any unary operator where the operand must be an integer -- this is really only~
sem_unary_math
: analyzes any any math unary operator, presently only negation (but in the future unary+
too)
The last group of normal associations are for builtin functions, like these:
FUNC_INIT(changes);
FUNC_INIT(printf);
FUNC_INIT(strftime);
FUNC_INIT(date);
FUNC_INIT(time);
Each of these is dispatched when a function call is found in the tree. By way of example FUNC_INIT(changes)
causes the changes
function to map to sem_func_changes
for validation.
There are a few other similar macros for more exotic cases but the general pattern should be clear now. With these in place
it's very easy to traverse arbitrary statement lists and arbitrary expressions with sub expressions and have the correct function
invoked without having large switch
blocks all over.
Semantic Errorsβ
Some of the following examples will show the handling of semantic errors more precisely but the theory is pretty simple. Each of the analyzers that has
been registered is responsible for putting an appropriate sem_node
into the AST it is invoked on. The caller will look to see if that sem_node
is of type SEM_TYPE_ERROR
using is_error(ast)
. If it is, the caller will mark its own AST as errant using record_error(ast)
and this continues all
the way up the tree. The net of this is that wherever you begin semantic analysis, you can know if there were any problems by checking for an error at the
top of the tree you provided.
At the point of the initial error, the analyzer is expected to also call report_error
providing a suitable message. This will be logged to stderr
.
In test mode it is also stored in the AST so that verification steps can confirm that errors were reported at exactly the right place.
If there are no errors, then a suitable sem_node
is created for the resulting type or else, at minimum, record_ok(ast)
is used to place the shared "OK" type on the node.
The "OK" type indicates no type information, but no errors either. "OK" is helpful for statements that don't involve expressions like DROP TABLE Foo
.
The Primitive Typesβ
Perhaps the simplest analysis of all happens at the leaves of the AST. By way of example, here is the code for expression nodes of type num
, the numeric literals.
// Expression type for numeric primitives
static void sem_expr_num(ast_node *ast, CSTR cstr) {
Contract(is_ast_num(ast));
EXTRACT_NUM_TYPE(num_type, ast);
switch (num_type) {
case NUM_BOOL:
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL);
break;
case NUM_INT:
ast->sem = new_sem(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL);
break;
case NUM_LONG:
ast->sem = new_sem(SEM_TYPE_LONG_INTEGER | SEM_TYPE_NOTNULL);
break;
default:
// this is all that's left
Contract(num_type == NUM_REAL);
ast->sem = new_sem(SEM_TYPE_REAL | SEM_TYPE_NOTNULL);
break;
}
}
As you can see the code simply looks at the AST node, confirming first that it is a num
node. Then it extracts the num_type
.
Then ast->sem
is set to a semantic node of the matching type adding in SEM_TYPE_NOTNULL
because literals are never null.
The new_sem
function is used to make an empty sem_node
with the sem_type
filled in as specified. Nothing can go wrong creating a literal so there are no failure modes.
It doesn't get much simpler unless maybe...
// Expression type for constant NULL
static void sem_expr_null(ast_node *ast, CSTR cstr) {
Contract(is_ast_null(ast));
// null literal
ast->sem = new_sem(SEM_TYPE_NULL);
}
It's hard to get simpler than doing semantic analysis of the NULL
literal. Its code should be clear with no further explanation needed.
Unary Operatorsβ
Let's dive in to a simple case that does require some analysis -- the unary operators. There are comparatively few and there isn't much code required to handle them all.
Here's the code for the unary math operators:
// The only unary math operators are '-' and '~'
// Reference types are not allowed
static void sem_unary_math(ast_node *ast, CSTR op) {
sem_t core_type, combined_flags;
if (!sem_unary_prep(ast, &core_type, &combined_flags)) {
return;
}
if (!sem_validate_numeric(ast, core_type, op)) {
return;
}
// The result of unary math promotes to integer. Basically this converts
// bool to integer. Long integer and Real stay as they are. Text is
// already ruled out.
sem_t sem_type_result = sem_combine_types(
(SEM_TYPE_INTEGER | SEM_TYPE_NOTNULL),
(core_type | combined_flags));
ast->sem = new_sem(sem_type_result);
ast->sem->kind = ast->left->sem->kind;
// note ast->sem->name is NOT propagated because SQLite doesn't let you refer to
// the column 'x' in 'select -x' -- the column name is actually '-x' which is useless
// so we have no name once you apply unary math (unless you use 'as')
// hence ast->sem->name = ast->left->sem->name is WRONG here and it is not missing on accident
}
Unary Prep
OK already we need to pause because there is a "prep" pattern here common to most of the shared operators that we should discuss.
The prep step takes care of most of the normal error handling which is the same for all the unary operators
and the same pattern happens in binary operators. Let's take a look at sem_unary_prep
.
// The unary operators all have a similar prep to the binary. We need
// to visit the left side (it's always the left node even if the operator goes on the right)
// if that's ok then we need the combined_flags and core type. There is only
// the one. Returns true if everything is ok.
static bool_t sem_unary_prep(ast_node *ast, sem_t *core_type, sem_t *combined_flags) {
// op left | left op
sem_expr(ast->left);
if (is_error(ast->left)) {
*core_type = SEM_TYPE_ERROR;
*combined_flags = 0;
record_error(ast);
return false;
}
sem_node *sem = ast->left->sem;
sem_t sem_type = sem->sem_type;
*core_type = core_type_of(sem_type);
*combined_flags = not_nullable_flag(sem_type) | sensitive_flag(sem_type);
Invariant(is_unitary(*core_type));
return true;
}
Reviewing the steps:
- first we analyze the operand, it will be in
ast->left
- if that's an error, we just return the error code from the prep steps
- now that it's not an error, we pull the core type out of the operand
- then we pull the not nullable and sensitive flag bits out of the operand
- finally return a boolean indicating the presence of an error (or not) for convenience
This is useful setup for all the unary operators, and as we'll see, the binary operators have a similar prep step.
Back to Unary Processing
Looking at the overall steps we see:
sem_unary_prep
: verifies that the operand is not an error, and gets its core type and flag bitssem_validate_numeric
: verifies that the operand is a numeric type- recall these are the math unary operators, so the operand must be numeric
sem_combine_types
: creates the smallest type that holds two compatible types- by combining with "integer not null" we ensure that the resulting type is at least as big as an integer
- if the argument is of type
long
orreal
then it will be the bigger type and the resulting type will belong
orreal
as appropriate - in short,
bool
is promoted toint
, everything else stays the same sem_combine_types
also combines the nullability and sensitivity appropriately
- a new
sem_node
of the combined type is created- the type "kind" of the operand is preserved (e.g. the
meters
inreal<meters>
) - any column alias or variable name is not preserved, the value is now anonymous
- the type "kind" of the operand is preserved (e.g. the
These primitives are designed to combine well, for instance, consider sem_unary_integer_math
static void sem_unary_integer_math(ast_node *ast, CSTR op) {
sem_unary_math(ast, op);
sem_reject_real(ast, op);
}
The steps are:
sem_unary_math
: do the sequence we just discussedsem_reject_real
: report/record an error if the result type isreal
otherwise do nothing
Note that in all cases the op
string simply gets pushed down to the place where the errors happen. Let's take a quick look at one of
the sources of errors in the above. Here's the numeric validator:
static bool_t sem_validate_numeric(ast_node *ast, sem_t core_type, CSTR op) {
if (is_blob(core_type)) {
report_error(ast->left, "CQL0045: blob operand not allowed in", op);
record_error(ast);
return false;
}
if (is_object(core_type)) {
report_error(ast->left, "CQL0046: object operand not allowed in", op);
record_error(ast);
return false;
}
if (is_text(core_type)) {
report_error(ast->left, "CQL0047: string operand not allowed in", op);
record_error(ast);
return false;
}
return true;
}
That function is pretty much dumb as rocks. The non-numeric types are blob, object, and text. There is a custom error for each type (it could have been shared
but specific error messages seem to help users.) This code doesn't know its context, but all it needs is op
to tell it what the numeric-only
operator was and it can produce a nice error message. It leaves an error in the AST using record_error
. Its caller can then simply return
if anything goes wrong.
It's not hard to guess how sem_reject_real
works:
// Some math operators like << >> & | % only make sense on integers
// This function does the extra checking to ensure they do not get real values
// as arguments. It's a post-pass after the normal math checks.
static void sem_reject_real(ast_node *ast, CSTR op) {
if (!is_error(ast)) {
sem_t core_type = core_type_of(ast->sem->sem_type);
if (core_type == SEM_TYPE_REAL) {
report_error(ast, "CQL0001: operands must be an integer type, not real", op);
record_error(ast);
}
}
}
- if the AST node isn't already an error, and the node is of type "real", report an error
- it assumes the type is already known to be numeric
- the pre-check for errors is to avoid double reporting; if something has already gone wrong, the core type will be
SEM_TYPE_ERROR
- no new error recording is needed in that case, as obviously an error was already recorded
Binary Operatorsβ
Binary Prepβ
With the knowledge we have so far, this code pretty much speaks for itself, but we'll walk through it.
// All the binary ops do the same preparation -- they evaluate the left and the
// right expression, then they check those for errors. Then they need
// the types of those expressions and the combined_flags of the result. This
// does exactly that for its various callers. Returns true if all is well.
static bool_t sem_binary_prep(ast_node *ast, sem_t *core_type_left, sem_t *core_type_right, sem_t *combined_flags) {
EXTRACT_ANY_NOTNULL(left, ast->left);
EXTRACT_ANY_NOTNULL(right, ast->right);
// left op right
sem_expr(left);
sem_expr(right);
if (is_error(left) || is_error(right)) {
record_error(ast);
*core_type_left = SEM_TYPE_ERROR;
*core_type_right = SEM_TYPE_ERROR;
*combined_flags = 0;
return false;
}
*core_type_left = core_type_of(left->sem->sem_type);
*core_type_right = core_type_of(right->sem->sem_type);
*combined_flags = combine_flags(left->sem->sem_type, right->sem->sem_type);
Invariant(is_unitary(*core_type_left));
Invariant(is_unitary(*core_type_right));
return true;
}
sem_expr
: used to recursively walk the left and right nodesis_error
: checks if either side had errors, and, if so, simply propagates the error- extract the left and right core types
- combine nullability and sensitivity flags
And that's it! These are the standard prep steps for all binary operators. With this done, the caller has the core types of the left and right operands plus combined flags on a silver platter and one check is needed to detect if anything went wrong.
Example: Is or Is Notβ
This analyzer is the simplest of all the binaries
// IS and IS NOT are special in that they return a not null boolean.
static void sem_binary_is_or_is_not(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;
if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}
if (!sem_verify_compat(ast, core_type_left, core_type_right, op)) {
return;
}
// the result of is or is not is always a bool and never null
ast->sem = new_sem(SEM_TYPE_BOOL | SEM_TYPE_NOTNULL | sensitive_flag(combined_flags));
}
sem_binary_prep
: checks for errors in the left or rightsem_verify_compat
: ensures that left and right operands are type compatible (discussed later)- the result is always of type
bool not null
If either step goes wrong the error will naturally propagate.
Example: Binary Mathβ
This is the general worker for binary math operations, the most common operations like '+', '-', '*' and so forth.
// For all math operations, we combine the types and yield the type that
// holds both using the helper. If any text, that's an error.
static void sem_binary_math(ast_node *ast, CSTR op) {
sem_t core_type_left, core_type_right, combined_flags;
if (!sem_binary_prep(ast, &core_type_left, &core_type_right, &combined_flags)) {
return;
}
if (error_any_object(ast, core_type_left, core_type_right, op)) {
return;
}
if (error_any_blob_types(ast, core_type_left, core_type_right, op)) {
return;
}
if (error_any_text_types(ast, core_type_left, core_type_right, op)) {
return;
}
sem_t core_type = sem_combine_types(core_type_left, core_type_right);
CSTR kind = sem_combine_kinds(ast->right, ast->left->sem->kind);
if (is_error(ast->right)) {
record_error(ast);
return;
}
ast->sem = new_sem(core_type | combined_flags);
ast->sem->kind = kind;
}
Let's have a look at those steps:
sem_binary_prep
: checks for errors on the left or righterror_any_object
: reports an error if the left or right is of type objecterror_any_blob_types
: reports an error if the left or right is of type bloberror_any_text_types
: reports an error if the left or right is of type textsem_combine_type
: computes the combined type, the smallest numeric type that holds both left and right- note the operands are now known to be numeric
- the three type error checkers give nice tight errors about the left or right operand
sem_combine_kinds
: tries to create a single typekind
for both operands- if their
kind
is incompatible, records an error on the right
- if their
new_sem
: creates asem_node
with the combined type, flags, and then thekind
is set
At this point it might help to look at a few more of the base validators -- they are rather unremarkable.
Example Validator: error_any_objectβ
// If either of the types is an object, then produce an error on the ast.
static bool_t error_any_object(ast_node *ast, sem_t core_type_left, sem_t core_type_right, CSTR op) {
if (is_object(core_type_left)) {
report_error(ast->left, "CQL0002: left operand cannot be an object in", op);
record_error(ast);
return true;
}
if (is_object(core_type_right)) {
report_error(ast->right, "CQL0003: right operand cannot be an object in", op);
record_error(ast);
return true;
}
return false;
}
is_object
: checks asem_type
againstSEM_TYPE_OBJECT
- if the left or right child is an object, an appropriate error is generated
- there is no strong convention for returning
true
if ok, ortrue
if error; it's pretty ad hoc- this doesn't seem to cause a lot of problems
Example Validator: sem_combine_kindsβ
// Here we check that type<Foo> only combines with type<Foo> or type.
// If there is a current object type, then the next item must match
// If there is no such type, then an object type that arrives becomes the required type
// if they ever don't match, record an error
static CSTR sem_combine_kinds_general(ast_node *ast, CSTR kleft, CSTR kright) {
if (kright) {
if (kleft) {
if (strcmp(kleft, kright)) {
CSTR errmsg = dup_printf("CQL0070: expressions of different kinds can't be mixed: '%s' vs. '%s'", kright, kleft);
report_error(ast, errmsg, NULL);
record_error(ast);
}
}
return kright;
}
return kleft;
}
// helper to crack the ast nodes first and then call the normal comparisons
static CSTR sem_combine_kinds(ast_node *ast, CSTR kright) {
CSTR kleft = ast->sem->kind;
return sem_combine_kinds_general(ast, kleft, kright);
}
sem_combine_kinds
: uses the workersem_combine_kinds_general
after extracting thekind
from the left node- usually you already have one
kind
and you want to know if anotherkind
is compatible, hence this helper
- usually you already have one
sem_combine_kinds_general
: applies the general rules for "kind" strings:- NULL + NULL => NULL
- NULL + x => x
- x + NULL => x
- x + x => x
- x + y => error (if x != y)
- this is one of the rare functions that creates a dynamic error message
Example Validator : is_numeric_compatβ
This helper is frequently called several times in the course of other semantic checks.
This one produces no errors, that's up to the caller. Often there is a numeric path
and a non-numeric path so this helper can't create the errors as it doesn't yet know
if anything bad has happened. Most of the is_something
functions are the same way.
cql_noexport bool_t is_numeric_compat(sem_t sem_type) {
sem_type = core_type_of(sem_type);
return sem_type >= SEM_TYPE_NULL && sem_type <= SEM_TYPE_REAL;
}
is_numeric_compat
operates by checking the core type for the numeric range.
Note that NULL
is compatible with numerics because expressions like NULL + 2
have meaning in SQL. The type of that expression is nullable integer and
the result is NULL
.
Example Validator : sem_combine_typesβ
// The second workhorse of semantic analysis, given two types that
// are previously known to be compatible, it returns the smallest type
// that holds both. If either is nullable, the result is nullable.
// Note: in the few cases where that isn't true, the normal algorithm for
// nullability result must be overridden (see coalesce, for instance).
static sem_t sem_combine_types(sem_t sem_type_1, sem_t sem_type_2) {
... too much code ... summary below
}
This beast is rather lengthy but unremarkable. It follows these rules:
- text is only compatible with text
- object is only compatible with object
- blob is only compatible with blob
- numerics are only compatible with other numerics and NULL
- NULL promotes the other operand, whatever it is (might still be NULL)
- bool promotes to integer if needed
- integer promotes to long integer if needed
- long integer promotes to real if needed
- the combined type is the smallest numeric type that holds left and right according to the above rules
Some examples might be helpful:
- 1 + 2L -> long
- false + 3.1 -> real
- 2L + 3.1 -> real
- true + 2 -> integer
- 'x' + 1 -> not compatible
Note that sem_combine_types
assumes the types have already been checked for compatibility and will use Contract
to enforce
this. You should be using other helpers like is_numeric_compat
and friends to ensure the types agree before computing
the combined type. A list of values that must be compatible with each other (e.g. in needle IN (haystack)
) can be
checked using sem_verify_compat
repeatedly.
Example Validator : sem_verify_assignmentβ
The sem_verify_assignment
function is used any time there is something like a logical assignment
going on. There are
two important cases:
SET x := y
: an actual assignmentcall foo(x)
: the expressionx
must be "assignable" to the formal variable for the argument offoo
This is a lot like normal binary operator compatibility with one extra rule: the source expression must
not be a bigger type than the target. e.g. you cannot assign a long
to an integer
, nor pass a long
expression to a function that has an integer parameter.
// This verifies that the types are compatible and that it's ok to assign
// the expression to the variable. In practice that means:
// * the variable type core type and kind must be compatible with the expression core type and kind
// * the variable must be nullable if the expression is nullable
// * the variable must be sensitive if the assignment is sensitive
// * the variable type must be bigger than the expression type
// Here ast is used only to give a place to put any errors.
static bool_t sem_verify_assignment(ast_node *ast, sem_t sem_type_needed, sem_t sem_type_found, CSTR var_name) {
if (!sem_verify_compat(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}
if (!sem_verify_safeassign(ast, sem_type_needed, sem_type_found, var_name)) {
return false;
}
if (is_nullable(sem_type_found) && is_not_nullable(sem_type_needed)) {
report_error(ast, "CQL0013: cannot assign/copy possibly null expression to not null target", var_name);
return false;
}
if (sensitive_flag(sem_type_found) && !sensitive_flag(sem_type_needed)) {
report_error(ast, "CQL0014: cannot assign/copy sensitive expression to non-sensitive target", var_name);
return false;
}
return true;
}
sem_verify_compat
: checks for standard type compatibility between the left and the rightsem_verify_safeassign
: checks that if the types are different the right operand is the smaller of the two- nullability checks ensure you aren't trying to assign a nullable value to a not null variable
- sensitivity checks ensure you aren't trying to assign a sensitive value to a not sensitive variable
Simple Statement Validationβ
With the expression building blocks, most of the usual kind of language statements become quite simple to check
for correctness. It's probably easiest to illustrate this with an example. Let's look at validation for
the WHILE
statement:
// While semantic analysis is super simple.
// * the condition must be numeric
// * the statement list must be error-free
// * loop_depth is increased allowing the use of interior leave/continue
static void sem_while_stmt(ast_node *ast) {
Contract(is_ast_while_stmt(ast));
EXTRACT_ANY_NOTNULL(expr, ast->left);
EXTRACT(stmt_list, ast->right);
// WHILE [expr] BEGIN [stmt_list] END
sem_numeric_expr(expr, ast, "WHILE", SEM_EXPR_CONTEXT_NONE);
if (is_error(expr)) {
record_error(ast);
return;
}
if (stmt_list) {
loop_depth++;
sem_stmt_list(stmt_list);
loop_depth--;
if (is_error(stmt_list)) {
record_error(ast);
return;
}
}
record_ok(ast);
}
EXTRACT*
: pulls out the tree parts we needsem_numeric_expr
: verifies the loop expression is numericsem_stmt_list
: recursively validates the body of the loop
Note: the while expression is one of the loop constructs which means that LEAVE
and CONTINUE
are legal inside it.
The loop_depth
global tracks the fact that we are in a loop so that analysis for LEAVE
and CONTINUE
can report errors if we are not.
It's not hard to imagine that sem_stmt_list
will basically walk the AST, pulling out statements and dispatching them using the STMT_INIT
tables previously discussed.
You might land right back in sem_while_stmt
for a nested WHILE
-- it's turtles all the way down.
If SEM_EXPR_CONTEXT_NONE
is a mystery, don't worry, it's covered in the next section.
Expression Contextsβ
It turns out that in the SQL language some expression types are only valid in some parts of a SQL statement (e.g. aggregate functions can't appear in a LIMIT
clause) and so there is always a context for any numeric expression. When a new root expression is being evaluated, it sets the expression context per the caller's specification.
The expression contexts are as follows:
#define SEM_EXPR_CONTEXT_NONE 0x0001
#define SEM_EXPR_CONTEXT_SELECT_LIST 0x0002
#define SEM_EXPR_CONTEXT_WHERE 0x0004
#define SEM_EXPR_CONTEXT_ON 0x0008
#define SEM_EXPR_CONTEXT_HAVING 0x0010
#define SEM_EXPR_CONTEXT_ORDER_BY 0x0020
#define SEM_EXPR_CONTEXT_GROUP_BY 0x0040
#define SEM_EXPR_CONTEXT_LIMIT 0x0080
#define SEM_EXPR_CONTEXT_OFFSET 0x0100
#define SEM_EXPR_CONTEXT_TABLE_FUNC 0x0200
#define SEM_EXPR_CONTEXT_WINDOW 0x0400
#define SEM_EXPR_CONTEXT_WINDOW_FILTER 0x0800
#define SEM_EXPR_CONTEXT_CONSTRAINT 0x1000
The idea here is simple: when calling a root expression, the analyzer provides the context value that has the bit that corresponds to the current context.
For instance, the expression being validated in is the WHERE
clause -- the code will provide SEM_EXPR_CONTEXT_WHERE
.
The inner validators check this context, in particular anything that is only available in some contexts has a bit-mask of that is the union
of the context bits where it can be used. The validator can check those possibilities against the current context with one bitwise "and" operation.
A zero result indicates that the operation is not valid in the current context.
This bitwise "and" is performed by one of these two helper macros which makes the usage a little clearer:
#define CURRENT_EXPR_CONTEXT_IS(x) (!!(current_expr_context & (x)))
#define CURRENT_EXPR_CONTEXT_IS_NOT(x) (!(current_expr_context & (x)))
Expression Context Example : Concatβ
The concatenation operator ||
is challenging to successfully emulate because it does many different kinds of
numeric to string conversions automatically. Rather than perennially getting this wrong, we simply do not support
this operator in a context where SQLite isn't going to be doing the concatenation. So typically users
use "printf" instead to get formatting done outside of a SQL context. The check for invalid use of ||
is very simple
and it happens, of course, in sem_concat
.
if (CURRENT_EXPR_CONTEXT_IS(SEM_EXPR_CONTEXT_NONE)) {
report_error(ast, "CQL0241: CONCAT may only appear in the context of a SQL statement", NULL);
record_error(ast);
return;
}
Expression Context Example : INβ
A slightly more complex example happens processing the IN
operator. This operator has two forms:
the form with an expression list, which can be used anywhere, and the form with a SELECT
statement.
The latter form can only appear in some sections of SQL, and not at all in loose expressions. For
instance, that form may not appear in the LIMIT
or OFFSET
sections of a SQLite statement.
We use this construct to do the validation:
uint32_t valid = SEM_EXPR_CONTEXT_SELECT_LIST
|SEM_EXPR_CONTEXT_WHERE
|SEM_EXPR_CONTEXT_ON
|SEM_EXPR_CONTEXT_HAVING
|SEM_EXPR_CONTEXT_TABLE_FUNC;
if (CURRENT_EXPR_CONTEXT_IS_NOT(valid)) {
report_error( ast, "CQL0078: [not] in (select ...) is only allowed inside "
"of select lists, where, on, and having clauses", NULL);
record_error(ast);
return;
}
If the reader is interested in a simple learning exercise, run down the purpose of SEM_EXPR_CONTEXT_TABLE_FUNC
-- it's simple,
but important, and it only has one use case so it's easy to find.
Name Resolutionβ
We've gotten pretty far without talking about the elephant in the room: name resolution.
Like SQL, many statements in CQL have names in positions where the type of the name is completely unambiguous. For instance
nobody could be confused what sort of symbol Foo
is in DROP INDEX Foo
.
This type, with a clear name category, is the easiest name resolutions, and there are a lot in this form. Let's do an example.
Example: Index Name Resolutionβ
// This is the basic checking for the drop index statement
// * the index must exist (have been declared) in some version
// * it could be deleted now, that's ok, but the name has to be valid
static void sem_drop_index_stmt(ast_node *ast) {
Contract(is_ast_drop_index_stmt(ast));
EXTRACT_ANY_NOTNULL(name_ast, ast->right);
EXTRACT_STRING(name, name_ast);
ast_node *index_ast = find_usable_index(name, name_ast, "CQL0112: index in drop statement was not declared");
if (!index_ast) {
record_error(ast);
return;
}
record_ok(ast);
}
Well, this is interesting. But what's going on with find_usable_index
? What is usable? Why aren't we just looking up the index
name in some name table? Let's have a look at the details of find_usable_index
.
// returns the node only if it exists and is not restricted by the schema region.
static ast_node *find_usable_index(CSTR name, ast_node *err_target, CSTR msg) {
ast_node *index_ast = find_index(name);
if (!index_ast) {
report_error(err_target, msg, name);
return NULL;
}
if (!sem_validate_object_ast_in_current_region(name, index_ast, err_target, msg)) {
return NULL;
}
return index_ast;
}
We haven't discussed schema regions yet but what you need to know about them for now is this:
- any object can be in a region.
- a region may depend on other regions
If an object is in a region, then it may only use schema parts that are in the same region, or the region's dependencies (transitively).
The point of this is that you might have a rather large schema and you probably don't want any piece of code to use just any piece of schema. You can use regions to ensure that the code for feature "X" doesn't try to use schema designed exclusively for feature "Y". That "X" code probably has no business even knowing of the existence of "Y" schema.
So now usable
simply means this:
find_index
can find the name in the symbol table for indices- the found index is accessible in the current region
If we had used an example that was looking up a table name, the same region considerations would apply,
however, additionally tables can be deprecated with @delete
so there would be additional checks to make
sure we're talking about a live table and not a table's tombstone.
In short, these simple cases just require looking up the entity and verifying that it's accessible in the current context.
Flexible Name Resolutionβ
The "hard case" for name resolution is where the name is occurring in an expression. Such a name can refer to all manner of things. It could be a global variable, a local variable, an argument, a table column, a field in a cursor, and others. The general name resolver goes through several phases looking for the name. Each phase can either report an affirmative success or error (in which case the search stops), or it may simply report that the name was not found but the search should continue.
We can demystify this a bit by looking at the most common way to get name resolution done.
// Resolves a (potentially qualified) identifier, writing semantic information
// into `ast` if successful, or reporting and recording an error for `ast` if
// not.
static void sem_resolve_id(ast_node *ast, CSTR name, CSTR scope) {
Contract(is_id(ast) || is_ast_dot(ast));
Contract(name);
// We have no use for `type` and simply throw it away.
sem_t *type = NULL;
sem_resolve_id_with_type(ast, name, scope, &type);
}
The name resolver works on either a vanilla name (e.g. x
) or a scoped name (e.g. T1.x
). The name and scope are provided.
The ast
parameter is used only as a place to report errors; there is no further cracking of the AST needed to resolve
the name. As you can see sem_resolve_id
just calls the more general function sem_resolve_id_with_type
and is used
in the most common case where you don't need to be able to mutate the semantic type info for the identifier. That's the 99% case.
So let's move on to the "real" resolver.
// This function is responsible for resolving both unqualified identifiers (ids)
// and qualified identifiers (dots). It performs the following two roles:
//
// - If an optional `ast` is provided, it works the same way most semantic
// analysis functions work: semantic information will be written into the
// ast, errors will be reported to the user, and errors will be recorded in
// the AST.
//
// - `*type_ptr` will be set to mutable type (`sem_t *`) in the current
// environment if the identifier successfully resolves to a type. (There are,
// unfortunately, a few exceptions in which a type will be successfully
// resolved and yet `*type_ptr` will not be set. These include when a cursor is
// in an expression position, when the expression is `rowid` (or similar), and
// when the id resolves to an enum case. The reason no mutable type is
// returned in these cases is that a new type is allocated as part of semantic
// analysis, and there exists no single, stable type in the environment to
// which a pointer could be returned. This is a limitation of this function,
// albeit one that's currently not problematic.)
//
// Resolution is attempted in the order that the `sem_try_resolve_*` functions
// appear in the `resolver` array. Each takes the same arguments: An (optional)
// AST, a mandatory name, an optional scope, and mandatory type pointer. If the
// identifier provided to one of these resolvers is resolved successfully, *or*
// if the correct resolver was found but there was an error in the program,
// `SEM_RESOLVE_STOP` is returned and resolution is complete, successful or not.
// If a resolver is tried and it determines that it is not the correct resolver
// for the identifier in question, `SEM_RESOLVE_CONTINUE` is returned and the
// next resolver is tried.
//
// This function should not be called directly. If one is interested in
// performing semantic analysis, call `sem_resolve_id` (or, if within an
// expression, `sem_resolve_id_expr`.) Alternatively, if one wants to get a
// mutable type from the environment, call `find_mutable_type`.
static void sem_resolve_id_with_type(ast_node *ast, CSTR name, CSTR scope, sem_t **type_ptr) {
Contract(name);
Contract(type_ptr);
*type_ptr = NULL;
sem_resolve (*resolver[])(ast_node *ast, CSTR, CSTR, sem_t **) = {
sem_try_resolve_arguments,
sem_try_resolve_column,
sem_try_resolve_rowid,
sem_try_resolve_cursor_as_expression,
sem_try_resolve_variable,
sem_try_resolve_enum,
sem_try_resolve_cursor_field,
sem_try_resolve_arg_bundle,
};
for (uint32_t i = 0; i < sizeof(resolver) / sizeof(void *); i++) {
if (resolver[i](ast, name, scope, type_ptr) == SEM_RESOLVE_STOP) {
return;
}
}
report_resolve_error(ast, "CQL0069: name not found", name);
record_resolve_error(ast);
}
This function is well described in its own comments. We can easily see the "mini-resolvers" which attempt to find the name in order:
sem_try_resolve_arguments
: an argument in the argument listsem_try_resolve_column
: a column name (possibly scoped)sem_try_resolve_rowid
: the virtual rowid column (possibly scoped)sem_try_resolve_cursor_as_expression
: use of a cursor as a boolean -- the bool is true if the cursor has datasem_try_resolve_variable
: local or global variablessem_try_resolve_enum
: the constant value of an enum (must be scoped)sem_try_resolve_cursor_field
: a field in a cursor (must be scoped)sem_try_resolve_arg_bundle
: a field in an argument bundle (must be scoped)
These all use this enum to communicate progress:
// All `sem_try_resolve_*` functions return either `SEM_RESOLVE_CONTINUE` to
// indicate that another resolver should be tried, or `SEM_RESOLVE_STOP` to
// indicate that the correct resolver was found. Continuing implies that no
// failure has (yet) occurred, but stopping implies neither success nor failure.
typedef enum {
SEM_RESOLVE_CONTINUE = 0,
SEM_RESOLVE_STOP = 1
} sem_resolve;
Each of these mini-resolvers will have a series of rules, for example sem_try_resolve_cursor_field
is going to have to do
something like this:
- if there is no scope, it can't be a cursor field, return
CONTINUE
- if the scope is not the name of a cursor, return
CONTINUE
- if the name is a field in the cursor, return
STOP
with success - else, report that the name is not a valid member of the cursor, and return
STOP
with an error
All the mini-resolvers are similarly structured, generically:
- if it's not my case, return
CONTINUE
- if it is my case return
STOP
(maybe with an error)
Some of the mini-resolvers have quite a few steps, but any one mini-resolver is only about a screenful of code and it does one job.
Flow Analysisβ
CQL implements a basic form of control flow analysis in "flow.c". The header "flow.h" exposes a small set of primitives used by "sem.c" during semantic analysis.
Flow analysis in CQL involves two important concepts: flow contexts and improvements. These are rather entangled concepts β one is useless without the other β and so the approach to describing them here will alternate between giving a bit of background on one and then the other, with a greater level of detail about the specific types of improvements being supplied later on.
A flow context is used, in essence, to create a boundary around a portion of a user's program. At the moment, there are four types of contexts.
The first type of context is called, rather boringly, a normal context.
Normal contexts are used for portions of a user's code that may be entered
conditionally. A good example of this is in SELECT
expressions: When a WHERE
clause is present, the expression list is only evaluated when the WHERE
clause
is true. If we look at sem_select_expr_list_con
, we can get an idea of how
this works in terms of flow contexts:
static void sem_select_expr_list_con(ast_node *ast) {
...
// Analyze the FROM portion (if it exists).
sem_select_from(select_from_etc);
error = is_error(select_from_etc);
// Push a flow context to contain improvements made via the WHERE clause that
// will be in effect for the SELECT expression list.
FLOW_PUSH_CONTEXT_NORMAL();
if (!error) {
...
sem_sensitive = sem_select_where_etc(select_from_etc);
...
sem_set_improvements_for_true_condition(where_expr);
...
}
...
if (!error) {
...
sem_select_expr_list(select_expr_list);
...
}
...
FLOW_POP_CONTEXT_NORMAL();
...
}
While very much simplified above, it can be seen that the steps are essentially as follows:
- Analyze the
FROM
clause. - Push a new normal context.
- Analyze the
WHERE
clause. - Set improvements given the
WHERE
clause (ultimately by callingflow_set_flag_for_type
); we'll come back to this part shortly. - Analyze the expression list with the improvements from the
WHERE
in effect. - Pop the context, un-setting the improvements from the
WHERE
.
This, of course, only begins to make sense once one understands what we mean by improvements.
CQL, at the moment, supports two forms of improvements: nullability improvements
and initialization improvements. Both of these will be discussed in more detail
later, but the basic idea is that an improvement upgrades the type of some value
within a particular flow context. For example, in the expression SELECT x + x
FROM t WHERE x IS NOT NULL
, we can reason that x + x
can safely be given a
nonnull type because of the WHERE
clause. This is exactly what we do in
sem_select_expr_list_con
: We make a context to hold the improvements that may
come from the WHERE
, analyze the WHERE
, set the appropriate improvements
given the WHERE
, analyze the expression list, and then pop the context to
unset the improvements (as they must not affect any enclosing expressions).
In addition to normal contexts, there are also branch contexts and branch
group contexts. These two context types are designed to work together for
handling IF
, CASE
, IIF
, SWITCH
, et cetera.
Like normal contexts, branch contexts assume that they are entered when some
condition is true. The difference is that branch contexts lie within a branch
group context, and branch groups know that at most one branch of a given set
of branches will be entered. A great example of this can be found in
sem_if_stmt
:
static void sem_if_stmt(ast_node *ast) {
...
// Each branch gets its own flow context in `sem_cond_action` where its
// condition is known to be true. We also create one more context for the
// entire set of branches. In addition to grouping the branches together, this
// outer context holds all of the negative improvements that result from the
// knowledge that, if a given branch's statements are being evaluated, all
// previous branches' conditions must have been false.
FLOW_PUSH_CONTEXT_BRANCH_GROUP();
// IF [cond_action]
sem_cond_action(cond_action);
...
if (elseif) {
sem_elseif_list(elseif);
...
}
...
if (elsenode) {
// ELSE [stmt_list]
flow_set_context_branch_group_covers_all_cases(true);
EXTRACT(stmt_list, elsenode->left);
if (stmt_list) {
FLOW_PUSH_CONTEXT_BRANCH();
sem_stmt_list_in_current_flow_context(stmt_list);
FLOW_POP_CONTEXT_BRANCH();
...
} else {
flow_context_branch_group_add_empty_branch();
}
record_ok(elsenode);
}
...
FLOW_POP_CONTEXT_BRANCH_GROUP();
...
}
It's instructive to look at sem_cond_action
as well:
static void sem_cond_action(ast_node *ast) {
...
// [expr] THEN stmt_list
sem_expr(expr);
...
if (stmt_list) {
FLOW_PUSH_CONTEXT_BRANCH();
// Add improvements for `stmt_list` where `expr` must be true.
sem_set_improvements_for_true_condition(expr);
sem_stmt_list_in_current_flow_context(stmt_list);
FLOW_POP_CONTEXT_BRANCH();
...
} else {
flow_context_branch_group_add_empty_branch();
}
// If a later branch will be taken, `expr` must be false. Add its negative
// improvements to the context created in `sem_if_stmt` so that all later
// branches will be improved by the OR-linked spine of IS NULL checks in
// `expr`.
sem_set_improvements_for_false_condition(expr);
...
}
Putting all of this together, we can see that the basic steps for analyzing an
IF
statement are as follows:
- Push a new branch group context to hold all of the branch contexts that are to come.
- Analyze the condition in the
IF condition THEN
portion of the statement. - Push a new branch context to hold the nullability improvements from the
condition (e.g., in
IF x IS NOT NULL THEN
, we can improvex
to have a nonnull type in the statement list after theTHEN
). - Set the improvements.
- Anaylze the statement list after the
THEN
. - Pop the branch context.
- Set the negative improvements resulting from the knowledge that
condition
must have been false if the previous branch wasn't entered (e.g., inIF y IS NULL THEN
, we know thaty
must be nonnull from just after the end of the branch until the end of the current branch group). - Repeat for the
ELSE IF
andELSE
branches (if any). - Pop the branch group context.
What makes all of this work are the following:
When a branch context is popped, it resets all improvements such that they become exactly what they were before the branch was analyzed. This is done to reflect the fact that, because at most one branch will be entered, neither adding improvements (via
flow_set_flag_for_type
) nor removing existing improvements (viaflow_unset_flag_for_type
) in a branch should affect any of the other branches in the group.When a branch group context is popped, it merges the effects of all of its branches. This is a key step that allows CQL to retain an improvement after a branch group is popped whenever the same improvement is made within every one of its branches and when the branches given cover all possible cases (which is indicated by the call to
flow_set_context_branch_group_covers_all_cases
in the code above).
The final type of context is called a jump context. Jump contexts are a
maximally pessimistic form of context that assume every improvement that might
be unset within them will be unset and that every improvement that might be set
within them will not be set. Jump contexts are used to make semantic analysis
safe in the possible presence of control flow statements like CONTINUE
,
LEAVE
, and THROW
, and so jump contexts are used for the analysis of
statements like LOOP
, WHILE
, and TRY
. Take the following line-numbered
code as an example:
001 DECLARE x TEXT;
002 SET x := "foo";
003 WHILE some_condition
004 BEGIN
005 IF another_condition THEN
006 SET x := NULL;
007 IF yet_another_condition THEN
008 LEAVE;
009 END IF;
010 SET x := "bar";
011 ELSE
012 -- do nothing
013 END IF;
014 END;
015 CALL requires_text_notnull(x);
Here, even though the outer IF
makes no change overall to the nullability
improvement to x
from line 2 -- it unsets it on line 6 and then re-sets it on
line 10 and the ELSE
does nothingβthere is no guarantee that line 10 will ever
be evaluated because we may jump straight from line 8 to line 15. As a result,
it is necessary that x
be un-improved after the WHILE
loop; a normal context
would not accomplish this, but a jump context does. See the comments within
_flow_push_context_branch
for additional discussion.
While jump contexts are necessary for the safety of improvements in the presence of loops, they are not sufficient: It's actually necessary to analyze loops twice. This is because execution of a loop might repeat, and so a statement that results in the unsetting of an improvement later in a loop must affect improvements earlier in that loop. For example:
DECLARE x INT;
SET x := 1;
WHILE some_condition
BEGIN
-- okay on the first analysis pass, but not the second
CALL requires_int_notnull(x);
-- must negatively affect the call on the line above
SET x := NULL;
END;
Semantic analysis keeps track of whether or not it is currently reanalyzing the
statement list of a loop via the current_loop_analysis_state
variable:
// The analysis of loops like LOOP and WHILE is done in two passes. First, we
// analyze the loop to conservatively figure out every improvement that the loop
// could possibly unset. After that, we reanalyze it with said improvements
// unset to ensure that everything is safe. See `sem_stmt_list_within_loop` for
// more information on why this is necessary.
typedef enum {
LOOP_ANALYSIS_STATE_NONE,
LOOP_ANALYSIS_STATE_ANALYZE,
LOOP_ANALYSIS_STATE_REANALYZE
} loop_analysis_state;
...
// Keeps tracks of the current loop analysis state. If this is equal to
// `LOOP_ANALYSIS_STATE_ANALYZE`, we are analyzing with a non-final set of
// improvements. This is useful for two reasons:
//
// 1. Procedures that perform rewrites based on improvements (e.g.,
// `sem_resolve_id_expr`) can use this to verify whether a rewrite is safe to
// perform (`LOOP_ANALYSIS_STATE_NONE` or `LOOP_ANALYSIS_STATE_REANALYZE`) or
// whether they should wait because they do not yet have definitive
// information (`LOOP_ANALYSIS_STATE_ANALYZE`).
//
// 2. Analyses that would otherwise fail if called during reanalysis (e.g.,
// `sem_verify_legal_variable_name`) can use this to check whether the
// current state is `LOOP_ANALYSIS_STATE_REANALYZE` and adjust their
// behaviors accordingly.
static loop_analysis_state current_loop_analysis_state = LOOP_ANALYSIS_STATE_NONE;
As indicated in the first comment above, the comments within
sem_stmt_list_within_loop
go into further detail.
At this point, we've only scratched the surface of control flow analysis in CQL. Fortunately, the files "flow.h" and "flow.c" are heavily commented and can be studied to deepen one's understanding.
Nullability Improvementsβ
Via a form of occurrence typing (also known as flow typing), CQL has the ability to determine that, due to a prior conditional check, a nullable variable or cursor field cannot be null within a particular context, and CQL will improve its type in that context.
Unlike most forms of semantic analysis performed by CQL, the analysis for
nullability improvements, as is the case for all types of improvements, makes
heavy use of the find_mutable_type
function:
// Returns the *mutable* type (`sem_t *`) for a given (potentially qualified)
// identifier if one exists in the environment. See the documentation for
// `sem_resolve_id_with_type` for limitations.
static sem_t *find_mutable_type(CSTR name, CSTR scope);
This function allows us to look up the type of the original binding referred to by a particular name/scope pair. In essence, it provides access to the current type environment for whichever part of the program we are analyzing. It also allows us to mutate that environment by virtue of the fact that it returns a pointer to the type of the binding, not merely the type itself.
By using find_mutable_type
to get a type pointer and toggling the
SEM_TYPE_INFERRED_NOTNULL
flag via flow_set_flag_for_type
and
flow_unset_flag_for_type
, the procedures sem_set_notnull_improved
and
sem_unset_notnull_improved
are able to record that a nullable identifier or
cursor field is either temporarily nonnull or no longer nonnull respectively:
// Enables a nonnull improvement, if possible.
static void sem_set_notnull_improved(CSTR name, CSTR scope);
// This needs to be called for everything that is no longer safe to consider NOT
// NULL due to a mutation. It is fine to call this for something not currently
// subject to improvement, but it must only be called with a name/scope pair
// referring to something that has a mutable type (e.g., it must not be an unbound
// variable, a cursor used as an expression, an enum case, et cetera).
static void sem_unset_notnull_improved(CSTR name, CSTR scope);
Similarly, sem_is_notnull_improved
uses find_mutable_type
to check whether
or not something is currently improved:
// Returns true if currently improved to be nonnull, else false.
static bool_t sem_is_notnull_improved(CSTR name, CSTR scope);
Why does nullability inference use this approach? The reason is that the
alternative would be maintaining some sort of set of currently improved
identifiers and cursor fields and checking it whenever resolving an identifier
or cursor field. The problem would be that merely knowing that some identifier
"x" is improved would not be sufficient, however: We'd have to know which "x".
Is it the local variable "x"? Is it the column "x" of the table from which we're
currently selecting? In essence, correctly maintaining an independent set of
all currently active improvements would involve re-implementing all of the
scoping rules of the language. By using find_mutable_type
, we can simply
piggyback on the existing name resolution logic and avoid all of these issues.
A nullability improvement is always created within a particular flow context.
When an improvement is added via sem_set_notnull_improved
, a record of that
improvement is recorded in the current context. When that context ends, that
same record is used to remove the improvement. It is also the case that
sem_unset_notnull_improved
may be used to remove an improvement before a
context has ended due to a SET
, FETCH
, or call to a procedure or function
with an OUT
argument resulting in the improvement no longer being safe.
Improvements can be introduced into the current context via
sem_set_notnull_improved
directly (when a variable is SET
to a value of a
nonnull type), but more commonly they are introduced via one of the following
two functions:
// Given a conditional expression `ast` possibly containing AND-linked
// subexpressions, set all of the applicable nullability and has-row
// improvements within the current flow context. Generally speaking, calls to
// this function should be bounded by a new flow context corresponding to the
// portion of the program for which the condition `ast` must be be true.
static void sem_set_improvements_for_true_condition(ast_node *expr);
// Improvements for known-false conditions are dual to improvements for
// known-true conditions.
//
// For nullability, known-false conditions improve ids and dots verified to be
// NULL via `IS NULL` along the outermost spine of `OR` expressions, whereas
// known-true conditions improve ids and dots verified to be nonnull via `IS NOT
// NULL` along the outermost spine of `AND` expressions. For example, the
// following two statements introduce the same improvements:
//
// IF a IS NOT NULL AND b IS NOT NULL THEN
// -- `a` and `b` are improved here because we know the condition is true
// END IF;
//
// IF a IS NULL OR b IS NULL RETURN;
// -- `a` and `b` are improved here because we know the condition is false
// -- since we must not have returned if we got this far
//
// ...
static void sem_set_improvements_for_false_condition(ast_node *ast);
These functions introduce improvements by gathering up all of the IS NOT NULL
checks (in the true case) or IS NULL
checks (in the false case) and
introducing improvements appropriately. The true version is used when we enter a
context that will only be evaluated at runtime when some particular condition is
true; the false version, conversely, is used when we enter a context that will
only be evaluated at runtime when some particular condition is false:
IF some_condition THEN
-- "true" improvements from `some_condition` are in
-- effect here
ELSE IF another_condition THEN
-- "false" improvements from `some_condition` and true
-- improvements from `another_condition` are in effect
-- here
ELSE
-- "false" improvements from both `some_condition` and
-- `another_condition` are in effect here
END IF;
Global variables in CQL require special treatment when it comes to nullability improvements. This is because any procedure call could potentially mutate any number of global variables, and so all currently improved globals must be un-improved at every such call. The following list keeps track of which global variables are currently improved:
typedef struct global_notnull_improvement_item {
sem_t *type;
struct global_notnull_improvement_item *next;
} global_notnull_improvement_item;
...
// Keeps track of all global variables that may currently be improved to be NOT
// NULL. We need this because we must un-improve all such variables after every
// procedure call (because we don't do interprocedural analysis and cannot know
// which globals may have been set to NULL).
static global_notnull_improvement_item *global_notnull_improvements;
The fact that we don't do interprocedural analysis (as the comment above indicates) is not a deficiency. Programmers should be able to reason locally about nullability improvements, and an analysis that depended upon the details of how other procedures were implemented would make that impossible.
So far, we have talked a lot about how improvements are set and unset, but we haven't talked about how the improvement actually happens in terms of code generation. Since CQL represents values of nullable and nonnull types differently (at least in the case of non-reference types), we cannot simply treat a value of a nullable type as though it were of a nonnull type: We need to actually change its representation.
The way this works is that, whenever we resolve a name/scope pair via
sem_resolve_id_expr
, we check whether the pair is currently improved via
sem_is_notnull_improved
. If it is, we call rewrite_nullable_to_notnull
to
wrap the id or dot we're resolving with a call to the function
cql_inferred_notnull
(for which we generate code in
cg_func_cql_inferred_notnull
):
// Wraps an id or dot in a call to cql_inferred_notnull.
cql_noexport void rewrite_nullable_to_unsafe_notnull(ast_node *_Nonnull ast);
// The `cql_inferred_notnull` function is not used by the programmer directly,
// but rather inserted via a rewrite during semantic analysis to coerce a value
// of a nullable type to be nonnull. The reason for this approach, as opposed to
// just changing the type directly, is that there are also representational
// differences between values of nullable and nonnull types; some conversion is
// required.
static void cg_func_cql_inferred_notnull(ast_node *call_ast, charbuf *is_null, charbuf *value);
As the comment for cg_func_cql_inferred_notnull
indicates, programmers do not
use cql_inferred_notnull
directly: It is only inserted as a product of the
above-mentioned rewrite. In fact, we explicitly disallow its use by programmers
in the parser:
// We insert calls to `cql_inferred_notnull` as part of a rewrite so we expect
// to see it during semantic analysis, but it cannot be allowed to appear in a
// program. It would be unsafe if it could: It coerces a value from a nullable
// type to a nonnull type without any runtime check.
#define YY_ERROR_ON_CQL_INFERRED_NOTNULL(x) \
EXTRACT_STRING(proc_name, x); \
if (!strcmp(proc_name, "cql_inferred_notnull")) { \
yyerror("Call to internal function is not allowed 'cql_inferred_notnull'"); \
}
One subtle aspect of the rewrite is that the rewrite itself performs analysis to
validate the product of the rewrite (as do other many other rewrites). To avoid
going into a loop of rewriting, analyzing the result (which ultimately happens
in sem_special_func_cql_inferred_notnull
), rewriting again because the result
contains a name that is improved, et cetera, we keep track of whether or not
we're currently analyzing a subexpression under a call to cql_inferred_notnull
and avoid re-rewriting appropriately:
// This is true if we are analyzing a call to `cql_inferred_notnull`. This can
// happen for three reasons:
//
// * We just did a rewrite that produced a `cql_inferred_notnull` call and now
// we're computing its type.
// * We're analyzing an expression that was already analyzed (e.g., in a CTE).
// * We're analyzing the output of a previous CQL run within which calls to
// `cql_inferrred_notnull` may occur.
//
// Regardless of the cause, if `is_analyzing_notnull_rewrite` is true, we do not
// want to rewrite again.
static bool_t is_analyzing_notnull_rewrite;
static void sem_special_func_cql_inferred_notnull(ast_node *ast, uint32_t arg_count, bool_t *is_aggregate) {
...
// Since we're checking a call to `cql_inferred_notnull`, its arguments have
// already been rewritten and we don't want to do it again. Setting
// `is_analyzing_notnull_rewrite` prevents that.
is_analyzing_notnull_rewrite = true;
sem_arg_list(arg_list, IS_NOT_COUNT);
is_analyzing_notnull_rewrite = false;
...
}
// Like `sem_resolve_id`, but specific to expression contexts (where nullability
// improvements are applicable).
static void sem_resolve_id_expr(ast_node *ast, CSTR name, CSTR scope) {
...
if (is_analyzing_notnull_rewrite) {
// If we're analyzing the product of a rewrite and we're already inside of a
// call to `cql_inferred_notnull`, do not expand again.
// forever.
return;
}
...
}
At this point, you should have a decent understanding of how nullability improvements function, both in terms of semantic analysis and in terms of code generation. The implementation is heavily commented, so reading the code and searching for calls to the core functions listed below should be sufficient to fill in any gaps:
bool_t sem_is_notnull_improved(CSTR name, CSTR scope);
void sem_set_notnull_improved(CSTR name, CSTR scope);
void sem_unset_notnull_improved(CSTR name, CSTR scope);
void sem_unset_global_notnull_improvements();
void sem_set_improvements_for_true_condition(ast_node *ast);
void sem_set_improvements_for_false_condition(ast_node *ast);
void sem_special_func_cql_inferred_notnull(ast_node *ast, uint32_t arg_count, bool_t *is_aggregate)
void rewrite_nullable_to_unsafe_notnull(ast_node *_Nonnull ast);
Initialization Improvementsβ
Compared to nullability improvements, initialization improvements are relatively simple.
The idea behind initialization improvements is that, if one declares a variable
of a reference type (BLOB
, OBJECT
, or TEXT
) that is also NOT NULL
, it is
not safe to use the variable until it has been given a value. For example:
DECLARE x TEXT NOT NULL;
IF some_condition THEN
SET x := some_text_notnull_value;
-- `x` is safe to use here
ELSE
-- `x` is NOT safe to use here (it might be uninitialized)
END IF;
-- `x` is NOT safe to use here either (it might be uninitialized)
As with nullability improvements, initialization improvements rely heavily on
flow contexts. The function sem_set_initialization_improved
, similarly to
sem_set_notnull_improved
for nullability, is used to enable an initialization
improvement. (There is nothing analogous to sem_unset_notnull_improved
for
initialization because nothing can ever be uninitialized once it has been given
a value.)
Unlike nullability improvements, initialization improvements use two flags:
SEM_TYPE_INIT_REQUIRED
and SEM_TYPE_INIT_COMPLETE
. Rather than assuming
everything is uninitalized by default and requiring the presence of some
SEM_TYPE_INITIALIZED
flag before anything can be used, we explicitly tag
things that are not initialized but need to be with SEM_TYPE_INIT_REQUIRED
and
later tag them with SEM_TYPE_INIT_COMPLETE
once they've been initialized.
Doing it in this way has two benefits:
It reduces the amount of noise in the AST output significantly: Code like
LET x := 10;
can remain{let_stmt}: x: integer notnull variable
in the AST without the need of the extra noise of someinitialized
flag.More importantly, it means we only have to deal with initialization in a tiny portion of "sem.c". For example, we must handle it in
sem_declare_vars_type
to add theSEM_TYPE_INIT_REQUIRED
flag and insem_assign
to addSEM_TYPE_INIT_COMPLETE
, butsem_let_stmt
can remain blissfully ignorant of initialization altogether.
There are only three places in which a variable may be initialized: sem_assign
(as mentioned), sem_fetch_stmt
(for the FETCH...INTO
form), and
sem_arg_for_out_param
(as passing a variable to a procedure requiring an OUT
argument of a NOT NULL
type can initialize it).
Regarding sem_arg_for_out_param
, we can only set initialization improvements
when a variable is passed as an OUT
argument because we require that all
procedures initialize all of their OUT
parameters of a nonnull reference type.
This is handled in two places:
In
sem_param
, we set theSEM_TYPE_INIT_REQUIRED
flag whenparam_should_require_initialization
is true.In
sem_validate_current_proc_params_are_initialized
, which is called both after analyzing the statement list of a procedure and for each return statement within the procedure, we ensure thatSEM_TYPE_INIT_COMPLETE
is present on all parameters that haveSEM_TYPE_INIT_REQUIRED
.
There is only one wrinkle in all of this: the cql:try_is_proc_body
attribute.
If cql:try_is_proc_body
is present on a TRY
statement, we call
sem_validate_current_proc_params_are_initialized
at the end of the TRY
and
not at the end of the procedure. The rationale for this is explained
thoroughly in the comments for
sem_find_ast_misc_attr_trycatch_is_proc_body_callback
.
That's all there is to it: "flow.c" does most of the hard work for us.
Structure types and the notion of Shapesβ
Earlier we discussed SEM_TYPE_STRUCT
briefly. Recall the basic notion of the structure
type:
// for tables and views and the result of a select
typedef struct sem_struct {
CSTR struct_name; // struct name
uint32_t count; // count of fields
CSTR *names; // field names
CSTR *kinds; // the "kind" text of each column, if any, e.g. integer<foo> foo is the kind
sem_t *semtypes; // typecode for each field
} sem_struct;
The structure is nothing more than an array of names, types and kinds with a count. But it creates the notion of what's usually called a "shape" in the codebase. Shapes can be used in a variety of ways as is described in Chapter 5 of the CQL Guide. But before we get into shapes, let's look at an example of how a structure type is created.
The code that follows is the back end of sem_create_table_stmt
. At this point the bulk of the analysis is
done and the columns all have their types. We're about to build the struct type for the table. Let's see how
that goes.
// now create a struct type with the correct number of columns
// the types have already been computed so all we have to do is
// check for duplicates
sem_struct *sptr = new_sem_struct(name, cols);
symtab *columns = symtab_new();
int32_t col = 0;
for (ast_node *item = col_key_list; item; item = item->right) {
Contract(is_ast_col_key_list(item));
EXTRACT_ANY_NOTNULL(def, item->left);
if (is_ast_col_def(def)) {
Invariant(def->sem->name);
Invariant(col <= cols); // it's possible that the rest are deleted and we're at the end.
// columns must be unique, including deleted columns
if (!symtab_add(columns, def->sem->name, NULL)) {
EXTRACT_NOTNULL(col_def_type_attrs, def->left);
EXTRACT_NOTNULL(col_def_name_type, col_def_type_attrs->left);
EXTRACT_ANY_NOTNULL(col_def_ast, col_def_name_type->left);
report_error(col_def_ast, "CQL0142: duplicate column name", def->sem->name);
record_error(ast);
symtab_delete(columns);
goto cleanup;;
}
if (is_deleted(def->sem->sem_type)) {
continue;
}
Invariant(col < cols);
sptr->names[col] = def->sem->name;
sptr->semtypes[col] = def->sem->sem_type;
sptr->kinds[col] = def->sem->kind;
col++;
}
}
symtab_delete(columns);
Invariant(col == cols);
ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;
new_sem_struct
: makes a struct to hold the result, we already have the count of columns and the table namesymtab_new
: is going to give us a scratch symbol table so we can check for duplicate column names- we walk all the items in the table and use
is_ast_col_def(def)
to find the column definitions Invariant(def->sem->name)
: claims that we must have already computed the semantic info for the column and it has its name populated- this was done earlier
symtab_add(columns, def->sem->name, NULL)
: adds a nil entry under the column name -- if this fails we have a duplicate column,- in which case we report errors and stop
is_deleted
: tells us if the column was marked with@delete
in which case it no longer counts as part of the table- if all this is good we set the
names
,kinds
, andsemtypes
from the column definition's semantic info symtab_delete
: cleans up the temporary symbol tablenew_sem
: creates asem_node
of typeSEM_TYPE_STRUCT
which is filled insem_join_from_sem_struct
will be discussed shortly, but it creates ajptr
with one table in it
Structure types often come from the shape of a table, but other things can create a structure type. For instance, the
columns of a view, or any select statement, are also described by a structure type and are therefore valid "shapes". The
return type of a procedure usually comes from a SELECT
statement, so the procedure too can be the source of a shape.
The arguments of a procedure form a shape. The fields of a cursor form a shape. You can even have a named subset
of the arguments of a procedure and use them like a shape. All of these things are described by structure types.
Shapes and the LIKE constructβ
There are many cases where you want to be able to capture or re-use something with a known shape and you don't
want to have to fully re-declare the thing. CQL uses the LIKE
construct to do these sorts of things. This is
more fully explained in Chapter 5 of the Guide,
but for now let's look at two different cases that are of interest.
First, a cursor:
DECLARE C CURSOR LIKE Foo; -- Foo something with a shape
So, in the above, Foo could be a table, a view, a procedure with a result, another cursor, and so forth.
How might we do this? This is the business of sem_declare_cursor_like_name
// Here we're going to make a new value cursor using the indicated name for the shape.
// The name has to be "likeable" meaning it refers to some named thing with a shape
// such as a table, a view, another cursor, or a procedure that returns a result set.
// These are the so called "value cursors" in that they have no underlying statement
// that they move through. You can just load them up with a row and pass them around.
static void sem_declare_cursor_like_name(ast_node *ast) {
Contract(is_ast_declare_cursor_like_name(ast));
EXTRACT_ANY_NOTNULL(new_cursor_ast, ast->left);
EXTRACT_STRING(new_cursor_name, new_cursor_ast);
EXTRACT_ANY_NOTNULL(shape_def, ast->right);
// no duplicates allowed
if (!sem_verify_legal_variable_name(ast, new_cursor_name)) {
record_error(new_cursor_ast);
record_error(ast);
return;
}
// must be a valid shape
ast_node *found_shape = sem_find_shape_def(shape_def, LIKEABLE_FOR_VALUES);
if (!found_shape) {
record_error(ast);
return;
}
// good to go, make our cursor, with storage.
shape_def->sem = found_shape->sem;
new_cursor_ast->sem = new_sem(SEM_TYPE_STRUCT | SEM_TYPE_VARIABLE | SEM_TYPE_VALUE_CURSOR | SEM_TYPE_HAS_SHAPE_STORAGE);
new_cursor_ast->sem->sptr = found_shape->sem->sptr;
new_cursor_ast->sem->name = new_cursor_name;
ast->sem = new_cursor_ast->sem;
symtab_add(current_variables, new_cursor_name, new_cursor_ast);
}
EXTRACT
: gets the pieces we need from the ASTsem_verify_legal_variable_name
: makes sure the cursor name is unique and doesn't hide a table namesem_find_shape_def
: searches for something with a suitable name that has a shape- we populate the name node with the semantic type that we found
new_sem
: makes a newsem_node
for the cursor variable withSEM_TYPE_STRUCT
- set the
sptr
field using the discovered shape
- set the
Note: name_ast->sem
isn't actually used for anything but it is helpful for debugging. If the AST is printed it
shows the original unmodified semantic type which can be helpful.
Briefly sem_find_shape_def
does these steps:
- if the right of the
LIKE
refers to procedure arguments (e.g. C LIKE Foo ARGUMENTS), get the args of the named procedure and use them as a shape - if the right is a local or global, and its a cursor, use the shape of that cursor for the new cursor
- if the right is the name of an argument bundle, use the shape of the bundle
- e.g. in
CREATE PROC Foo(p1 like Person, p2 like Person)
p1
andp2
are the names of argument bundles shaped likePerson
- e.g. in
- if the right is the name of a table or view, use that shape
- if the right is the name of a procedure with a structure result, use that shape
- if it's none of these, produce an error
This is the primary source of shape reuse. Let's look at how we might use that.
Suppose we want to write a procedure that inserts a row into the table Foo
, we could certainly list the columns of Foo
as arguments like this:
CREATE PROC InsertIntoFoo(id integer, t text, r real, b blob)
BEGIN
INSERT INTO Foo(id, t, r, b) VALUES(id, t, r, b);
END;
But that approach is going to get a lot less exciting when there are lots of columns and it will be increasingly a maintenance headache.
Compare that with the following:
CREATE PROC InsertIntoFoo(row LIKE Foo)
BEGIN
INSERT INTO Foo FROM row;
END;
Those two versions of InsertIntoFoo
compile into the same code. The semantic analyzer expands the (row LIKE Foo)
into
(row_id integer, row_t text, row_r real, row_b blob)
and then replaces FROM row
with
(row_id, row_t, row_r, row_b)
. In both case it simply looked up the shape using sem_find_shape_def
and then altered the AST to the canonical pattern. This kind of "shape sugar" is all over CQL and
greatly increases maintainability while eliminating common errors. The most common operation is simply
to expland a "shape" into a list of arguments or columns (maybe with or without type names). SQLite doesn't
know any of this shape magic so by the time SQLite sees the code it has to look "normal" -- the shapes
are all resolved.
Join Typesβ
The last of the type building data structure is the join type. Recall that we have this shape:
// for the data type of (parts of) the FROM clause
// sometimes I refer to as a "joinscope"
typedef struct sem_join {
uint32_t count; // count of table/views in the join
CSTR *names; // names of the table/view
struct sem_struct **tables; // struct type of each table/view
} sem_join;
This is an array of named structure types, which is exactly what you get when you do something like this:
select * from T1 INNER JOIN T2;
The result has all of the columns of T1
and all of the columns of T2
. They can be referred to with scoped
names like T1.x
which means "find the sptr
corresponding to the name T1
then within that structure
find the column named x
". In general, when we join, we take a jptr
on the left and concatenate it
with a jptr
on the right. For all this to work we have to start somewhere, usually single tables.
As we saw when we make a table we use sem_join_from_sem_struct
to make its initial jptr
. Let's
have a look at that now:
// Create a base join type from a single struct.
static sem_join *sem_join_from_sem_struct(sem_struct *sptr) {
sem_join *jptr = new_sem_join(1);
jptr->names[0] = sptr->struct_name;
jptr->tables[0] = new_sem_struct_strip_table_flags(sptr);
return jptr;
}
It doesn't get much simpler than the above, here are the steps briefly:
new_sem_join
: gives us an emptysem_join
with room for 1 table- we use the struct name for the name and the table's
sptr
for the shape new_sem_struct_strip_table_flags
: copies the table'ssptr
keeping only the essential flagsSEM_TYPE_HIDDEN_COL
SEM_FLAG_NOTNULL
SEM_FLAG_SENSITIVE
The other flags (e.g. SEM_TYPE_PK
) have no value in doing type checking and were only needed to help validate the table itself.
Those extra flags would be harmless but they would also contaminate all of the debug output, so they are stripped. As a result
the type of columns as they appear in say SELECT
statements is simpler than how they appear in a CREATE TABLE
statement.
When we need to create a new join type we simply (*) make a new sem_join
that is the concatenation of the left and right sides of the join
operation.
- some join types change the nullability of columns like
LEFT JOIN
, so we have to handle that too - the names of the tables in the new concatenated joinscope have to be unambiguous so there is also error checking to do
- but basically it's just a concatenation
Importantly, we call the thing a "joinscope" because it creates a namespace. When we are evaluating names inside of the FROM
clause or
even later in, say, a WHERE
clause, the joinscope that we have created so far controls the table.column
combinations that you can
use in expressions. This changes again when there is a subquery, so the joinscopes can be pushed and popped as needed.
By way of example, you'll see these two patterns in the code:
PUSH_JOIN(from_scope, select_from_etc->sem->jptr);
error = sem_select_orderby(select_orderby);
POP_JOIN();
PUSH_JOIN
: use thejptr
from theFROM
clause to put things back in scope for theORDER BY
clause
PUSH_JOIN_BLOCK();
sem_numeric_expr(ast->left, ast, "LIMIT", SEM_EXPR_CONTEXT_LIMIT);
POP_JOIN();
PUSH_JOIN_BLOCK
: causes the name search to stop -- nothing deeper in the stack is searched- in this case we do not allow
LIMIT
expressions to see any joinscopes, as they may not use any columns- even if the
LIMIT
clause is appearing in a subquery it can't refer to columns in the parent query
- even if the
Schema Regionsβ
We touched briefly on schema regions earlier in this section. The purpose and language for regions is described more fully in Chapter 10 of the Guide. In this section we'll deal with how they are implemented and what you should expect to see in the code.
When a region declaration is found this method is used:
// A schema region is an partitioning of the schema such that it
// only uses objects in the same partition or one of its declared
// dependencies. One schema region may be upgraded independently
// from any others (assuming they happen such that dependents are done first).
// Here we validate:
// * the region name is unique
// * the dependencies (if any) are unique and exist
static void sem_declare_schema_region_stmt(ast_node *ast) { ... }
The general rules are described in the comment, but effectively it accumulates the list of the declared region's dependencies. Sometimes these are called the antecedent regions. Since a region can only depend on regions that have already been declared, it's not possible to make any cycles. Regions are declared before you put anything into them.
Pieces of schema or procedures (or anything really) can go into a region by putting that code inside a begin/end pair for the named region. Like so:
@begin_schema_region your_region;
-- your stuff
@end_schema_region;
Now whatever happens to be in "your stuff" is:
- limited to seeing only the things that
your_region
is allowed to see, and - contributes its contents to
your_region
thereby limiting how others will be able to use "your stuff"
To see how this happens, let's have a look at sem_begin_schema_region_stmt
.
// Entering a schema region makes all the objects that follow part of that
// region. It also means that all the contained objects must refer to
// only pieces of schema that are in the same region or a dependent region.
// Here we validate that region we are entering is in fact a valid region
// and that there isn't already a schema region.
static void sem_begin_schema_region_stmt(ast_node * ast) {
Contract(is_ast_begin_schema_region_stmt(ast));
EXTRACT_STRING(name, ast->left);
// @BEGIN_SCHEMA_REGION name
if (!verify_schema_region_out_of_proc(ast)) {
record_error(ast);
return;
}
if (current_region) {
report_error(ast, "CQL0246: schema regions do not nest; end the current region before starting a new one", NULL);
record_error(ast);
return;
}
ast_node *region = find_region(name);
if (!region) {
report_error(ast->left, "CQL0244: unknown schema region", name);
record_error(ast);
return;
}
// Get the canonical name of the region (case adjusted)
Contract(is_region(region));
EXTRACT_STRING(region_name, region->left);
// we already know we are not in a region
Invariant(!current_region_image);
current_region_image = symtab_new();
sem_accumulate_public_region_image(current_region_image, region_name);
// this is the one and only text pointer value for this region
current_region = region_name;
record_ok(ast);
}
We see these basic steps:
EXTRACT
: gets the region nameverify_schema_region_out_of_proc
: makes sure we are out of any procedure (we have to be at the top level)- errors if in a procedure
current_region
: is tested to make sure we are not already in a region (no nesting)- errors if already in a region
find_region
: is used to find the region AST by name- errors if the region name isn't valid
EXTRACT
: is used again to get the canonical name of the region- you could write
@begin_schema_region YoUr_ReGION;
but we want the canonical nameyour_region
, as it was declared
- you could write
symtab_new
: creates a new symbol tablecurrent_region_image
sem_accumulate_public_region_image
: populatescurrent_region_image
by recursively walking this region adding the names of all the regions we find along the way- note the regions form a DAG so we might find the same name twice; we can stop if we find a region that is already in the image symbol table
current_region
: set it to the now new current region
Now we're all set up.
- We can use
current_region
to set theregion
in thesem_node
of anything we encounter - We can use
current_region_image
to quickly see if we are allowed to use any given region- if it's in the symbol table we can use it
Recall that at the end of sem_create_table_stmt
we do this:
ast->sem = new_sem(SEM_TYPE_STRUCT);
ast->sem->sptr = sptr;
ast->sem->jptr = sem_join_from_sem_struct(sptr);
ast->sem->region = current_region;
That should make a lot more sense now.
When doing the symmetric check in sem_validate_object_ast_in_current_region
we see this pattern:
// Validate whether or not an object is usable with a schema region. The object
// can only be a table, view, trigger or index.
static bool_t sem_validate_object_ast_in_current_region(
CSTR name,
ast_node *table_ast,
ast_node *err_target,
CSTR msg)
{
// We're in a non-region therefore no validation needed because non-region stmt
// can reference schema in any region.
if (!current_region) {
return true;
}
if (table_ast->sem && table_ast->sem->region) {
// if we have a current region then the image is always computed!
Invariant(current_region_image);
if (!symtab_find(current_region_image, table_ast->sem->region)) {
// The target region is not accessible from this region
...
return false;
}
} else {
// while in schema region '%s', accessing an object that isn't in a region is invalid
...
return false;
}
return true;
}
I've elided some of the code here, but only the part that generates error messages. The essential logic is:
- if we are not in a region we can access anything
- if we're in a region then...
- the thing we're trying to access must also be in a region, and
- that region must be in
current_region_image
- otherwise, we can't access it
This is enough to do all the region validation we need.
Results of Semantic Analysisβ
Semantic Analysis leaves a lot of global state ready for the remaining stages to harvest. If the state
is defined in sem.h
then it's ok to harvest. Here we'll highlight some of the most important things you
can use in later passes. These are heavily used in the code generators.
cql_data_decl( struct list_item *all_tables_list );
cql_data_decl( struct list_item *all_functions_list );
cql_data_decl( struct list_item *all_views_list );
cql_data_decl( struct list_item *all_indices_list );
cql_data_decl( struct list_item *all_triggers_list );
cql_data_decl( struct list_item *all_regions_list );
cql_data_decl( struct list_item *all_ad_hoc_list );
cql_data_decl( struct list_item *all_select_functions_list );
cql_data_decl( struct list_item *all_enums_list );
These linked lists are authoritiative; they let you easily enumerate all the objects of the specified type. For
instance, if you wanted to do some validation of all indices, you could simply walk all_indices_list
.
cql_noexport ast_node *find_proc(CSTR name);
cql_noexport ast_node *find_region(CSTR name);
cql_noexport ast_node *find_func(CSTR name);
cql_noexport ast_node *find_table_or_view_even_deleted(CSTR name);
cql_noexport ast_node *find_enum(CSTR name);
These functions give you access to the core name tables (which are still valid!) so that you can look up procedures, functions, tables, etc. by name.
Finally, information about all the schema annotations is invaluable for building schema upgraders. These two buffers hold dense arrays of annotation records as shown below.
cql_data_decl( bytebuf *schema_annotations );
cql_data_decl( bytebuf *recreate_annotations );
typedef struct recreate_annotation {
CSTR target_name; // the name of the target
CSTR group_name; // group name or "" if no group (not null, safe to sort)
ast_node *target_ast; // top level target (table, view, or index)
ast_node *annotation_ast; // the actual annotation
int32_t ordinal; // when sorting we want to use the original order (reversed actually) within a group
} recreate_annotation;
typedef struct schema_annotation {
int32_t version; // the version number (always > 0)
ast_node *target_ast; // top level target (table, view, or index)
CSTR target_name; // the name of the target
uint32_t annotation_type; // one of the codes below for the type of annotation
ast_node *annotation_ast; // the actual annotation
int32_t column_ordinal; // -1 if not a column
ast_node *column_ast; // a particular column if column annotation
} schema_annotation;
// Note: schema annotations are processed in the indicated order: the numbers matter!
#define SCHEMA_ANNOTATION_INVALID 0
#define SCHEMA_ANNOTATION_FIRST 1
#define SCHEMA_ANNOTATION_UNSUB 1
#define SCHEMA_ANNOTATION_CREATE_TABLE 2
#define SCHEMA_ANNOTATION_CREATE_COLUMN 3
#define SCHEMA_ANNOTATION_DELETE_TRIGGER 4
#define SCHEMA_ANNOTATION_DELETE_VIEW 5
#define SCHEMA_ANNOTATION_DELETE_INDEX 6
#define SCHEMA_ANNOTATION_DELETE_COLUMN 7
#define SCHEMA_ANNOTATION_DELETE_TABLE 8
#define SCHEMA_ANNOTATION_AD_HOC 9
#define SCHEMA_ANNOTATION_RESUB 10
#define SCHEMA_ANNOTATION_LAST 10
And of course, each "back end" is provided with the root of the AST so that it can also search and/or walk the AST in its own manner. We will see examples of this in later sections.
Recapβ
At present, there are nearly 20000 lines in sem.c
and it would no doubt take more than 20000 lines of text
to explain what they all do, and that would be more imprecise than the source code, and probably less
readable. sem.c
includes over 4000 lines of comments, and probably should have more. While there
is a lot of code there, it's very readable and I encourage you to do just that -- read it -- to get your answers.
The point of this part of the Internals Guide isn't to fully explain all 400+ error checks in about as many semantic error checking functions, it is to showcase the key concepts shared by all of them. Things like:
- errors are reported largely in the AST and percolate up
- expressions and statements have general purpose dispatch logic for continuing a statement walk
EXTRACT
macros are used to keep the tree walk on track and correct in the face of changes- regions are used for visibility
- versioning contributes to visibility
- nullability and sensitivity are tracked throughout using type bits
- type "kind" is managed by a simple string in the
sem_node
payload - the three main payloads are
sem_node
for basic info, andsem_struct
orsem_join
for the non-unitary types
This isn't everything but it should leave you well armed to begin your own exploration of sem.c
.
Note: details on unsub/resub are forthcoming. This code is under development.