### Creating the Bolt Compiler: Part 4

# An accessible introduction to type theory and implementing a type-checker

### June 15, 2020

### 10 min read

## What are types?

We’d discussed this briefly in the first part of the compiler series, however it’s worth revisiting this:

We implicitly classify words in a sentence as parts-of-speech (adjectives, nouns, verbs). Types work the **same way**, we classify program values based on the behaviour we’d like them to have. E.g. `int`

for numbers that can be multiplied together, `string`

for streams of characters that can be concatenated together.

The role of the type-checker is to prevent undesirable behaviour from happening - like concatenating `int`

s or multiplying `string`

s together - these operations make no sense so shouldn’t be allowed.

Most practical languages use types in one form or another. Some languages like Haskell are *statically* typed, meaning that the compiler checks the types at *compile-time*. You might argue, what about languages like JavaScript? There’s no compile-time type-checking there. These languages are *dynamically* typed, that is values are tagged with types and we check types at *runtime*, and throw a runtime error if types are mismatched. The runtime error acts as a very late safety check, stopping you just before you try to access a field of something that is `undefined`

. So even JavaScript has types under the hood.

This all means that at some point, your fancy new language will likely implement some form of type-checking. With Bolt, I’ve chosen to implement *static* type-checking, i.e. the type-checking is another stage of our compiler.

Right now this all feels a little hand-wavey. What we need to do is take our high-level goal of “preventing undesirable behaviour” and formalise this. We want a list of rules to check a program against, and then if it passes those, we know our program is safe.

Enter *type systems*.

## Type Systems

A type system consists of a set of rules, or *typing judgements*, that assigns a type $t$ to an expression $e$. We start with seemingly obvious facts, and build up from there.

$\vdash \mathbf{true} : bool$

$\vdash \mathbf{false} : bool$

$\vdash n : int$ for any natural number $n$.

The symbol $\vdash$ is just maths notion for “it follows that”.

### Typing environments

Thing is, looking at an expression on its own isn’t always enough for us to type-check an expression.

$\vdash x :\ ?$

What type does the variable $x$ have? Well it depends on what type it had when it was defined. Right, so these rules need to be updated to take that into account. To do this we’ll introduce a *typing environment*, that we’ll represent with a $\Gamma$. This typing environment $\Gamma$ is a function that maps our variables to our types.

$\cfrac{\Gamma(x) = t} {\Gamma \vdash x :\ t}$

$\Gamma(x) = t$ just says that we look up variable $x$ in $\Gamma$, and we find it has type $t$.

How do we construct this typing environment $\Gamma$? Hold your horses for just a second, there’s something we skimmed over in that last rule.

### Inference rules

We stacked rules in what looks like some kind of fraction. What’s that all about?

This odd “fraction” is a way of representing deductive reasoning (like Sherlock Holmes but for programming languages). If everything on the top holds, then the bottom also holds.

Here’s another rule. This says that if we know $e_1$ and $e_2$ are `int`

s, then if we add them together the result is also an `int`

. This “stack” is just a formal way of saying that - we call it an *inference* rule.

$\cfrac{\Gamma \vdash e_1 : int\ \ \ \ \ \ \Gamma \vdash e_2 : int } {\Gamma \vdash e_1 + e_2 :\ int}$

Using these *inference* rules, we can stack together our pieces of evidence about different parts of the program, and then be able to reason about the whole program. This is how we build up our type system - we *stack* our rules.

Let’s combine what we’ve learnt so far in a little expression: `x + 1`

where `x`

is an integer.

$\cfrac{ \cfrac{\cfrac{}{\Gamma(x) = int}} {\Gamma \vdash x :\ int} \ \ \ \ \ \ \cfrac{} {\Gamma \vdash 1 :\ int} } {\Gamma \vdash x + 1 :\ int}$

It’s convention here to represent the base cases (*axioms*) with a line above them.

$\cfrac{} {\Gamma \vdash 1 :\ int}$

Since they have nothing on the top, technically everything on the top is true. So the bottom **always** holds.

If you look back to our example and flip it upside down, it kind of looks like a tree. It also lays out our reasoning, so acts as a *proof* - you can just follow the steps. Why is `x+1`

an int? Well `x`

and `1`

are `int`

s? Why is `x`

an int? Because… You get the idea. So what we’ve constructed here is called a *proof tree*.

Okay, let’s look at another rule. What about an `if-else`

block?

if (something) {do_one_thing} else {do_something_else}

Well, the `something`

has to be a boolean expression, and regardless of which branch we execute we ought to return the same type. This is because we’re *statically* checking our types - we haven’t actually run our program so in general we don’t know which branch we’ll execute. JavaScript and Python would allow different types, since they’re doing their type-checking at runtime, so they know which branch was chosen.

$\cfrac{\Gamma \vdash e_1 : bool\ \ \ \ \ \ \Gamma \vdash e_2 : t \ \ \ \ \ \ \Gamma \vdash e_3 : t } {\Gamma \vdash \mathbf{if}\ e_1\ \{ e_2 \}\ \mathbf{else}\ \{ e_3 \} :\ t}$

### Typing the Overall Program

This post would end up being too long if I were to list all the rules, but you can work through each of the cases (they match the *grammar* we defined in the previous post). So let’s talk about how we’d type the overall program.

Initially $\Gamma = \{ \}$ - It’s empty as we haven’t defined any variables. We care if the program is *well-typed* (you can assign it a type) but we don’t care which type $t$ it evaluates to. So we essentially want a proof tree that shows:

$\{\}\ \vdash program :\ t$

How do we add variables to $\Gamma$? Through `let`

declarations, i.e. `let x = e`

$\cfrac{\Gamma \vdash e_1 : t_1 \ \ \ \ \ \Gamma, x: t_1 \vdash e_2 : t_2} {\Gamma \vdash \mathbf{let}\ x = e_1 ;\ e_2 : t_2 }$

Breaking this down, we get the type of the expression $e_1$ being assigned to $x$. We then extend the environment with the mapping from $x$ to $t_1$ (which we write as $\Gamma, x: t_1$) and we can now use this extended environment to type the next expression $e_2$.

### Type Checking vs Type Inference

One finer point: here we wrote `let x = e`

. We could have equally written this as `let x : t = e`

- where the programmer *annotates* `x`

with type `t`

. e.g. `let x : int = 1 + 2`

.

These lead to different typing algorithms - in the first case the compiler *infers* that `1+2`

has type `int`

, and in the second case, the compiler has to *check* that `1+2`

has type `int`

(since we specified the type `int`

of `x`

).

As you might imagine, type inference means that the programmer has to write fewer type annotations, but it is much more complex for the compiler - it’s like filling a Sudoku from scratch versus checking a Sudoku solution is valid. OCaml and Haskell are examples of languages with type inference baked in.

In practice, most statically-typed languages do require some type annotations, but can infer some types (e.g. the `auto`

keyword in C++). This is like completing a partially solved Sudoku puzzle and is much easier.

For Bolt, we’re going to infer types *within* a function or method definition, but require programmers to annotate the parameter and return types. This is a nice middle ground.

function int something(int x, bool y){let z = ... // z's type is inferred...}

Okay, enough theory, let’s get to implementing this type-checker!

## Implementing a Type-checker

### Just give me the code!

You’ll need to clone the Bolt repository:

git clone https://github.com/mukul-rathi/bolt

The Bolt repository `master`

branch contains a more complex type-checker, with support for inheritance, function/method overloading and generics. Each of these topics will get its own special post later in the series.

So instead, you’ll want to look at the `simple-compiler-tutorial`

release. You can do this as follows:

git checkout simple-compiler-tutorial

This contains a stripped-back version of Bolt from earlier in the development processs. (View this online)

The folder we care about is `src/frontend/typing`

.

### Types in Bolt

Bolt has four main types: `int`

, `bool`

, `void`

and user-defined classes. We represent these four options using a variant type `type_expr`

in OCaml:

type type_expr = TEInt | TEClass of Class_name.t | TEVoid | TEBool

### Annotating our AST with types

Let’s recap our Abstract Syntax Tree from the last part of the series:

type identifier =| Variable of Var_name.t| ObjField of Var_name.t * Field_name.ttype expr =| Integer of loc * int| Boolean of loc * bool| Identifier of loc * identifier| Constructor of loc * Class_name.t * constructor_arg list| Let of loc * type_expr option * Var_name.t * expr(** binds variable to expression (optional type annotation) *)| Assign of loc * identifier * expr| If of loc * expr * block_expr * block_expr(** If ___ then ___ else ___ *)...and block_expr = Block of loc * expr list

This AST annotates each expression with `loc`

- the line and position of the expression. In our type-checking phase, we’ll be checking the types of each of the possible expressions. We’ll want to store our results by *directly annotating* the AST, so the next compiler stage can view the types just by looking at the AST.

This AST gets the imaginative name `typed_ast`

:

type identifier =| Variable of type_expr * Var_name.t| ObjField of Class_name.t * Var_name.t * type_expr * Field_name.t(** class of the object, type of field *)type expr =| Integer of loc * int (** no need to annotate as obviously TEInt *)| Boolean of loc * bool (** no need to annotate as obviously TEBool *)| Identifier of loc * identifier (** Type info associated with identifier *)| Constructor of loc * type_expr * Class_name.t * constructor_arg list| Let of loc * type_expr * Var_name.t * expr| Assign of loc * type_expr * identifier * expr| If of loc * type_expr * expr * block_expr * block_expr(** the If-else type is that of the branch exprs *)...and block_expr =| Block of loc * type_expr * expr list (** type is of the final expr in block *)

We don’t annotate obvious types, like for `Integer`

and `Boolean`

, but we annotate the type of the overall expression for other expressions e.g. the type returned by an `if-else`

statement.

A good rule of thumb when annotating the AST is, what would the next stage need to be told about the program that it can’t guess from it being well-typed? For an `if-else`

statement, if it is well-typed then the if-condition expression is clearly of type `bool`

, but we’d need to be told the type of the branches.

### A note on OCaml syntax

In this tutorial we’ll be using OCaml syntax. If you’re unfamiliar with this, the main gist is that we’ll:

**a)** Pattern match each of the cases. Here we have a variable `x`

and we do different things based on each of its cases `A`

, `B`

match x with| A -> something| B -> something_else

**b)** Use a Result monad: in essence, this has two values: `Ok something`

and `Error`

. We sequence each of the operations using the `>>=`

operator - you don’t need to know anything about monads for this tutorial, just think of this as the same as normal expression sequencing with `;`

s, just that we’re using another operator to represent that earlier expressions might raise an error.

### The Type Environment

Recall that we used our type environment $\Gamma$ to look up the types of variables. We can store this as a list of *bindings* (variable, type) pairs.

`type_env.ml`

also contains a “environment” of helper functions that we’ll use in this type-checking phase. These are mostly uninteresting getter methods that you can look at in the repo.

type type_binding = Var_name.t * type_exprtype type_env = type_binding list(** A bunch of getter methods used in type-checking the core language *)val get_var_type : Var_name.t -> type_env -> loc -> type_expr Or_error.t...

It also includes a couple of functions that check we can assign to an identifier (it’s not `const`

or the special identifier `this`

) and that check we don’t have duplicate variable declarations (shadowing) in the same scope. These again are conceptually straightforward but are necessary just to cover edge cases.

For example:

let check_identifier_assignable class_defns id env loc =let open Result inmatch id with| Parsed_ast.Variable x ->if x = Var_name.of_string "this" thenError(Error.of_string(Fmt.str "%s Type error - Assigning expr to 'this'.@." (string_of_loc loc)))else Ok ()| Parsed_ast.ObjField (obj_name, field_name) ->get_obj_class_defn obj_name env class_defns loc>>= fun class_defn ->get_class_field field_name class_defn loc>>= fun (TField (modifier, _, _, _)) ->if modifier = MConst thenError(Error.of_string(Fmt.str "%s Type error - Assigning expr to a const field.@."(string_of_loc loc)))else Ok ()

### Typing Expressions

This. This is the *crux* of the implementation. So if you’ve been skimming the post, here’s where you should pay attention.

What do we need to type-check an expression?

We need the class definitions and function definitions, in case we need to query the types of fields and function/method type signatures. We also need the expression itself, and the typing environment we’re using to type-check it.

What do we return? The typed expression, along with its type (we return type separately to make recursive calls more straightforward). Or we return an error, if the expression is not well-typed.

Our function type signature captures this exactly:

val type_expr :Parsed_ast.class_defn list-> Parsed_ast.function_defn list-> Parsed_ast.expr-> type_env-> (Typed_ast.expr * type_expr) Or_error.t

In the second post in this series, I discussed why we’re using OCaml. This stage of the compiler is one where it really pays off.

For example to type an identifier, we pattern match based on whether it is a variable `x`

or an object field `x.f`

. If it is a variable then we get its type from the environment (we pass in `loc`

as line+position info for error messages). If it returns a `var_type`

without an error, then we return the type-annotated variable.

If it is an object field `x.f`

, then we need to look up the type of object `x`

in the `env`

and get its corresponding class definition. We can then look up the type of field `f`

in the class definition. Then we annotate the identifier with the two bits of type information we’ve just learnt: the class of the object, and the field type.

Our code does exactly that, with no boilerplate:

let type_identifier class_defns identifier env loc =let open Result inmatch identifier with| Parsed_ast.Variable var_name ->get_var_type var_name env loc>>| fun var_type -> (Typed_ast.Variable (var_type, var_name), var_type)| Parsed_ast.ObjField (var_name, field_name) ->get_obj_class_defn var_name env class_defns loc>>= fun (Parsed_ast.TClass (class_name, _, _, _) as class_defn) ->get_class_field field_name class_defn loc>>| fun (TField (_, field_type, _, _)) ->(Typed_ast.ObjField (class_name, var_name, field_type, field_name), field_type)

Right, so let’s look at expressions. This again is clean code that just reads like our typing judgements. We have `expr1 binop expr2`

where `expr1`

and `expr2`

are being combined using some binary operator e.g. `+`

.

Let’s remind ourselves of the rule:

$\cfrac{\Gamma \vdash e_1 : int\ \ \ \ \ \ \Gamma \vdash e_2 : int } {\Gamma \vdash e_1 + e_2 :\ int}$

What did this say? We type-check the subexpressions $e_1$ and $e_2$ first. If they’re both `int`

s then the overall expression is an `int`

.

Here’s the main conceptual jump: type-checking each of the expressions on the top of the judgements corresponds to *recursively* calling our type-checking function on the subexpressions. We then combine the results of these type-checking judgments using our rule.

So here, we type-check each of the subexpressions `expr1`

and `expr2`

(`type_with_defns`

just makes the recursive call shorter).

let rec type_expr class_defns function_defns (expr : Parsed_ast.expr) env =let open Result inlet type_with_defns = type_expr class_defns function_defns inlet type_block_with_defns = type_block_expr class_defns function_defns in match expr with| Parsed_ast.BinOp (loc, bin_op, expr1, expr2) -> (type_with_defns expr1 env>>= fun (typed_expr1, expr1_type) ->type_with_defns expr2 env>>= fun (typed_expr2, expr2_type) ->

Recursive calls, check.

Next, because our code deals with more binary operators than just `+`

we slightly generalise our typing judgement. Regardless of whether it is `&&`

`+`

or `>`

, the operands `expr1`

and `expr2`

have to have the same type. Then we check that this type is `int`

for the `+ * / %`

arithmetic operator cases.

If so, then all `Ok`

and we return `TEInt`

as the type of the operand.

if not (expr1_type = expr2_type) thenError ... (* can't have different types *)elselet type_mismatch_error expected_type actual_type = ...match bin_op with| BinOpPlus | BinOpMinus | BinOpMult | BinOpIntDiv | BinOpRem ->if expr1_type = TEInt thenOk (Typed_ast.BinOp (loc, TEInt, bin_op, typed_expr1, typed_expr2), TEInt)else type_mismatch_error TEInt expr1_type...

As another example of a binary operator, let’s look at `< <= >>> >=`

. These take in integers and return a bool, so we check that the operand `expr1`

has type `TEInt`

and we return `TEBool`

| BinOpLessThan | BinOpLessThanEq | BinOpGreaterThan | BinOpGreaterThanEq ->if expr1_type = TEInt thenOk (Typed_ast.BinOp (loc, TEBool, bin_op, typed_expr1, typed_expr2), TEBool)else type_mismatch_error TEInt expr1_type

Ok, still with me? This post is getting long, so let’s just look at `if-else`

statements and `let`

expressions, and then you can look at the rest of the code in the repo.

Again, let’s remind ourselves of our typing judgement for `if-else`

.

$\cfrac{\Gamma \vdash e_1 : bool\ \ \ \ \ \ \Gamma \vdash e_2 : t \ \ \ \ \ \ \Gamma \vdash e_3 : t } {\Gamma \vdash \mathbf{if}\ e_1\ \{ e_2 \}\ \mathbf{else}\ \{ e_3 \} :\ t}$

That’s three expressions on top of the inference rule.

What do these expressions correspond to? Say it with me, *recursive calls*!

| Parsed_ast.If (loc, cond_expr, then_expr, else_expr) -> (type_with_defns cond_expr env>>= fun (typed_cond_expr, cond_expr_type) ->type_block_with_defns then_expr env>>= fun (typed_then_expr, then_expr_type) ->type_block_with_defns else_expr env>>= fun (typed_else_expr, else_expr_type) ->

Now we need to check that the returned types are what we expected, i.e. the branches have the same type, and the condition expression has type `TEBool`

. If that’s the case, it’s all `Ok`

and we can return the type of the branch.

if not (then_expr_type = else_expr_type) thenError ...elsematch cond_expr_type with| TEBool ->Ok( Typed_ast.If(loc, then_expr_type, typed_cond_expr, typed_then_expr, typed_else_expr), then_expr_type )| _ ->Error ...

Right, final expression for this post, the `let`

expression, which looks like this:

$\cfrac{\Gamma \vdash e_1 : t_1 \ \ \ \ \ \Gamma, x: t_1 \vdash e_2 : t_2} {\Gamma \vdash \mathbf{let}\ x = e_1 ;\ e_2 : t_2 }$

Again, this requires two recursive calls, but note that the second typing judgement requires the type $t_1$ of the first - we need to pass in an extended type environment (after we type-checked $e_1$) to account for this.

We also have an additional requirement: we want our `let`

expressions to be *block-scoped*.

if {let x = ...// can access x here} else {// shouldn't be able to access x here}// or here

So in essence, we want to only update the environment

- if we have a
`let`

expression, - only for subsequent expressions in the block

We can encode this by *pattern-matching* on our block type-checking rule. If there are no subsequent expressions (i.e. we have no expressions (pattern-match on `[]`

) or just one expression (pattern-match on `[expr]`

) in the block), then we don’t need to update our environment.

type_block_expr class_defns function_defns (Parsed_ast.Block (loc, exprs)) env =...>>= fun () ->match exprs with| [] -> Ok (Typed_ast.Block (loc, TEVoid, []), TEVoid) (* empty block has type void *)| [expr] ->type_with_defns expr env>>| fun (typed_expr, expr_type) ->(Typed_ast.Block (loc, expr_type, [typed_expr]), expr_type)

Only if we have at least two expressions left in our block (pattern-match on `expr1 :: expr2 :: exprs`

), and the first of the two is a `let`

expression do we update the environment for the rest of the block. We use this updated environment in a recursive call on `expr2::exprs`

(the remaining expressions). We then combine the result of the typed blocks.

| expr1 :: expr2 :: exprs ->type_with_defns expr1 env>>= fun (typed_expr1, expr1_type) ->(let updated_env =match typed_expr1 with| Typed_ast.Let (_, _, var_name, _) -> (var_name, expr1_type) :: env| _ -> env intype_block_with_defns (Parsed_ast.Block (loc, expr2 :: exprs)) updated_env)>>| fun (Typed_ast.Block (_, _, typed_exprs), block_expr_type) ->(Typed_ast.Block (loc, block_expr_type, typed_expr1 :: typed_exprs), block_expr_type)

### Join me on this learning journey!

This summer I’m using my blog to teach the topics I’ve learnt this year. It’s a win-win - you get computer science tutorials and I get to share it with you!

### Typing Class and Function Definitions

We’ve broken the back of the type-checking now. Let’s just wrap up by checking class and function definitions.

We’ll skip over the tedium of checking that there are no duplicate class definitions or duplicate field definitions in a class etc. (this is in the repo). Likewise, checking that the types annotated in fields and method/function type signatures are valids is just a matter of checking there is a corresponding class definition.

Unlike our main function, for other functions, we don’t actually start with an empty environment when type-checking the function body as we *already* know the types of some variables - the parameters to the function!

let init_env_from_params params =List.map~f:(function TParam (type_expr, param_name, _, _) -> (param_name, type_expr))params...type_block_expr class_defns function_defns body_expr(init_env_from_params params)

We also need to check that the type of the result of the body is the return type (or if it is `void`

then we don’t care about the type returned).

>>= fun (typed_body_expr, body_return_type) ->(* We throw away returned expr if return type is void *)if return_type = TEVoid || body_return_type = return_type thenOk(Typed_ast.TFunction(func_name, maybe_borrowed_ret_ref, return_type, params, typed_body_expr))else Error ...

For class methods, we can initialise the environment and check the return type in the same way. But we know the type of another special variable `this`

- the class itself.

let init_env_from_method_params params class_name =let param_env =List.map~f:(function TParam (type_expr, param_name, _, _) -> (param_name, type_expr))params in(Var_name.of_string "this", TEClass class_name) :: param_env```

## Where does this fit in the Bolt pipeline?

We’re two stages into the Bolt compiler pipeline - seen in the `compile_program_ir`

function.

let compile_program_ir ?(should_pprint_past = false) ?(should_pprint_tast = false)?(should_pprint_dast = false) ?(should_pprint_drast = false)?(should_pprint_fir = false) ?(ignore_data_races = false) ?compile_out_file lexbuf =let open Result inparse_program lexbuf>>= maybe_pprint_ast should_pprint_past pprint_parsed_ast>>= type_program>>= maybe_pprint_ast should_pprint_tast pprint_typed_ast>>= ...

## Take Away: 3 Actionable Steps

If you’ve got this far, amazing job! The Bolt repo contains the full code listing with all the typing judgements for each case of the compiler.

Let’s recap what we’ve done so far:

- Define what properties in your expression you want to type-check. E.g. a condition expression has type
`bool`

. - Formalise this with a typing judgement. Use the
*inference*rule to reason about subexpressions. - Map the subexpressions in the inference rule to
*recursive*calls, then use the inference rule to combine their results.

Next up, we’ll talk about the dataflow analysis used to type-check our linear capabilities in Bolt. This is similar to how the Rust borrow checker uses “non-lexical lifetimes” to check borrowing.