About Me
Blog
moon indicating dark mode
sun indicating light mode

Creating the Bolt Compiler: Part 4

An accessible introduction to type theory and implementing a type-checker

June 15, 2020

10 min read


What are types?

We’d discussed this briefly in the first part of the compiler series, however it’s worth revisiting this:

We implicitly classify words in a sentence as parts-of-speech (adjectives, nouns, verbs). Types work the same way, we classify program values based on the behaviour we’d like them to have. E.g. int for numbers that can be multiplied together, string for streams of characters that can be concatenated together.

The role of the type-checker is to prevent undesirable behaviour from happening - like concatenating ints or multiplying strings together - these operations make no sense so shouldn’t be allowed.

Most practical languages use types in one form or another. Some languages like Haskell are statically typed, meaning that the compiler checks the types at compile-time. You might argue, what about languages like JavaScript? There’s no compile-time type-checking there. These languages are dynamically typed, that is values are tagged with types and we check types at runtime, and throw a runtime error if types are mismatched. The runtime error acts as a very late safety check, stopping you just before you try to access a field of something that is undefined. So even JavaScript has types under the hood.

This all means that at some point, your fancy new language will likely implement some form of type-checking. With Bolt, I’ve chosen to implement static type-checking, i.e. the type-checking is another stage of our compiler.

Right now this all feels a little hand-wavey. What we need to do is take our high-level goal of “preventing undesirable behaviour” and formalise this. We want a list of rules to check a program against, and then if it passes those, we know our program is safe.

Enter type systems.

Type Systems

A type system consists of a set of rules, or typing judgements, that assigns a type tt to an expression ee. We start with seemingly obvious facts, and build up from there.

true:bool\vdash \mathbf{true} : bool

false:bool\vdash \mathbf{false} : bool

n:int\vdash n : int for any natural number nn.

The symbol \vdash is just maths notion for “it follows that”.

Typing environments

Thing is, looking at an expression on its own isn’t always enough for us to type-check an expression.

x: ?\vdash x :\ ?

What type does the variable xx have? Well it depends on what type it had when it was defined. Right, so these rules need to be updated to take that into account. To do this we’ll introduce a typing environment, that we’ll represent with a Γ\Gamma. This typing environment Γ\Gamma is a function that maps our variables to our types.

Γ(x)=tΓx: t\cfrac{\Gamma(x) = t} {\Gamma \vdash x :\ t}

Γ(x)=t\Gamma(x) = t just says that we look up variable xx in Γ\Gamma, and we find it has type tt.

How do we construct this typing environment Γ\Gamma? Hold your horses for just a second, there’s something we skimmed over in that last rule.

Inference rules

We stacked rules in what looks like some kind of fraction. What’s that all about?

This odd “fraction” is a way of representing deductive reasoning (like Sherlock Holmes but for programming languages). If everything on the top holds, then the bottom also holds.

Here’s another rule. This says that if we know e1e_1 and e2e_2 are ints, then if we add them together the result is also an int. This “stack” is just a formal way of saying that - we call it an inference rule.

Γe1:int      Γe2:intΓe1+e2: int\cfrac{\Gamma \vdash e_1 : int\ \ \ \ \ \ \Gamma \vdash e_2 : int } {\Gamma \vdash e_1 + e_2 :\ int}

Using these inference rules, we can stack together our pieces of evidence about different parts of the program, and then be able to reason about the whole program. This is how we build up our type system - we stack our rules.

Let’s combine what we’ve learnt so far in a little expression: x + 1 where x is an integer.

Γ(x)=intΓx: int      Γ1: intΓx+1: int\cfrac{ \cfrac{\cfrac{}{\Gamma(x) = int}} {\Gamma \vdash x :\ int} \ \ \ \ \ \ \cfrac{} {\Gamma \vdash 1 :\ int} } {\Gamma \vdash x + 1 :\ int}

It’s convention here to represent the base cases (axioms) with a line above them.

Γ1: int\cfrac{} {\Gamma \vdash 1 :\ int}

Since they have nothing on the top, technically everything on the top is true. So the bottom always holds.

If you look back to our example and flip it upside down, it kind of looks like a tree. It also lays out our reasoning, so acts as a proof - you can just follow the steps. Why is x+1 an int? Well x and 1 are ints? Why is x an int? Because… You get the idea. So what we’ve constructed here is called a proof tree.

Okay, let’s look at another rule. What about an if-else block?

if (something) {
do_one_thing
} else {
do_something_else
}

Well, the something has to be a boolean expression, and regardless of which branch we execute we ought to return the same type. This is because we’re statically checking our types - we haven’t actually run our program so in general we don’t know which branch we’ll execute. JavaScript and Python would allow different types, since they’re doing their type-checking at runtime, so they know which branch was chosen.

Γe1:bool      Γe2:t      Γe3:tΓif e1 {e2} else {e3}: t\cfrac{\Gamma \vdash e_1 : bool\ \ \ \ \ \ \Gamma \vdash e_2 : t \ \ \ \ \ \ \Gamma \vdash e_3 : t } {\Gamma \vdash \mathbf{if}\ e_1\ \{ e_2 \}\ \mathbf{else}\ \{ e_3 \} :\ t}

Typing the Overall Program

This post would end up being too long if I were to list all the rules, but you can work through each of the cases (they match the grammar we defined in the previous post). So let’s talk about how we’d type the overall program.

Initially Γ={}\Gamma = \{ \} - It’s empty as we haven’t defined any variables. We care if the program is well-typed (you can assign it a type) but we don’t care which type tt it evaluates to. So we essentially want a proof tree that shows:

{} program: t\{\}\ \vdash program :\ t

How do we add variables to Γ\Gamma? Through let declarations, i.e. let x = e

Γe1:t1     Γ,x:t1e2:t2Γlet x=e1; e2:t2\cfrac{\Gamma \vdash e_1 : t_1 \ \ \ \ \ \Gamma, x: t_1 \vdash e_2 : t_2} {\Gamma \vdash \mathbf{let}\ x = e_1 ;\ e_2 : t_2 }

Breaking this down, we get the type of the expression e1e_1 being assigned to xx. We then extend the environment with the mapping from xx to t1t_1 (which we write as Γ,x:t1\Gamma, x: t_1) and we can now use this extended environment to type the next expression e2e_2.

Type Checking vs Type Inference

One finer point: here we wrote let x = e. We could have equally written this as let x : t = e - where the programmer annotates x with type t. e.g. let x : int = 1 + 2.

These lead to different typing algorithms - in the first case the compiler infers that 1+2 has type int, and in the second case, the compiler has to check that 1+2 has type int (since we specified the type int of x).

As you might imagine, type inference means that the programmer has to write fewer type annotations, but it is much more complex for the compiler - it’s like filling a Sudoku from scratch versus checking a Sudoku solution is valid. OCaml and Haskell are examples of languages with type inference baked in.

In practice, most statically-typed languages do require some type annotations, but can infer some types (e.g. the auto keyword in C++). This is like completing a partially solved Sudoku puzzle and is much easier.

For Bolt, we’re going to infer types within a function or method definition, but require programmers to annotate the parameter and return types. This is a nice middle ground.

example_function.bolt
function int something(int x, bool y){
let z = ... // z's type is inferred
...
}

Okay, enough theory, let’s get to implementing this type-checker!

Implementing a Type-checker

Just give me the code!

You’ll need to clone the Bolt repository:

git clone https://github.com/mukul-rathi/bolt

The Bolt repository master branch contains a more complex type-checker, with support for inheritance, function/method overloading and generics. Each of these topics will get its own special post later in the series.

So instead, you’ll want to look at the simple-compiler-tutorial release. You can do this as follows:

git checkout simple-compiler-tutorial

This contains a stripped-back version of Bolt from earlier in the development processs. (View this online)

The folder we care about is src/frontend/typing.

Types in Bolt

Bolt has four main types: int, bool, void and user-defined classes. We represent these four options using a variant type type_expr in OCaml:

ast_types.mli
type type_expr = TEInt | TEClass of Class_name.t | TEVoid | TEBool

Annotating our AST with types

Let’s recap our Abstract Syntax Tree from the last part of the series:

parsed_ast.mli
type identifier =
| Variable of Var_name.t
| ObjField of Var_name.t * Field_name.t
type expr =
| Integer of loc * int
| Boolean of loc * bool
| Identifier of loc * identifier
| Constructor of loc * Class_name.t * constructor_arg list
| Let of loc * type_expr option * Var_name.t * expr
(** binds variable to expression (optional type annotation) *)
| Assign of loc * identifier * expr
| If of loc * expr * block_expr * block_expr
(** If ___ then ___ else ___ *)
...
and block_expr = Block of loc * expr list

This AST annotates each expression with loc - the line and position of the expression. In our type-checking phase, we’ll be checking the types of each of the possible expressions. We’ll want to store our results by directly annotating the AST, so the next compiler stage can view the types just by looking at the AST.

This AST gets the imaginative name typed_ast:

typed_ast.mli
type identifier =
| Variable of type_expr * Var_name.t
| ObjField of Class_name.t * Var_name.t * type_expr * Field_name.t
(** class of the object, type of field *)
type expr =
| Integer of loc * int (** no need to annotate as obviously TEInt *)
| Boolean of loc * bool (** no need to annotate as obviously TEBool *)
| Identifier of loc * identifier (** Type info associated with identifier *)
| Constructor of loc * type_expr * Class_name.t * constructor_arg list
| Let of loc * type_expr * Var_name.t * expr
| Assign of loc * type_expr * identifier * expr
| If of loc * type_expr * expr * block_expr * block_expr
(** the If-else type is that of the branch exprs *)
...
and block_expr =
| Block of loc * type_expr * expr list (** type is of the final expr in block *)

We don’t annotate obvious types, like for Integer and Boolean, but we annotate the type of the overall expression for other expressions e.g. the type returned by an if-else statement.

A good rule of thumb when annotating the AST is, what would the next stage need to be told about the program that it can’t guess from it being well-typed? For an if-else statement, if it is well-typed then the if-condition expression is clearly of type bool, but we’d need to be told the type of the branches.

A note on OCaml syntax

In this tutorial we’ll be using OCaml syntax. If you’re unfamiliar with this, the main gist is that we’ll:

a) Pattern match each of the cases. Here we have a variable x and we do different things based on each of its cases A, B

match x with
| A -> something
| B -> something_else

b) Use a Result monad: in essence, this has two values: Ok something and Error. We sequence each of the operations using the >>= operator - you don’t need to know anything about monads for this tutorial, just think of this as the same as normal expression sequencing with ;s, just that we’re using another operator to represent that earlier expressions might raise an error.

The Type Environment

Recall that we used our type environment Γ\Gamma to look up the types of variables. We can store this as a list of bindings (variable, type) pairs.

type_env.ml also contains a “environment” of helper functions that we’ll use in this type-checking phase. These are mostly uninteresting getter methods that you can look at in the repo.

type_env.mli
type type_binding = Var_name.t * type_expr
type type_env = type_binding list
(** A bunch of getter methods used in type-checking the core language *)
val get_var_type : Var_name.t -> type_env -> loc -> type_expr Or_error.t
...

It also includes a couple of functions that check we can assign to an identifier (it’s not const or the special identifier this) and that check we don’t have duplicate variable declarations (shadowing) in the same scope. These again are conceptually straightforward but are necessary just to cover edge cases.

For example:

type_env.ml
let check_identifier_assignable class_defns id env loc =
let open Result in
match id with
| Parsed_ast.Variable x ->
if x = Var_name.of_string "this" then
Error
(Error.of_string
(Fmt.str "%s Type error - Assigning expr to 'this'.@." (string_of_loc loc)))
else Ok ()
| Parsed_ast.ObjField (obj_name, field_name) ->
get_obj_class_defn obj_name env class_defns loc
>>= fun class_defn ->
get_class_field field_name class_defn loc
>>= fun (TField (modifier, _, _, _)) ->
if modifier = MConst then
Error
(Error.of_string
(Fmt.str "%s Type error - Assigning expr to a const field.@."
(string_of_loc loc)))
else Ok ()

Typing Expressions

This. This is the crux of the implementation. So if you’ve been skimming the post, here’s where you should pay attention.

What do we need to type-check an expression?

We need the class definitions and function definitions, in case we need to query the types of fields and function/method type signatures. We also need the expression itself, and the typing environment we’re using to type-check it.

What do we return? The typed expression, along with its type (we return type separately to make recursive calls more straightforward). Or we return an error, if the expression is not well-typed.

Our function type signature captures this exactly:

type_expr.mli
val type_expr :
Parsed_ast.class_defn list
-> Parsed_ast.function_defn list
-> Parsed_ast.expr
-> type_env
-> (Typed_ast.expr * type_expr) Or_error.t

In the second post in this series, I discussed why we’re using OCaml. This stage of the compiler is one where it really pays off.

For example to type an identifier, we pattern match based on whether it is a variable x or an object field x.f. If it is a variable then we get its type from the environment (we pass in loc as line+position info for error messages). If it returns a var_type without an error, then we return the type-annotated variable.

If it is an object field x.f, then we need to look up the type of object x in the env and get its corresponding class definition. We can then look up the type of field f in the class definition. Then we annotate the identifier with the two bits of type information we’ve just learnt: the class of the object, and the field type.

Our code does exactly that, with no boilerplate:

type_expr.ml
let type_identifier class_defns identifier env loc =
let open Result in
match identifier with
| Parsed_ast.Variable var_name ->
get_var_type var_name env loc
>>| fun var_type -> (Typed_ast.Variable (var_type, var_name), var_type)
| Parsed_ast.ObjField (var_name, field_name) ->
get_obj_class_defn var_name env class_defns loc
>>= fun (Parsed_ast.TClass (class_name, _, _, _) as class_defn) ->
get_class_field field_name class_defn loc
>>| fun (TField (_, field_type, _, _)) ->
(Typed_ast.ObjField (class_name, var_name, field_type, field_name), field_type)

Right, so let’s look at expressions. This again is clean code that just reads like our typing judgements. We have expr1 binop expr2 where expr1 and expr2 are being combined using some binary operator e.g. +.

Let’s remind ourselves of the rule:

Γe1:int      Γe2:intΓe1+e2: int\cfrac{\Gamma \vdash e_1 : int\ \ \ \ \ \ \Gamma \vdash e_2 : int } {\Gamma \vdash e_1 + e_2 :\ int}

What did this say? We type-check the subexpressions e1e_1 and e2e_2 first. If they’re both ints then the overall expression is an int.

Here’s the main conceptual jump: type-checking each of the expressions on the top of the judgements corresponds to recursively calling our type-checking function on the subexpressions. We then combine the results of these type-checking judgments using our rule.

So here, we type-check each of the subexpressions expr1 and expr2 (type_with_defns just makes the recursive call shorter).

type_expr.ml
let rec type_expr class_defns function_defns (expr : Parsed_ast.expr) env =
let open Result in
let type_with_defns = type_expr class_defns function_defns in
let type_block_with_defns = type_block_expr class_defns function_defns in match expr with
| Parsed_ast.BinOp (loc, bin_op, expr1, expr2) -> (
type_with_defns expr1 env
>>= fun (typed_expr1, expr1_type) ->
type_with_defns expr2 env
>>= fun (typed_expr2, expr2_type) ->

Recursive calls, check.

Next, because our code deals with more binary operators than just + we slightly generalise our typing judgement. Regardless of whether it is && + or >, the operands expr1 and expr2 have to have the same type. Then we check that this type is int for the + * / % arithmetic operator cases.

If so, then all Ok and we return TEInt as the type of the operand.

type_expr.ml
if not (expr1_type = expr2_type) then
Error ... (* can't have different types *)
else
let type_mismatch_error expected_type actual_type = ...
match bin_op with
| BinOpPlus | BinOpMinus | BinOpMult | BinOpIntDiv | BinOpRem ->
if expr1_type = TEInt then
Ok (Typed_ast.BinOp (loc, TEInt, bin_op, typed_expr1, typed_expr2), TEInt)
else type_mismatch_error TEInt expr1_type
...

As another example of a binary operator, let’s look at < <= >>> >=. These take in integers and return a bool, so we check that the operand expr1 has type TEInt and we return TEBool

type_expr.ml
| BinOpLessThan | BinOpLessThanEq | BinOpGreaterThan | BinOpGreaterThanEq ->
if expr1_type = TEInt then
Ok (Typed_ast.BinOp (loc, TEBool, bin_op, typed_expr1, typed_expr2), TEBool)
else type_mismatch_error TEInt expr1_type

Ok, still with me? This post is getting long, so let’s just look at if-else statements and let expressions, and then you can look at the rest of the code in the repo.

Again, let’s remind ourselves of our typing judgement for if-else.

Γe1:bool      Γe2:t      Γe3:tΓif e1 {e2} else {e3}: t\cfrac{\Gamma \vdash e_1 : bool\ \ \ \ \ \ \Gamma \vdash e_2 : t \ \ \ \ \ \ \Gamma \vdash e_3 : t } {\Gamma \vdash \mathbf{if}\ e_1\ \{ e_2 \}\ \mathbf{else}\ \{ e_3 \} :\ t}

That’s three expressions on top of the inference rule.

What do these expressions correspond to? Say it with me, recursive calls!

type_expr.ml
| Parsed_ast.If (loc, cond_expr, then_expr, else_expr) -> (
type_with_defns cond_expr env
>>= fun (typed_cond_expr, cond_expr_type) ->
type_block_with_defns then_expr env
>>= fun (typed_then_expr, then_expr_type) ->
type_block_with_defns else_expr env
>>= fun (typed_else_expr, else_expr_type) ->

Now we need to check that the returned types are what we expected, i.e. the branches have the same type, and the condition expression has type TEBool. If that’s the case, it’s all Ok and we can return the type of the branch.

type_expr.ml
if not (then_expr_type = else_expr_type) then
Error ...
else
match cond_expr_type with
| TEBool ->
Ok
( Typed_ast.If
(loc, then_expr_type, typed_cond_expr, typed_then_expr, typed_else_expr)
, then_expr_type )
| _ ->
Error ...

Right, final expression for this post, the let expression, which looks like this:

Γe1:t1     Γ,x:t1e2:t2Γlet x=e1; e2:t2\cfrac{\Gamma \vdash e_1 : t_1 \ \ \ \ \ \Gamma, x: t_1 \vdash e_2 : t_2} {\Gamma \vdash \mathbf{let}\ x = e_1 ;\ e_2 : t_2 }

Again, this requires two recursive calls, but note that the second typing judgement requires the type t1t_1 of the first - we need to pass in an extended type environment (after we type-checked e1e_1) to account for this.

We also have an additional requirement: we want our let expressions to be block-scoped.

block_scoped.bolt
if {
let x = ...
// can access x here
} else {
// shouldn't be able to access x here
}
// or here

So in essence, we want to only update the environment

  1. if we have a let expression,
  2. only for subsequent expressions in the block

We can encode this by pattern-matching on our block type-checking rule. If there are no subsequent expressions (i.e. we have no expressions (pattern-match on []) or just one expression (pattern-match on [expr]) in the block), then we don’t need to update our environment.

type_expr.ml
type_block_expr class_defns function_defns (Parsed_ast.Block (loc, exprs)) env =
...
>>= fun () ->
match exprs with
| [] -> Ok (Typed_ast.Block (loc, TEVoid, []), TEVoid) (* empty block has type void *)
| [expr] ->
type_with_defns expr env
>>| fun (typed_expr, expr_type) ->
(Typed_ast.Block (loc, expr_type, [typed_expr]), expr_type)

Only if we have at least two expressions left in our block (pattern-match on expr1 :: expr2 :: exprs), and the first of the two is a let expression do we update the environment for the rest of the block. We use this updated environment in a recursive call on expr2::exprs (the remaining expressions). We then combine the result of the typed blocks.

type_expr.ml
| expr1 :: expr2 :: exprs ->
type_with_defns expr1 env
>>= fun (typed_expr1, expr1_type) ->
(let updated_env =
match typed_expr1 with
| Typed_ast.Let (_, _, var_name, _) -> (var_name, expr1_type) :: env
| _ -> env in
type_block_with_defns (Parsed_ast.Block (loc, expr2 :: exprs)) updated_env)
>>| fun (Typed_ast.Block (_, _, typed_exprs), block_expr_type) ->
(Typed_ast.Block (loc, block_expr_type, typed_expr1 :: typed_exprs), block_expr_type)

Join me on this learning journey!

This summer I’m using my blog to teach the topics I’ve learnt this year. It’s a win-win - you get computer science tutorials and I get to share it with you!

Typing Class and Function Definitions

We’ve broken the back of the type-checking now. Let’s just wrap up by checking class and function definitions.

We’ll skip over the tedium of checking that there are no duplicate class definitions or duplicate field definitions in a class etc. (this is in the repo). Likewise, checking that the types annotated in fields and method/function type signatures are valids is just a matter of checking there is a corresponding class definition.

Unlike our main function, for other functions, we don’t actually start with an empty environment when type-checking the function body as we already know the types of some variables - the parameters to the function!

type_functions.ml
let init_env_from_params params =
List.map
~f:(function TParam (type_expr, param_name, _, _) -> (param_name, type_expr))
params
...
type_block_expr class_defns function_defns body_expr(init_env_from_params params)

We also need to check that the type of the result of the body is the return type (or if it is void then we don’t care about the type returned).

type_functions.ml
>>= fun (typed_body_expr, body_return_type) ->
(* We throw away returned expr if return type is void *)
if return_type = TEVoid || body_return_type = return_type then
Ok
(Typed_ast.TFunction
(func_name, maybe_borrowed_ret_ref, return_type, params, typed_body_expr))
else Error ...

For class methods, we can initialise the environment and check the return type in the same way. But we know the type of another special variable this - the class itself.

type_classes.ml
let init_env_from_method_params params class_name =
let param_env =
List.map
~f:(function TParam (type_expr, param_name, _, _) -> (param_name, type_expr))
params in
(Var_name.of_string "this", TEClass class_name) :: param_env
```

Where does this fit in the Bolt pipeline?

We’re two stages into the Bolt compiler pipeline - seen in the compile_program_ir function.

compile_program_ir.ml
let compile_program_ir ?(should_pprint_past = false) ?(should_pprint_tast = false)
?(should_pprint_dast = false) ?(should_pprint_drast = false)
?(should_pprint_fir = false) ?(ignore_data_races = false) ?compile_out_file lexbuf =
let open Result in
parse_program lexbuf
>>= maybe_pprint_ast should_pprint_past pprint_parsed_ast
>>= type_program
>>= maybe_pprint_ast should_pprint_tast pprint_typed_ast
>>= ...

Take Away: 3 Actionable Steps

If you’ve got this far, amazing job! The Bolt repo contains the full code listing with all the typing judgements for each case of the compiler.

Let’s recap what we’ve done so far:

  1. Define what properties in your expression you want to type-check. E.g. a condition expression has type bool.
  2. Formalise this with a typing judgement. Use the inference rule to reason about subexpressions.
  3. Map the subexpressions in the inference rule to recursive calls, then use the inference rule to combine their results.

Next up, we’ll talk about the dataflow analysis used to type-check our linear capabilities in Bolt. This is similar to how the Rust borrow checker uses “non-lexical lifetimes” to check borrowing.

Share This On Twitter

If you liked this post, please consider sharing it with your network. If you have any questions, tweet away and I’ll answer :) I also tweet when new posts drop!

PS: I also share helpful tips and links as I'm learning - so you get them well before they make their way into a post!

© Mukul Rathi 2020