In part 1 and part 2 of this series, I introduced the project and we wrote some Roc code to load an input file and save the compiled result to a different file.

Note: Other articles in this series are collected here.

However, we are a long ways from actually having that compiled result available! This article introduces the phases involved in writing a compiler and focus on implementing the first phase, known as lexical analysis or tokenizing.

But first, let’s briefly cover what a compiler actually does.

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Dude, What Does a Compiler do?

Real-world compilers are incredibly complex pieces of software. But the basics are fairly simple, and follow a linear process of steps, some of which are optional:

  1. Lexical Analysis or Tokenizing splits the input program (a string) into meaningful tokens. The result is just a single dimensional array of token objects. The relationship between tokens is not understood at this point.
  2. Syntax Analysis or Parsing takes that token stream as output, ensures it follows the appropriate grammar rules, and constructs an Abstract Syntax Tree.
  3. Transformation converts the AST into a different AST. Whether you are transpiling (source code in one language to source code in a different language) or compiling (source code in one language to machine code or byte code), typically the input AST will be different from the output. Even though WAT is designed to translate directly to Wasm, the AST is actually drastically different, so we’ll need this step.
  4. Semantic Analysis does things program validation (like type checking) to ensure that the Abstract Syntax Tree is well-formed (for some definition of well-formed). The Wasm specification actually defines exactly what kind of type checking a true WAT compiler should implement, but for the purposes of this series, I intend to skip it.
  5. Code Generation converts the abstract syntax tree to the desired output representation. When compiling, this would typically be a sequence of bytes in machine code or byte code. When transpiling it will be source code in a new target programming language.

Each of these steps is both easier and harder than it sounds. The basic idea for each is easy to describe, but every different piece of syntax needs to be handled according to its own rules, so the total amount of work involved is not small. I’m too stubborn to shrink my input example, though, so we’re going to compile the whole Hello World example, which (as a reminder) looks like this in WAT syntax:

(module
    (import "wasi_snapshot_preview1" "fd_write" (func $fd_write (param i32 i32 i32 i32) (result i32)))

    (memory 1)
    (export "memory" (memory 0))
    (export "_start" (func $main))

    (data (offset (i32.const 8)) "hello world")

    (func $main
        (i32.const 0)
        (i32.const 8)
        (i32.store)

        (i32.const 4)
        (i32.const 11)
        (i32.store)

        (i32.const 1)
        (i32.const 0)
        (i32.const 1)
        (i32.const 20)
        (call $fd_write)
        drop
    )
)

Tokenizing

The first step, and the focus of this article, is tokenizing. Most compilers are able to do tokenizing and parsing in one step, and indeed any sane person would probably start with the roc-parser package. However, for simplicity and education, we’ll do them separately.

Tokenizing is the process of turning the input string (which is effectively an array of characters) into an array of logical tokens. Some tokens such as a parenthesis in our input string will map to a token all by themselves, but others, such as the identifier “module” will cover several characters with a single token.

In fact, the simplest possible Wasm module is (module), so we can start with just these three tokens!

I want to limit our main.roc file to “code that has side effects”, so I’m going to start by creating a new Roc module (file) named Tokenizer.roc.

All Roc files need to start with a header that is designed to communicate the purpose of the file with respect to other modules. For example, our main file defined the file as an app, and indicated what platform the file needed to run on.

This Tokenizer.roc file will instead be a module. The header for a module just needs to list what definitions are intended to be exported from that module. Any definitions not included in that list are not exported.

Our primary tokenizer interface will be a function that takes the input string as an argument and outputs an array of Tokens as output. We don’t know exactly what a Token is yet, but given that there will be different types of tokens, I know it will need to be some kind of Tag union. For now, I’m just going to define the module header to indicate we are exporting a tokenize function and a Token type. The header only needs to specify the names of the exported elements, not any types or definitions, so this is sufficient:

module [
    Token,
    tokenize,
]

To access these exports from the main.roc app module, we just need to add import Tokenizer to the imports.

Now we can update our compile function (written in the previous article) in main.roc to call the tokenize function with our input:

compile : Str -> List U8
compile = \input ->
    tokens = Tokenizer.tokenize input
    dbg tokens

    Str.toUtf8 "TODO: Compile Input"

If we try to compile this now, of course we’ll get errors. Specifically, two “Missing Definition” errors. Our Tokenizer.roc file is exporting two definitions, but they don’t exist! Let’s start by adding dummy definitions for each of them:

module [
    Token,
    tokenize,
]

Token : []

tokenize : Str -> List Token
tokenize = \input ->
    dbg input

    []

Functions in Roc are always specified with Roc’s lambda syntax, which is \arguments -> body. Everything in Roc is an expression that returns a result, so there is no explicit return statement. The last element in the function is the “return” value of that function. The type definition before the function is optional, but in a classic “help me help you,” situation, it gives the compiler extra information to shout at us if our body or call site doesn’t match what we documented.

Now it should compile. The input string and an empty token array are output from each dbg statement (one in each module) when we run roc dev -- main.wat.

Iterating Tokens

The next step is to actually iterate over the characters in input and extract meaningful tokens from it. There are two reasons this is harder than it sounds:

  • WAT syntax is written in UTF-8 and can technically support any unicode character. Roc’s standard library doesn’t support unicode characters out of the box and we would need to rely on the roc/unicode package to handle this correctly. However, the delimiters in Roc syntax are all single byte ASCII characters, so we can get away with just assuming anything that is not a delimiter is part of a UTF-8 name.
  • Roc is a purely functional programming language, which means it doesn’t have loops. Iteration is encapsulated in a function call. The typical “walk” functions deal with either one byte or one unicode character at a time. However, tokenizing is traditionally implemented by checking multiple characters until a condition is met. Another option would be to use the List.walkUntil and List.walkFrom functions from the standard library to work on list indices, but I feel like that would be messier than looking at each character once and storing it if it isn’t complete.

The Str.walkUtf8 function acts like a reducer, accumulating state while it runs a custom function that evaluates each byte (not unicode character) in the string. The function needs to accept the current state and the next character as arguments, and return the new state.

Records

Now is a good time to introduce records in Roc. Records are kind of like structs in other languages. They are strongly typed, but the types can be soundly inferred based on usage. So you can basically define what looks like an untyped Javascript object, but Roc will infer the types for all the fields.

Our initial state will be a fairly simple record:

initialState = {
    currentToken: None,
    tokens: [],
}

This is a top-level definition in the module. If you are from the school of thought that globals are bad, you’re right, but things are a little different in functional programming. All values are immutable, which means that this def is more like a “const” than a global “variable”.

The currentToken is the tag literal None. I pulled this value out of thin air. It isn’t a special Roc type like Python’s None or null or Option.None in various other languages. It’s a literal tag with no payload. The cool thing about tags is that they can act like variants; currentToken can take one of many tags only one of which has been defined at this time: None.

I’ll add a type to this record later, but I wanted to illustrate that they aren’t mandatory.

The second piece we need is a function that takes the current state and the next token and returns the next state. For starters, let’s just debug the next token and return the currentState unmodified:

evaluateOneChar = \currentState, nextByte ->
    dbg nextByte

    currentState

Armed with these two definitions and the input string passed to tokenize, we can now implement the tokenize function that we roughed out earlier:

tokenize : Str -> List Token
tokenize = \input ->
    finalState = Str.walkUtf8
        input
        initialState
        evaluateOneChar

    finalState.tokens

Str is a builtin module so we don’t need to import it. The walkUtf8 function will call our evaluateOneChar function for each byte in the string. We can run our code on empty_module.wat (which contains (module)) and get the following debug output:

[Tokenizer.roc:23] nextByte = 40
[Tokenizer.roc:23] nextByte = 109
[Tokenizer.roc:23] nextByte = 111
[Tokenizer.roc:23] nextByte = 100
[Tokenizer.roc:23] nextByte = 117
[Tokenizer.roc:23] nextByte = 108
[Tokenizer.roc:23] nextByte = 101
[Tokenizer.roc:23] nextByte = 41
[Tokenizer.roc:23] nextByte = 10
[main.roc:14] tokens = []

The bytes are all U8 values, so they show up as integers, but we can see that it has been called for every byte in the input string.

Evaluating Tokens

The S-expression syntax used in WAT means that we can always expect the first token to be an opening parenthesis. Let’s update evaluateOneChar to parse a single parentheses. If it encounters any other characters, it will continue to debug the character and return the current state unmodified. I’m not clear on whether we are better off using pattern matching or if expressions for this, but I’m going to start with pattern matching and see how it works.

evaluateOneChar = \currentState, nextByte ->
    when (currentState.currentToken, nextByte) is
        (None, '(') ->
            {
                currentToken: None,
                tokens: List.append currentState.tokens LParen,
            }

        _ ->
            dbg nextByte

            currentState

This when expression is matching on two values at the same time, by collecting them in a tuple. If the current Token is None and the next character is an opening parenthesis, we just return new state that appends an LParen tag to the list of tokens. We would have to handle a LParen differently if the currentToken was an identifier or string, so we need to match on both values.

Any non-matching values fall into the “wildcard” arm, denoted by _ which just debugs the nextByte and returns the current state unmodified.

Notice the single quotes in '('. These indicate that the character in question is a character and not a string, and its type is an integer. Note that it is not like characters in e.g. C, which only map to bytes (integers less than 255). A character can be multiple bytes long for unicode characters and have codepoints higher than the traditional 127 ASCII characters.

If we try to compile this now, Roc will complain that our tokenize function is returning a List [LParen, when a List [] was expected. This is because the tokenize function is defined to return a List Token, and our Token type is currently an empty list of tags. You can fix this easily by changing the Token type as follows:

Token : [LParen]

We’ll need to add all our tags to this tag union type as we go. If we run the program now, notice that the 40 byte (a ‘(’ character is a 40 byte in ASCII and UTF-8) is no longer outputted as a dbg and our tokens array now contains a LParen tag:

[Tokenizer.roc:31] nextByte = 109
[Tokenizer.roc:31] nextByte = 111
[Tokenizer.roc:31] nextByte = 100
[Tokenizer.roc:31] nextByte = 117
[Tokenizer.roc:31] nextByte = 108
[Tokenizer.roc:31] nextByte = 101
[Tokenizer.roc:31] nextByte = 41
[Tokenizer.roc:31] nextByte = 10
[main.roc:14] tokens = [LParen]

Parsing a name

The next sequence we’ll need to parse is a name. I should really be trying to match the formal WAT grammar defined here, but for expediency, I’m going to define a name based on the subset of WAT that is needed to compile the “hello world” example I listed above.

Specifically, a name (for this abused subset of the grammar) is any sequence of lowercase ASCII letters, a number, or the underscore or period characters. This will match things like module, export, or i32.const, but not variables like $fd_write or strings like "_start".

Normally I’d use a regular expression for this, but given the uncertain status of the only Roc regex library I’m going to create a top-level Set instead. A Set is more suitable for this than a List because checking if it contains a given byte happens in constant time, as opposed to the linear scan that would be needed if we used a List. I actually decided I need two sets because the first character of a name cannot be a number or punctuation character. The definitions looks like this:

validNameStartBytes = Set.fromList [
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
    'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
    'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
    'y', 'z']
validNameBytes =  validNameStartBytes |> Set.union (Set.fromList [
    '0', '1', '2', '3', '4', '5',
    '6', '7', '8', '9', '.', '_',
])

Note that the Set.union call is returning a new Set and not modifying the existing validNameStartBytes in place. Roc never modifies anything in place. So both sets contain the values we expect after these calls are made.

Now we can add a pattern to our when statement to start a new Name current token when we encounter a letter and the current token is currently None:

(None, nameByte) if Set.contains validNameStartBytes nameByte ->
    {
        currentState &
        currentToken: Name [nameByte],
    }

This pattern match has a condition on it so it only matches if whatever value in the nameByte variable is in the validNameStartBytes set. If it does match, we return a new record that is derived from the existing currentState record. That’s what the & does; it says “return a record that contains all the values of the existing currentState record but change the following fields to…”. Then we say we’re replacing the currentToken field in that record with a new Name tag. Unlike our previous tags (None and LParen), the Name tag gets a payload, which we set to a List of bytes (containing only one byte, for now).

This code will compile and run, but it is only handling one character. Our debug output now looks like this:

[Tokenizer.roc:36] nextByte = 111
[Tokenizer.roc:36] nextByte = 100
[Tokenizer.roc:36] nextByte = 117
[Tokenizer.roc:36] nextByte = 108
[Tokenizer.roc:36] nextByte = 101
[Tokenizer.roc:36] nextByte = 41
[Tokenizer.roc:36] nextByte = 10
[main.roc:14] tokens = [LParen]

This is subtly different from our last output listing, as the 109 byte is no longer output. We aren’t actually doing anything with it yet, in terms of updating the output tokens, but we are handling it.

We can handle the remaining name characters similarly, except that at the time those bytes are read, the currentToken is now a Name rather than a None. So we need to match and destructure a Name token from the currentToken inside the when:

        (Name nameBytes, nameByte) if Set.contains validNameBytes nameByte ->
            { currentState &
                currentToken: Name (List.append nameBytes nameByte),
            }

The tuple we are matching on now matches a Name token and extracts the current nameBytes from it. It then appends the new nameByte to create a new copy of that list and sets that as the new currentToken.

Running it now, we see that only two bytes are unhandled (a ) and a newline, for those who don’t have the ascii table memorized / open in a separate browser tab):

[Tokenizer.roc:41] nextByte = 41
[Tokenizer.roc:41] nextByte = 10
[main.roc:14] tokens = [LParen]

But we’re still not adding the name to the array of tokens. There are a few characters that could end a name (notably whitespace), but I’m going to code to the (module) example for now and only match on a closing ):

        (Name nameBytes, ')') ->
            {
                currentToken: None,
                tokens: List.concat currentState.tokens [Name nameBytes, RParen],
            }

We need to append two tokens to the array in this case (the Name and the ) we just consumed), so we use List.concat to combine two lists instead of List.append.

Now that we are actually updating the tokens, tokenize will complain that we are not returning the right type. We need to update the Token type definition to include our new Name and RParen tags:

Token : [LParen, RParen, Name (List U8)]

Notice how we define the Name type to have a payload that is a List U8.

Now our output looks like this:

[Tokenizer.roc:47] nextByte = 10
[main.roc:14] tokens = [LParen, (Name [109, 111, 100, 117, 108, 101]), RParen]

We’re handling everything in a basic input and getting an array of tokens back! Aren’t we special? Let’s also eat that whitespace character. Start by adding a new top-level Set to hold various whitespace characters:

whitespaceBytes = Set.fromList [9u8, 10u8, 13u8, 32u8]

Rather than trying to figure out how to get the single quote syntax to pick up tabs and newlines, I just entered their ASCII/UTF-8 values directly. The u8 suffix says that these should by stored as bytes rather than the default 32-bit integers.

We only want to drop whitespace if the current token is None (otherwise we’ll encounter issues when we target e.g. spaces inside quoted strings). Dropping it is as simple as returning the currentState without modification:

        (None, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            currentState

Now we are able to tokenize every character in the input string and get valid output. This is getting pretty long, so I’m going to pause here! In the next article, we’ll deal with error handling and hopefully finish tokenizing the larger hello world string.