Build a Wasm Compiler in Roc

In earlier articles, I introduced the project, wrote some Roc code to load an input file, and started implementing a Tokenizer.

This part takes a bit of a detour with a refactor to support rudimentary error reporting.

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Handling Errors during Tokenizing

Before we start adding more tokens to tokenize the Hello World module, I want to beef up our error handling a bit. There are a few reasons I don’t like the basic dbg nextByte we are currently using the wildcard arm:

The wildcard prevents the compiler from informing us if we missed any cases at compile time. Exhaustive pattern matching is an amazing feature. We should leverage it!
Other than whitespace, I haven’t memorized the ascii table and I’m tired of looking up those numbers. We should map the utf8 byte back to a character if we can.
We don’t communicate anything back to the calling function about the success or failure of the operation.
It doesn’t give us any feedback as to where tokenizing broke down.

Let’s address each of these in order.

Eliminate the wildcard

Let’s start with simply removing the wildcard and seeing what kind of errors we get:

This when does not cover all the possibilities:

...

Other possibilities include:

    ( Name _, _ )    (note the lack of an if clause)
    ( None, _ )    (note the lack of an if clause)

I would have to crash if I saw one of those! Add branches for them!

So we are missing the case where we are processing an existing Name token and encounter something that is neither a name character or an RParen. We’re also missing cases where the current token is None and we received something that wasn’t a LParen, whitespace, or name byte.

Some of those cases are legitimate syntax that we neglected to handle. For example, a Name clause can end when we encounter whitespace instead of a RParen. Let’s quickly add a pattern arm for that:

        (Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            {
                currentToken: None,
                tokens: List.append currentState.tokens (Name nameBytes),
            }

If we encounter whitespace after a name, we end the current Name token, add it to the list of tokens, and discard the whitespace. However, we cannot compile (module ) without adding another match arm for the close paren when the currentToken is None:

        (None, ')') ->
            {
                currentToken: None,
                tokens: List.append currentState.tokens RParen,
            }

We still need to handle the other wildcard possibilities, though. Some of them will be oversights in our existing code, but others would represent legitimate compiler errors in the input WAT source code. For now, let’s just replace the wildcard with something that explicitly handles the cases we know we have missed (effectively communicating that we missed them on purpose).

        (Name _, unexpectedByte) | (None, unexpectedByte) ->
            byteAsUtf8 =
                [unexpectedByte]
                |> Str.fromUtf8
                |> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
                |> Result.withDefault ""
            dbg "Unexpected character '$(byteAsUtf8)'"

            currentState

This arm uses a | character to handle two completely different patterns. One is when our currentToken is a Name but received a byte that doesn’t handle Name properly, and the other is when we have a None and receive a byte that isn’t expected with None. The case now also goes to the trouble of attempting to map the byte back to a UTF-8 string. This is an operation that could fail (e.g. if the byte is part of a multi-byte unicode codepoint), so as a fallback we return the string form of the byte number and unwrap the result with an unused withDefault.

Now, if we add a new token type, we’ll get compile-time errors for any cases we neglect to purposefully handle.

Together, these changes address the first two of the problems I listed.

Communicating Errors

The dbg statement is a horrible way to communicate with the end user, if for no other reason that Roc strips it out of release binaries! Instead, we should collect errors and return them from our tokenize function.

Let’s start with collecting errors. We can add an errors field to our initialState record, but before we do that, let’s define a couple new types. The types aren’t necessary to the compiler, but our record is about to get complicated enough that the developer (i.e. me) will appreciate having it documented:

Token : [LParen, RParen, Name (List U8)]
CurrentToken : [None, Name (List U8)]
Error : {
    message : Str,
}
TokenizerState : {
    currentToken : CurrentToken,
    tokens : List Token,
    errors : List Error,
}

We already have the first type (Token), but I wanted to include it to explain the discussion of CurrentToken. These are both closed tag unions, which means only those explicit tags can be provided. There is some overlap between them, but they represent different things; only a certain subset of the types of Tokens can ever be set as a CurrentToken (e.g. LParen tokens are added directly to the tokens array and don’t get set as current).

Further, we never return a None Tag from tokenize, but this is a valid Tag in the TokenizerState. So these need to be two different types. When we add new types of tokens that can go in both places, we need to update both types. This sucks, but the compiler will let us know if we get it wrong.

The Error record type will be extended later to include some location information, but for now I’m just giving it a single field.

TokenizerState is a record type that we previously inferred from the structure of initialState, but now we’re making it explicit. I also added an errors field. We can add a type to initialState to guarantee that it has the right type:

initialState : TokenizerState
initialState = {
    currentToken: None,
    tokens: [],
    errors: [],
}

These changes alone will introduce compiler errors because we are not handling errors properly in evaluateOneChar. The compiler is already aware that the signature of evaluateOneChar needs to handle errors and doesn’t yet. But to be extra clear for the developers sake, I’ll add a type to evaluateOneChar as well:

evaluateOneChar : TokenizerState, U8 -> TokenizerState
evaluateOneChar = \currentState, nextByte ->

To satisfy the compiler, we need to make sure all our arms use the & trick to ensure the entire current state (including errors) is copied over. For example, the two arms that match parentheses when currentToken is None need to change from the following:

        (None, '(') ->
            {
                currentToken: None,
                tokens: List.append currentState.tokens LParen,
            }

        (None, ')') ->
            {
                currentToken: None,
                tokens: List.append currentState.tokens RParen,
            }

…to the following:

        (None, '(') ->
            { currentState &
                tokens: List.append currentState.tokens LParen,
            }

        (None, ')') ->
            { currentState &
                tokens: List.append currentState.tokens RParen,
            }

The arm that matches a Name against whitespace needs to do this:

        (Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            { currentState &
                currentToken: None,
                tokens: List.append currentState.tokens (Name nameBytes),
            }

And the one that matches a right parenthesis after name becomes:

        (Name nameBytes, ')') ->
            { currentState &
                currentToken: None,
                tokens: List.concat currentState.tokens [Name nameBytes, RParen],
            }

At this point our code compiles, but we aren’t actually collecting errors until we replace the dbg arm with something that actually updates the errors:

        (Name _, unexpectedByte) | (None, unexpectedByte) ->
            byteAsUtf8 =
                [unexpectedByte]
                |> Str.fromUtf8
                |> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
                |> Result.withDefault ""

            { currentState &
                errors: List.append currentState.errors { message: "Unexpected character '$(byteAsUtf8)'" },
            }

This compiles and runs, but if I run it on a file that contains invalid code (e.g. (module!)) it won’t actually report anything.

Reporting errors from tokenizing

To ensure my errors array is getting data, I tried tossing a dbg finalState.errors in the tokenize function before we return it. The errors are there, so we just need to figure out what to do with them!

All we need to do is return a Result instead of a List Token from the tokenize function. Results are basically just a union of the Tags Ok and Err, with generic payloads. We can inspect the errors field and return a failure if that is the right behaviour:

tokenize : Str -> Result (List Token) (List Error)
tokenize = \input ->
    finalState = Str.walkUtf8
        input
        initialState
        evaluateOneChar

    when finalState.errors is
        [] -> Ok finalState.tokens
        errors -> Err errors

Since we are now exposing it, we should add Error to the module exports at the top of the file:

module [
    Token,
    Error,
    tokenize,
]

Going back to the main.roc file for a while, our compile function currently just passes whatever it receives from tokenize to a dbg statement. We should instead make it return a Result so the main function can format the errors nicely. This function will probably encounter other types of errors later, so let’s create a new Tag to specify TokenizerError:

compile : Str -> Result (List U8) [TokenizeError (List Tokenizer.Error)]
compile = \input ->
    when Tokenizer.tokenize input is
        Ok tokens ->
            dbg tokens

            Ok (Str.toUtf8 "TODO: Compile Input")

        Err errors -> Err (TokenizeError errors)

Note: This decision turned out to be a mistake and a later part will refactor the parsing code to always return the same type of Error. I left the mistake in place, partially due to laziness aka lack of time, but also because I want folks to see that real development is full of dead ends and backtracking like this.

Finally, we need to update the main function to format the errors and output them to Stdout, if there are any, and only write the output file if there is a valid response. Let’s start with a (not very good) formatErrors function, which will live in main.roc for now:

formatErrors = \errors ->
    Str.joinWith (errors |> List.map \error -> error.message) "\n"

To use this, I have to break up the beautiful pipeline we made in main so that I can assign the response from compile to a Result and then match on it:

main : Task.Task {} [Exit I32 Str]
main =
    when Arg.list! {} is
        [] | [_] -> Task.err (Exit 1 "No input filename provided")
        [_, _, _, ..] -> Task.err (Exit 2 "Too many arguments")
        [_, filename] ->
            compileResult =
                (
                    filename
                    |> File.readUtf8
                    |> Task.mapErr \error ->
                        when error is
                            FileReadErr _ NotFound ->
                                Exit 3 "$(filename) does not exist"

                            FileReadErr _ _ ->
                                Exit 99 "Error reading $(filename)"

                            FileReadUtf8Err _ _ ->
                                Exit 4 "Unable to read UTF8 in $(filename)"
                )!
                    |> compile

            when compileResult is
                Ok compiledBytes -> writeWithWasmExtension compiledBytes filename
                Err (TokenizeError errors) ->
                    errors
                    |> formatErrors
                    |> Stderr.line
                    |> Task.mapErr \_ -> Exit 99 "System is failing"

Now when I try to compile a main.wat file that contains (module!) I get a slightly informative error message: Unexpected character '!'. I wonder what line that error occurred on (If you guessed line 1, good for you).

Tracking Error Locations

Let’s go back to Tokenizer.roc and see what we can do about tracking the location of errors. This will require two phases: First we’ll need to keep track of where we are in the file as we parse it. We can do this by adding current line number and column to the state. But we’ll have to adjusting it with each iteration. Second, we need to report the current location when we store errors, by adding line and column fields to the Error record.

First, I’ll create a new Position type (don’t forget to export it) and add it to the TokenizerState type:

Position : {
    row : U32,
    column : U32,
}
TokenizerState : {
    currentToken : CurrentToken,
    currentPosition : Position,
    tokens : List Token,
    errors : List Error,
}

We need to initialize the initialState with an appropriate position:

initialState : TokenizerState
initialState = {
    currentToken: None,
    currentPosition: {
        row: 1,
        column: 0,
    },
    tokens: [],
    errors: [],
}

Since we are already propagating the current state in all our when branches using currentState, the code will still compile, but the position isn’t moving.

We could try to add position tracking to the big when statement, but I feel like it will be easier to do it in a separate operation. The position tracking mechanism only cares about the nextByte, so mixing it up with all m x n conditions that are created by (currentToken, nextByte) would be a big hassle.

First, I’ll rename currentState to previousState in the function signature:

evaluateOneChar : TokenizerState, U8 -> TokenizerState
evaluateOneChar = \previousState, nextByte ->

This allows me to add an intermediate currentState variable that will have the correct position without having to edit our gigantic (but nowhere near what it needs to be!) when expression:

    currentState =
        if nextByte == '\n' then
            { previousState &
                currentPosition: {
                    row: previousState.currentPosition.row + 1,
                    column: 0,
                },
            }
        else
            { previousState &
                currentPosition: {
                    row: previousState.currentPosition.row,
                    column: previousState.currentPosition.column + 1,
                },
            }

This is assuming unix-style line endings, but could be easily changed to handle \r\n and \r as well.

I added a dbg (nextByte, currentState.currentPosition) to double check it was incrementing correctly.

In fact it wasn’t incrementing correctly. My first attempt had some off-by-one-errors. So don’t kick yourself over off-by-one errors. Even a quarter century of experience won’t be enough to end them!

Showing the error positions to the user

Now that we are tracking positions, we just need to get that information back to the user when there is an error. Start by adding a position to the Error type:

Error : {
    message : Str,
    position : Position,
}

The compiler will tell us we need to update the record that adds an error to the currentState in the arm that adds errors:

        (Name _, unexpectedByte) | (None, unexpectedByte) ->
            byteAsUtf8 =
                [unexpectedByte]
                |> Str.fromUtf8
                |> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
                |> Result.withDefault ""

            { currentState &
                errors: List.append currentState.errors {
                    message: "Unexpected character '$(byteAsUtf8)'",
                    position: currentState.currentPosition,
                },
            }

The error handling code we already wrote in main.roc should automatically propagate this new information right to the formatErrors method, so we can just update the List.map call in there:

formatErrors = \errors ->
    Str.joinWith
        (
            errors
            |> List.map \error ->
                row = error.position.row |> Num.toStr
                column = error.position.column |> Num.toStr
                "$(row):$(column) $(error.message)"
        )
        "\n"

The positions are integers, so we need to call Num.toStr before passing them to the format string.

Now if I run our tokenizer against the following code, I get the appropriate position in the error:

❯ cat main.wat
(
  module
  !
)

❯ roc dev -- main.wat
3:3 Unexpected character '!'

Attaching positions to tokens

I can already tell we’re going to wish each token in the AST had a position attached to it so we can report similar information when we encounter errors during the parsing phase. I wasn’t sure I wanted to go to that level of complexity just right now, but it turned out to be pretty easy.

Note: This decision causes a lot of pain later in the series, but no real compiler can get away with not reporting error locations, so I doubled down on it.

First, we need to change our two Token types from a tag union to a record that contains a position and tag union:

Token : {
    position : Position,
    token : [LParen, RParen, Name (List U8)],
}
CurrentToken : {
    position : Position,
    token : [None, Name (List U8)],
}

Now we just follow the Roc compiler errors to ensure we are returning the position with every token. First, update initialState:

initialState = {
    currentToken: {
        position: {
            row: 1,
            column: 0,
        },
        token: None,
    },
    currentPosition: {
        row: 1,
        column: 0,
    },
    tokens: [],
    errors: [],
}

Note that currentPosition is tracked separately from the position of the currentToken. This is because in a token that is more than one character long (such as Name), the currentToken.position will point to the start of the token while currentPosition will point to whichever character we are evaluating next. We could probably get away with just keeping the currentToken.position and calculating the currentPosition based on the length of the currentToken, but storing the two values separately feels safer.

Next, we need to update our big when statement to match on the new token field of the currentToken instead of having it at the top level:

    when (currentState.currentToken.token, nextByte) is

We also need to update all of our branch arms so that when they assign to either currentToken or tokens, they are getting a record instead of a tag. I won’t show all of these, but here are a few examples:

The left and right parenthesis arms need to attach the current position to the token when they construct it and append it to the list:

        (None, '(') ->
            { currentState &
                tokens: List.append currentState.tokens {
                    position: currentState.currentPosition,
                    token: LParen,
                },
            }

The arm that starts a Name token needs to set the position of currentToken:

        (None, nameByte) if Set.contains validNameStartBytes nameByte ->
            { currentState &
                currentToken: {
                    position: currentState.currentPosition,
                    token: Name [nameByte],
                },
            }

But the arm that updates the Name token should use the & trick to copy position from the currentToken that was set in the above arm. Roc complained at me when I tried to use currentState.currentToken &, so I had to assign it to another variable:

        (Name nameBytes, nameByte) if Set.contains validNameBytes nameByte ->
            currentToken = currentState.currentToken
            { currentState &
                currentToken: { currentToken &
                    token: Name (List.append nameBytes nameByte),
                },
            }

The two arms that copy the currentToken into the tokens array need to construct the record format:

        (Name nameBytes, ')') ->
            { currentState &
                currentToken: {
                    position: currentState.currentToken.position,
                    token: None,
                },
                tokens: List.concat currentState.tokens [
                    {
                        position: currentState.currentToken.position,
                        token: Name nameBytes,
                    },
                    {
                        position: currentState.currentPosition,
                        token: RParen,
                    },
                ],
            }

Pay attention to above example. I had to update three positions here. The new None currentToken needs to have the current position. And we are adding two tokens to the array at different positions. The first, which holds the Name, copies the position from the currentToken and the second, which holds the RParen, copies the currentPosition.

After making these changes, I can compile a valid wasm file that looks like this:

(
  module
)

And the existing dbg line in the compile function spits out tokens that have been annotated with their position:

[main.roc:16] tokens = [{position: {column: 1, row: 1}, token: LParen}, {position: {column: 3, row: 2}, token: (Name [109, 111, 100, 117, 108, 101])}, {position: {column: 1, row: 3}, token: RParen}]

This was an interesting refactor; the Roc compiler made it dead simple, much like Rust. But unlike Rust, it did the job extremely quickly. I dislike Rust because it’s so hard on developer velocity. Roc allows me to create static binaries with the same level of certainty, but without waiting around for stuff. And without fighting with the borrow checker; I’ve never used Roc before, but other than a few obtuse compiler errors that I’m sure will be fixed, development in this language has been shockingly smooth.

Crikey, I thought we were going to get further today, but I think I better leave the remaining tokens (and testing!) for the next article.

Build a Wasm Compiler in Roc - Part 4