Build a Wasm Compiler in Roc

In earlier articles, we implemented a tokenizer and parser to convert Wasm’s WAT syntax into an S-expression abstract syntax tree, and started to implement a transformer to convert that AST into one more suitable for generating Wasm bytecode.

This article continues where part 8 left off, as we try to expand the parser to something more than an empty WAT (module).

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Getting our bearings

As a refresher, the hello.wat file that I want to compile to Wasm bytecode looks like this:

(module
    (import "wasi_snapshot_preview1" "fd_write" (func $fd_write (param i32 i32 i32 i32) (result i32)))

    (memory 1)
    (export "memory" (memory 0))
    (export "_start" (func $main))

    (data (offset (i32.const 8)) "hello world")

    (func $main
        (i32.const 0)
        (i32.const 8)
        (i32.store)

        (i32.const 4)
        (i32.const 11)
        (i32.store)

        (i32.const 1)
        (i32.const 0)
        (i32.const 1)
        (i32.const 20)
        (call $fd_write)
        drop
    )
)

We have parsed this code into a bunch of S-Expressions, and we’ve constructed an empty module AST that currently looks like this:

Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}

Module : {
    funcs : List (Common.Positionable Func),
    mems : List (Common.Positionable Mem),
    datas : List (Common.Positionable Data),
    imports : List (Common.Positionable Import),
    exports : List (Common.Positionable Export),
}

emptyModule = {
    funcs: [],
    mems: [],
    datas: [],
    imports: [],
    exports: [],
}

Our next task is to start populating those lists based on the children of the module S-expression in the input string.

We’ve already transformed the initial module expression in our transform function, but we are currently absolutely ignoring the children. Since we want to raise respectful and well-mannered AST nodes, ignoring them as children is probably not the wisest course of action.

I’m going to do this recursively, with a different function for each type of syntax we want to parse. That allows us to impose a bit more structure on the types than if we used a single recursive function that just returns a big ol' bag of nodes (like our parser does).

Each recursive function is called with the children from an S-Expression. We don’t pass the S-Expression’s name down because the function in question would not have been called unless we had already matched the name. Remember, children is a list of Parser.Expression objects, and the list can contain either Term or SExpression tags with their associated bodies.

Each of these recursive functions will return a Result where the Ok field depends on which function was called, and the Err field will always be a list of Common.Error objects. To make this clearer, let’s add a parameterized type in the type definitions at the top of the file for this kind of result:

TransformResult ok : Result ok (List Common.Error)

We can shorten the type of transform to use this new type:

transform : Parser.Node -> TransformResult (Common.Positionable Module)

We can also use parameterized types in function definitions. I am expecting/hoping/planning that all our recursive functions will have the same basic structure, so let’s define a type for that:

Transformer ok : List Parser.Expression -> TransformResult ok

A Transformer is a function that accepts a list of the parse children of that node and returns TransformResult where the ok type depends on the function being called.

Armed with these types, we can start to flesh out the boilerplate for a function that parses the children of a module S-Expression:

transformModule : Transformer Module
transformModule = \children ->
    List.walk children (Ok emptyModule) \currentResult, child ->
        currentResult

Now we can shorten the success arm of our transform function to call this function instead:

    when expression is
        { position, expression: SExpression { name: "module", children } } ->
            children
            |> transformModule
            |> Result.map \element -> { position, element }

The Transformer type is not aware of the position, but the transform function needs to return a positionable, so we throw a map on the end to modify a success result. Error results don’t need to be modified because they’ll carry their own position.

At this point, my existing tests are all passing because we are returning an empty result when there are no children, and the only success test is for an empty (module). Let’s add a new test for a memory child so we have more things to fix!

Parsing a memory

Memory is one of the simpler expressions, so let’s start with that. Technically, a valid WAT memory will always be of the form (memory num) or (memory num num), depending on whether there is an upper size limit on the memory. But I’m coding to a sample instead of to the WAT spec, so I’m going to pretend that only (memory num) is valid!

We can start by reflecting this on the Mem type that is currently defined as {}:

Mem : {
    min : U32,
}

Since Module already refers to this type, we don’t need to update it.

Now that the type exists, we should be able to set up an expectation for a simple module containing a single Mem:

expect
    # (module (memory 10))
    result = transform
        (
            SExpression {
                position: { column: 1, row: 1 },
                name: "module",
                state: Closed,
                children: [
                    SExpression {
                        position: { column: 8, row: 1 },
                        name: "memory",
                        state: Closed,
                        children: [
                            Term {
                                position: { column: 17, row: 1 },
                                term: Number 10,
                            },
                        ],
                    },
                ],
            }
        )

    result
    == Ok {
        position: { row: 1, column: 1 },
        element: {
            funcs: [],
            mems: [
                {
                    position: { row: 1, column: 10 },
                    element: { min: 10u32 },
                },
            ],
            datas: [],
            imports: [],
            exports: [],
        },
    }

Next, let’s create a new transformMemory function. Like transformModule, it is a Transformer function, but the ok value is a Mem this time:

transformMem : Transformer Mem
transformMem = \children ->
    #TODO

Because Mem has such a rigid structure, we don’t need to iterate over the children like we do in transformModule. We know it’s either a (memory Number number) or an error:

transformMem : Transformer Mem
transformMem = \children ->
    when children is
        [Term { position, term: Number number }] -> Ok { min: number }
        _ -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]

Ideally, we wouldn’t be hardcoding an incorrect error position like that, though. I tried to change it to extract a position from the child nodes, but the compiler got confused by the recursive typechecking. I’m 90% certain this code is correct, but it wouldn’t compile:

transformMem : Transformer Mem
transformMem = \children ->
    when children is
        [Term { position, term: Number number }] -> Ok { min: number }
        [] -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]
        [Term { position }, ..] -> Err [{ position, message: "Invalid memory" }]
        [SExpression { position }, ..] -> Err [{ position, message: "Invalid memory" }]

The error message is very long, but the gist is “I’m so confused about recursive types, help.” At this point, I’m also confused about recursive types in Roc, so I can’t help the poor compiler.

My solution for now is:

transformMem : Transformer Mem
transformMem = \children ->
    when children is
        [Term { position, term: Number number }] -> Ok { min: number }
        [] -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]
        [Term { position }, ..] -> Err [{ position, message: "Invalid memory" }]
        [SExpression { position }, ..] -> Err [{ position, message: "Invalid memory" }]
        _ -> crash "Impossible match arm in transformMem" # shouldn't be necessary?

Once I get the success case working, I’ll try to write some expectations for the failure cases to make sure it isn’t actually possible to hit that match arm.

My success case expectation is still failing because I’m not actually calling transformMem anywhere. We will need to set up a pattern match in the List.walk in transformModule to call it in the appropriate case:

transformModule : Transformer Module
transformModule = \children ->
    List.walk children (Ok emptyModule) \currentResult, child ->
        when child is
            SExpression { position, name: "memory", children: expressions } ->
                mergeResult currentResult (transformMem expressions) \module, mem -> { module &
                        mems: List.append module.mems { position, element: mem },
                    }

            _ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected child" }]

Inside the List.walk callback, we are matching on child. If it is an S-Expression with a name of memory, we call the transformMem function on it. Then we call a sneaky function called mergeResult that I’ll show you shortly. This function accepts the currentResult, which is either an Ok Module or an Err (List Common.Error), and another result that has an arbitrary success type, but the same Err (List Common.Error) type. It returns a new Ok Module Result if both results are Ok, or a new list of errors if either or both of them returned an error.

The third argument is a function that acts something like a Result.map function, except it accepts two values. It is only called if both results are Ok, and it accepts the contents of those Ok results and returns a new value that mergeResult will wrap in another Ok.

In other words, the only way for mergeResult to return an Ok value is if both of its arguments are Ok.

So in this particular case, we are calling mergeResult with a Result who’s success value is a Module and a second Result containing a Mem. The mapping function returns a new Module where the Mem has been appended to the mems list.

You might find it instructive to try to define mergeResult yourself. I certainly did! If not, or if you get stuck, here’s the implementation:

mergeResult : TransformResult a, TransformResult b, (a, b -> a) -> TransformResult a
mergeResult = \originalResult, nextResult, mapper ->

    when (originalResult, nextResult) is
        (Ok original, Ok next) ->
            Ok (mapper original next)

        (Err errors, Ok _) ->
            Err errors

        (Ok _, Err errors) ->
            Err errors

        (Err originalErrors, Err nextErrors) ->
            Err (List.concat originalErrors nextErrors)

The secret sauce is that mergeResult has two parameterized types, which I have named a and b because I have an incredible imagination.

The when matches on a tuple, and if either arm is an Err result, we always return an Err result, possibly merging the list of errors if they are both Err results.

In the success case, we call the mapper function to generate a new a (whatever type a is) and return that as the Ok result.

In our specific callsite, the a type is Module and the b type is Mem, but this function can actually merge any kind of TransformResult. We’ll definitely be calling it again with a being a Module for other types of module children such as data and import. I suspect we’ll also be using it with a different a to combine results when a child contains nested expressions, but I’m not certain yet.

At this point my tests are passing again and I’m ready to write some expectations for invalid memory expressions that aren’t a simple (memory 1).

expect
    # (module (memory))
    result = transform
        (
            SExpression {
                position: { column: 1, row: 1 },
                name: "module",
                state: Closed,
                children: [
                    SExpression {
                        position: { column: 8, row: 1 },
                        name: "memory",
                        state: Closed,
                        children: [],
                    },
                ],
            }
        )

    result
    == Err [{ message: "Invalid memory", position: { column: 0, row: 0 } }]

expect
    # (module (memory "ten"))
    result = transform
        (
            SExpression {
                position: { column: 1, row: 1 },
                name: "module",
                state: Closed,
                children: [
                    SExpression {
                        position: { column: 8, row: 1 },
                        name: "memory",
                        state: Closed,
                        children: [
                            Term {
                                position: { column: 17, row: 1 },
                                term: String "10",
                            },
                        ],
                    },
                ],
            }
        )

    result
    == Err [{ message: "Invalid memory", position: { column: 17, row: 1 } }]

expect
    # (module (memory (something 10)))
    result = transform
        (
            SExpression {
                position: { column: 1, row: 1 },
                name: "module",
                state: Closed,
                children: [
                    SExpression {
                        position: { column: 8, row: 1 },
                        name: "memory",
                        state: Closed,
                        children: [
                            SExpression {
                                position: { column: 17, row: 1 },
                                name: "something",
                                state: Closed,
                                children: [
                                    Term {
                                        position: { column: 28, row: 1 },
                                        term: Number 10,
                                    },
                                ],
                            },
                        ],
                    },
                ],
            }
        )

    result
    == Err [{ message: "Invalid memory", position: { column: 17, row: 1 } }]

Note that the first case is not ideal because it doesn’t record the “real” position of the error. Since there are no children, the error is actually in the parent element. We can solve this by adding an arm to the when expression in transformModule for the “empty children memory” case:

            SExpression { position, name: "memory", children: [] } ->
                mergeError currentResult { position, message: "Invalid memory: min limit required" }

This must go before the expression that handles other lists of children since it catches the empty situation already. It also depends on a mergeError function that is kind of like a simplified version of the mergeResult function:

mergeError : TransformResult a, Common.Error -> TransformResult a
mergeError = \originalResult, error ->
    when originalResult is
        Ok _ ->
            errors = [error]
            Err errors

        Err errors -> Err (List.append errors error)

Now I need to update my (module (memory)) test case to address this better error and position:

    result
    == Err [{ message: "Invalid memory: min limit required", position: { column: 8, row: 1 } }]

Modeling Direct Exports

Exports in WAT represent entities that can be accessed by the host environment (e.g. the browser, or a Rust application). Exports are somewhat nuanced because you can export multiple types of entities including functions, tables, memories, and globals. Further, you can export them explicitly with the export S-Expression, or implicitly by adding an export line when you define one of those entities.

Our “Hello World” app contains two exports, and to make my life simpler they are both explicit. One exports the memory we just defined, and the other exports a “_start” function so the runtime can call it.

    (export "memory" (memory 0))
    (export "_start" (func $main))

In the “hello world” context, these S-expressions are children of the (module), immediately after the (memory 1) expression we already know how to parse. The name of the S-Expression is export, and the first child must be a string, which represents the name of the export that the caller can access. We are exporting one as “memory”, and the other as “_start”, in this case, presumably because wasmtime knows what to do with those names. The third child is an exportdesc. In WAT syntax, it can be a func, table, memory, or global token, but we’ll only be dealing with memory and func.

Even though the (memory 0) token has the exact same syntax as the (memory 1) we defined before, it is important to understand that the number means something different. At the top level, (memory 1) means “create a memory with at least 1 page worth of bytes available”. Inside an export expression, (memory 0) means, “export the memory object at index 0 in the mems array”.

Since we aren’t writing a Wasm runtime, we don’t actually care that much, but hopefully you now understand why I’m not reusing the existing Mem type and have instead created the following type aliases:

MemIndex : U32
FuncIndex: String

Note that a type definition like this just means that U32 and MemIndex can be used interchangeably. It doesn’t place any requirements on us to use MemIndex everywhere it is needed. If we wanted that level of control (especially useful in libraries that expose the mapped types), we might use Opaque Types, but I’m not interested in adding that level of complexity right now.

I should also mention that realistically, these should probably be a tag union, as indexes in many cases can either by an integer or string identifier. But our sample code doesn’t need that so I’m going to say: memories are always integer indexes, functions are always string identifiers.

While we’re fiddling around with types, let’s also extend Export to be a real record:

Export : {
    name : Str,
    type : [Mem MemIndex, Func FuncIndex],
}

As usual, I’m not coding to the full Wasm spec, though you will note that–by making type a tag union–I’ve left the door open for other structures of exports in the future.

Now we have the types in place to construct an expectation around memory exports:

expect
    # (module
    #    (export "memory" (memory 0))
    # )
    result = transform
        (
            SExpression {
                position: { column: 1, row: 1 },
                name: "module",
                state: Closed,
                children: [
                    SExpression {
                        position: { column: 4, row: 2 },
                        name: "export",
                        state: Closed,
                        children: [
                            Term {
                                position: { column: 15, row: 2 },
                                term: String "memory",
                            },
                            SExpression {
                                position: { column: 21, row: 2 },
                                name: "memory",
                                state: Closed,
                                children: [
                                    Term {
                                        position: { column: 30, row: 2 },
                                        term: Number 0,
                                    },
                                ],
                            },
                        ],
                    },
                ],
            }
        )

    result
    ==
    Ok {
        position: { row: 1, column: 1 },
        element: {
            funcs: [],
            mems: [],
            datas: [],
            imports: [],
            exports: [
                {
                    position: { row: 2, column: 4 },
                    element: {
                        name: "memory",
                        type: Mem 0,
                    },
                },
            ],
        },
    }

Note: wasm-tools parse will happily build this syntax to Wasm, but if you try to run it, validation will know that you don’t have any memories to export and it will refuse to run.

I won’t show the test case for the function export, but it is similar.

Now we need to write a new transformExport function that can parse this expression. It can be similar to transformMem with an extra layer of nesting, although a parser for the full Wasm export grammar would need to be far more nuanced.

transformExport : Transformer Export
transformExport = \children ->
    when children is
        [Term { term: String name }, SExpression { name: "memory", children: [Term { term: Number memIndex }] }] ->
            Ok {
                name,
                type: Mem memIndex,
            }

        [Term { term: String name }, SExpression { name: "func", children: [Term { term: Variable funcIndex }] }] ->
            Ok {
                name,
                type: Func funcIndex,
            }

        [Term { term: String _ }, SExpression { position, name: exportType }] ->
            Err [{ position, message: "Can't handle export type $(exportType)" }]

        _ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected Export format" }]

I handled the success cases and any arms that have the right format, but are trying to export a type that we don’t currently handle. Then I got lazy and just threw a catch-all on there for any other wild syntaxes. Note that this kind of “laziness” is (probably) exactly why the Roc compiler is both so good and so bad at reporting errors. Roc is written in Rust, which features pattern matching similar to what Roc has. When the devs have added the appropriate match arms to their compiler, we get nice errors and explanations. When they haven’t done it yet, we get a wildcard arm that says “TODO: Add context”. And when they completely forgot to add an arm, we get a Rust panic. Luckily, I don’t expect anyone to actually use this compiler, so I can cheat every once in a while!

To make the test pass, I added two more arms to the when expression in the transformModule function:

SExpression { position, name: "export", children: [] } ->
    mergeError currentResult { position, message: "Invalid export: name and index required" }

SExpression { position, name: "export", children: expressions } ->
    mergeResult currentResult (transformExport expressions) \module, export -> { module &
            exports: List.append module.exports { position, element: export },
        }

This is getting repetitive, so I won’t show the tests for the error cases. There is a wildcard arm so I know that if I got it wrong, it’s going to be an annoying error message instead of an outright crash.

I think I’ll leave it here for now. In the next part, we’ll be compiling import and data expressions. In the unlikely event that goes well, we might even be able to do a function!

Build a Wasm Compiler in Roc - Part 9

Getting our bearings

Parsing a memory

Modeling Direct Exports

Dusty Phillips