Build a Wasm Compiler in Roc - Part 9
In earlier articles, we implemented a tokenizer and parser to convert Wasm’s WAT syntax into an S-expression abstract syntax tree, and started to implement a transformer to convert that AST into one more suitable for generating Wasm bytecode.
This article continues where part 8 left off, as we try to expand the parser
to something more than an empty WAT (module)
.
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Getting our bearings
As a refresher, the hello.wat
file that I want to compile to Wasm bytecode
looks like this:
(module
(import "wasi_snapshot_preview1" "fd_write" (func $fd_write (param i32 i32 i32 i32) (result i32)))
(memory 1)
(export "memory" (memory 0))
(export "_start" (func $main))
(data (offset (i32.const 8)) "hello world")
(func $main
(i32.const 0)
(i32.const 8)
(i32.store)
(i32.const 4)
(i32.const 11)
(i32.store)
(i32.const 1)
(i32.const 0)
(i32.const 1)
(i32.const 20)
(call $fd_write)
drop
)
)
We have parsed this code into a bunch of S-Expressions, and we’ve constructed an empty module AST that currently looks like this:
Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}
Module : {
funcs : List (Common.Positionable Func),
mems : List (Common.Positionable Mem),
datas : List (Common.Positionable Data),
imports : List (Common.Positionable Import),
exports : List (Common.Positionable Export),
}
emptyModule = {
funcs: [],
mems: [],
datas: [],
imports: [],
exports: [],
}
Our next task is to start populating those lists based on the children of the
module
S-expression in the input string.
We’ve already transformed the initial module
expression in our transform
function, but we are currently absolutely ignoring the children. Since we want
to raise respectful and well-mannered AST nodes, ignoring them as children is
probably not the wisest course of action.
I’m going to do this recursively, with a different function for each type of syntax we want to parse. That allows us to impose a bit more structure on the types than if we used a single recursive function that just returns a big ol' bag of nodes (like our parser does).
Each recursive function is called with the children
from an S-Expression. We
don’t pass the S-Expression’s name
down because the function in question
would not have been called unless we had already matched the name. Remember,
children
is a list of Parser.Expression
objects, and the list can contain
either Term
or SExpression
tags with their associated bodies.
Each of these recursive functions will return a Result
where the Ok
field
depends on which function was called, and the Err
field will always be a
list of Common.Error
objects. To make this clearer, let’s add a parameterized
type in the type definitions at the top of the file for this kind of result:
TransformResult ok : Result ok (List Common.Error)
We can shorten the type of transform
to use this new type:
transform : Parser.Node -> TransformResult (Common.Positionable Module)
We can also use parameterized types in function definitions. I am expecting/hoping/planning that all our recursive functions will have the same basic structure, so let’s define a type for that:
Transformer ok : List Parser.Expression -> TransformResult ok
A Transformer
is a function that accepts a list of the parse children of that
node and returns TransformResult
where the ok
type depends on the function
being called.
Armed with these types, we can start to flesh out the boilerplate for a function
that parses the children of a module
S-Expression:
transformModule : Transformer Module
transformModule = \children ->
List.walk children (Ok emptyModule) \currentResult, child ->
currentResult
Now we can shorten the success arm of our transform
function to call this
function instead:
when expression is
{ position, expression: SExpression { name: "module", children } } ->
children
|> transformModule
|> Result.map \element -> { position, element }
The Transformer
type is not aware of the position
, but the transform
function needs to return a positionable
, so we throw a map
on the end to
modify a success result. Error results don’t need to be modified because
they’ll carry their own position.
At this point, my existing tests are all passing because we are returning
an empty result when there are no children, and the only success test is for
an empty (module)
. Let’s add a new test for a memory
child so we have
more things to fix!
Parsing a memory
Memory is one of the simpler expressions, so let’s start with that.
Technically, a valid WAT memory will always be of the form (memory num)
or
(memory num num)
, depending on whether there is an upper size limit on the
memory. But I’m coding to a sample instead of to the WAT spec, so I’m going to
pretend that only (memory num)
is valid!
We can start by reflecting this on the Mem
type that is currently defined as
{}
:
Mem : {
min : U32,
}
Since Module
already refers to this type, we don’t need to update it.
Now that the type exists, we should be able to set up an expectation for
a simple module containing a single Mem
:
expect
# (module (memory 10))
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 8, row: 1 },
name: "memory",
state: Closed,
children: [
Term {
position: { column: 17, row: 1 },
term: Number 10,
},
],
},
],
}
)
result
== Ok {
position: { row: 1, column: 1 },
element: {
funcs: [],
mems: [
{
position: { row: 1, column: 10 },
element: { min: 10u32 },
},
],
datas: [],
imports: [],
exports: [],
},
}
Next, let’s create a new transformMemory
function. Like transformModule
,
it is a Transformer
function, but the ok
value is a Mem
this time:
transformMem : Transformer Mem
transformMem = \children ->
#TODO
Because Mem
has such a rigid structure, we don’t need to iterate over the children
like we do in transformModule
. We know it’s either a (memory Number number)
or
an error:
transformMem : Transformer Mem
transformMem = \children ->
when children is
[Term { position, term: Number number }] -> Ok { min: number }
_ -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]
Ideally, we wouldn’t be hardcoding an incorrect error position like that, though. I tried to change it to extract a position from the child nodes, but the compiler got confused by the recursive typechecking. I’m 90% certain this code is correct, but it wouldn’t compile:
transformMem : Transformer Mem
transformMem = \children ->
when children is
[Term { position, term: Number number }] -> Ok { min: number }
[] -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]
[Term { position }, ..] -> Err [{ position, message: "Invalid memory" }]
[SExpression { position }, ..] -> Err [{ position, message: "Invalid memory" }]
The error message is very long, but the gist is “I’m so confused about recursive types, help.” At this point, I’m also confused about recursive types in Roc, so I can’t help the poor compiler.
My solution for now is:
transformMem : Transformer Mem
transformMem = \children ->
when children is
[Term { position, term: Number number }] -> Ok { min: number }
[] -> Err [{ position: { row: 0, column: 0 }, message: "Invalid memory" }]
[Term { position }, ..] -> Err [{ position, message: "Invalid memory" }]
[SExpression { position }, ..] -> Err [{ position, message: "Invalid memory" }]
_ -> crash "Impossible match arm in transformMem" # shouldn't be necessary?
Once I get the success case working, I’ll try to write some expectations for the failure cases to make sure it isn’t actually possible to hit that match arm.
My success case expectation is still failing because I’m not actually calling
transformMem
anywhere. We will need to set up a pattern match in the
List.walk
in transformModule
to call it in the appropriate case:
transformModule : Transformer Module
transformModule = \children ->
List.walk children (Ok emptyModule) \currentResult, child ->
when child is
SExpression { position, name: "memory", children: expressions } ->
mergeResult currentResult (transformMem expressions) \module, mem -> { module &
mems: List.append module.mems { position, element: mem },
}
_ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected child" }]
Inside the List.walk
callback, we are matching on child
. If it is an
S-Expression
with a name of memory
, we call the transformMem
function on
it. Then we call a sneaky function called mergeResult
that I’ll show you
shortly. This function accepts the currentResult
, which is either an Ok Module
or an Err (List Common.Error)
, and another result that has an
arbitrary success type, but the same Err (List Common.Error)
type. It returns
a new Ok Module
Result if both results are Ok
, or a new list of errors if
either or both of them returned an error.
The third argument is a function that acts something like a Result.map
function, except it accepts two values. It is only called if both results
are Ok
, and it accepts the contents of those Ok
results and returns a new
value that mergeResult
will wrap in another Ok
.
In other words, the only way for mergeResult
to return an Ok
value is if
both of its arguments are Ok
.
So in this particular case, we are calling mergeResult
with a Result
who’s
success value is a Module
and a second Result
containing a Mem
. The
mapping function returns a new Module
where the Mem
has been appended to
the mems
list.
You might find it instructive to try to define mergeResult
yourself. I certainly
did! If not, or if you get stuck, here’s the implementation:
mergeResult : TransformResult a, TransformResult b, (a, b -> a) -> TransformResult a
mergeResult = \originalResult, nextResult, mapper ->
when (originalResult, nextResult) is
(Ok original, Ok next) ->
Ok (mapper original next)
(Err errors, Ok _) ->
Err errors
(Ok _, Err errors) ->
Err errors
(Err originalErrors, Err nextErrors) ->
Err (List.concat originalErrors nextErrors)
The secret sauce is that mergeResult
has two parameterized types, which I
have named a
and b
because I have an incredible imagination.
The when
matches on a tuple, and if either arm is an Err
result, we always
return an Err
result, possibly merging the list of errors if they are both
Err
results.
In the success case, we call the mapper
function to generate a new a
(whatever type a
is) and return that as the Ok
result.
In our specific callsite, the a
type is Module
and the b
type is Mem
,
but this function can actually merge any kind of TransformResult
. We’ll
definitely be calling it again with a
being a Module
for other types of
module children such as data
and import
. I suspect we’ll also be using it
with a different a
to combine results when a child contains nested
expressions, but I’m not certain yet.
At this point my tests are passing again and I’m ready to write some
expectations for invalid memory
expressions that aren’t a simple (memory 1)
.
expect
# (module (memory))
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 8, row: 1 },
name: "memory",
state: Closed,
children: [],
},
],
}
)
result
== Err [{ message: "Invalid memory", position: { column: 0, row: 0 } }]
expect
# (module (memory "ten"))
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 8, row: 1 },
name: "memory",
state: Closed,
children: [
Term {
position: { column: 17, row: 1 },
term: String "10",
},
],
},
],
}
)
result
== Err [{ message: "Invalid memory", position: { column: 17, row: 1 } }]
expect
# (module (memory (something 10)))
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 8, row: 1 },
name: "memory",
state: Closed,
children: [
SExpression {
position: { column: 17, row: 1 },
name: "something",
state: Closed,
children: [
Term {
position: { column: 28, row: 1 },
term: Number 10,
},
],
},
],
},
],
}
)
result
== Err [{ message: "Invalid memory", position: { column: 17, row: 1 } }]
Note that the first case is not ideal because it doesn’t record the “real”
position of the error. Since there are no children, the error is actually
in the parent element. We can solve this by adding an arm to the when
expression
in transformModule
for the “empty children memory” case:
SExpression { position, name: "memory", children: [] } ->
mergeError currentResult { position, message: "Invalid memory: min limit required" }
This must go before the expression that handles other lists of children since
it catches the empty situation already. It also depends on a mergeError
function
that is kind of like a simplified version of the mergeResult
function:
mergeError : TransformResult a, Common.Error -> TransformResult a
mergeError = \originalResult, error ->
when originalResult is
Ok _ ->
errors = [error]
Err errors
Err errors -> Err (List.append errors error)
Now I need to update my (module (memory))
test case to address this better error and position:
result
== Err [{ message: "Invalid memory: min limit required", position: { column: 8, row: 1 } }]
Modeling Direct Exports
Exports in WAT represent entities that can be accessed by the host environment
(e.g. the browser, or a Rust application). Exports are somewhat nuanced because
you can export multiple types of entities including functions, tables,
memories, and globals. Further, you can export them explicitly with the
export
S-Expression, or implicitly by adding an export
line when you define
one of those entities.
Our “Hello World” app contains two exports, and to make my life simpler they are both explicit. One exports the memory we just defined, and the other exports a “_start” function so the runtime can call it.
(export "memory" (memory 0))
(export "_start" (func $main))
In the “hello world” context, these S-expressions are children of the
(module)
, immediately after the (memory 1)
expression we already know how
to parse. The name of the S-Expression is export
, and the first child must be
a string, which represents the name of the export that the caller can access.
We are exporting one as “memory”, and the other as “_start”, in this case,
presumably because wasmtime knows what to do with those names. The third child
is an exportdesc
. In WAT syntax, it can be a func, table, memory, or global
token, but we’ll only be dealing with memory
and func
.
Even though the (memory 0)
token has the exact same syntax as the (memory 1)
we defined before, it is important to understand that the number means
something different. At the top level, (memory 1)
means “create a memory with
at least 1 page worth of bytes available”. Inside an export
expression,
(memory 0)
means, “export the memory
object at index 0 in the mems
array”.
Since we aren’t writing a Wasm runtime, we don’t actually care that much, but
hopefully you now understand why I’m not reusing the existing Mem
type and
have instead created the following type aliases:
MemIndex : U32
FuncIndex: String
Note that a type definition like this just means that U32
and MemIndex
can
be used interchangeably. It doesn’t place any requirements on us to use
MemIndex
everywhere it is needed. If we wanted that level of control
(especially useful in libraries that expose the mapped types), we might use
Opaque Types, but I’m not
interested in adding that level of complexity right now.
I should also mention that realistically, these should probably be a tag union, as indexes in many cases can either by an integer or string identifier. But our sample code doesn’t need that so I’m going to say: memories are always integer indexes, functions are always string identifiers.
While we’re fiddling around with types, let’s also extend Export
to be a real
record:
Export : {
name : Str,
type : [Mem MemIndex, Func FuncIndex],
}
As usual, I’m not coding to the full Wasm spec, though you will note that–by
making type
a tag union–I’ve left the door open for other structures of exports
in the future.
Now we have the types in place to construct an expectation around memory exports:
expect
# (module
# (export "memory" (memory 0))
# )
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 4, row: 2 },
name: "export",
state: Closed,
children: [
Term {
position: { column: 15, row: 2 },
term: String "memory",
},
SExpression {
position: { column: 21, row: 2 },
name: "memory",
state: Closed,
children: [
Term {
position: { column: 30, row: 2 },
term: Number 0,
},
],
},
],
},
],
}
)
result
==
Ok {
position: { row: 1, column: 1 },
element: {
funcs: [],
mems: [],
datas: [],
imports: [],
exports: [
{
position: { row: 2, column: 4 },
element: {
name: "memory",
type: Mem 0,
},
},
],
},
}
Note:
wasm-tools parse
will happily build this syntax to Wasm, but if you try to run it, validation will know that you don’t have any memories to export and it will refuse to run.
I won’t show the test case for the function export, but it is similar.
Now we need to write a new transformExport
function that can parse this
expression. It can be similar to transformMem
with an extra layer of nesting,
although a parser for the full Wasm export grammar would need to be far more
nuanced.
transformExport : Transformer Export
transformExport = \children ->
when children is
[Term { term: String name }, SExpression { name: "memory", children: [Term { term: Number memIndex }] }] ->
Ok {
name,
type: Mem memIndex,
}
[Term { term: String name }, SExpression { name: "func", children: [Term { term: Variable funcIndex }] }] ->
Ok {
name,
type: Func funcIndex,
}
[Term { term: String _ }, SExpression { position, name: exportType }] ->
Err [{ position, message: "Can't handle export type $(exportType)" }]
_ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected Export format" }]
I handled the success cases and any arms that have the right format, but are trying to export a type that we don’t currently handle. Then I got lazy and just threw a catch-all on there for any other wild syntaxes. Note that this kind of “laziness” is (probably) exactly why the Roc compiler is both so good and so bad at reporting errors. Roc is written in Rust, which features pattern matching similar to what Roc has. When the devs have added the appropriate match arms to their compiler, we get nice errors and explanations. When they haven’t done it yet, we get a wildcard arm that says “TODO: Add context”. And when they completely forgot to add an arm, we get a Rust panic. Luckily, I don’t expect anyone to actually use this compiler, so I can cheat every once in a while!
To make the test pass, I added two more arms to the when
expression in the
transformModule
function:
SExpression { position, name: "export", children: [] } ->
mergeError currentResult { position, message: "Invalid export: name and index required" }
SExpression { position, name: "export", children: expressions } ->
mergeResult currentResult (transformExport expressions) \module, export -> { module &
exports: List.append module.exports { position, element: export },
}
This is getting repetitive, so I won’t show the tests for the error cases. There is a wildcard arm so I know that if I got it wrong, it’s going to be an annoying error message instead of an outright crash.
I think I’ll leave it here for now. In the next part, we’ll be compiling import and data expressions. In the unlikely event that goes well, we might even be able to do a function!