Build a Wasm Compiler in Roc - Part 10
In earlier articles, we implemented a tokenizer and parser to convert Wasm’s WAT syntax into an S-expression abstract syntax tree and started to implement a transformer to convert that AST into one more suitable for generating Wasm bytecode.
Now that the boilerplate is in place, this article will continue to implement recursive transformer functions, getting into some of the more complicated expressions in the “hello world” WAT example.
I should mention that the code in this article gets long and repetitive. I’m including all the code we’ll need for the hello world example, but if you feel like it’s getting a bit boring, feel free to skim bits of it! I now understand why most “introduction to compiler” articles focus on such a tiny amount of syntax! I feel committed now, though, so if you’re willing to bear with me, we’ll see this through to the end.
So far, we can parse (basic) modules, so long as they contain only memory
and
export
expressions. Let’s next add a data
expression. There is only one
data
expression in our “hello world” example, but it is undoubtedly the most
important data in the world (considering how many times it has been typed over
the years…):
(data (offset (i32.const 8)) "hello world")
The spec for data segments is fairly straightforward, though I’m obviously going to simplify it because, “Oh my god, this blog series has gotten out of hand.”
The offset
syntax is a bit odd. The children of that expression can be any
sequence of constant Wasm instructions, though a single i32.const
is most
common. The (i32.const 8)
instruction pushes the number 8
onto the
runtime’s stack. However, my understanding is that these instructions are
evaluated at compile time. In practice, the limitation to constant instructions
means you can do a bit of arithmetic in there if you need to, but not much
else. I’m going to try to make the data
accept more than one instruction, but
I’m not going to test it.
Another thing to note is that if there is only one instruction, the runtime is
supposed to support leaving the offset
keyword off. I’m not going to bother
with that. It’s just a bunch of extra match arms without very much education!
Transforming Instructions
Let’s start by adding a new type for an i32.const
instruction. In a real app,
we’d probably try to separate things on the .
because the instruction for
e.g. i32.const
and i64.const
probably have plenty of overlap. In the
example we are compiling, all instructions happen on i32
, so I’m just going
to treat it as a one name.
I32ConstInstruction : U32
Note that the value is a U32
instead of an I32
, even though the wasm value
is an I32
. (The difference is that I32
allows negative numbers, but the
maximum value is only half of the maximum of a U32
). Wasm uses the I32
type
for both signed and unsigned integers, and the interpretation depends on
context. We don’t have that context, and I’m hoping only the
runtime actually needs that context. In any event, our “Hello World” script
doesn’t do any negative integer calculation, so we can get away with ignoring
it.
Now we can add a transformI32ConstInstruction
:
transformI32Const : Transformer I32ConstInstruction
transformI32Const = \children ->
when children is
[Term { term: Number value }] -> Ok value
_ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected i32.const format. One number expected." }]
Once again, I’m cheating on the error case. I should really match on any other
term or S-expression so I can pull the position out of the first element like
we did for transformMem
. However, I found that caused a lot of Roc compiler
discomfort, so let’s just pretend we’ve done that, ok?
Unlike our previous transform*
functions, it is not legal to put a i32.const
instruction at the top level in a module
. So it doesn’t make sense to add a
match arm for i32.const
to the transformModule
function. Instead, let’s
flesh out a new function for transforming a list of instructions.
Instructions are the things that make an assembly-style language actually do things. In Wasm, they are usually doing things to the stack. The point is, there are a fair number of Wasm instructions, and each one has its own structure and prerequisites.
Can anyone smell a tagged union?
Let’s add a new Instruction
type that is, yes indeed, a tagged union:
Instruction : [I32Const (Common.Positionable I32ConstInstruction)]
While we’re at it, let’s update the existing (currently empty) Data
record
type to include an offset
as a list of instructions, as well as the bytes
that are supposed to be placed at that offset in the default memory:
Data : {
offset : List Instruction,
bytes : Str,
}
I made it so that the offset
could theoretically contain multiple
instructions. I only added one type of instruction, but if someone wanted to
support other const
expressions, they would be able to add the types to this
tag.
Now we can start roughing out a transformInstructions
function that accepts a list
of children and tries to build a list of Instruction
records out of it:
transformInstructions : Transformer (List Instruction)
transformInstructions = \children ->
List.walk
children
(Ok [])
\currentResult, child ->
when child is
Term { position } ->
mergeError
currentResult
{ position, message: "Expected instruction, got term" }
SExpression { position, name: "i32.const", children: expressionChildren } ->
mergeResult
currentResult
(transformI32Const expressionChildren)
\currentList, element ->
List.append currentList (I32Const { position, element })
SExpression { position, name } ->
mergeError
currentResult
{ position, message: "Unexpected instruction $(name)" }
I went to the trouble of showing how to non-lazily handle errors this time. If
we get a Term
or an SExpression
that is not an i32.const
, we throw an
error using the handy mergeError
function. If we get an i32.const
, we
extract it using the transformI32Const
function we defined earlier, and use
mergeResult
with a mapper function that appends the tag to a list.
When we add more instructions later, we’ll be able to add more match arms to this function.
Transforming the data section
Realistically, I could/should have added some expectations for the functions we just wrote, but the more Roc I write, the more confident I am that “if it compiles, it works” tends to hold true. Usually when I encounter languages that support this promise, they make me angry quickly because the “if it compiles” part is so slow to achieve. But Roc has been (so far) fast enough that I’m starting to rely on the compiler as rudimentary unit test coverage.
Probably not the best habit to get into, though, so let’s write an expectation
now before we write the transformData
function:
expect
# (module
# (data (offset (i32.const 8)) "hello world")
# )
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 3, row: 2 },
name: "data",
state: Closed,
children: [
SExpression {
position: { column: 9, row: 2 },
name: "offset",
state: Closed,
children: [
SExpression {
position: { column: 17, row: 2 },
name: "i32.const",
state: Closed,
children: [
Term {
position: { column: 28, row: 2 },
term: Number 8,
},
],
},
],
},
Term {
position: { row: 1, column: 30 },
term: String "hello world",
},
],
},
],
}
)
result
==
Ok {
position: { row: 1, column: 1 },
element: {
funcs: [],
mems: [],
datas: [
{
position: { row: 2, column: 3 },
element: {
offset: [I32Const { position: { row: 2, column: 17 }, element: 8 }],
bytes: "hello world",
},
},
],
imports: [],
exports: [],
},
}
Ok, when that test passes (possibly with a couple tweaks, but I think I got it close), I will assert that data sections are “good enough”.
First, let’s define the transformData
function, which starts out fairly familiar:
transformData : Transformer Data
transformData = \children ->
when children is
[SExpression { position, name: "offset", children: offsetChildren }, Term { term: String string }] ->
transformInstructions offsetChildren
|> Result.map \instructions -> {
offset: instructions,
bytes: string,
}
_ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected data segment" }]
I was able to use the Result.map
function to translate a success condition
into something that has been wrapped in a position
. I went back to cheating on the
error conditions, so our error positions are going to be annoying.
Then I needed to do some light copy-pasting to add data
arms to the
transformModule
function:
SExpression { position, name: "data", children: [] } ->
mergeError currentResult { position, message: "Invalid data: offset and bytes required" }
SExpression { position, name: "data", children: expressions } ->
mergeResult currentResult (transformData expressions) \module, data -> { module &
datas: List.append module.datas { position, element: data },
}
The unit test passes, so let’s move on before we find something wrong with it!
Import expressions
Import expressions are rather complicated, and we’ll probably find it easiest
to build them from the inside out. Here’s the import
expression from our
“hello world” sample:
(import "wasi_snapshot_preview1" "fd_write" (func $fd_write (param i32 i32 i32 i32) (result i32)))
As usual, the WAT grammar has a few different structures for import
expressions, and you can import things other than func
from them,
but we’re just going to model one of them.
The overall structure is an expression named import
with three children. The
first two children are strings representing the namespace and name of the
import; these are used to communicate with the runtime that needs to hook up
the import machinery.
The third param is an expression whose structure depends on what you are importing. Since we are importing a func, it is followed by the identifier of that func and two more expressions representing the list of params and list of results.
Other than the complexity of the expressions, the key thing to be aware of with imports is that (when adding functions) we need to update the function types section of the module as well as the imports section.
Let’s start at the inside by parsing types. Conveniently, “hello world” only uses one type (i32), so we can keep it simple. For imaginary future extensibility, we can make it a tagged union with (for now) just one tag:
Type : [I32]
For our target sample, the only place we need types is in parameter and results lists, so we may as well parse them as a list:
transformType : Transformer (List (Common.Positionable Type))
transformType = \children ->
List.walk children (Ok []) \currentResult, child ->
when child is
Term { position, term: Name "i32" } ->
mergeResult
currentResult
(Ok { position, element: I32 })
List.append
Term { position, term: Name name } ->
mergeError
currentResult
{ position, message: "Unexpected type: $(name)" }
Term { position } ->
mergeError
currentResult
{ position, message: "Not a type" }
SExpression { position } ->
mergeError
currentResult
{ position, message: "Not a type" }
_ -> crash "Hit unexpected match arm in transformType" # Pretty sure this shouldn't be necessary
Pretty straightforward; I tried to give slightly interesting error messages if
it encounters anything other than an i32
. The crash arm feels like it
shouldn’t be necessary, but either I haven’t specified the other types
exhaustively, or the complier wasn’t able to confirm that they were exhaustive.
Just because we can, let’s write some quick tests for this:
expect
# (...i32, i32)
result = transformType [
Term { position: { row: 1, column: 5 }, term: Name "i32" },
Term { position: { row: 1, column: 9 }, term: Name "i32" },
]
result
==
Ok [
{ element: I32, position: { column: 5, row: 1 } },
{ element: I32, position: { column: 9, row: 1 } },
]
expect
# (...f32) isn't supported yet
result = transformType [
Term { position: { row: 1, column: 5 }, term: Name "f32" },
]
result == Err [{ position: { column: 5, row: 1 }, message: "Unexpected type: f32" }]
Nothing too unusual in there, so let’s next try to parse a function signature. When we get around to generating code, the function types will go in a separate section of the binary, and must include both import types and the types of functions defined in the module.
We have two functions in our hello world
syntax: one in the import statement,
and one defined in the module body. We’ll focus on the import statement for
now, as the signature of the one in the body is calculated differently.
The WAT syntax looks like this:
(func $identifier (param ...) (result ...))
As usual, the formal grammar includes variations on this syntax, but for my sanity, I’ll assume they all look like this.
Let’s add a new FuncType
top level definition:
FuncType : {
identifier : Str,
param : List (Common.Positionable Type),
result : List (Common.Positionable Type),
}
Now we can write a rudimentary trnasformFuncType
function:
transformFuncType : Transformer FuncType
transformFuncType = \children ->
when children is
[Term { position, term: Variable identifier }, SExpression { name: "param", children: paramChildren }, SExpression { name: "result", children: resultChildren }] ->
withParams = mergeResult
(Ok { identifier, param: [], result: [] })
(transformTypes paramChildren)
\current, param -> { current &
param,
}
mergeResult
withParams
(transformTypes resultChildren)
\current, result -> { current &
result,
}
_ -> Err [{ position: { row: 0, column: 0 }, message: "Inappropriate error handling in transformFuncType" }]
I say “rudimentary” because I’m not even pretending to handle errors this time! Let’s add a similarly uninspired test:
expect
# (... $identifier (param i32 i32) (result i32))
result = transformFuncType [
Term { position: { row: 1, column: 5 }, term: Variable "$identifier" },
SExpression {
position: { row: 1, column: 17 },
name: "param",
state: Closed,
children: [
Term { position: { row: 1, column: 24 }, term: Name "i32" },
Term { position: { row: 1, column: 28 }, term: Name "i32" },
],
},
SExpression {
position: { row: 1, column: 33 },
name: "result",
state: Closed,
children: [
Term { position: { row: 1, column: 41 }, term: Name "i32" },
],
},
]
result
== Ok {
identifier: "$identifier",
param: [
{ position: { row: 1, column: 24 }, element: I32 },
{ position: { row: 1, column: 28 }, element: I32 },
],
result: [
{ position: { row: 1, column: 41 }, element: I32 },
],
}
Now rinse and repeat (and I truly apologize for the monotony) with the
Import
type:
Import : {
namespace : Str,
name : Str,
definition : [Func Str],
}
The definition is a tagged union in case we want to add other types of imports
later. But notice that it does not include the function definition we just
went to so much trouble to parse! This is the type that we will be attaching to the
Module
type. We record the namespace and module in the imports
section, but
the funcType
needs to go in a new types
section:
Module : {
types : List (Common.Positionable FuncType),
funcs : List (Common.Positionable Func),
mems : List (Common.Positionable Mem),
datas : List (Common.Positionable Data),
imports : List (Common.Positionable Import),
exports : List (Common.Positionable Export),
}
We’ll also need to add types
to emptyModule
:
emptyModule = {
types: [],
funcs: [],
mems: [],
datas: [],
imports: [],
exports: [],
}
I had to update a couple module unit tests with types: []
as well.
So when we update the Module
after parsing an import, we need to update two
different fields on it.
Let’s add a TransformedImport
type to be the return value of our upcoming
transformImport
function; it will look more like you expected:
# The result of `transformImport` is split into `types` and `imports` section
# on the module
TransformedImport : {
namespace : Str,
name : Str,
definition : [Func FuncType],
}
Now we can write the half-assed transformImport
implementation:
transformImport : Transformer TransformedImport
transformImport = \children ->
when children is
[Term { term: String namespace }, Term { term: String name }, SExpression { name: "func", children: funcChildren }] ->
funcChildren
|> transformFuncType
|> Result.map \functionDefinition -> { namespace, name, definition: Func functionDefinition }
_ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected import encountered" }]
Now we have to update the transformModule
function to handle top-level import
expressions. This isn’t going to be a straight-up copy-pasting like the
previous when
arms because we need to extract TransformedImport
into
separate types
and imports
sections in the module:
SExpression { position, name: "import", children: [] } ->
mergeError currentResult { position, message: "Invalid import: namespace, name, and definition required" }
SExpression { position, name: "import", children: expressions } ->
mergeResult
currentResult
(transformImport expressions)
(\module, { name, namespace, definition: Func func } ->
{ module &
types: List.append module.types { position, element: func },
imports: List.append module.imports { position, element: { name, namespace, definition: Func func.identifier } },
}
)
I have to admit, this code was pretty tricky to craft. When you return the
wrong type from a when
arm, Roc just throws up its arms and says “something’s
wrong in this when
arm: here are the types I saw and what I expected.” But
manually analyzing those types to see where it got confused (especially if
recursive types are involved) can be pretty difficult.
My solution is often to hand the code to Claude.ai and say “what’s wrong?” But frankly, if the AI can figure it out, the complier should have been able to give me better feedback in the first place. I’m sure it’s coming!
The test for this is bloody huge to set up, but it’s not really any different from anything else we’ve defined:
expect
# (module
# (import
# "wasi_snapshot_preview1"
# "fd_write"
# (func $fd_write (param i32 i32 i32 i32) (result i32))
# )
# )
result = transform
(
SExpression {
position: { column: 1, row: 1 },
name: "module",
state: Closed,
children: [
SExpression {
position: { column: 3, row: 2 },
name: "import",
state: Closed,
children: [
Term {
position: { row: 3, column: 4 },
term: String "wasi_snapshot_preview1",
},
Term {
position: { row: 3, column: 4 },
term: String "fd_write",
},
SExpression {
position: { column: 4, row: 5 },
name: "func",
state: Closed,
children: [
Term {
position: { row: 5, column: 10 },
term: Variable "fd_write",
},
SExpression {
position: { column: 20, row: 5 },
name: "param",
state: Closed,
children: [
Term {
position: { column: 27, row: 5 },
term: Name "i32",
},
Term {
position: { column: 31, row: 5 },
term: Name "i32",
},
Term {
position: { column: 35, row: 5 },
term: Name "i32",
},
Term {
position: { column: 39, row: 5 },
term: Name "i32",
},
],
},
SExpression {
position: { column: 44, row: 5 },
name: "result",
state: Closed,
children: [
Term {
position: { column: 51, row: 5 },
term: Name "i32",
},
],
},
],
},
],
},
],
}
)
result
==
Ok {
position: { column: 1, row: 1 },
element: {
datas: [],
exports: [],
funcs: [],
imports: [
{
position: { column: 3, row: 2 },
element: {
definition: Func "fd_write",
name: "fd_write",
namespace: "wasi_snapshot_preview1",
},
},
],
mems: [],
types: [
{
position: { column: 3, row: 2 },
element: {
identifier: "fd_write",
param: [
{
element: I32,
position: { column: 27, row: 5 },
},
{
element: I32,
position: { column: 31, row: 5 },
},
{
element: I32,
position: { column: 35, row: 5 },
},
{
element: I32,
position: { column: 39, row: 5 },
},
],
result: [
{
element: I32,
position: { column: 51, row: 5 },
},
],
},
},
],
},
}
This article has been rather code-heavy and, if you’ve been following the series, rather new-information-sparse. I apologize if it was boring to read… To be honest, it was pretty boring to write! We have one more type to transform in the next article, and then we’ll be ready to move on to code generation!