Build a Wasm Compiler in Roc - Part 8
In earlier articles, we implemented a tokenizer for the Wasm text syntax (WAT) and started on a parser to convert those tokes to a S-expression AST.
In this part, we’ll start to create a transformer to convert that AST to a new one that better matches the WASM output we will be crafting. Don’t ask me how many parts that’s going to take!
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Transforming the AST
The transformation step is optional in some compilers, depending on how well the input maps to output. As one example, a “formatter” (which is actually a transpiler where the source and target language are the same) typically reads the AST from existing source codes, and outputs (potentially modified) source code from the same AST. Interpreters are sometimes also able to work directly from the AST of the input language.
In the case of WAT to Wasm, my assumption going in was that since the two are textual and binary descriptions of the exact same bytecode, the AST wouldn’t need transformation. I couldn’t have been more wrong! The S-expression syntax is its own language, and while it is easy to parse (the parser only took TWO posts!), it does not create an AST that resembles Wasm. It does, however, create an AST that is easy to transform into one that resembles Wasm.
What does a Wasm AST look like?
The Wasm Abstract Syntax Tree is so flat that it barely qualifies for the title of “tree”. There is a big fat module node at the root of the tree, and a bunch of flat lists of other kinds of nodes. Some of those other nodes have a more traditional tree-like structure, though.
This flatness is partially because Wasm is meant to be quickly and efficiently read by a computer and transformed directly to machine code. Machine code is just a flat list of instructions after all. There’s nothing “abstract” about it!
Looking at the Wasm specification for a module, there are potentially nine different sections in a module. However we’ll only need to model a subset of them (in bold) to make our “hello world” example work:
- types
- imports
- funcs
- mems
- globals
- exports
- start
- elems
- datas
Each section is represented as a flat list, and whenever an element in one
section needs to reference an element in another section, it uses the index of
that element in the other list. Variables, which start with a $
are
essentially just names for various indices.
The elements of each of these lists each have their own structure. For some of
them (e.g. functions with a bunch of instructions in them), that structure can
be pretty complicated, but others are fairly simple. For example, if you click
a tangled web of links in the spec, you’ll discover that the mems
section
contains a list of mem
types, and each mem
has a type that is a limit
,
and a limit is either one or two integers representing the minimum and maximum
size of the memory (in units of page size).
In WAT syntax, this means that (memory 1)
has a limit with a minimum size but
no maximum, and means “create a memory with at least one page of data in it”.
Typically, each module only has one memory at the implicit 0
index in the
list, but future versions of Wasm willsupport multiple memories.
After transformation, the AST for a memory would probably be a tg named
Memory
with a payload of type Limit
where a Limit
is a record that holds
a minimum and an optional maximum.
The chief difference between the two ASTs is that any one S-Expression can
contain any arbitrary collection of terms and nested S-Expressions, but a
memory
node MUST contain a limit. For example the following valid S-Expression
is not a valid Wasm node:
(memory (i32.const 8) "hello world")
(In fact this is a valid data
node that I changed the name to memory
). If
I pass this S-expression into wasm-tools parse
, it throws an error.
So the goal of this transformation step is to take our input S-expressions, confirm that they represent legitimate Wasm AST nodes, and build that AST. The AST we want to build is strongly typed and the child expressions in each S-expression must conform to that type.
Hopefully that’s enough background for us to get started and you won’t have to spend as much time puzzling out the formal specification as I did!
Boilerplate
Our parser is complete, so let’s create a new Transformer.roc
file with
a dummy function in it for transforming an AST:
module [
transform,
NodeTag,
WasmNode,
]
import Common
import Parser
NodeTag : [Module {}]
WasmNode : {
position : Common.Position,
node : NodeTag,
}
transform : Parser.Expression -> Result WasmNode (List Common.Error)
transform = \expressions -> Ok {
position: { row: 1, column: 1 },
node: Module ({}),
}
The types are kind of just a wild guess at what we might expect to return from
this function for now, and transform
is hard-coded with a quasi-legit guess
at the expected node.
Back in main.roc
, we’ll need to add an import for Transformer
and pipe in the
output of parse
in our compile
function:
compile : Str -> Result (List U8) (List Common.Error)
compile = \input ->
input
|> Tokenizer.tokenize
|> Result.try Parser.parse
|> Result.try Transformer.transform
|> Result.map \ast ->
# dbg expression
Str.toUtf8 "TODO: Compile Input"
This is identical to the previous version except for the one new |> Result.try Transformer.transform
line.
Let’s get a test going for an empty module:
expect
# (module)
result = transform {
position: { row: 1, column: 1 },
expression: SExpression {
name: "module",
children: [],
state: Closed,
},
}
result
== Ok {
position: { row: 1, column: 1 },
node: Module ({}),
}
This test passes right now because of the hard coding in transform
. Let’s add
a couple more tests for some potentially broken inputs:
expect
# (memory)
result = transform {
position: { row: 1, column: 1 },
expression: SExpression {
name: "memory",
children: [],
state: Closed,
},
}
result
== Err [
{
message: "Expected a 'module', but received a 'memory'",
position: { column: 1, row: 1 },
},
]
expect
# 42
# (technically, parser would never emit this)
result = transform {
position: { row: 1, column: 1 },
expression: Term (Number 42),
}
result
== Err [
{
message: "Expected a S-expression, but received a Term",
position: { column: 1, row: 1 },
},
]
These obviously fail with the hardcoded transform body, so let’s start actually
implementing that function. Throw a when
expression into that function:
transform : Parser.Expression -> Result WasmNode (List Common.Error)
transform = \expression ->
when expression is
{ position, expression: SExpression { name: "module", children } } ->
Ok {
position,
node: Module ({}),
}
{ position, expression: SExpression { name } } ->
Err [
{
position,
message: "Expected a 'module', but received a '$(name)'",
},
]
{ position, expression: Term _ } ->
Err [
{
position,
message: "Expected a S-expression, but received a Term",
},
]
If we receive a module with any children (any children at all), we return an Ok
,
but if the input is a non-module S-Expression or a Term, we return an appropriate
Err
.
Now we just need to figure out what to do with the children in the success case. But before we do that, we should probably think about the AST data structure.
Designing an AST
In a non-empty module, we can expect one or more child expressions inside the
module expression, and each child might represent things like data or a
memory or a function. Each of these will ultimately be collected in lists
for each “type” of element. So our Module
record needs to have several fields
with lists of other types as their input. Let’s start with some empty data types:
Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}
Module : {
funcs : List Func,
mems : List Mem,
datas : List Data,
imports : List Import,
exports : List Export,
}
Unlike with S-Expressions, the set of children that each type can have is strictly defined. A Wasm module can have 10 types of children; our hello world will require around half of those.
A problem with this model is that we lose positions of all the AST nodes.
That’s why I originally defined WasmNode
to have a position. But that isn’t
looking like an appropriate design now because we don’t want just any Wasm
node to be a child of another Node. For example, it would be suboptimal if
Module.funcs
was a list of WasmNode
records, because we only want Func
nodes to be in there.
One solution would be to put a position
on every type, something like this:
Func : {
position: Common.Position,
... other func fields
}
Mem : {
position: Common.Position,
... other mem fields
}
The problem with this is “what if we forget”. This is a good time to introduce the concept of “type parameters”, (typically referred to as “generic types” in other languages).
Let’s try to add a new Positionable
type to our Common
library:
Positionable: {
position: Position
element: ????
}
The question is, what is the type of the element
field? It’s not a tag union
because for any one place we use a Positionable
, it can only be one type, but
we want that to be a different type in different places.
This is similar to how a List
works. Any one list can contain only one type
of element, but two different lists can each contain different types of
elements.
The trick is to add a word to the Positionable
definition to reflect “some
instance-specific type”, and then we use that word as a stand-in for the type
of the element
field:
Positionable elem : {
position : Position,
element : elem,
}
The “stand-in” can be any value (think of it like a variable, but for types), but it must be lowercase so Roc knows it needs to treat it like a type parameter instead of trying to find a real type somewhere.
Now when we want to use Positionable
, we have to specify the type we want it
to have. You’ve already seen this in places where we have specified a type as
List Str
. Different lists can contain different elements, but a List Str
can only contain strings.
We are already importing Common
in the Transformer
, so we can say that the
fields of Module
must contain things like List (Common.Positionable Func)
:
Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}
Module : {
funcs : List (Common.Positionable Func),
mems : List (Common.Positionable Mem),
datas : List (Common.Positionable Data),
imports : List (Common.Positionable Import),
exports : List (Common.Positionable Export),
}
The parentheses are required so that Roc understands that the type parameter
Common.Positionable
is e.g. a Func
, and that the type parameter for List
is a Common.Positionable Func
. If we left it as List Common.Positionable Func
, Roc would assume we were passing two type parameters to List
, and it
would complain because List
only expects one type parameter.
Let’s also create an initial Module
that has all those fields set to an empty
list:
emptyModule = {
funcs: [],
mems: [],
datas: [],
imports: [],
exports: [],
}
It is relatively common in immutable languages to have an initial empty object like
this instead of a constructor or new
function. In fact, those five empty
lists are all pointing to the exact same empty list object! List.append
always returns a new list, and if we call List.append
on one of the fields
in emptyModule
, the result would be a new module.
Note that I’ve removed WasmNode
and NodeTag
now. I changed the transform
function signature to always return a Positionable Module
instead of the
WasmNode
we have deleted:
transform : Parser.Expression -> Result (Common.Positionable Module) (List Common.Error)
transform = \expression ->
when expression is
{ position, expression: SExpression { name: "module", children } } ->
Ok {
position,
element: emptyModule,
}
# error arms snipped
My code is still failing to compile because the expectation for the success
case is currently returning an object with a node
field instead of an element
field. We can reuse the emptyModule
here as well:
expect
# (module)
result = transform {
position: { row: 1, column: 1 },
expression: SExpression {
name: "module",
children: [],
state: Closed,
},
}
result
== Ok {
position: { row: 1, column: 1 },
element: emptyModule,
}
An exercise for the reader
I’m not going to do it right now, but now that Positionable
exists, there
are quite a few other types that we could replace with it:
Common.Error
could be aPositionable Str
.Parser.Expression
might be able to have theSExpression
andTerm
in its tag union changed toPositionable
.Tokenizer.Token
could become aPositionable TokenTag
Tokenizer.CurrentToken
could be aPositionable [None, Name (List U8), Number (List U8), String (List U8), Variable (List U8)]
Most of the work would be changing e.g. message
to element
(for errors) and
token
to element
(for tokens).
I can’t think of any off the top of my head, but it might be worth looking for
some reusable functions that can be used with Positionable
elements.
The next step is to actually do something with the children
in the when
expression in transform
. The compiler is kindly telling me this field is not
used in the branch, so I won’t forget that there is more work to be done.
I’m going to leave that for the next article, though. See you then!