Build a Wasm Compiler in Roc - Part 12
In earlier articles, we have implemented a tokenizer, parser, and transformer to convert the Web Assembly Text Format to an Abstract Syntax Tree that can hopefully easily compile to Wasm.
Truthfully, the next step should be validation. Validation is the process of statically analyzing the syntax tree to catch as many errors as possible. This is where things like type checking and borrow checking happen, for example.
The wasm spec has an in-depth description of what validation should look like for a conforming compiler. But we are not building a Wasm-conforming compiler. We’re barely even building a Wasm-looking compiler! The truth is, static analysis gets its own university class and textbook, distinct from the complier classes and textbooks, so I think we can be forgiven for skipping it.
The next step is code generation. I find it weird that most tutorials on compiler design are technically transpilers, not compilers at all! The one I was originally using as a reference transpiles Lisp syntax to Javascript, for example.
On the one hand, targetting a syntax that is probably familiar to the reader
has the benefit of allowing the author to explain less of it. But on the other
hand, compilers that target bytecode or machine code have to use a different
suite of tools to investigate what is going on. Those tutorials miss out on
some of the complexities. There’s no conceptual difference between generating
a human readable string and generating a list of bytes, but if you need to
debug the latter, you need to know about tools like xxd
!
So when I set out to write this series, I decided I wanted to build a real compiler. I have many regrets already, so if this turns out to be a mistake, well, at least we tried!
The boilerplate
As usual, the first thing we need to do is create a new function that will be
called from our existing compile
function. This time, however, our new
generate
function will eventually yield the real bytes that the compile
function needs to return.
Start with a new Generator.roc
file. Import Transformer
so we can access
the Module
type. Make sure Generator.roc
exposes a generate
function
that accepts a Module
and returns a List U8
.
By this point, I expect you have seen enough Roc code to implement the above on your own, but if you need a reference, this worked for me:
module [
generate,
]
import Transformer
import Common
generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \module ->
Ok (Str.toUtf8 "TODO: Compile Input")
Now we can simplify the compile
function over in main.roc
to an elegant pipeline:
compile : Str -> Result (List U8) (List Common.Error)
compile = \input ->
input
|> Tokenizer.tokenize
|> Result.try Parser.parse
|> Result.try Transformer.transform
|> Result.try Generator.generate
I love me a good pipeline!
Spitting out an empty module
Let’s start simple. Add an expectation to generate
that looks like this:
expect
# (module)
result = generate {
position: { row: 1, column: 1 },
element: {
types: [],
funcs: [],
mems: [],
datas: [],
imports: [],
exports: [],
},
}
result == Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])
Those eight bytes are included at the beginning of every valid wasm
module,
and are the smallest possible wasm
module. If you put (module)
in a file
called main.wat
you can use these commands to investigate it:
❯ wasm-tools parse main.wat > main.wasm
❯ wasm-tools dump main.wasm
0x0 | 00 61 73 6d | version 1 (Module)
| 01 00 00 00
❯ xxd main.wasm
00000000: 0061 736d 0100 0000 .asm....
The first four bytes (0x00 0x61 0x73 0x6d) are the “magic string” to identify a
wasm file, and they never change. It is common to represent bytes in
hexadecimal because one byte will always fit in two hexadecimal digits (the
highest byte value is 255
in decimal, but ff
in hex.
Indeed, that’s what xxd
and wasm-tools dump
both output. The hexadecimal
0x61
is the same as the integer 97
, which is the ascii and utf-8 syntax for
the letter a
. The hex digits 0x73
and 0x6d
map to the letters s
and
m
, as you can see in the right side on the xxd output.
The second set of four bytes is the wasm version, and will always be 0x01 0x00 0x00 0x00 (a little endian 1) until the wasm group determines that a breaking change needs to be shipped. They intend for evolutions of the syntax to always be backwards compatible, but you never know when it comes to future code!
So, at least until that breaking change occurs, all wasm modules start with these eight bytes.
We can make our compiler output a valid wasm module by hard-coding those bytes
in the generate
function:
generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \module ->
result = Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])
result
So now our test passes! The rest of the generate
function will be
concatenating different sequences of bytes to that string, depending on the
structure of our input module
.
Generating a memory
Let’s now look at the bytecode for the following wat
file:
(module
(memory 1)
)
If we run wasm-tools dump main.wat
, we’ll see the expected wasm output when
building this file:
❯ wasm-tools dump main.wat
0x0 | 00 61 73 6d | version 1 (Module)
| 01 00 00 00
0x8 | 05 03 | memory section
0xa | 01 | 1 count
0xb | 00 01 | [memory 0] MemoryType { memory64: false, shared: false, initial: 1, maximum: None }
These are the bytes we need to generate for a memory that has a lower limit of one page of memory. To understand why we need to look at the Binary format for a wasm module.
Wasm modules are comprised of sections. Each section has a unique id that
prefixes the section. For memory
sections, the unique id is 5
(which is the
same in hex and decimal), so that first byte is 05
.
The byte 03
is tricky. It reflects the total number of bytes inside the
memory section. In this case, there are 3 bytes (01
, 00
, and 01
).
However, it’s easy to imagine that some sections will be more than 255
bytes, so how is it that the 03
is encoded in only one byte?
All integers in Wasm are encoded in something called LEB128, which stands for “Little Endian Base 128.” This is a variable-length encoding for integers that is primarily used by Wasm and little else. This is unfortunate for us, because it means there is no Roc library for parsing such integers.
Think of LEB128 as “like UTF-8, but for integers.” In UTF-8, the original ASCII characters are encoded in one byte. Other characters take two, three, or four bytes, depending on the codepoint. The idea is that more common letters could take fewer bytes to transmit than more unusual ones. This hasn’t played out so well in practice, however, because emoji characters tend to take more bytes to transmit, but are the most common communication alphabet for some people!
I’m pretty confident our little sample app will only need integers that fit in a single byte, so I’m tempted to pretend LEB128 == U8. But the Wikipedia article on LEB128 indicates it won’t be too much trouble, so we’ll take a look at it shortly.
Those three bytes inside the memory section, 01, 00, and 01 are the encoding of a “vector” of memories. In Wasm, a vector is always represented as the number of elements in the vector followed by the contents of the vector.
The number of elements in the vector is another LEB128-encoded integer, which
in this case is the single byte 01. The memory itself can be encoded as either
the byte 0x00 followed by a single LEB128 encoded integer (representing a
memory with a minimum size but no maximum size), or the byte 0x01
followed by
two integers. Since our parser can’t even handle upper limits on memories, it’s
always going to be 0x00 followed by a number. In the example above, the number
is 01
to reflect the 1
in (memory 1)
.
So we need some helper functions to encode numbers and vectors before we can output a memory.
Encoding Integers with LEB128
In our sample program, all integers are unsigned non-negative u32 elements that need to be encoded in unsigned LEB128 syntax. We’ll only need one function, with this signature:
generateU32: U32 -> List U8
generateU32 = \number ->
The pseudocode and example code on the Wikipedia page for LEB128 uses a loop, but of course we can’t do that because Roc doesn’t have loops. Instead, I think we can make a tail recursive function that collects the bytes as we go.
So generateU32
can be as simple as:
generateU32 : U32 -> List U8
generateU32 = \number ->
generateU32Recurse [] number
Now we need to make generateU32Recurse
, which is essentially the “body” of a
loop. First I defined a few unit tests:
expect
result = generateU32 624485
result == [0xE5, 0x8E, 0x26]
expect
result = generateU32 3
result == [0x03]
expect
result = generateU32 0
result == [0x00]
expect
result = generateU32 123456
result == [0xc0, 0xc4, 0x7]
The first one is the example from the Wikipedia page, the second and third ensures single and zero bytes are encoded correctly (since that’s probably all we’ll need), and the last I just threw together and compared to the output of the Javascript code from the wikipedia page.
The generateU32Recurse
function needs to do approximately the same thing as the
loop body of this Javascript snippet (pulled directly from Wikipedia):
const encodeSignedLeb128FromInt32 = (value) => {
value |= 0;
const result = [];
while (true) {
const byte_ = value & 0x7f;
value >>= 7;
console.log(byte_, value);
if (
(value === 0 && (byte_ & 0x40) === 0) ||
(value === -1 && (byte_ & 0x40) !== 0)
) {
result.push(byte_);
return result;
}
result.push(byte_ | 0x80);
}
};
Our implementation will be a little more verbose (but perhaps also a bit more readable) due to the lack of bitwise operators in Roc:
generateU32Recurse : List U8, U32 -> List U8
generateU32Recurse = \currentItems, remainingBits ->
leastSignificantByte = Num.bitwiseAnd remainingBits 0x7f
nextBits = Num.shiftRightBy remainingBits 7
if (nextBits == 0) && ((Num.bitwiseAnd leastSignificantByte 0x40) == 0) then
List.append currentItems (Num.toU8 leastSignificantByte)
else
generateU32Recurse (List.append currentItems (Num.toU8 (Num.bitwiseOr leastSignificantByte 0x80))) nextBits
First we combine 0x7f
with the existing number to strip off all but the
bottom 7 bits. This effectively extracts the leastSignificantByte
, which is
IHMO a more descriptive name than byte_
, (even though both of them are
inaccurate because it’s a 7-bit sequence, not an 8-bit byte!). The second line
is basically the opposite, using a right shift operator to drop the lowest
seven bits and store the rest.
The conditional checks whether or not this is the “last” byte to process. It is
only the last byte if there are no more bits AND the current least significant
byte does not have a 1
in its 7th bit. If that condition matches, we
losslessly (because we know it’s only 7 bits-worth) convert the number to a U8
and append it to the list; that is our final result. This is like the “return”
case in the Javascript loop.
If the condition doesn’t match, we make the tail recursive call by setting the
high bit to 1
with a bitwise Or (in LSB128, this indicates that there are
more bytes to come) and calling generateU32Recurse
with the new list and
remaining bits.
The tests pass, so we seem to have accomplished our goal. We’ll be using this function a lot as we encode integers in other types… starting with a vector.
Encoding Vectors with a parameterized type function
The wasm spec says, simply, “Vectors are encoded with their length followed by the encoding of their element sequence.”
This is true no matter what elements the vector contains, so it would be convenient
if we could construct a generateVector
function that can work with a variety
of elements.
For example, a vector of four integers (that might seem familiar) would encode like this:
expect
result = generateVector [624485, 3, 0, 123456] generateU32
result == [0x04,
0xE5, 0x8E, 0x26,
0x03,
0x00,
0xc0, 0xc4, 0x7]
The first entry is the number of integers in the list, and the remaining four
entries are the generateU32
encodings for those integers.
You may find it instructive to write the generateVector
function yourself,
but if you get stuck, here is my working version:
generateVector : List a, (a -> List U8) -> List U8
generateVector = \items, encodeOne ->
result = items |> List.len |> Num.toU32 |> generateU32
items
|> List.walk result \current, next ->
List.concat current (encodeOne next)
First, we prime the result by piping the input list into something that gets
the length. We convert the result to a U32 (lengths default to U64, but our
sample code doesn’t make any lists that wouldn’t fit in e.g. a U8) and then
pipes that out to our generateU32
function. That means that our output vector
now has the bytes representing one integer (the length of the list) in it.
We use this as the “primer” for the List.walk
function. The “Loop body” just
calls the encodeOne
function (whatever it does) on each result and
concatenates the resulting bytes onto the list.
Encoding a Mem
Now we need a function that encodes a Transformer.Mem
object.
I didn’t actually export
Mem
fromTransformer
, so go ahead and do that first.
This is the expectation we need to fulfill for transforming a single Mem
:
expect
result = generateMem { min: 1 }
result == [0x00, 0x01]
The 0x00
is hardcoded because our system only knows how to handle memories
with a lower limit. The 0x01
is the integer encoding of whatever minimum
was recorded in the Mem
. The implementation is simplicity itself:
generateMem : Transformer.Mem -> List U8
generateMem = \mem ->
List.concat [0x00] (generateU32 mem.min)
I have to admit, this part is going much better than any of the previous
steps. Part of that is because I get to ignore the positions
I so carefully
passed forward, but part of it is also that our outputs have been getting
increasingly complicated. Tokenizer converted bytes to a flat list of tokens.
Parser converted that flat list to a tree with nested nodes. Transformer
converted that tree to an even more complicated tree. But now we get a simpler
transformation; the output of all our functions is just an array of bytes.
Can’t get simpler than bytes!
Finally Encoding the mems
Now we get to combine all those little functions to support a “mems” section. Let’s start with an expectation:
expect
# (module
# (memory 1)
# )
result = generate {
position: { row: 1, column: 1 },
element: {
types: [],
funcs: [],
mems: [{ position: { row: 2, column: 4 }, element: { min: 1 } }],
datas: [],
imports: [],
exports: [],
},
}
result == Ok ([
0x00, 0x61, 0x73, 0x6d,
0x01, 0x00, 0x00, 0x00,
0x05, 0x03,
0x01, 0x00, 0x01
])
Roc unhelpfully formats the bytes into either a single row or single column, but if I could format it the way I wanted to it would look like this. The first two rows (eight bytes) contain the usual header, and the third row is the header of a “mem” section. As I mentioned a long time ago, a “mem” section starts with the byte 0x05 to identify the section, followed by the number of bytes in that section and then the full encoding of a vector (with 1 element, in this case), including the encoding of each mem (0x00, 0x01 in this case).
The only trick is that-if there are no mems at all-we don’t want to output
anything for that section, not even the 0x05
.
We can solve this with a function that uses a when
statement to check if there
is any content. Don’t forget (I did!) that you need to count the number of
bytes in the section:
generateMems = \mems ->
when mems is
[] -> []
_ ->
memorySection = generateVector mems generateMem
[0x05]
|> List.concat (memorySection |> List.len |> Num.toU32 |> generateU32)
|> List.concat memorySection
I suspect we could make a generic version of this function as well, since all the sections are going to have this structure. Maybe I’ll explore that in the next part.
I thought I was almost done, but I forgot that our modules are full of all that
Positionable
noise. Yeah, I know we should be using those… someday. For
now, I added a helper function to Common
as follows (remember to export it):
extractElement : Positionable a -> a
extractElement = \positionable -> positionable.element
Armed with this, I can do some nifty pipelining to generate the appropriate bytes:
generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \modulePositioned ->
module = modulePositioned |> Common.extractElement
Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])
|> Result.map \bytes ->
List.concat
bytes
(
module.mems
|> List.map Common.extractElement
|> generateMems
)
First we extract the common module from modulePositioned
.
modulePositioned.element
would have been easier to type, but the fact that it
is a separate function comes in handy when we want to extract the mems from a
list of Positionable
s.
We first construct an Ok
result with the header. We pass it to Result.map
and
only concat the memory section if it’s not an error (which is guaranteed right
now, but this pipeline has some growing to do…). Inside the map, we
concatenate the bytes to the output of calling generateMems
on the
module.mems
, but only after discarding the Positionable
portion.
And that’s the basics of generation! We have a lot of more sections to encode in subsequent parts, but I fear they’re going to be rather monotonous. Most of what you need to know was covered right here!