Build a Wasm Compiler in Roc

In earlier articles, we have implemented a tokenizer, parser, and transformer to convert the Web Assembly Text Format to an Abstract Syntax Tree that can hopefully easily compile to Wasm.

Truthfully, the next step should be validation. Validation is the process of statically analyzing the syntax tree to catch as many errors as possible. This is where things like type checking and borrow checking happen, for example.

The wasm spec has an in-depth description of what validation should look like for a conforming compiler. But we are not building a Wasm-conforming compiler. We’re barely even building a Wasm-looking compiler! The truth is, static analysis gets its own university class and textbook, distinct from the complier classes and textbooks, so I think we can be forgiven for skipping it.

The next step is code generation. I find it weird that most tutorials on compiler design are technically transpilers, not compilers at all! The one I was originally using as a reference transpiles Lisp syntax to Javascript, for example.

On the one hand, targetting a syntax that is probably familiar to the reader has the benefit of allowing the author to explain less of it. But on the other hand, compilers that target bytecode or machine code have to use a different suite of tools to investigate what is going on. Those tutorials miss out on some of the complexities. There’s no conceptual difference between generating a human readable string and generating a list of bytes, but if you need to debug the latter, you need to know about tools like xxd!

So when I set out to write this series, I decided I wanted to build a real compiler. I have many regrets already, so if this turns out to be a mistake, well, at least we tried!

The boilerplate

As usual, the first thing we need to do is create a new function that will be called from our existing compile function. This time, however, our new generate function will eventually yield the real bytes that the compile function needs to return.

Start with a new Generator.roc file. Import Transformer so we can access the Module type. Make sure Generator.roc exposes a generate function that accepts a Module and returns a List U8.

By this point, I expect you have seen enough Roc code to implement the above on your own, but if you need a reference, this worked for me:

module [
    generate,
]

import Transformer
import Common

generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \module ->
    Ok (Str.toUtf8 "TODO: Compile Input")

Now we can simplify the compile function over in main.roc to an elegant pipeline:

compile : Str -> Result (List U8) (List Common.Error)
compile = \input ->
    input
    |> Tokenizer.tokenize
    |> Result.try Parser.parse
    |> Result.try Transformer.transform
    |> Result.try Generator.generate

I love me a good pipeline!

Spitting out an empty module

Let’s start simple. Add an expectation to generate that looks like this:

expect
    # (module)
    result = generate {
        position: { row: 1, column: 1 },
        element: {
            types: [],
            funcs: [],
            mems: [],
            datas: [],
            imports: [],
            exports: [],
        },
    }

    result == Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])

Those eight bytes are included at the beginning of every valid wasm module, and are the smallest possible wasm module. If you put (module) in a file called main.wat you can use these commands to investigate it:

❯ wasm-tools parse main.wat > main.wasm

❯ wasm-tools dump main.wasm
 0x0 | 00 61 73 6d | version 1 (Module)
     | 01 00 00 00

❯ xxd main.wasm
00000000: 0061 736d 0100 0000                      .asm....

The first four bytes (0x00 0x61 0x73 0x6d) are the “magic string” to identify a wasm file, and they never change. It is common to represent bytes in hexadecimal because one byte will always fit in two hexadecimal digits (the highest byte value is 255 in decimal, but ff in hex.

Indeed, that’s what xxd and wasm-tools dump both output. The hexadecimal 0x61 is the same as the integer 97, which is the ascii and utf-8 syntax for the letter a. The hex digits 0x73 and 0x6d map to the letters s and m, as you can see in the right side on the xxd output.

The second set of four bytes is the wasm version, and will always be 0x01 0x00 0x00 0x00 (a little endian 1) until the wasm group determines that a breaking change needs to be shipped. They intend for evolutions of the syntax to always be backwards compatible, but you never know when it comes to future code!

So, at least until that breaking change occurs, all wasm modules start with these eight bytes.

We can make our compiler output a valid wasm module by hard-coding those bytes in the generate function:

generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \module ->
    result = Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])

    result

So now our test passes! The rest of the generate function will be concatenating different sequences of bytes to that string, depending on the structure of our input module.

Generating a memory

Let’s now look at the bytecode for the following wat file:

(module
  (memory 1)
)

If we run wasm-tools dump main.wat, we’ll see the expected wasm output when building this file:

❯ wasm-tools dump main.wat
 0x0 | 00 61 73 6d | version 1 (Module)
     | 01 00 00 00
 0x8 | 05 03       | memory section
 0xa | 01          | 1 count
 0xb | 00 01       | [memory 0] MemoryType { memory64: false, shared: false, initial: 1, maximum: None }

These are the bytes we need to generate for a memory that has a lower limit of one page of memory. To understand why we need to look at the Binary format for a wasm module.

Wasm modules are comprised of sections. Each section has a unique id that prefixes the section. For memory sections, the unique id is 5 (which is the same in hex and decimal), so that first byte is 05.

The byte 03 is tricky. It reflects the total number of bytes inside the memory section. In this case, there are 3 bytes (01, 00, and 01). However, it’s easy to imagine that some sections will be more than 255 bytes, so how is it that the 03 is encoded in only one byte?

All integers in Wasm are encoded in something called LEB128, which stands for “Little Endian Base 128.” This is a variable-length encoding for integers that is primarily used by Wasm and little else. This is unfortunate for us, because it means there is no Roc library for parsing such integers.

Think of LEB128 as “like UTF-8, but for integers.” In UTF-8, the original ASCII characters are encoded in one byte. Other characters take two, three, or four bytes, depending on the codepoint. The idea is that more common letters could take fewer bytes to transmit than more unusual ones. This hasn’t played out so well in practice, however, because emoji characters tend to take more bytes to transmit, but are the most common communication alphabet for some people!

I’m pretty confident our little sample app will only need integers that fit in a single byte, so I’m tempted to pretend LEB128 == U8. But the Wikipedia article on LEB128 indicates it won’t be too much trouble, so we’ll take a look at it shortly.

Those three bytes inside the memory section, 01, 00, and 01 are the encoding of a “vector” of memories. In Wasm, a vector is always represented as the number of elements in the vector followed by the contents of the vector.

The number of elements in the vector is another LEB128-encoded integer, which in this case is the single byte 01. The memory itself can be encoded as either the byte 0x00 followed by a single LEB128 encoded integer (representing a memory with a minimum size but no maximum size), or the byte 0x01 followed by two integers. Since our parser can’t even handle upper limits on memories, it’s always going to be 0x00 followed by a number. In the example above, the number is 01 to reflect the 1 in (memory 1).

So we need some helper functions to encode numbers and vectors before we can output a memory.

Encoding Integers with LEB128

In our sample program, all integers are unsigned non-negative u32 elements that need to be encoded in unsigned LEB128 syntax. We’ll only need one function, with this signature:

generateU32: U32 -> List U8
generateU32 = \number ->

The pseudocode and example code on the Wikipedia page for LEB128 uses a loop, but of course we can’t do that because Roc doesn’t have loops. Instead, I think we can make a tail recursive function that collects the bytes as we go.

So generateU32 can be as simple as:

generateU32 : U32 -> List U8
generateU32 = \number ->
    generateU32Recurse [] number

Now we need to make generateU32Recurse, which is essentially the “body” of a loop. First I defined a few unit tests:

expect
    result = generateU32 624485
    result == [0xE5, 0x8E, 0x26]

expect
    result = generateU32 3
    result == [0x03]

expect
    result = generateU32 0
    result == [0x00]

expect
    result = generateU32 123456
    result == [0xc0, 0xc4, 0x7]

The first one is the example from the Wikipedia page, the second and third ensures single and zero bytes are encoded correctly (since that’s probably all we’ll need), and the last I just threw together and compared to the output of the Javascript code from the wikipedia page.

The generateU32Recurse function needs to do approximately the same thing as the loop body of this Javascript snippet (pulled directly from Wikipedia):

const encodeSignedLeb128FromInt32 = (value) => {
  value |= 0;
  const result = [];
  while (true) {
    const byte_ = value & 0x7f;
    value >>= 7;
    console.log(byte_, value);
    if (
      (value === 0 && (byte_ & 0x40) === 0) ||
      (value === -1 && (byte_ & 0x40) !== 0)
    ) {
      result.push(byte_);
      return result;
    }
    result.push(byte_ | 0x80);
  }
};

Our implementation will be a little more verbose (but perhaps also a bit more readable) due to the lack of bitwise operators in Roc:

generateU32Recurse : List U8, U32 -> List U8
generateU32Recurse = \currentItems, remainingBits ->
    leastSignificantByte = Num.bitwiseAnd remainingBits 0x7f
    nextBits = Num.shiftRightBy remainingBits 7

    if (nextBits == 0) && ((Num.bitwiseAnd leastSignificantByte 0x40) == 0) then
        List.append currentItems (Num.toU8 leastSignificantByte)
    else
        generateU32Recurse (List.append currentItems (Num.toU8 (Num.bitwiseOr leastSignificantByte 0x80))) nextBits

First we combine 0x7f with the existing number to strip off all but the bottom 7 bits. This effectively extracts the leastSignificantByte, which is IHMO a more descriptive name than byte_, (even though both of them are inaccurate because it’s a 7-bit sequence, not an 8-bit byte!). The second line is basically the opposite, using a right shift operator to drop the lowest seven bits and store the rest.

The conditional checks whether or not this is the “last” byte to process. It is only the last byte if there are no more bits AND the current least significant byte does not have a 1 in its 7th bit. If that condition matches, we losslessly (because we know it’s only 7 bits-worth) convert the number to a U8 and append it to the list; that is our final result. This is like the “return” case in the Javascript loop.

If the condition doesn’t match, we make the tail recursive call by setting the high bit to 1 with a bitwise Or (in LSB128, this indicates that there are more bytes to come) and calling generateU32Recurse with the new list and remaining bits.

The tests pass, so we seem to have accomplished our goal. We’ll be using this function a lot as we encode integers in other types… starting with a vector.

Encoding Vectors with a parameterized type function

The wasm spec says, simply, “Vectors are encoded with their length followed by the encoding of their element sequence.”

This is true no matter what elements the vector contains, so it would be convenient if we could construct a generateVector function that can work with a variety of elements.

For example, a vector of four integers (that might seem familiar) would encode like this:

expect
    result = generateVector [624485, 3, 0, 123456] generateU32
    result == [0x04,
        0xE5, 0x8E, 0x26,
        0x03,
        0x00,
        0xc0, 0xc4, 0x7]

The first entry is the number of integers in the list, and the remaining four entries are the generateU32 encodings for those integers.

You may find it instructive to write the generateVector function yourself, but if you get stuck, here is my working version:

generateVector : List a, (a -> List U8) -> List U8
generateVector = \items, encodeOne ->
    result = items |> List.len |> Num.toU32 |> generateU32
    items
    |> List.walk result \current, next ->
        List.concat current (encodeOne next)

First, we prime the result by piping the input list into something that gets the length. We convert the result to a U32 (lengths default to U64, but our sample code doesn’t make any lists that wouldn’t fit in e.g. a U8) and then pipes that out to our generateU32 function. That means that our output vector now has the bytes representing one integer (the length of the list) in it.

We use this as the “primer” for the List.walk function. The “Loop body” just calls the encodeOne function (whatever it does) on each result and concatenates the resulting bytes onto the list.

Encoding a Mem

Now we need a function that encodes a Transformer.Mem object.

I didn’t actually export Mem from Transformer, so go ahead and do that first.

This is the expectation we need to fulfill for transforming a single Mem:

expect
    result = generateMem { min: 1 }
    result == [0x00, 0x01]

The 0x00 is hardcoded because our system only knows how to handle memories with a lower limit. The 0x01 is the integer encoding of whatever minimum was recorded in the Mem. The implementation is simplicity itself:

generateMem : Transformer.Mem -> List U8
generateMem = \mem ->
    List.concat [0x00] (generateU32 mem.min)

I have to admit, this part is going much better than any of the previous steps. Part of that is because I get to ignore the positions I so carefully passed forward, but part of it is also that our outputs have been getting increasingly complicated. Tokenizer converted bytes to a flat list of tokens. Parser converted that flat list to a tree with nested nodes. Transformer converted that tree to an even more complicated tree. But now we get a simpler transformation; the output of all our functions is just an array of bytes. Can’t get simpler than bytes!

Finally Encoding the mems

Now we get to combine all those little functions to support a “mems” section. Let’s start with an expectation:

expect
    # (module
    #    (memory 1)
    # )
    result = generate {
        position: { row: 1, column: 1 },
        element: {
            types: [],
            funcs: [],
            mems: [{ position: { row: 2, column: 4 }, element: { min: 1 } }],
            datas: [],
            imports: [],
            exports: [],
        },
    }

    result == Ok ([
        0x00, 0x61, 0x73, 0x6d,
        0x01, 0x00, 0x00, 0x00,
        0x05, 0x03,
        0x01, 0x00, 0x01
        ])

Roc unhelpfully formats the bytes into either a single row or single column, but if I could format it the way I wanted to it would look like this. The first two rows (eight bytes) contain the usual header, and the third row is the header of a “mem” section. As I mentioned a long time ago, a “mem” section starts with the byte 0x05 to identify the section, followed by the number of bytes in that section and then the full encoding of a vector (with 1 element, in this case), including the encoding of each mem (0x00, 0x01 in this case).

The only trick is that-if there are no mems at all-we don’t want to output anything for that section, not even the 0x05.

We can solve this with a function that uses a when statement to check if there is any content. Don’t forget (I did!) that you need to count the number of bytes in the section:

generateMems = \mems ->
    when mems is
        [] -> []
        _ ->
            memorySection = generateVector mems generateMem

            [0x05]
            |> List.concat (memorySection |> List.len |> Num.toU32 |> generateU32)
            |> List.concat memorySection

I suspect we could make a generic version of this function as well, since all the sections are going to have this structure. Maybe I’ll explore that in the next part.

I thought I was almost done, but I forgot that our modules are full of all that Positionable noise. Yeah, I know we should be using those… someday. For now, I added a helper function to Common as follows (remember to export it):

extractElement : Positionable a -> a
extractElement = \positionable -> positionable.element

Armed with this, I can do some nifty pipelining to generate the appropriate bytes:

generate : Common.Positionable Transformer.Module -> Result (List U8) (List Common.Error)
generate = \modulePositioned ->
    module = modulePositioned |> Common.extractElement

    Ok ([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00])
    |> Result.map \bytes ->
        List.concat
            bytes
            (
                module.mems
                |> List.map Common.extractElement
                |> generateMems
            )

First we extract the common module from modulePositioned. modulePositioned.element would have been easier to type, but the fact that it is a separate function comes in handy when we want to extract the mems from a list of Positionables.

We first construct an Ok result with the header. We pass it to Result.map and only concat the memory section if it’s not an error (which is guaranteed right now, but this pipeline has some growing to do…). Inside the map, we concatenate the bytes to the output of calling generateMems on the module.mems, but only after discarding the Positionable portion.

And that’s the basics of generation! We have a lot of more sections to encode in subsequent parts, but I fear they’re going to be rather monotonous. Most of what you need to know was covered right here!

Build a Wasm Compiler in Roc - Part 12