Build a Wasm Compiler in Roc

In earlier articles, we have implemented a tokenizer, parser, and transformer to convert WAT syntax to a Wasm AST and got started on the code generation.

This part will continue with code generation. We’ll start with the import section because it has a couple interesting complications that we need to take into account.

The Import Section

Consider this wat import statement:

(module
    (import "wasi_snapshot_preview1" "fd_write" (
        func $fd_write (param i32 i32 i32 i32) (result i32))
    )
)

If we look at the bytes in the binary representation of this section, it’s surprisingly long:

❯ wasm-tools dump main.wat
  0x0 | 00 61 73 6d | version 1 (Module)
      | 01 00 00 00
  0x8 | 01 09       | type section
  0xa | 01          | 1 count
--- rec group 0 (implicit) ---
  0xb | 60 04 7f 7f | [type 0] SubType { is_final: true, supertype_idx: None, composite_type: Func(FuncType { params: [I32, I32, I32, I32], results: [I32] }) }
      | 7f 7f 01 7f
 0x13 | 02 23       | import section
 0x15 | 01          | 1 count
 0x16 | 16 77 61 73 | import [func 0] Import { module: "wasi_snapshot_preview1", name: "fd_write", ty: Func(0) }
      | 69 5f 73 6e
      | 61 70 73 68
      | 6f 74 5f 70
      | 72 65 76 69
      | 65 77 31 08
      | 66 64 5f 77
      | 72 69 74 65
      | 00 00
 0x38 | 00 12       | custom section
 0x3a | 04 6e 61 6d | name: "name"
      | 65
 0x3f | 01 0b       | function name section
 0x41 | 01          | 1 count
 0x42 | 00 08 66 64 | Naming { index: 0, name: "fd_write" }
      | 5f 77 72 69
      | 74 65

The first eight bytes describe the wasm module as usual, and then we have 11 bytes describing the type section, which we already covered in Part 12.

Then we see the import section itself, which we’ll discuss shortly.

But after the import section, we see a custom section. Custom sections have the section id 00, but unlike other sections, they can go between any other section and there can be more than one of them. Custom sections have a name followed by any arbitrary bytes. Custom sections are completely ignored by the WAT runtime, but they can be used to provide additional information to tooling such as debuggers and LSPs.

The Wasm standard largely ignores custom sections, with the exception of a name section. If the custom section has the name “name”, then it is expected that the bytes in that section conform to the description in the Custom Section Appendix.

The name section is intended to attach names to identifiers such as functions. The functions are stored in a flat list and Wasm looks them up using integer indices into that list. However, if you want to e.g. run a debugger or convert Wasm code back to WAT code, it helps to know what the original human readable identifiers were for those names.

Since this is a for-fun project that nobody in their right mind will use for debugging, we’ll skip the custom name section, even though standard Wasm tools generate it.

That means we only need to generate the import section, which is quite long because it contains some strings:

 0x13 | 02 23       | import section
 0x15 | 01          | 1 count
 0x16 | 16 77 61 73 | import [func 0] Import { module: "wasi_snapshot_preview1", name: "fd_write", ty: Func(0) }
      | 69 5f 73 6e
      | 61 70 73 68
      | 6f 74 5f 70
      | 72 65 76 69
      | 65 77 31 08
      | 66 64 5f 77
      | 72 69 74 65
      | 00 00

Because those bytes are kind of opaque, it may help to see which of them translate directly to characters by using xxd:

00000000: 00 61 73 6d 01 00 00 00  .asm....
00000008: 01 09 01 60 04 7f 7f 7f  ...`....
00000010: 7f 01 7f 02 23 01 16 77  ....#..w
00000018: 61 73 69 5f 73 6e 61 70  asi_snap
00000020: 73 68 6f 74 5f 70 72 65  shot_pre
00000028: 76 69 65 77 31 08 66 64  view1.fd
00000030: 5f 77 72 69 74 65 00 00  _write..

The first byte is 0x02, which is the id for the import section. The second byte is 0x23, which is the hex interpretation of the decimal 35: there are 35 bytes in the import section.

Then we have a vector containing all the imports; the first byte in the vector is the integer 01 reflecting the length of the vectior. There is only one import in this example.

The 0x16 is the start of another vector and represents the fact that there are 22 bytes (hex 16 = decimal 22) in the “namespace” string, which is “wasi_snapshot_preview1”. The bytes for that string follow, ending with 0x31.

Next we have the byte 08, which is the length of the string, “fd_write”, followed by the bytes of that string, which end with 0x65.

Finally, we have two 00 bytes. The first of these is a flag that can be one of 00, 01, 02, or 03, depending on what kind of item we are importing. Since our sample app only supports importing functions, we will only need 00, for functions.

The final 00 is the LEB128 index in the types table of the function type that is being imported. This may cause us a bit of trouble because it is not self contained. Given just the module.imports, we don’t know what index the function type is in the table. We need module.types as well.

Let’s try to solve that first. Our concatSection function accepts a function that accepts an arbitrary object and returns a List U8. One option we have is to make the “arbitrary object” be something other than module.imports. For example, it could be a tuple of (module.imports, module.types). But that feels clunky. Let’s see if we can be more clever.

To solve this, we can create a function that accepts the module.types and converts it to a dictionary for easy lookup. The interesting part will be that the return value of module types will be a function (so it’s a function that returns a function). Specifically, it will be a function that is able to satisfy the generateOne contract when generating imports.

generateImportFactory : List (Common.Positionable Transformer.FuncType) -> (Transformer.Import -> List U8)
generateImportFactory = \funcTypes ->
    typeDict =
        funcTypes
        |> List.map Common.extractElement
        |> List.mapWithIndex \funcType, index -> (funcType.identifier, index)
        |> Dict.fromList

    \importDef ->
        when importDef.definition is
            Func identifier if Dict.contains typeDict identifier ->
                # TODO: Format import here
                []

            Func identifier ->
                crash "Unexpected identifier shouldn't get past AST generation"

            _ ->
                crash "Haven't implemented non func imports"

I called my function generateImportFactory, which I’m pretty sure is not an idiomatic functional language name, and it very much gives me Java flashbacks. The point is, this function accepts whatever the module’s types are and converts those types to a “name to index” dictionary lookup. Then it returns a function that needs to do the actual generateImport logic. For now it’s an empty list.

Now that it exists, hooking this factory up in generate is pretty easy:

generate : Common.Positionable Transformer.Module -> List U8
generate = \modulePositioned ->
    module = modulePositioned |> Common.extractElement

    [0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00]
    |> concatSection 0x01 module.types generateFuncType
    |> concatSection 0x02 module.imports (generateImportFactory module.types)
    |> concatSection 0x05 module.mems generateMem
    |> concatSection 0x0b module.datas generateData

We need to call the new function with module.types and pass the resulting function to concatSection.

Before we implement the actual import generation, let’s create a fragile and annoying to debug unit test:

expect
    # (module
    #     (import "wasi_snapshot_preview1" "fd_write" (
    #         func $fd_write (param i32 i32 i32 i32) (result i32))
    #     )
    # )
    result = generate {
        position: { row: 1, column: 1 },
        element: {
            types: [
                {
                    position: { row: 3, column: 48 },
                    element: {
                        identifier: "fd_write",
                        param: [
                            { element: I32, position: { row: 3, column: 30 } },
                            { element: I32, position: { row: 3, column: 34 } },
                            { element: I32, position: { row: 3, column: 38 } },
                            { element: I32, position: { row: 3, column: 42 } },
                        ],
                        result: [{ element: I32, position: { row: 3, column: 55 } }],
                    },
                },
            ],
            funcs: [],
            mems: [],
            datas: [],
            imports: [
                {
                    position: { row: 2, column: 4 },
                    element: {
                        namespace: "wasi_snapshot_preview1",
                        name: "fd_write",
                        definition: Func "fd_write",
                    },
                },
            ],
            exports: [],
        },
    }

    result
    == [
        0x00, 0x61, 0x73, 0x6d,
        0x01, 0x00, 0x00, 0x00,
        0x01, 0x09, # type section 1
        0x01, 0x60, 0x04, 0x7f, 0x7f,
        0x7f, 0x7f, 0x01, 0x7f,
        0x02, 0x23, # import section 2
        0x01, # One import
        0x16, # namespace length
        0x77, 0x61, 0x73, 0x69, 0x5f, 0x73,
        0x6e, 0x61, 0x70, 0x73, 0x68, 0x6f,
        0x74, 0x5f, 0x70, 0x72, 0x65, 0x76,
        0x69, 0x65, 0x77, 0x31,
        0x08, # name length
        0x66, 0x64, 0x5f, 0x77, 0x72, 0x69,
        0x74, 0x65,
        0x00, # Func type
        0x00, # function id looked up in the types table.
    ]

Note: This is even more annoying in a real Roc file where the auto formatter messes up the comments.

Now we just need to replace the TODO in generateImportFactory with a pretty little pipeline that concatenates vector representations of the namespace and name to the func identifier 0x00. Finally, we append the index of the type by looking it up in the typedDict we generated at the top of the outer function.

    \importDef ->
        when importDef.definition is
            Func identifier if Dict.contains typeDict identifier ->
                []
                |> List.concat (generateVector (Str.toUtf8 importDef.namespace) \b -> [b])
                |> List.concat (generateVector (Str.toUtf8 importDef.name) \b -> [b])
                |> List.append 0x00
                |> List.concat (generateU32 ((Dict.get typeDict identifier) |> Result.withDefault 999999 |> Num.toU32))

The test passes now, so I’m going to claim imports are working! Only two more sections to go! Unfortunately, I have just determined that I messed up on the Transformer a bit, so we’ll need to go back and solve that.

Fixing function types in transformer

When I wrote the transformer for functions, I neglected to add the function to the types table in the AST. This is partially because the WAT syntax I was using doesn’t mention the types, and the sample we are coding to is a function with zero parameters or return values.

I don’t want to take the time to handle creating functions with parameters, so I’m going to resort to a horrible hack.

Our transformFunc type currently only works on functions that don’t have parameters or responses. Here is the current implementation:

transformFunc : Transformer Func
transformFunc = \children ->
    when children is
        [Term { term: Variable identifier }, .. as funcInstructions] ->
            funcInstructions
            |> transformInstructions
            |> Result.map \instructions -> {
                identifier,
                instructions,
            }

        _ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected Instruction" }]

The single possible success arm assumes that everything that comes after the identifier in a function definition is an instruction. In reality, you can also pass param and result elements in a few different formats. For example, here’s a simple add function in WAT syntax:

(module
    (func $add (export "add") (param $a i32) (param $b i32) (result i32)
        (i32.add (local.get $a) (local.get $b))
    )
)

If we tried to plug this into our current transformFunc, it would call transformInstruction on a param element and transformInstruction would choke on it as an invalid instruction (which it is).

If we want to process param and result SExpressions, we’d need to use various List functions to extract the param and result ones. What I’m going to do instead is make this arm legitimately only handle functions that don’t have any param or result parameters.

To do this I’ll need a little helper function:

containsParamsOrResult : List Parser.Expression -> Bool
containsParamsOrResult = \children ->
    List.any children \child ->
        when child is
            SExpression { name: "param" } | SExpression { name: "result" } -> Bool.true
            _ -> Bool.false

This returns true if any of the elements in the list of expressions is an SExpression with a param or result name.

Then I can update the match arm in the transformFunc function to use this new function as a guard:

        [Term { term: Variable identifier }, .. as funcInstructions] if !(containsParamsOrResult funcInstructions) ->

Next, I need to make transformFunc return a tuple of Func and FuncType. Since we’re only handling arms where there are no params or results, we can set the FuncType to be one with the given identifier, but empty param and result lists:

transformFunc : Transformer (Func, FuncType)
transformFunc = \children ->
    when children is
        [Term { term: Variable identifier }, .. as funcInstructions] if !(containsParamsOrResult funcInstructions) ->
            funcInstructions
            |> transformInstructions
            |> Result.map \instructions -> (
                    {
                        identifier,
                        instructions,
                    },
                    {
                        identifier,
                        param: [],
                        result: [],
                    },
                )

        _ -> Err [{ position: { row: 0, column: 0 }, message: "Unexpected Instruction" }]

Then we need to modify the when arm in the transformModule function so that it knows how to handle the tuple we just returned:

SExpression { position, name: "func", children: expressions } ->
                mergeResult currentResult (transformFunc expressions) \module, (func, funcType) -> { module &
                        types: List.append module.types { position, element: funcType },
                        funcs: List.append module.funcs { position, element: func },
                    }

Notice the callback we pass to mergeResult; I use destructuring to convert the tuple back to func and funcType variables. Previously, this function only updated the funcs, List but now it updates types as well.

I have one expectation failing now because it doesn’t have anything in the types array when parsing functions. I’ll let you solve that one on your own.

Hopefully if/when we get around to implementing other functions, this change will make it easy to add functions with parameters or results.

This refactor should put our Transformer in a good position to actually implement functions, but I’m going to leave that for the next article.

Build a Wasm Compiler in Roc - Part 14

The Import Section

Fixing function types in transformer

Dusty Phillips