In earlier articles, we implemented a tokenizer for the Wasm text syntax (WAT) and started on a parser to convert those tokes to a S-expression AST.

In this part, we’ll start to create a transformer to convert that AST to a new one that better matches the WASM output we will be crafting. Don’t ask me how many parts that’s going to take!

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Transforming the AST

The transformation step is optional in some compilers, depending on how well the input maps to output. As one example, a “formatter” (which is actually a transpiler where the source and target language are the same) typically reads the AST from existing source codes, and outputs (potentially modified) source code from the same AST. Interpreters are sometimes also able to work directly from the AST of the input language.

In the case of WAT to Wasm, my assumption going in was that since the two are textual and binary descriptions of the exact same bytecode, the AST wouldn’t need transformation. I couldn’t have been more wrong! The S-expression syntax is its own language, and while it is easy to parse (the parser only took TWO posts!), it does not create an AST that resembles Wasm. It does, however, create an AST that is easy to transform into one that resembles Wasm.

What does a Wasm AST look like?

The Wasm Abstract Syntax Tree is so flat that it barely qualifies for the title of “tree”. There is a big fat module node at the root of the tree, and a bunch of flat lists of other kinds of nodes. Some of those other nodes have a more traditional tree-like structure, though.

This flatness is partially because Wasm is meant to be quickly and efficiently read by a computer and transformed directly to machine code. Machine code is just a flat list of instructions after all. There’s nothing “abstract” about it!

Looking at the Wasm specification for a module, there are potentially nine different sections in a module. However we’ll only need to model a subset of them (in bold) to make our “hello world” example work:

  • types
  • imports
  • funcs
  • mems
  • globals
  • exports
  • start
  • elems
  • datas

Each section is represented as a flat list, and whenever an element in one section needs to reference an element in another section, it uses the index of that element in the other list. Variables, which start with a $ are essentially just names for various indices.

The elements of each of these lists each have their own structure. For some of them (e.g. functions with a bunch of instructions in them), that structure can be pretty complicated, but others are fairly simple. For example, if you click a tangled web of links in the spec, you’ll discover that the mems section contains a list of mem types, and each mem has a type that is a limit, and a limit is either one or two integers representing the minimum and maximum size of the memory (in units of page size).

In WAT syntax, this means that (memory 1) has a limit with a minimum size but no maximum, and means “create a memory with at least one page of data in it”. Typically, each module only has one memory at the implicit 0 index in the list, but future versions of Wasm willsupport multiple memories.

After transformation, the AST for a memory would probably be a tag named Memory with a payload of type Limit where a Limit is a record that holds a minimum and an optional maximum.

The chief difference between the two ASTs is that any one S-Expression can contain any arbitrary collection of terms and nested S-Expressions, but a memory node MUST contain a limit. For example the following valid S-Expression is not a valid Wasm node:

    (memory (i32.const 8) "hello world")

(In fact this is a valid data node that I changed the name to memory). If I pass this S-expression into wasm-tools parse, it throws an error.

So the goal of this transformation step is to take our input S-expressions, confirm that they represent legitimate Wasm AST nodes, and build that AST. The AST we want to build is strongly typed and the child expressions in each S-expression must conform to that type.

Hopefully that’s enough background for us to get started and you won’t have to spend as much time puzzling out the formal specification as I did!

Boilerplate

Our parser is complete, so let’s create a new Transformer.roc file with a dummy function in it for transforming an AST:

module [
    transform,
    NodeTag,
    WasmNode,
]

import Common
import Parser

NodeTag : [Module {}]

WasmNode : {
    position : Common.Position,
    node : NodeTag,
}

transform : Parser.Expression -> Result WasmNode (List Common.Error)
transform = \expressions -> Ok {
        position: { row: 1, column: 1 },
        node: Module ({}),
    }

The types are kind of just a wild guess at what we might expect to return from this function for now, and transform is hard-coded with a quasi-legit guess at the expected node.

Back in main.roc, we’ll need to add an import for Transformer and pipe in the output of parse in our compile function:

compile : Str -> Result (List U8) (List Common.Error)
compile = \input ->
    input
    |> Tokenizer.tokenize
    |> Result.try Parser.parse
    |> Result.try Transformer.transform
    |> Result.map \ast ->
        # dbg expression

        Str.toUtf8 "TODO: Compile Input"

This is identical to the previous version except for the one new |> Result.try Transformer.transform line.

Let’s get a test going for an empty module:

expect
    # (module)
    result = transform {
        position: { row: 1, column: 1 },
        expression: SExpression {
            name: "module",
            children: [],
            state: Closed,
        },
    }

    result
    == Ok {
        position: { row: 1, column: 1 },
        node: Module ({}),
    }

This test passes right now because of the hard coding in transform. Let’s add a couple more tests for some potentially broken inputs:

expect
    # (memory)
    result = transform {
        position: { row: 1, column: 1 },
        expression: SExpression {
            name: "memory",
            children: [],
            state: Closed,
        },
    }

    result
    == Err [
        {
            message: "Expected a 'module', but received a 'memory'",
            position: { column: 1, row: 1 },
        },
    ]

expect
    # 42
    # (technically, parser would never emit this)
    result = transform {
        position: { row: 1, column: 1 },
        expression: Term (Number 42),
    }

    result
    == Err [
        {
            message: "Expected a S-expression, but received a Term",
            position: { column: 1, row: 1 },
        },
    ]

These obviously fail with the hardcoded transform body, so let’s start actually implementing that function. Throw a when expression into that function:

transform : Parser.Expression -> Result WasmNode (List Common.Error)
transform = \expression ->
    when expression is
        { position, expression: SExpression { name: "module", children } } ->
            Ok {
                position,
                node: Module ({}),
            }

        { position, expression: SExpression { name } } ->
            Err [
                {
                    position,
                    message: "Expected a 'module', but received a '$(name)'",
                },
            ]

        { position, expression: Term _ } ->
            Err [
                {
                    position,
                    message: "Expected a S-expression, but received a Term",
                },
            ]

If we receive a module with any children (any children at all), we return an Ok, but if the input is a non-module S-Expression or a Term, we return an appropriate Err.

Now we just need to figure out what to do with the children in the success case. But before we do that, we should probably think about the AST data structure.

Designing an AST

In a non-empty module, we can expect one or more child expressions inside the module expression, and each child might represent things like data or a memory or a function. Each of these will ultimately be collected in lists for each “type” of element. So our Module record needs to have several fields with lists of other types as their input. Let’s start with some empty data types:

Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}

Module : {
    funcs : List Func,
    mems : List Mem,
    datas : List Data,
    imports : List Import,
    exports : List Export,
}

Unlike with S-Expressions, the set of children that each type can have is strictly defined. A Wasm module can have 10 types of children; our hello world will require around half of those.

A problem with this model is that we lose positions of all the AST nodes. That’s why I originally defined WasmNode to have a position. But that isn’t looking like an appropriate design now because we don’t want just any Wasm node to be a child of another Node. For example, it would be suboptimal if Module.funcs was a list of WasmNode records, because we only want Func nodes to be in there.

One solution would be to put a position on every type, something like this:

Func : {
   position: Common.Position,
   ... other func fields
}
Mem : {
   position: Common.Position,
   ... other mem fields
}

The problem with this is “what if we forget”. This is a good time to introduce the concept of “type parameters”, (typically referred to as “generic types” in other languages).

Let’s try to add a new Positionable type to our Common library:

Positionable: {
    position: Position
    element: ????
}

The question is, what is the type of the element field? It’s not a tagged union because for any one place we use a Positionable, it can only be one type, but we want that to be a different type in different places.

This is similar to how a List works. Any one list can contain only one type of element, but two different lists can each contain different types of elements.

The trick is to add a word to the Positionable definition to reflect “some instance-specific type”, and then we use that word as a stand-in for the type of the element field:

Positionable elem : {
    position : Position,
    element : elem,
}

The “stand-in” can be any value (think of it like a variable, but for types), but it must be lowercase so Roc knows it needs to treat it like a type parameter instead of trying to find a real type somewhere.

Now when we want to use Positionable, we have to specify the type we want it to have. You’ve already seen this in places where we have specified a type as List Str. Different lists can contain different elements, but a List Str can only contain strings.

We are already importing Common in the Transformer, so we can say that the fields of Module must contain things like List (Common.Positionable Func):

Func : {}
Mem : {}
Data : {}
Import : {}
Export : {}

Module : {
    funcs : List (Common.Positionable Func),
    mems : List (Common.Positionable Mem),
    datas : List (Common.Positionable Data),
    imports : List (Common.Positionable Import),
    exports : List (Common.Positionable Export),
}

The parentheses are required so that Roc understands that the type parameter Common.Positionable is e.g. a Func, and that the type parameter for List is a Common.Positionable Func. If we left it as List Common.Positionable Func, Roc would assume we were passing two type parameters to List, and it would complain because List only expects one type parameter.

Let’s also create an initial Module that has all those fields set to an empty list:

emptyModule = {
    funcs: [],
    mems: [],
    datas: [],
    imports: [],
    exports: [],
}

It is relatively common in immutable languages to have an initial empty object like this instead of a constructor or new function. In fact, those five empty lists are all pointing to the exact same empty list object! List.append always returns a new list, and if we call List.append on one of the fields in emptyModule, the result would be a new module.

Note that I’ve removed WasmNode and NodeTag now. I changed the transform function signature to always return a Positionable Module instead of the WasmNode we have deleted:

transform : Parser.Expression -> Result (Common.Positionable Module) (List Common.Error)
transform = \expression ->
    when expression is
        { position, expression: SExpression { name: "module", children } } ->
            Ok {
                position,
                element: emptyModule,
            }
        # error arms snipped

My code is still failing to compile because the expectation for the success case is currently returning an object with a node field instead of an element field. We can reuse the emptyModule here as well:

expect
    # (module)
    result = transform {
        position: { row: 1, column: 1 },
        expression: SExpression {
            name: "module",
            children: [],
            state: Closed,
        },
    }

    result
    == Ok {
        position: { row: 1, column: 1 },
        element: emptyModule,
    }

An exercise for the reader

I’m not going to do it right now, but now that Positionable exists, there are quite a few other types that we could replace with it:

  • Common.Error could be a Positionable Str.
  • Parser.Expression might be able to have the SExpression and Term in its tagged union changed to Positionable.
  • Tokenizer.Token could become a Positionable TokenTag
  • Tokenizer.CurrentToken could be a Positionable [None, Name (List U8), Number (List U8), String (List U8), Variable (List U8)]

Most of the work would be changing e.g. message to element (for errors) and token to element (for tokens).

I can’t think of any off the top of my head, but it might be worth looking for some reusable functions that can be used with Positionable elements.

The next step is to actually do something with the children in the when expression in transform. The compiler is kindly telling me this field is not used in the branch, so I won’t forget that there is more work to be done.

I’m going to leave that for the next article, though. See you then!