In earlier articles, I introduced this “WAT to Wasm compiler in Roc” project, wrote some Roc code to load an input file, and implemented a tokenizer for a “hello world” of Wat to Wasm compilation. It was… more work than I expected. Four blog posts more work, to be precise! I have no idea where it’s going to end.

But I do know what’s next! Parsing.

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Parsing (Syntax Analysis)

Syntax analysis, or parsing, is the process of turning our array of tokens into an Abstract Syntax Tree, or AST. An AST uses parent-child relationships between the tokens we just got to construct something meaningful.

Since our input language (WAT) is designed to directly model the Wasm AST, you might expect that we will be building a Wasm AST directly. I know I was! But WAT is its own language based on S-expressions, and our initial parsing will involve creating an AST for S-expression syntax, rather than the Wasm AST those expressions model.

The Wikipedia article on S-expressions is pretty short, but as usual I’m only going to model those aspects of the syntax that are useful for the WAT “hello world” example:

  • All names, variables, integers, or strings will be represented as atoms.
  • Each set of parentheses in the input maps to a separate S-expression.
  • Each S-expression has a name, which is the first atom after the opening parenthesis.
  • If an S-expression’s first atom is not a name, it is an error.
  • Each S-expression includes a list of terms and nested S-expressions in it. These may be mixed.

Before we start trying to model this in Roc, let’s get some plumbing set up. Start by creating a new Parser.roc file. I hope you’re as relieved as I am to be leaving Tokenizer.roc behind us. We’ll need a module definition that exposes a parse function. I imagine we’ll be adding some types to the exports, but this is enough to get us started:

module [
    parse,
]

import Common

parse : List _ -> Result _ (List Common.Error)
parse = \_tokens ->
    Err [{ message: "I don't know", position: { row: 1, column: 1 } }]

The two instances of _ in the type definition mean “I don’t care what this type is, infer it for now”. With this dummy implementation, we can update our compile function in main.roc to pipe the output of Tokenize to also call parse. As a refresher, here is how compile currently looks:

compile : Str -> Result (List U8) [(List Common.Error)]
compile = \input ->
    when Tokenizer.tokenize input is
        Ok tokens ->
            dbg tokens

            Ok (Str.toUtf8 "TODO: Compile Input")

        Err errors -> Err Errors

We could put a nested when in the Ok there, but that kind of nested indentation is a code smell, especially in functional programming. Usually there are helper functions that allow you to treat it in a more functional manner, preferably in a pipeline. The Result builtin module has a few handy functions we can use to make this more functional. Have a look:

compile = \input ->
    input
    |> Tokenizer.tokenize
    |> Result.try Parser.parse
    |> Result.map \expression ->
        # dbg expression

        Str.toUtf8 "TODO: Compile Input"

There are quite a few things to note in this little snippet:

  • The input string is piped to tokenize and the result of that operation (which may be a successful list of tokens or a list of errors) is piped to Result.try.
  • Result.try calls a callback on the contents of the successful result. This has the effect of only calling Parser.parse if tokenization was successful. We use Result.try because the returned value will be a new Result. Since tokenize and parse (now) both return Common.Error when there is an error, the transformed Result will either be Ok <parsed output> or Err List Common.Error.
  • Result.map is very similar to Result.try. The key difference is that the callback you pass to map must be guaranteed to succeed, and it returns the “converted” success value. Result.map will automatically wrap that converted value it in Ok. Result.try, on the other hand, returns a new Result, which may be an Ok or an Err. In this case, we haven’t got the codegen or compile steps working yet, so the map just returns an arbitrary string.
  • If I uncomment dbg expression the compiler panics at the Rust level and then hangs before it even gets to executing our code. There seems to be a bug deep in the Roc compiler that I was not able to find a minimal repro for (To the Roc devs I know are reading this: I’m sorry I wasn’t more helpful!). This issue cost me an entire afternoon, and tarnished the good feelings I’ve been developing towards Roc! Hopefully I’ll be able to find a safer place to put the dbg statement, because I don’t know how I’ll be able to finish parsing if I can’t see what the hell it’s building!

So now we have tokens arriving in the parse module… let’s do something with them!

Recursive Types

Reading back through my subset-of-S-Expression grammar description, it appears we’ll need two types, a Term, and an SExpression. The difficulty is that the SExpression can contain nested SExpressions.

Roc has only limited support of recursive types. We can’t do this, for example:

Term : [Number U32, String Str, Name Str, Variable Str]
# this won't work
SExpression : {
    name : Str,
    children : List [Term Term, SExpression SExpression],
}
[```

Roc fails with an error that is simultaneously comprehensive and uninformative:

```plain
── CYCLIC ALIAS in main.roc ────────────────────────────────────────────────────

The SExpression alias is self-recursive in an invalid way:

8│  SExpression : {
    ^^^^^^^^^^^

Recursion in aliases is only allowed if recursion happens behind a
tagged union, at least one variant of which is not recursive.

The children list does place the SExpression behind a tag, but the recursion is not behind a tag. I can kind of understand this, but I was surprised that even this doesn’t work:

Term : [Number U32, String Str, Name Str, Variable Str]
# still won't work
SExpression : {
    name : Str,
    children : [Term Term, List [Term Term, SExpression SExpression]],
}

In this attempt, the children attribute can be either a single Term or a list of Term and SExpression objects. The duplication of Term Term in here tells me we’re doing something wrong, so maybe we want something like this?:

Term : [Number U32, String Str, Name Str, Variable Str]
# nope!
SExpression : {
    name : Str,
    children : [Term Term, List [SExpression SExpression]],
}

That fails, too. sigh In fact, we need a top-level tag union for the recursive part of the definition, but even this doesn’t work:

Term : [Number U32, String Str, Name Str, Variable Str]
# nuh-uh
Expression : [Term Term, SExpression SExpression]
SExpression : {
    name : Str,
    children : List Expression,
}

However, this (semantically identical) version does work:

Term : [Number U32, String Str, Name Str, Variable Str]
# Squeeeee!
Expression : [
    Term Term,
    SExpression
        {
            name : Str,
            children : List Expression,
        },
]

You were starting to think I was never going to show you valid Roc code again, weren’t you?

This working example basically says that every expression is either a Term or an SExpression, and every SExpression has a name and a list of children that are other Expressions (i.e. either a Term or an SExpression). Which is (I think) sufficient for our needs.

We could build the parser with just this information, but we should probably add some information to each Expression to track the position. We have that information in our list of Token objects, so we may as well use it! Let’s add Let’s turn the Term tag into a record payload and add a position to SExpression:

Term : [Number U32, String Str, Name Str, Variable Str]
Expression : [
    Term
        {
            position : Common.Position,
            term : Term,
        },
    SExpression
        {
            position : Common.Position,
            name : Str,
            children : List Expression,
        },
]

Note: I regret doing this. It made the rest of the series quite a bit more complicated. But identifying the locations of errors is the primary purpose of a complier, so I stand by my decision.

While we’re at it, we can update the signature of the parse function to reflect that we now know what it’s supposed to be returning:

parse : List Tokenizer.Token -> Result Expression (List Common.Error)

Don’t forget to add Expression to the module exports in Parser.roc.

Tokens to Expressions

Recall our tokens list is currently a list of just a handful of token types:

TokenTag : [LParen, RParen, Name Str, Number U32, String Str, Variable Str]

The first two are used to delimit S-expressions, and the rest are terms. However, S-expressions can be nested, so we can’t do anything as simple as “Walk from the LParen to the first RParen”. Nor can we use the “currentToken in state” trick we used in tokenize because there could be multiple “current” tokens, depending how deeply nested we go.

Instead, we’ll have to use recursion, though to be performant, we’ll want to use “tail recursion” which is a fancy way of saying “the language doesn’t have efficient loops so you have to jump through hoops to write code that the language can compile to efficient loops.”

Less facetiously, it’s a fancy way of saying “you need to pass state from one recursive invocation to the next such that the recursive call is the last (hence “tail”) call in the function.”

This is a characteristic of most functional languages; loops are function calls instead of language keywords. The standard libraries for these languages generally have a ton of helpful functions that behave like the language keywords you may be used to in imperative languages.

If we do this, Roc will indeed be able to convert our functional calls to loops when it compiles the code. If the recursive function needs to wait for the recursive call before it can perform additional calculations and return, then it isn’t tail recursion.

The first step is to define the structure of the state we are passing forward. (If we are modeling an imperative loop, this represents the variables that are defined outside the loop and updated inside the loop):

ParseState : {
    openNodeStack : List Expression,
    errors : List Common.Error,
}

initialParseState : ParseState
initialParseState = {
    openNodeStack: [],
    errors: [],
}

The openNodeStack is kind of like the currentNode we used in the tokenizer. It’s a list instead of a single element to hold all the nested expressions that have been opened with an LParen but not yet closed with an RParen.

Next, we need to define the signature for the parseRecursive function, returning the initalState as a placeholder for now:

parseRecursive : List Tokenizer.Token, ParseState -> ParseState
parseRecursive = \tokens, currentState ->
    initialParseState

Now with those types in place, we can implement the entirety of our parse function. This is going to be a ridiculously fun when statement to write because of all the destructuring required to determine if each result is valid:

parse : List Tokenizer.Token -> Result Expression (List Common.Error)
parse = \tokens ->
    when parseRecursive tokens initialParseState is
        { errors: [], openNodeStack: [SExpression { name, position, children }] } ->
            Ok (SExpression { name, position, children })

        { errors: [], openNodeStack: [Term { position }] } ->
            Err [
                {
                    position,
                    message: "Expected initial module expression, received a term"
                }
            ]

        { errors: [], openNodeStack: [] } ->
            Err [
                {
                    position: { row: 1, column: 1 },
                    message: "Missing module expression"
                }
            ]

        { errors: [], openNodeStack: [..] } ->
            Err [
                {
                    position: { row: 1, column: 1 },
                    message: "Too many expressions"
                }
            ]

        { errors, openNodeStack: [..] } -> Err errors

Here’s a breakdown:

  • The first match arm matches the success case: a single S-Expression.
  • The second arm fails if the recursive parse only returned a single Term.
  • The third arm fails if the recursive parse didn’t find anything at all.
  • The fourth arm fails for any other set of nodes (more than one expression)
  • The fifth arm fails if we received any errors from the recursive parse step.

Now that the outer function is working, let’s get the recursive function doing its job.

I don’t often do top-down design like this in imperative languages, but it seems to work really well with functional programming.

Implementing parseRecursive

Much like Tokenize.evaluateOneChar, the parseRecursive function is largely just a big when statement. The difference is that it has to “drive itself”. Where evaluateOneChar was called repeatedly on our behalf by the Str.walkUtf8 function, parseRecursive will instead have to call itself. To guarantee that we are not violating the rules of tail recursion, I will avoid calling parseRecursive from the big when statement. Instead, we’ll assign the “next”" set of tokens and state to variables and then match on that to make the recursive call:

parseRecursive : List Tokenizer.Token, ParseState -> ParseState
parseRecursive = \tokens, currentState ->

    (nextTokens, nextState) =
        when (tokens, currentState.openNodeStack) is
            _ -> (tokens |> List.dropFirst 1, currentState)

    when nextTokens is
        [] -> nextState
        _ -> parseRecursive nextTokens nextState

The first when expression is eventually going to be crazy long like the one in tokenize. For now, it matches everything and sets nextTokens to the same list with the first token dropped. This ensures that the program will terminate, even if it isn’t actually parsing anything.

The second when expression performs the recursive check. If there are no more tokens, it returns the nextState as the base case. Otherwise, it makes the recursive call with the new list of tokens and state.

I think now is a good time to switch back to test-driven development. Let’s write a testcase for the simplest possible set of input tokens (ignoring error cases):

expect
    # (module)
    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
        { position: { row: 1, column: 8 }, token: RParen },
    ]

    result = parse tokens

    result
    == Ok (SExpression {
        name: "module",
        position: { row: 1, column: 1 },
        children: [],
    })

Tip: Rather than craft them from scratch, we are able to copy the input tokens from the output of the expectations in Tokenize.roc.

Let’s start adding match arms! The first thing we want to match is an LParen followed by a Name token. Roc helpfully allows us to match on multiple elements in a List, so this can be done in an elegantly declarative fashion:

([{ position, token: LParen }, { token: Name name }, .. as rest], currentNodeStack) ->
    (
        rest,
        { currentState &
            openNodeStack: List.append
                currentNodeStack
                (
                    SExpression {
                        position,
                        name,
                        children: [],
                    }
                ),
        },
    )

Remember, we are matching on a tuple. This match arm matches any situation where the first element in the tuple is a list with at least two tokens, and those two elements match LParen and Name tokens. Tokens are records with both a position and a token. We capture the position of the LParen so we can reuse it in the expression.

The return value of this match arm uses the “rest” of the tokens from the initial match as the new tokens value. The new state uses the & trick to make sure we propagate any errors. For this particular case, all we need to do is open a new SExpression node, with empty children and add it to the stack of open nodes to be used in the next recursive call.

Adding this single match arm is enough to make our test case succeed, because the dropFirst behaviour in our wildcard arm is silently dropping the RParen. Let’s modify the wildcard arm with marginally better error handling:

            ([], _) -> ([], {openNodeStack: [], errors: [{position: {row: 0, column: 0}, message: "No tokens provided"}]})
            ([token, .. as rest], _) ->
                message = "Unexpected Token $(token |> Tokenizer.toStr)"
                (
                    rest,
                    { currentState &
                        errors: List.append currentState.errors { position: token.position, message },
                    },
                )

This relies on a new toStr method we need to add to the Tokenizer module:

toStr : Token -> Str
toStr = \token ->
    when token.token is
        LParen -> "("
        RParen -> ")"
        Name name -> "Name $(name)"
        Number number -> "Number $(number |> Num.toStr)"
        String string -> "String $(string)"
        Variable variable -> "Variable $(variable)"

Now roc test on our initial test case fails appropriately. According to dbg, the result is:

result = Err [{ message: "Unexpected Token )", position: { column: 8, row: 1 } }]

To fix this, we need to handle the RParen case in our still-young when statement. The behaviour of an RParen depends on the current stack of unclosed S-expressions:

  • If the current stack has only one node on it, we just leave it as is and drop the token. This is the case our current test would hit.
  • If the current stack has multiple nodes, representing nested S-Expressions, we remove the top-most node and append it to the children of the node below it.
  • If the current stack has no nodes, an RParen is an error.

Let’s solve the first case first:

            ([{ token: RParen }, .. as rest], [_singleRootNode]) ->
                (rest, currentState)

This change alone fixes our test case, so let’s add a new test case for the empty node case:

expect
    # )
    tokens = [
        { position: { row: 1, column: 1 }, token: RParen },
    ]

    result = parse tokens

    result
    == Err [
        {
            position: { row: 1, column: 1 },
            message: "Unmatched ')'",
        },
    ]

To make this case pass, we need to match an RParen when the state is empty:

            ([{ position, token: RParen }, .. as rest], []) ->
                message = "Unmatched ')'"
                (
                    rest,
                    { currentState &
                        errors: List.append currentState.errors { position, message },
                    },
                )

However, I now realize that our single-state pattern is too naive. The solution is going to require yet another refactor that I think I’ll leave to the next article!