In earlier articles, we implemented a tokenizer for the Wasm text syntax (WAT). In part 6, we started building a parser. We ended that part on a bit of a down note when I realized we were in for yet another refactor. I’m in a better mood today and it’s looking like it won’t be so bad, after all!

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Note: The when patterns matched in this article have some ridiculously long patterns in them. As far as I can tell, Roc expects these patterns to be on a single line, so I haven’t added wrapping. I apologize for any horizontal scrolling, especially on mobile devices. I did left-align the clauses instead of including the full indentation that is used in my source code, which may also be confusing.

Adding open and closed states to S-expressions

The problem I hinted at at the end of part 6 is that the current implementation would successfully match an input such as (module)), when that should clearly be an error. To prove this, I wrote a test-case:

expect
    # (module))
    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
        { position: { row: 1, column: 8 }, token: RParen },
        { position: { row: 1, column: 9 }, token: RParen },
    ]

    result = parse tokens

    result
    == Err [
        {
            position: { row: 1, column: 9 },
            message: "Unmatched ')'",
        },
    ]

This expectation is currently failing, as the result is an Ok right now. The problem is that we are not reflecting whether or not an S-expression has previously been closed. Indeed, this test case also fails:

expect
    # (module
    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
    ]

    result = parse tokens

    result
    == Err [
        {
            position: { row: 1, column: 1 },
            message: "Incomplete expression, no closing ')'",
        },
    ]

It seems we need to store some extra state on an expression to reflect whether it is currently open or closed. That’s going to make our when clauses even beefier, but I don’t see any other solution.

An annoying refactor

This is going to require a lot of changes, but I think it will still be educational and hopefully not too confusing!

First, I add a state field to the SExpression record to reflect whether a given expression is Open or Closed:

Expression : [
    Term
        {
            position : Common.Position,
            term : Term,
        },
    SExpression
        {
            position : Common.Position,
            name : Str,
            children : List Expression,
            state : [Open, Closed],
        },
]

Really, I should be stripping this field out of the expressions that get returned from parse, since a successful result should have all expressions Closed, but I’m going to ignore that for now.

Now I just have to follow the compiler errors. Refactor with confidence, indeed!

The equality comparison in our one (currently) success expectation now needs to have a State: Closed added to it:

    result
    == Ok (SExpression {
        name: "module",
        position: { row: 1, column: 1 },
        children: [],
        state: Closed,
    })

Our LParen handler in the ever-growing when statement needs to explicitly to set the state of the new expression it creates to Open:

when (tokens, currentState.openNodeStack) is
    ([{ position, token: LParen }, { token: Name name }, .. as rest], currentNodeStack) ->
        (
            rest,
            { currentState &
                openNodeStack: List.append
                    currentNodeStack
                    (
                        SExpression {
                            position,
                            name,
                            children: [],
                            state: Open,
                        }
                    ),
            },
        )

That’s enough to make the compiler happy, and roc check is now passing. However, all three of our test cases are failing. We need to update the arm that closes an expression to also flip the state. It’s currently nice and simple:

# before
([{ token: RParen }, .. as rest], [_singleRootNode]) ->
    (rest, currentState)

We need to make two changes:

  • We need to confirm the single node inside the open tokens is an S-expression and that it is currently open. We can do that in the pattern.
  • The result needs to replace the single-element list in currentState with an identical single-element list, except the state is now closed.
# after
([{ token: RParen }, .. as rest], [SExpression { position, name, children, state: Open }]) ->
    (
        rest,
        { currentState &
            openNodeStack: [SExpression { position, name, children, state: Closed }],
        },
    )

Now our (module) test is passing again. The other two tests are failing, but we have progress: The (module)) test is no longer reporting a success; it’s just reporting the wrong failure (handled by the wildcard “Unexpected Token” arm).

To make this test have its own error, we’ll need to add a new arm similar to (and yet very different from) the one we just created:

([{ token: RParen, position }, .. as rest], [.., SExpression { state: Closed }]) ->
    message = "Unmatched ')'"
    (
        rest,
        { currentState &
            errors: List.append currentState.errors { position, message },
        },
    )

Note that I snuck a .., in at the front of the list in the pattern for the current node stack. This will fail for any duplicate ), not just when it is the only expression in the list. Note also that the position comes from the RParen token, so we are propagating the correct position in the error message.

The third test case, where (module succeeds even though it should fail, actually needs to improve the smaller when statement in the parse function. We haven’t looked at this one in a while, so here is its current state:

parse = \tokens ->
    when parseRecursive tokens initialParseState is
        { errors: [], openNodeStack: [SExpression { name, position, children }] } ->
            Ok (SExpression { name, position, children })

        { errors: [], openNodeStack: [Term { position }] } ->
            Err [
                {
                    position,
                    message: "Expected initial module expression, received a term"
                }
            ]

        { errors: [], openNodeStack: [] } ->
            Err [
                {
                    position: { row: 1, column: 1 },
                    message: "Missing module expression"
                }
            ]

        { errors: [], openNodeStack: [..] } ->
            Err [
                {
                    position: { row: 1, column: 1 },
                    message: "Too many expressions"
                }
            ]

        { errors, openNodeStack: [..] } -> Err errors

We need to update the first arm so that it is only successful if the nested expression is Closed, and add a new error state if it is Open:

{ errors: [], openNodeStack: [SExpression { name, position, children, state: Closed }] } ->
    Ok (SExpression { name, position, children, state: Closed })

{ errors: [], openNodeStack: [SExpression { position, state: Open }] } ->
    Err [
        {
            position,
            message: "Incomplete expression, no closing ')'",
        },
    ]

Now the tests I have so far are passing. We should probably add a couple more for the cases where a name is missing:

expect
    # ()
    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: RParen },
    ]

    result = parse tokens

    result
    == Err [
        { position: { row: 1, column: 2 }, message: "Expected Name, found ')'", },
        { position: { row: 1, column: 2 }, message: "Unmatched ')'", }
    ]

expect
    # (10)
    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Number 10 },
        { position: { row: 1, column: 4 }, token: RParen },
    ]

    result = parse tokens

    result
    == Err [
        {message: "Expected Name, found 'Number 10'", position: {column: 2, row: 1}},
        {message: "Unexpected Token Number 10", position: {column: 2, row: 1}},
        {message: "Unmatched ')'", position: {column: 4, row: 1}}
    ]

All S-expressions in (our version of) the grammar must start with a name, and if they don’t have one it is an error. Note that my original versions of these tests only had one error in each list, but when I actually ran them they ended up like above. This suggests that my plan of continuing parsing after we encounter an incorrect node just leads to a gigantic avalanche of errors, even on subsequent good syntax. So that could be improved, but I’m not going to worry about it for now.

Both of these tests will pass after we add a new case to the when statement. This case must be placed after the existing check for a valid Name token, as it catches all non-name tokens that follow an LParen:

            ([{ token: LParen }, { token, position}, .. as rest], _) ->
                message = "Expected Name, found '$({token, position} |> Tokenizer.toStr)'"
                (
                    List.prepend rest {token, position},
                    { currentState &
                        errors: List.append currentState.errors {position, message}
                    },
                )

The second token gets added back to the state with List.prepend (the LParen stays off or we’d end up with infinite recursion) in the (probably vain) hope that it might be valid syntax for later parsing.

Let’s see what happens if we try to add a nested expression!

Nested Expression

Let’s add a new TDD expectation for the case where we have a nested s-expression that has a nested term. The WAT expression (module (memory 10)) should be suitable. I don’t feel like guessing what the outcome should be so I’ll start with an easy variation of the test:

expect
    # (module (memory 10))

    tokens = [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
        { position: { row: 1, column: 9 }, token: LParen },
        { position: { row: 1, column: 10 }, token: Name "memory" },
        { position: { row: 1, column: 17 }, token: Number 10 },
        { position: { row: 1, column: 19 }, token: RParen },
        { position: { row: 1, column: 20 }, token: RParen },
    ]

    result = parse tokens

    dbg result

    result |> Result.isOk

Once we get it passing, I can inspect the output of that dbg and copy the content into the expectation if it is right.

To make this test pass, we will need two new match arms. The first one will need to match this case:

  • The next token is one of the terminal tokens.
  • The openNodeStack currently has an S-expression at its tail position.

If it matches, we need to return a new state where the S-expression on the openNodeStack has had a new child added to it. This match arm needs to support four different types of tokens, so I thought it would end up being a ginormous union. That didn’t work out, though, as we’ll soon see.

For starters, I’ll just try to match a Number, since that’s what the test does. I fear trying to generalize it will lead to compiler errors, but we’ll try our luck once we get there.

I’m finding the Roc compiler doesn’t give very good errors if I use an overly declarative syntax and get it wrong, so my first attempt splits the parts out into variables like this:

([{ position: tokenPosition, token: Number number }, .. as remainingTokens], [.. as headOfStack, { position, expression: SExpression { name, children, state: Open } }]) ->
    term = Term { position: tokenPosition, term: Number number }
    newChildren = List.append children term
    newTail = {position, name, children: newChildren, state: Open}
    nextNodeStack = List.append headOfStack newTail
    (
        remainingTokens,
        { currentState &
            openNodeStack: nextNodeStack
        },
    )

But that seems awfully verbose, so once it was working, I consolidated the new variables one by one until I ended up with a happy compiler on this:

([{ position: termPosition, token: Number number }, .. as remainingTokens], [.. as headOfStack, SExpression { position, name, children, state: Open }]) ->
    (
        remainingTokens,
        { currentState &
            openNodeStack: List.append
                headOfStack
                (
                    SExpression {
                        position,
                        name,
                        children: List.append children (Term { position: termPosition, term: Number number }),
                        state: Open,
                    }
                ),
        },
    )

To my dismay, I was not able to capture the token generically so I could reuse it as a term. The syntax token: (Number number) as term feels like it should work, but no variations on that incantation were successful. I tried making a function to construct the next state, but it required so many arguments that it wasn’t worth the overhead. So I just ended up copying the above code three times and changing the Number number to support String, Name, and Variable terms.

We still need one more arm to match the nested RParen. Our current “close an open S-expression” code only works when there is exactly one open expression in the stack. We need a new case for when there is more than one open S-expression on the stack. It needs to close the S-expression at the top of the stack and then append it to the children of the next one down.

This is going to be one hell of a match arm!

([{ token: RParen }, .. as remainingTokens], [.. as headOfStack, SExpression { position, name, children, state: Open }, SExpression { position: childPosition, name: childName, children: childChildren, state: Open }]) ->
    (
        remainingTokens,
        { currentState &
            openNodeStack: List.append
                headOfStack
                (
                    SExpression {
                        position,
                        name,
                        children: List.append
                            children
                            (
                                SExpression {
                                    position: childPosition,
                                    name: childName,
                                    children: childChildren,
                                    state: Closed,
                                }
                            ),
                        state: Open,
                    }
                ),
        },
    )

The first element of the tuple to be matched is an RParen. The second element is destructured as a list of tokens that has two or more tokens. As part of that destructuring, we pop the top two tokens from the stack, using pattern matching and destructuring to implicitly assign the values of those expressions to variables.

Constructing the returned expression requires several things:

  • Recreate the “top” node that we destructured from the stack and change it to a Closed SExpression.
  • Append that node to the children of the second-to-top node from the original stack. Reconstruct that node, but leave it open.
  • Append the resuting reconstructed node to the “head” of the original stack.

The effect is to create a new state as if we had popped the top node from the original stack and added it as a child to the new top node (aka old second to top node).

I had some trouble with this pattern; Roc has a bug (I think) in its pattern matching engine that was erroneously marking this as a redundant pattern that was covered by a previous case. I had to move it above the patterns that check for Term tokens for the compiler to be happy.

Concern about Roc test speeds

At this point, my tests are all passing again, but I am seeing some concerning behaviour:

❯ time roc test
0 failed and 23 passed in 878 ms.

________________________________________________________
Executed in  921.04 millis    fish           external
   usr time  895.82 millis    0.13 millis  895.69 millis
   sys time  294.54 millis    1.39 millis  293.15 millis

I am known to be rather militant about test times. For developer velocity to work the way I demand, an entire test suite needs to build and run in under 10 seconds. I’ve seen this with ReScript and Zora, so I know it’s possible. Go compiles and tests quickly as well, as does Inko, so it’s clearly not just a compiled language issue.

It’s impossible to guess whether these results will extend linearly, but if it does, we might hit my 10-second threshold after writing a paltry 300 tests. That would make Roc a non-starter for me, even though I am so far really enjoying the language. Luckily, Roc does claim “Fast” as one of its three core values, so hopefully this is not an inherent problem in the language design and it will improve over time.

In the end, after adding more tests, I found that adding new tests had minimal impact on the run time, so I am no longer as concerned about this.

Roc does have two CLI options we can use to try to make tests run faster. The --optimize option spends more time on compiling so that test runtime is faster. For me, this triples the total test time, but it’s possible that with a larger test suite the run times would dominate the compile times and it would be a net improvement. The --dev flag does the opposite: it compiles as quickly as possible, but tests may take longer to run. Right now, I’m finding that --dev is about the same time as the results without either option, but again, that might change with a larger test suite.

Our parser is finished (?)

This is a bit anticlimactic but after all that… it seems to be working. The subset of WAT/Lisp/S-expression syntax that we have modelled so far is enough for the entire “hello world” WAT example to parse without failing. I scanned through the resulting expression and it “seems right”.

All the SExpressions have a state of Closed, and all the Names seem to have been migrated. I don’t see any “Unexpected Token” errors, so I’m going to imagine the parser is good enough!

In the next article, we’ll get an introduction to Transformation.