Build a Wasm Compiler in Roc - Part 7
In earlier articles, we implemented a tokenizer for the Wasm text syntax (WAT). In part 6, we started building a parser. We ended that part on a bit of a down note when I realized we were in for yet another refactor. I’m in a better mood today and it’s looking like it won’t be so bad, after all!
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Note: The
when
patterns matched in this article have some ridiculously long patterns in them. As far as I can tell, Roc expects these patterns to be on a single line, so I haven’t added wrapping. I apologize for any horizontal scrolling, especially on mobile devices. I did left-align the clauses instead of including the full indentation that is used in my source code, which may also be confusing.
Adding open and closed states to S-expressions
The problem I hinted at at the end of part 6 is that the current implementation
would successfully match an input such as (module))
, when that should clearly
be an error. To prove this, I wrote a test-case:
expect
# (module))
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
{ position: { row: 1, column: 8 }, token: RParen },
{ position: { row: 1, column: 9 }, token: RParen },
]
result = parse tokens
result
== Err [
{
position: { row: 1, column: 9 },
message: "Unmatched ')'",
},
]
This expectation is currently failing, as the result is an Ok
right now. The
problem is that we are not reflecting whether or not an S-expression has
previously been closed. Indeed, this test case also fails:
expect
# (module
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
]
result = parse tokens
result
== Err [
{
position: { row: 1, column: 1 },
message: "Incomplete expression, no closing ')'",
},
]
It seems we need to store some extra state on an expression to reflect whether
it is currently open or closed. That’s going to make our when
clauses even
beefier, but I don’t see any other solution.
An annoying refactor
This is going to require a lot of changes, but I think it will still be educational and hopefully not too confusing!
First, I add a state
field to the SExpression
record to reflect whether
a given expression is Open
or Closed
:
Expression : [
Term
{
position : Common.Position,
term : Term,
},
SExpression
{
position : Common.Position,
name : Str,
children : List Expression,
state : [Open, Closed],
},
]
Really, I should be stripping this field out of the expressions that get
returned from parse
, since a successful result should have all expressions
Closed
, but I’m going to ignore that for now.
Now I just have to follow the compiler errors. Refactor with confidence, indeed!
The equality comparison in our one (currently) success expectation now needs to
have a State: Closed
added to it:
result
== Ok (SExpression {
name: "module",
position: { row: 1, column: 1 },
children: [],
state: Closed,
})
Our LParen
handler in the ever-growing when statement needs to explicitly to set
the state
of the new expression it creates to Open
:
when (tokens, currentState.openNodeStack) is
([{ position, token: LParen }, { token: Name name }, .. as rest], currentNodeStack) ->
(
rest,
{ currentState &
openNodeStack: List.append
currentNodeStack
(
SExpression {
position,
name,
children: [],
state: Open,
}
),
},
)
That’s enough to make the compiler happy, and roc check
is now passing. However,
all three of our test cases are failing. We need to update the arm that closes an
expression to also flip the state. It’s currently nice and simple:
# before
([{ token: RParen }, .. as rest], [_singleRootNode]) ->
(rest, currentState)
We need to make two changes:
- We need to confirm the single node inside the open tokens is an S-expression and that it is currently open. We can do that in the pattern.
- The result needs to replace the single-element list in
currentState
with an identical single-element list, except thestate
is now closed.
# after
([{ token: RParen }, .. as rest], [SExpression { position, name, children, state: Open }]) ->
(
rest,
{ currentState &
openNodeStack: [SExpression { position, name, children, state: Closed }],
},
)
Now our (module)
test is passing again. The other two tests are failing, but we
have progress: The (module))
test is no longer reporting a success; it’s just
reporting the wrong failure (handled by the wildcard “Unexpected Token” arm).
To make this test have its own error, we’ll need to add a new arm similar to (and yet very different from) the one we just created:
([{ token: RParen, position }, .. as rest], [.., SExpression { state: Closed }]) ->
message = "Unmatched ')'"
(
rest,
{ currentState &
errors: List.append currentState.errors { position, message },
},
)
Note that I snuck a ..,
in at the front of the list in the pattern for the
current node stack. This will fail for any duplicate )
, not just when it is
the only expression in the list. Note also that the position comes from the
RParen
token, so we are propagating the correct position in the error
message.
The third test case, where (module
succeeds even though it should fail,
actually needs to improve the smaller when
statement in the parse
function.
We haven’t looked at this one in a while, so here is its current state:
parse = \tokens ->
when parseRecursive tokens initialParseState is
{ errors: [], openNodeStack: [SExpression { name, position, children }] } ->
Ok (SExpression { name, position, children })
{ errors: [], openNodeStack: [Term { position }] } ->
Err [
{
position,
message: "Expected initial module expression, received a term"
}
]
{ errors: [], openNodeStack: [] } ->
Err [
{
position: { row: 1, column: 1 },
message: "Missing module expression"
}
]
{ errors: [], openNodeStack: [..] } ->
Err [
{
position: { row: 1, column: 1 },
message: "Too many expressions"
}
]
{ errors, openNodeStack: [..] } -> Err errors
We need to update the first arm so that it is only successful if the nested
expression is Closed
, and add a new error state if it is Open
:
{ errors: [], openNodeStack: [SExpression { name, position, children, state: Closed }] } ->
Ok (SExpression { name, position, children, state: Closed })
{ errors: [], openNodeStack: [SExpression { position, state: Open }] } ->
Err [
{
position,
message: "Incomplete expression, no closing ')'",
},
]
Now the tests I have so far are passing. We should probably add a couple more for the cases where a name is missing:
expect
# ()
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: RParen },
]
result = parse tokens
result
== Err [
{ position: { row: 1, column: 2 }, message: "Expected Name, found ')'", },
{ position: { row: 1, column: 2 }, message: "Unmatched ')'", }
]
expect
# (10)
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Number 10 },
{ position: { row: 1, column: 4 }, token: RParen },
]
result = parse tokens
result
== Err [
{message: "Expected Name, found 'Number 10'", position: {column: 2, row: 1}},
{message: "Unexpected Token Number 10", position: {column: 2, row: 1}},
{message: "Unmatched ')'", position: {column: 4, row: 1}}
]
All S-expressions in (our version of) the grammar must start with a name, and if they don’t have one it is an error. Note that my original versions of these tests only had one error in each list, but when I actually ran them they ended up like above. This suggests that my plan of continuing parsing after we encounter an incorrect node just leads to a gigantic avalanche of errors, even on subsequent good syntax. So that could be improved, but I’m not going to worry about it for now.
Both of these tests will pass after we add a new case to the when
statement.
This case must be placed after the existing check for a valid Name
token, as it catches all non-name tokens that follow an LParen
:
([{ token: LParen }, { token, position}, .. as rest], _) ->
message = "Expected Name, found '$({token, position} |> Tokenizer.toStr)'"
(
List.prepend rest {token, position},
{ currentState &
errors: List.append currentState.errors {position, message}
},
)
The second token gets added back to the state with List.prepend
(the LParen
stays off or we’d end up with infinite recursion) in the (probably vain) hope
that it might be valid syntax for later parsing.
Let’s see what happens if we try to add a nested expression!
Nested Expression
Let’s add a new TDD expectation for the case where we have a nested
s-expression that has a nested term. The WAT expression (module (memory 10))
should be suitable. I don’t feel like guessing what the outcome should be so
I’ll start with an easy variation of the test:
expect
# (module (memory 10))
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
{ position: { row: 1, column: 9 }, token: LParen },
{ position: { row: 1, column: 10 }, token: Name "memory" },
{ position: { row: 1, column: 17 }, token: Number 10 },
{ position: { row: 1, column: 19 }, token: RParen },
{ position: { row: 1, column: 20 }, token: RParen },
]
result = parse tokens
dbg result
result |> Result.isOk
Once we get it passing, I can inspect the output of that dbg
and copy the
content into the expectation if it is right.
To make this test pass, we will need two new match arms. The first one will need to match this case:
- The next token is one of the terminal tokens.
- The
openNodeStack
currently has an S-expression at its tail position.
If it matches, we need to return a new state where the S-expression on the
openNodeStack
has had a new child added to it. This match arm needs to
support four different types of tokens, so I thought it would end up being a
ginormous union. That didn’t work out, though, as we’ll soon see.
For starters, I’ll just try to match a Number
, since that’s what the test
does. I fear trying to generalize it will lead to compiler errors, but we’ll
try our luck once we get there.
I’m finding the Roc compiler doesn’t give very good errors if I use an overly declarative syntax and get it wrong, so my first attempt splits the parts out into variables like this:
([{ position: tokenPosition, token: Number number }, .. as remainingTokens], [.. as headOfStack, { position, expression: SExpression { name, children, state: Open } }]) ->
term = Term { position: tokenPosition, term: Number number }
newChildren = List.append children term
newTail = {position, name, children: newChildren, state: Open}
nextNodeStack = List.append headOfStack newTail
(
remainingTokens,
{ currentState &
openNodeStack: nextNodeStack
},
)
But that seems awfully verbose, so once it was working, I consolidated the new variables one by one until I ended up with a happy compiler on this:
([{ position: termPosition, token: Number number }, .. as remainingTokens], [.. as headOfStack, SExpression { position, name, children, state: Open }]) ->
(
remainingTokens,
{ currentState &
openNodeStack: List.append
headOfStack
(
SExpression {
position,
name,
children: List.append children (Term { position: termPosition, term: Number number }),
state: Open,
}
),
},
)
To my dismay, I was not able to capture the token generically so I could reuse
it as a term. The syntax token: (Number number) as term
feels like it should
work, but no variations on that incantation were successful. I tried making a
function to construct the next state, but it required so many arguments that it
wasn’t worth the overhead. So I just ended up copying the above code three
times and changing the Number number
to support String
, Name
, and
Variable
terms.
We still need one more arm to match the nested RParen
. Our current “close an
open S-expression” code only works when there is exactly one open expression in
the stack. We need a new case for when there is more than one open S-expression
on the stack. It needs to close the S-expression at the top of the stack and then
append it to the children of the next one down.
This is going to be one hell of a match arm!
([{ token: RParen }, .. as remainingTokens], [.. as headOfStack, SExpression { position, name, children, state: Open }, SExpression { position: childPosition, name: childName, children: childChildren, state: Open }]) ->
(
remainingTokens,
{ currentState &
openNodeStack: List.append
headOfStack
(
SExpression {
position,
name,
children: List.append
children
(
SExpression {
position: childPosition,
name: childName,
children: childChildren,
state: Closed,
}
),
state: Open,
}
),
},
)
The first element of the tuple to be matched is an RParen
. The second element
is destructured as a list of tokens that has two or more tokens. As part of
that destructuring, we pop the top two tokens from the stack, using pattern
matching and destructuring to implicitly assign the values of those expressions
to variables.
Constructing the returned expression requires several things:
- Recreate the “top” node that we destructured from the stack and change it to
a
Closed
SExpression. - Append that node to the children of the second-to-top node from the original stack. Reconstruct that node, but leave it open.
- Append the resuting reconstructed node to the “head” of the original stack.
The effect is to create a new state as if we had popped the top node from the original stack and added it as a child to the new top node (aka old second to top node).
I had some trouble with this pattern; Roc has a bug (I think) in its pattern matching engine that was erroneously marking this as a redundant pattern that was covered by a previous case. I had to move it above the patterns that check for Term tokens for the compiler to be happy.
Concern about Roc test speeds
At this point, my tests are all passing again, but I am seeing some concerning behaviour:
❯ time roc test
0 failed and 23 passed in 878 ms.
________________________________________________________
Executed in 921.04 millis fish external
usr time 895.82 millis 0.13 millis 895.69 millis
sys time 294.54 millis 1.39 millis 293.15 millis
I am known to be rather militant about test times. For developer velocity to work the way I demand, an entire test suite needs to build and run in under 10 seconds. I’ve seen this with ReScript and Zora, so I know it’s possible. Go compiles and tests quickly as well, as does Inko, so it’s clearly not just a compiled language issue.
It’s impossible to guess whether these results will extend linearly, but if it does, we might hit my 10-second threshold after writing a paltry 300 tests. That would make Roc a non-starter for me, even though I am so far really enjoying the language. Luckily, Roc does claim “Fast” as one of its three core values, so hopefully this is not an inherent problem in the language design and it will improve over time.
In the end, after adding more tests, I found that adding new tests had minimal impact on the run time, so I am no longer as concerned about this.
Roc does have two CLI options we can use to try to make tests run faster. The
--optimize
option spends more time on compiling so that test runtime is
faster. For me, this triples the total test time, but it’s possible that with a
larger test suite the run times would dominate the compile times and it would
be a net improvement. The --dev
flag does the opposite: it compiles as
quickly as possible, but tests may take longer to run. Right now, I’m finding
that --dev
is about the same time as the results without either option, but
again, that might change with a larger test suite.
Our parser is finished (?)
This is a bit anticlimactic but after all that… it seems to be working. The subset of WAT/Lisp/S-expression syntax that we have modelled so far is enough for the entire “hello world” WAT example to parse without failing. I scanned through the resulting expression and it “seems right”.
All the SExpression
s have a state of Closed
, and all the Name
s seem to
have been migrated. I don’t see any “Unexpected Token” errors, so I’m going to
imagine the parser is good enough!
In the next article, we’ll get an introduction to Transformation.