Build a Wasm Compiler in Roc - Part 6
In earlier articles, I introduced this “WAT to Wasm compiler in Roc” project, wrote some Roc code to load an input file, and implemented a tokenizer for a “hello world” of Wat to Wasm compilation. It was… more work than I expected. Four blog posts more work, to be precise! I have no idea where it’s going to end.
But I do know what’s next! Parsing.
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Parsing (Syntax Analysis)
Syntax analysis, or parsing, is the process of turning our array of tokens into an Abstract Syntax Tree, or AST. An AST uses parent-child relationships between the tokens we just got to construct something meaningful.
Since our input language (WAT) is designed to directly model the Wasm AST, you might expect that we will be building a Wasm AST directly. I know I was! But WAT is its own language based on S-expressions, and our initial parsing will involve creating an AST for S-expression syntax, rather than the Wasm AST those expressions model.
The Wikipedia article on S-expressions is pretty short, but as usual I’m only going to model those aspects of the syntax that are useful for the WAT “hello world” example:
- All names, variables, integers, or strings will be represented as atoms.
- Each set of parentheses in the input maps to a separate S-expression.
- Each S-expression has a name, which is the first atom after the opening parenthesis.
- If an S-expression’s first atom is not a name, it is an error.
- Each S-expression includes a list of terms and nested S-expressions in it. These may be mixed.
Before we start trying to model this in Roc, let’s get some plumbing set up.
Start by creating a new Parser.roc
file. I hope you’re as relieved as I am to
be leaving Tokenizer.roc
behind us. We’ll need a module definition that
exposes a parse
function. I imagine we’ll be adding some types to the
exports, but this is enough to get us started:
module [
parse,
]
import Common
parse : List _ -> Result _ (List Common.Error)
parse = \_tokens ->
Err [{ message: "I don't know", position: { row: 1, column: 1 } }]
The two instances of _
in the type definition mean “I don’t care what this
type is, infer it for now”. With this dummy implementation, we can update our
compile
function in main.roc
to pipe the output of Tokenize
to also call
parse
. As a refresher, here is how compile
currently looks:
compile : Str -> Result (List U8) [(List Common.Error)]
compile = \input ->
when Tokenizer.tokenize input is
Ok tokens ->
dbg tokens
Ok (Str.toUtf8 "TODO: Compile Input")
Err errors -> Err Errors
We could put a nested when
in the Ok
there, but that kind of nested
indentation is a code smell, especially in functional programming. Usually
there are helper functions that allow you to treat it in a more functional
manner, preferably in a pipeline. The Result
builtin module has a few handy
functions we can use to make this more functional. Have a look:
compile = \input ->
input
|> Tokenizer.tokenize
|> Result.try Parser.parse
|> Result.map \expression ->
# dbg expression
Str.toUtf8 "TODO: Compile Input"
There are quite a few things to note in this little snippet:
- The input string is piped to
tokenize
and the result of that operation (which may be a successful list of tokens or a list of errors) is piped toResult.try
. Result.try
calls a callback on the contents of the successful result. This has the effect of only callingParser.parse
if tokenization was successful. We useResult.try
because the returned value will be a newResult
. Sincetokenize
andparse
(now) both returnCommon.Error
when there is an error, the transformedResult
will either beOk <parsed output>
orErr List Common.Error
.Result.map
is very similar toResult.try
. The key difference is that the callback you pass tomap
must be guaranteed to succeed, and it returns the “converted” success value.Result.map
will automatically wrap that converted value in anOk
.Result.try
, on the other hand, returns a newResult
, which may be anOk
or anErr
. In this case, we haven’t got the codegen or compile steps working yet, so the map just returns an arbitrary string.- If I uncomment
dbg expression
the compiler panics at the Rust level and then hangs before it even gets to executing our code. There seems to be a bug deep in the Roc compiler that I was not able to find a minimal repro for (To the Roc devs that I know are reading this: I’m sorry I wasn’t more helpful!). This issue cost me an entire afternoon, and tarnished the good feelings I’ve been developing towards Roc! Hopefully I’ll be able to find a safer place to put thedbg
statement, because I don’t know how I’ll be able to finish parsing if I can’t see what the hell it’s building!
So now we have tokens arriving in the parse module… let’s do something with them!
Recursive Types
Reading back through my subset-of-S-Expression grammar description, it appears
we’ll need two types, a Term
, and an SExpression
. The difficulty is that
the SExpression
can contain nested SExpressions
.
Roc has only limited support of recursive types. We can’t do this, for example:
Term : [Number U32, String Str, Name Str, Variable Str]
# this won't work
SExpression : {
name : Str,
children : List [Term Term, SExpression SExpression],
}
[```
Roc fails with an error that is simultaneously comprehensive and uninformative:
```plain
── CYCLIC ALIAS in main.roc ────────────────────────────────────────────────────
The SExpression alias is self-recursive in an invalid way:
8│ SExpression : {
^^^^^^^^^^^
Recursion in aliases is only allowed if recursion happens behind a
tagged union, at least one variant of which is not recursive.
The children
list does place the SExpression
behind a tag, but the recursion
is not behind a tag. I can kind of understand this, but I was surprised that
even this doesn’t work:
Term : [Number U32, String Str, Name Str, Variable Str]
# still won't work
SExpression : {
name : Str,
children : [Term Term, List [Term Term, SExpression SExpression]],
}
In this attempt, the children
attribute can be either a single Term
or a
list of Term
and SExpression
objects. The duplication of Term Term
in
here tells me we’re doing something wrong, so maybe we want something like
this?:
Term : [Number U32, String Str, Name Str, Variable Str]
# nope!
SExpression : {
name : Str,
children : [Term Term, List [SExpression SExpression]],
}
That fails, too. sigh In fact, we need a top-level tag union for the recursive part of the definition, but even this doesn’t work:
Term : [Number U32, String Str, Name Str, Variable Str]
# nuh-uh
Expression : [Term Term, SExpression SExpression]
SExpression : {
name : Str,
children : List Expression,
}
However, this (semantically identical) version does work:
Term : [Number U32, String Str, Name Str, Variable Str]
# Squeeeee!
Expression : [
Term Term,
SExpression
{
name : Str,
children : List Expression,
},
]
You were starting to think I was never going to show you valid Roc code again, weren’t you?
This working example basically says that every expression is either a Term
or
an SExpression
, and every SExpression
has a name
and a list of children
that are other Expressions
(i.e. either a Term
or an SExpression
). Which
is (I think) sufficient for our needs.
We could build the parser with just this information, but we should probably
add some information to each Expression
to track the position. We have that
information in our list of Token
objects, so we may as well use it! Let’s add
Let’s turn the Term
tag into a record payload and add a position
to
SExpression
:
Term : [Number U32, String Str, Name Str, Variable Str]
Expression : [
Term
{
position : Common.Position,
term : Term,
},
SExpression
{
position : Common.Position,
name : Str,
children : List Expression,
},
]
Note: I regret doing this. It made the rest of the series quite a bit more complicated. But identifying the locations of errors is the primary purpose of a complier, so I stand by my decision.
While we’re at it, we can update the signature of the parse
function to
reflect that we now know what it’s supposed to be returning:
parse : List Tokenizer.Token -> Result Expression (List Common.Error)
Don’t forget to add Expression
to the module exports in
Parser.roc
.
Tokens to Expressions
Recall our tokens
list is currently a list of just a handful of token types:
TokenTag : [LParen, RParen, Name Str, Number U32, String Str, Variable Str]
The first two are used to delimit S-expressions, and the rest are terms.
However, S-expressions can be nested, so we can’t do anything as simple as
“Walk from the LParen
to the first RParen
”. Nor can we use the
“currentToken
in state” trick we used in tokenize
because there could be
multiple “current” tokens, depending how deeply nested we go.
Instead, we’ll have to use recursion, though to be performant, we’ll want to use “tail recursion” which is a fancy way of saying “the language doesn’t have efficient loops so you have to jump through hoops to write code that the language can compile to efficient loops.”
Less facetiously, it’s a fancy way of saying “you need to pass state from one recursive invocation to the next such that the recursive call is the last (hence “tail”) call in the function.”
This is a characteristic of most functional languages; loops are function calls instead of language keywords. The standard libraries for these languages generally have a ton of helpful functions that behave like the language keywords you may be used to in imperative languages.
If we do this, Roc will indeed be able to convert our functional calls to loops when it compiles the code. If the recursive function needs to wait for the recursive call before it can perform additional calculations and return, then it isn’t tail recursion.
The first step is to define the structure of the state we are passing forward. (If we are modeling an imperative loop, this represents the variables that are defined outside the loop and updated inside the loop):
ParseState : {
openNodeStack : List Expression,
errors : List Common.Error,
}
initialParseState : ParseState
initialParseState = {
openNodeStack: [],
errors: [],
}
The openNodeStack
is kind of like the currentNode
we used in the
tokenizer. It’s a list instead of a single element to hold all the nested
expressions that have been opened with an LParen
but not yet closed with an
RParen
.
Next, we need to define the signature for the parseRecursive
function,
returning the initalState
as a placeholder for now:
parseRecursive : List Tokenizer.Token, ParseState -> ParseState
parseRecursive = \tokens, currentState ->
initialParseState
Now with those types in place, we can implement the entirety of our parse
function. This is going to be a ridiculously fun when
statement to write
because of all the destructuring required to determine if each result is valid:
parse : List Tokenizer.Token -> Result Expression (List Common.Error)
parse = \tokens ->
when parseRecursive tokens initialParseState is
{ errors: [], openNodeStack: [SExpression { name, position, children }] } ->
Ok (SExpression { name, position, children })
{ errors: [], openNodeStack: [Term { position }] } ->
Err [
{
position,
message: "Expected initial module expression, received a term"
}
]
{ errors: [], openNodeStack: [] } ->
Err [
{
position: { row: 1, column: 1 },
message: "Missing module expression"
}
]
{ errors: [], openNodeStack: [..] } ->
Err [
{
position: { row: 1, column: 1 },
message: "Too many expressions"
}
]
{ errors, openNodeStack: [..] } -> Err errors
Here’s a breakdown:
- The first match arm matches the success case: a single S-Expression.
- The second arm fails if the recursive parse only returned a single
Term
. - The third arm fails if the recursive parse didn’t find anything at all.
- The fourth arm fails for any other set of nodes (more than one expression)
- The fifth arm fails if we received any errors from the recursive parse step.
Now that the outer function is working, let’s get the recursive function doing its job.
I don’t often do top-down design like this in imperative languages, but it seems to work really well with functional programming.
Implementing parseRecursive
Much like Tokenize.evaluateOneChar
, the parseRecursive
function is largely
just a big when
statement. The difference is that it has to “drive itself”.
Where evaluateOneChar
was called repeatedly on our behalf by the
Str.walkUtf8
function, parseRecursive
will instead have to call itself. To
guarantee that we are not violating the rules of tail recursion, I will avoid
calling parseRecursive
from the big when statement. Instead, we’ll assign the
“next”" set of tokens and state to variables and then match on that to make the
recursive call:
parseRecursive : List Tokenizer.Token, ParseState -> ParseState
parseRecursive = \tokens, currentState ->
(nextTokens, nextState) =
when (tokens, currentState.openNodeStack) is
_ -> (tokens |> List.dropFirst 1, currentState)
when nextTokens is
[] -> nextState
_ -> parseRecursive nextTokens nextState
The first when
expression is eventually going to be crazy long like the one
in tokenize
. For now, it matches everything and sets nextTokens
to the same
list with the first token dropped. This ensures that the program will
terminate, even if it isn’t actually parsing anything.
The second when
expression performs the recursive check. If there are no more
tokens, it returns the nextState
as the base case. Otherwise, it makes the
recursive call with the new list of tokens and state.
I think now is a good time to switch back to test-driven development. Let’s write a testcase for the simplest possible set of input tokens (ignoring error cases):
expect
# (module)
tokens = [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
{ position: { row: 1, column: 8 }, token: RParen },
]
result = parse tokens
result
== Ok (SExpression {
name: "module",
position: { row: 1, column: 1 },
children: [],
})
Tip: Rather than craft them from scratch, we are able to copy the input tokens from the output of the expectations in
Tokenize.roc
.
Let’s start adding match arms! The first thing we want to match is an LParen
followed by a Name
token. Roc helpfully allows us to match on multiple
elements in a List
, so this can be done in an elegantly declarative fashion:
([{ position, token: LParen }, { token: Name name }, .. as rest], currentNodeStack) ->
(
rest,
{ currentState &
openNodeStack: List.append
currentNodeStack
(
SExpression {
position,
name,
children: [],
}
),
},
)
Remember, we are matching on a tuple. This match arm matches any situation
where the first element in the tuple is a list with at least two tokens, and
those two elements match LParen
and Name
tokens. Token
s are
records with both a position and a token. We capture the position of the LParen
so we can reuse it in the expression.
The return value of this match arm uses the “rest” of the tokens from the
initial match as the new tokens
value. The new state uses the &
trick to
make sure we propagate any errors. For this particular case, all we need to do
is open a new SExpression
node, with empty children and add it to the
stack of open nodes to be used in the next recursive call.
Adding this single match arm is enough to make our test case succeed, because
the dropFirst
behaviour in our wildcard arm is silently dropping the
RParen
. Let’s modify the wildcard arm with marginally better error handling:
([], _) -> ([], {openNodeStack: [], errors: [{position: {row: 0, column: 0}, message: "No tokens provided"}]})
([token, .. as rest], _) ->
message = "Unexpected Token $(token |> Tokenizer.toStr)"
(
rest,
{ currentState &
errors: List.append currentState.errors { position: token.position, message },
},
)
This relies on a new toStr
method we need to add to the Tokenizer
module:
toStr : Token -> Str
toStr = \token ->
when token.token is
LParen -> "("
RParen -> ")"
Name name -> "Name $(name)"
Number number -> "Number $(number |> Num.toStr)"
String string -> "String $(string)"
Variable variable -> "Variable $(variable)"
Now roc test
on our initial test case fails appropriately. According to
dbg
, the result is:
result = Err [{ message: "Unexpected Token )", position: { column: 8, row: 1 } }]
To fix this, we need to handle the RParen
case in our still-young when
statement. The behaviour of an RParen
depends on the current stack of unclosed S-expressions:
- If the current stack has only one node on it, we just leave it as is and drop the token. This is the case our current test would hit.
- If the current stack has multiple nodes, representing nested S-Expressions, we remove the top-most node and append it to the children of the node below it.
- If the current stack has no nodes, an
RParen
is an error.
Let’s solve the first case first:
([{ token: RParen }, .. as rest], [_singleRootNode]) ->
(rest, currentState)
This change alone fixes our test case, so let’s add a new test case for the empty node case:
expect
# )
tokens = [
{ position: { row: 1, column: 1 }, token: RParen },
]
result = parse tokens
result
== Err [
{
position: { row: 1, column: 1 },
message: "Unmatched ')'",
},
]
To make this case pass, we need to match an RParen
when the state is empty:
([{ position, token: RParen }, .. as rest], []) ->
message = "Unmatched ')'"
(
rest,
{ currentState &
errors: List.append currentState.errors { position, message },
},
)
However, I now realize that our single-state pattern is too naive. The solution is going to require yet another refactor that I think I’ll leave to the next article!