Build a Wasm Compiler in Roc - Part 4
In earlier articles, I introduced the project, wrote some Roc code to load an input file, and started implementing a Tokenizer.
This part takes a bit of a detour with a refactor to support rudimentary error reporting.
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Handling Errors during Tokenizing
Before we start adding more tokens to tokenize the Hello World
module, I want
to beef up our error handling a bit. There are a few reasons I don’t like the
basic dbg nextByte
we are currently using the wildcard arm:
- The wildcard prevents the compiler from informing us if we missed any cases at compile time. Exhaustive pattern matching is an amazing feature. We should leverage it!
- Other than whitespace, I haven’t memorized the ascii table and I’m tired of looking up those numbers. We should map the utf8 byte back to a character if we can.
- We don’t communicate anything back to the calling function about the success or failure of the operation.
- It doesn’t give us any feedback as to where tokenizing broke down.
Let’s address each of these in order.
Eliminate the wildcard
Let’s start with simply removing the wildcard and seeing what kind of errors we get:
This when does not cover all the possibilities:
...
Other possibilities include:
( Name _, _ ) (note the lack of an if clause)
( None, _ ) (note the lack of an if clause)
I would have to crash if I saw one of those! Add branches for them!
So we are missing the case where we are processing an existing Name
token and
encounter something that is neither a name character or an RParen
. We’re also
missing cases where the current token is None
and we received something that
wasn’t a LParen
, whitespace, or name byte.
Some of those cases are legitimate syntax that we neglected to handle. For
example, a Name
clause can end when we encounter whitespace instead of a
RParen
. Let’s quickly add a pattern arm for that:
(Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
{
currentToken: None,
tokens: List.append currentState.tokens (Name nameBytes),
}
If we encounter whitespace after a name, we end the current Name
token, add it
to the list of tokens, and discard the whitespace. However, we cannot
compile (module )
without adding another match arm for the close paren when
the currentToken is None
:
(None, ')') ->
{
currentToken: None,
tokens: List.append currentState.tokens RParen,
}
We still need to handle the other wildcard possibilities, though. Some of them will be oversights in our existing code, but others would represent legitimate compiler errors in the input WAT source code. For now, let’s just replace the wildcard with something that explicitly handles the cases we know we have missed (effectively communicating that we missed them on purpose).
(Name _, unexpectedByte) | (None, unexpectedByte) ->
byteAsUtf8 =
[unexpectedByte]
|> Str.fromUtf8
|> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
|> Result.withDefault ""
dbg "Unexpected character '$(byteAsUtf8)'"
currentState
This arm uses a |
character to handle two completely different patterns. One
is when our currentToken
is a Name
but received a byte that doesn’t handle
Name
properly, and the other is when we have a None
and receive a byte that
isn’t expected with None
. The case now also goes to the trouble of attempting
to map the byte back to a UTF-8 string. This is an operation that could fail
(e.g. if the byte is part of a multi-byte unicode codepoint), so as a fallback
we return the string form of the byte number and unwrap the result with an
unused withDefault
.
Now, if we add a new token type, we’ll get compile-time errors for any cases we neglect to purposefully handle.
Together, these changes address the first two of the problems I listed.
Communicating Errors
The dbg
statement is a horrible way to communicate with the end user, if for no
other reason that Roc strips it out of release binaries! Instead, we should
collect errors and return them from our tokenize
function.
Let’s start with collecting errors. We can add an errors
field to our
initialState
record, but before we do that, let’s define a couple new types.
The types aren’t necessary to the compiler, but our record is about to get
complicated enough that the developer (i.e. me) will appreciate having it
documented:
Token : [LParen, RParen, Name (List U8)]
CurrentToken : [None, Name (List U8)]
Error : {
message : Str,
}
TokenizerState : {
currentToken : CurrentToken,
tokens : List Token,
errors : List Error,
}
We already have the first type (Token), but I wanted to include it to explain
the discussion of CurrentToken
. These are both closed tag unions, which means
only those explicit tags can be provided. There is some overlap between them,
but they represent different things; only a certain subset of the types of
Tokens can ever be set as a CurrentToken
(e.g. LParen
tokens are added
directly to the tokens
array and don’t get set as current).
Further, we never return a None
Tag from tokenize
, but this is a valid Tag
in the TokenizerState
. So these need to be two different types. When we add
new types of tokens that can go in both places, we need to update both types.
This sucks, but the compiler will let us know if we get it wrong.
The Error
record type will be extended later to include some location information,
but for now I’m just giving it a single field.
TokenizerState
is a record type that we previously inferred from the
structure of initialState
, but now we’re making it explicit. I also added an
errors
field. We can add a type to initialState
to guarantee that it has the
right type:
initialState : TokenizerState
initialState = {
currentToken: None,
tokens: [],
errors: [],
}
These changes alone will introduce compiler errors because we are not handling
errors
properly in evaluateOneChar
. The compiler is already aware that the
signature of evaluateOneChar
needs to handle errors
and doesn’t yet. But to
be extra clear for the developers sake, I’ll add a type to evaluateOneChar
as
well:
evaluateOneChar : TokenizerState, U8 -> TokenizerState
evaluateOneChar = \currentState, nextByte ->
To satisfy the compiler, we need to make sure all our arms use the &
trick to
ensure the entire current state (including errors) is copied over. For example,
the two arms that match parentheses when currentToken
is None
need to
change from the following:
(None, '(') ->
{
currentToken: None,
tokens: List.append currentState.tokens LParen,
}
(None, ')') ->
{
currentToken: None,
tokens: List.append currentState.tokens RParen,
}
…to the following:
(None, '(') ->
{ currentState &
tokens: List.append currentState.tokens LParen,
}
(None, ')') ->
{ currentState &
tokens: List.append currentState.tokens RParen,
}
The arm that matches a Name
against whitespace needs to do this:
(Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
{ currentState &
currentToken: None,
tokens: List.append currentState.tokens (Name nameBytes),
}
And the one that matches a right parenthesis after name becomes:
(Name nameBytes, ')') ->
{ currentState &
currentToken: None,
tokens: List.concat currentState.tokens [Name nameBytes, RParen],
}
At this point our code compiles, but we aren’t actually collecting errors until
we replace the dbg
arm with something that actually updates the errors:
(Name _, unexpectedByte) | (None, unexpectedByte) ->
byteAsUtf8 =
[unexpectedByte]
|> Str.fromUtf8
|> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
|> Result.withDefault ""
{ currentState &
errors: List.append currentState.errors { message: "Unexpected character '$(byteAsUtf8)'" },
}
This compiles and runs, but if I run it on a file that contains invalid code
(e.g. (module!)
) it won’t actually report anything.
Reporting errors from tokenizing
To ensure my errors
array is getting data, I tried tossing a dbg finalState.errors
in the tokenize
function before we return it.
The errors
are there, so we just need to figure out what to do with them!
All we need to do is return a Result
instead of a List Token
from the tokenize
function. Results are basically just a union of the Tags Ok
and Err
, with
generic payloads. We can inspect the errors
field and return a failure if
that is the right behaviour:
tokenize : Str -> Result (List Token) (List Error)
tokenize = \input ->
finalState = Str.walkUtf8
input
initialState
evaluateOneChar
when finalState.errors is
[] -> Ok finalState.tokens
errors -> Err errors
Since we are now exposing it, we should add Error
to the module exports at the
top of the file:
module [
Token,
Error,
tokenize,
]
Going back to the main.roc
file for a while, our compile
function currently
just passes whatever it receives from tokenize
to a dbg
statement. We
should instead make it return a Result
so the main
function can format the
errors nicely. This function will probably encounter other types of errors
later, so let’s create a new Tag to specify TokenizerError
:
compile : Str -> Result (List U8) [TokenizeError (List Tokenizer.Error)]
compile = \input ->
when Tokenizer.tokenize input is
Ok tokens ->
dbg tokens
Ok (Str.toUtf8 "TODO: Compile Input")
Err errors -> Err (TokenizeError errors)
Note: This decision turned out to be a mistake and a later part will refactor the parsing code to always return the same type of Error. I left the mistake in place, partially due to laziness aka lack of time, but also because I want folks to see that real development is full of dead ends and backtracking like this.
Finally, we need to update the main
function to format the errors and output
them to Stdout
, if there are any, and only write the output file if there is a
valid response. Let’s start with a (not very good) formatErrors
function,
which will live in main.roc
for now:
formatErrors = \errors ->
Str.joinWith (errors |> List.map \error -> error.message) "\n"
To use this, I have to break up the beautiful pipeline we made in main
so
that I can assign the response from compile
to a Result
and then match on
it:
main : Task.Task {} [Exit I32 Str]
main =
when Arg.list! {} is
[] | [_] -> Task.err (Exit 1 "No input filename provided")
[_, _, _, ..] -> Task.err (Exit 2 "Too many arguments")
[_, filename] ->
compileResult =
(
filename
|> File.readUtf8
|> Task.mapErr \error ->
when error is
FileReadErr _ NotFound ->
Exit 3 "$(filename) does not exist"
FileReadErr _ _ ->
Exit 99 "Error reading $(filename)"
FileReadUtf8Err _ _ ->
Exit 4 "Unable to read UTF8 in $(filename)"
)!
|> compile
when compileResult is
Ok compiledBytes -> writeWithWasmExtension compiledBytes filename
Err (TokenizeError errors) ->
errors
|> formatErrors
|> Stderr.line
|> Task.mapErr \_ -> Exit 99 "System is failing"
Now when I try to compile a main.wat
file that contains (module!)
I get a slightly
informative error message: Unexpected character '!'
. I wonder what line that
error occurred on (If you guessed line 1, good for you).
Tracking Error Locations
Let’s go back to Tokenizer.roc
and see what we can do about tracking the
location of errors. This will require two phases: First we’ll need to keep
track of where we are in the file as we parse it. We can do this by adding
current line number and column to the state. But we’ll have to adjusting it
with each iteration. Second, we need to report the current location when we
store errors, by adding line and column fields to the Error record.
First, I’ll create a new Position
type (don’t forget to export it) and add
it to the TokenizerState
type:
Position : {
row : U32,
column : U32,
}
TokenizerState : {
currentToken : CurrentToken,
currentPosition : Position,
tokens : List Token,
errors : List Error,
}
We need to initialize the initialState
with an appropriate position:
initialState : TokenizerState
initialState = {
currentToken: None,
currentPosition: {
row: 1,
column: 0,
},
tokens: [],
errors: [],
}
Since we are already propagating the current state in all our when
branches using currentState
, the code will still compile, but
the position isn’t moving.
We could try to add position tracking to the big when statement, but
I feel like it will be easier to do it in a separate operation. The position
tracking mechanism only cares about the nextByte
, so mixing it up with all
m x n conditions that are created by (currentToken, nextByte)
would be
a big hassle.
First, I’ll rename currentState
to previousState
in the function signature:
evaluateOneChar : TokenizerState, U8 -> TokenizerState
evaluateOneChar = \previousState, nextByte ->
This allows me to add an intermediate currentState
variable that will have
the correct position without having to edit our gigantic (but nowhere near what
it needs to be!) when
expression:
currentState =
if nextByte == '\n' then
{ previousState &
currentPosition: {
row: previousState.currentPosition.row + 1,
column: 0,
},
}
else
{ previousState &
currentPosition: {
row: previousState.currentPosition.row,
column: previousState.currentPosition.column + 1,
},
}
This is assuming unix-style line endings, but could be easily changed to handle
\r\n
and \r
as well.
I added a dbg (nextByte, currentState.currentPosition)
to double check it was
incrementing correctly.
In fact it wasn’t incrementing correctly. My first attempt had some off-by-one-errors. So don’t kick yourself over off-by-one errors. Even a quarter century of experience won’t be enough to end them!
Showing the error positions to the user
Now that we are tracking positions, we just need to get that information back
to the user when there is an error. Start by adding a position
to the Error
type:
Error : {
message : Str,
position : Position,
}
The compiler will tell us we need to update the record that adds an error
to the currentState
in the arm that adds errors:
(Name _, unexpectedByte) | (None, unexpectedByte) ->
byteAsUtf8 =
[unexpectedByte]
|> Str.fromUtf8
|> Result.onErr \_ -> Err (Num.toStr unexpectedByte)
|> Result.withDefault ""
{ currentState &
errors: List.append currentState.errors {
message: "Unexpected character '$(byteAsUtf8)'",
position: currentState.currentPosition,
},
}
The error handling code we already wrote in main.roc
should automatically
propagate this new information right to the formatErrors
method, so we can
just update the List.map
call in there:
formatErrors = \errors ->
Str.joinWith
(
errors
|> List.map \error ->
row = error.position.row |> Num.toStr
column = error.position.column |> Num.toStr
"$(row):$(column) $(error.message)"
)
"\n"
The positions are integers, so we need to call Num.toStr
before passing them
to the format string.
Now if I run our tokenizer against the following code, I get the appropriate position in the error:
❯ cat main.wat
(
module
!
)
❯ roc dev -- main.wat
3:3 Unexpected character '!'
Attaching positions to tokens
I can already tell we’re going to wish each token in the AST had a position attached to it so we can report similar information when we encounter errors during the parsing phase. I wasn’t sure I wanted to go to that level of complexity just right now, but it turned out to be pretty easy.
Note: This decision causes a lot of pain later in the series, but no real compiler can get away with not reporting error locations, so I doubled down on it.
First, we need to change our two Token
types from a tag union to a record
that contains a position and tag union:
Token : {
position : Position,
token : [LParen, RParen, Name (List U8)],
}
CurrentToken : {
position : Position,
token : [None, Name (List U8)],
}
Now we just follow the Roc compiler errors to ensure we are returning the
position with every token. First, update initialState
:
initialState = {
currentToken: {
position: {
row: 1,
column: 0,
},
token: None,
},
currentPosition: {
row: 1,
column: 0,
},
tokens: [],
errors: [],
}
Note that currentPosition
is tracked separately from the position of the
currentToken
. This is because in a token that is more than one character long
(such as Name
), the currentToken.position
will point to the start of the
token while currentPosition
will point to whichever character we are
evaluating next. We could probably get away with just keeping the
currentToken.position
and calculating the currentPosition
based on the
length of the currentToken
, but storing the two values separately feels
safer.
Next, we need to update our big when
statement to match on the new token
field of the currentToken
instead of having it at the top level:
when (currentState.currentToken.token, nextByte) is
We also need to update all of our branch arms so that when they assign
to either currentToken
or tokens
, they are getting a record
instead of a tag. I won’t show all of these, but here are a few
examples:
The left and right parenthesis arms need to attach the current position to the token when they construct it and append it to the list:
(None, '(') ->
{ currentState &
tokens: List.append currentState.tokens {
position: currentState.currentPosition,
token: LParen,
},
}
The arm that starts a Name
token needs to set the position of currentToken
:
(None, nameByte) if Set.contains validNameStartBytes nameByte ->
{ currentState &
currentToken: {
position: currentState.currentPosition,
token: Name [nameByte],
},
}
But the arm that updates the Name
token should use the &
trick to copy
position from the currentToken
that was set in the above arm. Roc complained
at me when I tried to use currentState.currentToken &
, so I had to assign
it to another variable:
(Name nameBytes, nameByte) if Set.contains validNameBytes nameByte ->
currentToken = currentState.currentToken
{ currentState &
currentToken: { currentToken &
token: Name (List.append nameBytes nameByte),
},
}
The two arms that copy the currentToken
into the tokens
array need to
construct the record format:
(Name nameBytes, ')') ->
{ currentState &
currentToken: {
position: currentState.currentToken.position,
token: None,
},
tokens: List.concat currentState.tokens [
{
position: currentState.currentToken.position,
token: Name nameBytes,
},
{
position: currentState.currentPosition,
token: RParen,
},
],
}
Pay attention to above example. I had to update three positions here. The
new None
currentToken
needs to have the current position. And we are adding
two tokens to the array at different positions. The first, which holds the
Name
, copies the position from the currentToken
and the second, which holds
the RParen
, copies the currentPosition
.
After making these changes, I can compile a valid wasm file that looks like this:
(
module
)
And the existing dbg
line in the compile
function spits out tokens that have been
annotated with their position:
[main.roc:16] tokens = [{position: {column: 1, row: 1}, token: LParen}, {position: {column: 3, row: 2}, token: (Name [109, 111, 100, 117, 108, 101])}, {position: {column: 1, row: 3}, token: RParen}]
This was an interesting refactor; the Roc compiler made it dead simple, much like Rust. But unlike Rust, it did the job extremely quickly. I dislike Rust because it’s so hard on developer velocity. Roc allows me to create static binaries with the same level of certainty, but without waiting around for stuff. And without fighting with the borrow checker; I’ve never used Roc before, but other than a few obtuse compiler errors that I’m sure will be fixed, development in this language has been shockingly smooth.
Crikey, I thought we were going to get further today, but I think I better leave the remaining tokens (and testing!) for the next article.