Build a Wasm Compiler in Roc - Part 5
In earlier articles, I introduced this compiler project, wrote some Roc code to load an input file, and started implementing a tokenizer with error handling.
I think I need to admit that I have absolutely no clue how to estimate how long a blog article is going to be. I thought “build a compiler” would be one article. And then I thought “build a tokenizer” would be one article. sigh
I swear we’ll be done with tokenizing at the end of this post. But first we’ll take a detour to have a look at Roc’s very elegant built-in testing syntax.
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Testing In Roc
I thought it was time to add more types to the parser, but I decided I need some unit tests instead. Test expectations are built into Roc, and you can put them anywhere. It seems like tradition is to include them in the same file as the code they are testing, probably because that means you can access internal state as well as the public definition. My preference has always been to put my tests in separate folders, but that works better with highly dynamic languages than with compiled languages.
So far, I’ve been testing by running roc dev -- main.wat
and tweaking
main.wat
to have contents that put the compiler in situations I want to test.
But the compile
function is pure functional with no IO or side effects, so I
can SO easily pass any string into it and assert on the output without worrying
about other files.
For example, let’s confirm that the basics are still working after that big refactor:
expect
tokens = tokenize "(module)"
tokens
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name ("module" |> Str.toUtf8) },
{ position: { row: 1, column: 8 }, token: RParen },
]
That’s it! The expect
expression must return a Boolean value and pass or
fail is the result of that boolean. In Roc equality checks are always
structural, so we just have to manually construct the expected set of tags and
assert on the output. If you aren’t sure what output you should be expecting,
just put a dbg tokens
in there and inspect whatever you really got. If it’s
correct, copy it into the expectation!
Here’s another one:
expect
result = tokenize "(module!)"
result == Err [{ position: { row: 1, column: 8 }, message: "Unexpected character '!'" }]
What about multi-line?
expect
tokens = tokenize "(\n module\n)"
tokens
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
{ position: { row: 3, column: 1 }, token: RParen },
]
expect
result = tokenize "(!\n module!\n)"
result
== Err [
{ position: { row: 1, column: 2 }, message: "Unexpected character '!'" },
{ position: { row: 2, column: 9 }, message: "Unexpected character '!'" },
]
This is all very pretty in principle, but I should mention a couple of caveats:
- If you don’t separate the result and the comparison into their own variables, you don’t get any output to compare the actual and expected values. This is planned to be fixed in Roc, but the issue has been open for a while.
- About half the time when I get my expectations wrong, instead of a test
failure, I get a Rust-level panic. This isn’t very friendly to debug! (Ironically,
this seems to be more likely to happen when using
dbg
) - I miss being able to “name” tests so I can more easily identify which one went wrong. The output (when it doesn’t panic) includes line numbers, which should be good enough, but a name would be more friendly.
One last thing: Expect statements can be placed inline in functions, and they
will be executed every time you run roc dev
. This is a good way to assert
that code is doing what you expect. It just outputs the failure and doesn’t
crash the program if it doesn’t meet the expectation. In production builds, the
expect
lines are stripped out altogether.
roc test
doesn’t currently have a “watch” mode, so this will be a good time
to reach for entr (available in Homebrew and
other package managers). I use entr
a lot because many of the little-known
languages I play around with don’t have built-in watching tools. I use the command
fd .roc | entr roc test
to rerun roc test
every time I save a file.
(If you don’t have fd
, it’s just a variation of find
, but I’m not going to
guess what the correct incantation of find
is. Ask an AI).
More Tokens
Now that we know how to do testing, we can start using test driven development to add new types of tokens. Let’s tackle this valid WAT program next:
(
module
(memory 1)
)
This allocates one page of heap memory in a linear buffer that the program can
use later. It doesn’t currently use it, but it can. Running roc dev -- main.wat
on this file indicates an error at line 3, column 13, because it
doesn’t know what to do with a 1
that isn’t part of a name. But we don’t need
roc dev -- main.wat
anymore. Let’s move this to an expect
in Tokenizer.roc
:
expect
result = tokenize
"""
(
module
(memory 1)
)
"""
dbg result
Result.isOk result
This test introduces Roc’s multi-line strings, which use the same triple-quote
syntax as Python, except it also helpfully strips off common prefix
indentation. For now, we just check if the Result is Ok
, which it isn’t.
Roc fails and outputs the result
variable for us to introspect.
It looks like we need a new Number
type that will behave similar to Name
,
but for integers. I’m not going to code this to the WAT
spec, so let’s
pretend “numbers” are just plain non-negative integers. I’m
ignoring negatives, floating points, hex, and other notations.
I don’t expect to hand-code the exact result correctly on my first attempt (see previous comment about off-by-one errors), but I’m going to set the test up like this, for starters:
expect
result = tokenize
"""
(
module
(memory 10)
)
"""
dbg result
result
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
{ position: { row: 3, column: 3 }, token: LParen },
{ position: { row: 3, column: 4 }, token: Name ("memory" |> Str.toUtf8) },
{ position: { row: 3, column: 8 }, token: Number 10 },
{ position: { row: 3, column: 10 }, token: RParen },
{ position: { row: 4, column: 1 }, token: RParen },
]
Now we just need to chase down test (and compiler) errors. The first error
is upset because our test result has a Number
in it, but it needs to match
the Token
type. I can solve this by adding Number
to both Token
and
CurrentToken
. But there’s a catch:
Token : {
position : Position,
token : [LParen, RParen, Name (List U8), Number U32],
}
CurrentToken : {
position : Position,
token : [None, Name (List U8), Number (List U8)],
}
The final token will automatically convert the number to an integer, but while we
are collecting the initial token, we’ll store it as a List U8
because we
don’t know if we have all the digits yet.
Now it’s crashing on the big-ass when
statement not covering the new token type.
Before I address that, let’s split up the validNameBytes
set so that we have a
separate set for integer digits.
validDigitBytes =
Set.fromList [
'0',
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
]
validNameBytes =
validNameStartBytes
|> Set.union validDigitBytes
|> Set.union
(
Set.fromList [
'.',
'_',
]
)
From the tokenizer’s point of view, Number
s are collected almost identically
to Name
s except we need to look things up in this new Set
instead of
validNameBytes
. I don’t like the number of arms this is adding to our
when
expression, but when I tried to abstract out some common pieces, it just
got messier. For now, I’m going to keep the when statement as is, but we may be
looking at a refactor later.
(None, numberByte) if Set.contains validDigitBytes numberByte ->
{ currentState &
currentToken: {
position: currentState.currentPosition,
token: Number [numberByte],
},
}
If we currently don’t have a token and we encounter a digit, we need to start
a number token and store it in currentToken
.
Next we have the case where we need to add a digit to an existing number token:
(Number numberBytes, numberByte) if Set.contains validDigitBytes numberByte ->
currentToken = currentState.currentToken
{ currentState &
currentToken: { currentToken &
token: Number (List.append numberBytes numberByte),
},
}
We also need the two “end number” states, which arise if we encounter whitespace or a
RParen
. These won’t be a straight up copy-paste because we also need to convert the
list of number bytes to a U32.
Note: Now that I think of it, we should probably be doing something similar with the name bytes, converting them back to a Str before emitting the token. I don’t want to do that right now while I have this Number thing in the air, but I’ll make that change later.
(Number numberBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
{ currentState &
currentToken: {
position: currentState.currentPosition,
token: None,
},
tokens: List.append currentState.tokens {
position: currentState.currentToken.position,
token: Number (numberBytes |> digitsToNumber),
},
}
(Number numberBytes, ')') ->
{ currentState &
currentToken: {
position: currentState.currentToken.position,
token: None,
},
tokens: List.concat currentState.tokens [
{
position: currentState.currentToken.position,
token: Number (numberBytes |> digitsToNumber),
},
{
position: currentState.currentPosition,
token: RParen,
},
],
}
Both of these cases rely on a digitsToNumber
helper function that I haven’t
defined yet. We have a sequence of bytes that we know are digit ASCII characters.
The two ways I can think of to convert this to an Integer are:
- Convert each byte to a string and then the string to a number using standard library utilities. Both of those operations return results because they can fail, but we can ignore the results because we know the input bytes are legitimate digits.
- Convert the
U8
byte to aU32
and multiply by 10 for each placeholder. This has the potential for overflows, but I’m going to file that under “wait until a user reports it as a bug.” This is only a good place to file things when you are working on projects that aren’t intended to have users.
This is where I uncovered another problem with Roc’s youthful test framework. I
thought I’d put this little function in a separate file so I could unit test
it, but roc test Util.roc
crashes with a Rust-level
panic (unwrapping a
None
option). Apparently the presence of other broken files in the directory
causes weird things to happen when trying to test unbroken files. When I moved
my Util.roc
to its own project folder, the test command ran ok. So I ended up
doing that until I was confident the function was doing what I wanted, and then
I just copied it into tokenizer.roc
.
digitsToNumber : List U8 -> U32
digitsToNumber = \digits ->
digits |> List.walk 0u32 \state, digit ->
(state * 10u32) + ((digit - 48) |> Num.toU32)
expect digitsToNumber [] == 0
expect digitsToNumber [48u8] == 0
expect digitsToNumber [49u8] == 1
expect digitsToNumber [49u8, 48u8] == 10
expect digitsToNumber [49u8, 52u8, 55u8] == 147
expect digitsToNumber [48u8, 48u8, 49u8, 52u8, 55u8] == 147
Roc doesn’t have loops, so we have to call the walk
function with a reducer
pattern. People sometimes struggle with reducers, but I find them fairly
straightforward; each invocation of the function is equivalent to one iteration
of the “loop” (looping keywords are essentially functions instead of statements).
We initialize the state to a 0u32
and each iteration/invocation returns a new
state by doing the following:
- multiplies the existing state by 10
- convert the next digit from an ASCII value to its integer by subtracting 48 (because 0 = 48 in the Ascii table)
- explicitly converts the resulting
U8
to aU32
. This is a lossless operation since U32 is bigger thanU8
- adds the next digit to the previously 10x’d state
- returns the result as the new state to be passed to the next iteration
That’s a surprising amount of work for effectively two lines of code. It even does the right thing with leading zeros. The input digits are unsigned (non-negative), so we don’t have to worry about the math when negatives are involved.
Now we just need to add Number
to our “wildcard” case. The body can stay the
same:
(Number _ , unexpectedByte) | (Name _, unexpectedByte) | (None, unexpectedByte) ->
Note: if we wanted to, we could split these out into separate cases so we could give more informative “I was expecting a xxx or yyy but got a zzz” style of messages. I’m not going to bother with that. There are enough cases here!
The unit test I wrote earlier did indeed have a couple errors in the positions. I compared the real output of the failing test to what I had estimated and agreed that my numbers were off. The corrected test looks like this:
expect
result = tokenize
"""
(
module
(memory 10)
)
"""
dbg result
result
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
{ position: { row: 3, column: 3 }, token: LParen },
{ position: { row: 3, column: 4 }, token: Name ("memory" |> Str.toUtf8) },
{ position: { row: 3, column: 11 }, token: Number 10 },
{ position: { row: 3, column: 13 }, token: RParen },
{ position: { row: 4, column: 1 }, token: RParen },
]
Converting bytes to names
I mentioned earlier that I realized output tokens should have Names as strings instead of lists of bytes. This is somewhat nuanced because converting bytes to strings can fail, if they aren’t valid UTF-8. When that happens, we need to return an error instead of a token. This is complicated by the fact that our code will need to do the conversion in (at least) two match arms. My solution is a new top-level function that takes the current state and the Name bytes and returns the updated state.
The first step is to change the type of Token
so that the Name
tag hosts a
Str
as a payload (the CurrentToken
continues to be a list of U8
s):
Token : {
position : Position,
token : [LParen, RParen, Name Str, Number U32],
}
Now I need a function that takes the current state and name bytes and returns
the next state, updating either errors
or tokens
, depending on whether the
name bytes can be converted to a string:
endString : TokenizerState, List U8 -> TokenizerState
endString = \currentState, nameBytes ->
when Str.fromUtf8 nameBytes is
Err _ ->
bytesDisplay = nameBytes |> List.map Num.toStr |> Str.joinWith ', '
{ currentState &
errors: List.append currentState.errors {
message: "Unexpected Utf8 byte in: $(bytesDisplay)",
position: currentState.currentPosition,
},
}
Ok string ->
{ currentState &
currentToken: {
position: currentState.currentToken.position,
token: None,
},
tokens: List.append currentState.tokens {
position: currentState.currentToken.position,
token: Name string,
},
}
There isn’t really anything new in this snippet, in spite of its length!
Now I need to update the two when
arms that terminate a name in the main
pattern match:
(Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
endString currentState nameBytes
(Name nameBytes, ')') ->
withStrState = endString currentState nameBytes
{ withStrState &
tokens: List.append withStrState.tokens {
position: withStrState.currentPosition,
token: RParen,
},
}
The whitespace case becomes trivially simple as we just return the result of
calling endString
directly. The )
case is slightly more complicated because
we need to store the endString
result as intermediate state and construct a
new result that appends the RParen
token.
The code is still failing to compile at this point because my tests have
Name (List U8)
tokens, but the result is now a Name Str
. Fixing that
is actually a simplification; we need to change the expect
expressions that
contain lines like this:
{ position: { row: 1, column: 2 }, token: Name ("module" |> Str.toUtf8) },
to
{ position: { row: 1, column: 2 }, token: Name "module" },
Now it compiles and the tests pass.
Technically, the tests didn’t pass for me the first time. There was a bug in one of my match arms that I fixed before showing it to you. I hate it when authors make it seem like coding always goes right. The whole point of using unit tests is to catch mistakes. If we didn’t make mistakes, we wouldn’t need unit tests! That said, I’m not going to force you to type all my mistakes all over again.
I realized I don’t have a test for the “number with whitespace” arm yet, so I added one:
expect
result = tokenize
"""
(
module
(memory 10 )
)
"""
result
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 2, column: 3 }, token: Name "module" },
{ position: { row: 3, column: 3 }, token: LParen },
{ position: { row: 3, column: 4 }, token: Name "memory" },
{ position: { row: 3, column: 11 }, token: Number 10 },
{ position: { row: 3, column: 14 }, token: RParen },
{ position: { row: 4, column: 1 }, token: RParen },
]
There are other edge cases, I’m sure, but I’m not going to worry about them right now. Let’s see what it takes to add a string token type!
A string token
Strings are normally tricky because you need to manage escapes, but for expediency, I’m going to assume no escapes, mostly because I want to move on from tokenizing without introducing yet another post!
I’m starting to think I shouldn’t have picked quite so complicated an example as “hello world in WAT”. Compilers may be simpler than folks think, but apparently they ARE big.
First, I’ll add the String
token to both the Token
and CurrentToken
types:
Token : {
position : Position,
token : [LParen, RParen, Name Str, Number U32, String Str],
}
CurrentToken : {
position : Position,
token : [None, Name (List U8), Number (List U8), String (List U8)],
}
Next, I’ll make a simple test case, again acknowledging it might not be accurate:
expect
result = tokenize
"""
(module
(export "memory" (memory 0))
)
"""
result
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
{ position: { row: 2, column: 3 }, token: LParen },
{ position: { row: 2, column: 4 }, token: Name "export" },
{ position: { row: 2, column: 11 }, token: String "memory" },
{ position: { row: 2, column: 20 }, token: LParen },
{ position: { row: 2, column: 21 }, token: Name "memory" },
{ position: { row: 2, column: 28 }, token: Number 0 },
{ position: { row: 2, column: 29 }, token: RParen },
{ position: { row: 2, column: 30 }, token: RParen },
{ position: { row: 3, column: 1 }, token: RParen },
]
Notice that we are discarding the quotation marks that delimit the string and only storing the string contents in the emitted token.
We’ll need a match arm that matches on a "
when the current token is
None
to start a string:
(None, '"') ->
{ currentState &
currentToken: {
position: currentState.currentPosition,
token: String [],
},
}
Strings are a bit different from what we’ve seen before because all bytes
get appended to the string unless we encounter the ending "
mark. To
cover this, we’ll put the "
mark handler before the “other character” handler:
(String stringBytes, '"') ->
crash "TODO: Handle string terminus"
(String stringBytes, stringByte) ->
currentToken = currentState.currentToken
{ currentState &
currentToken: { currentToken &
token: String (List.append stringBytes stringByte),
},
}
There is no error case when processing a string, as all (non-quote) bytes are validly appended to the List. There could be an error if those bytes are not valid UTF-8, but that is encountered later.
Replacing the crash
arm above is going to require a bit of legwork. I need
to convert the bytes I have to a string. I already have code to do that in the
endString
function we just wrote and it’d be a shame to let that work go to
waste. But endString
returns tokens that are Name
tags, and we need one
that returns a String
instead.
The solution is to add a third parameter to our endString
function. It will
represent a function that accepts a String
as input and returns an
appropriate tag. But in order to return an appropriate tag, we’ll need to define
a new TokenTag
type in our type definitions:
TokenTag : [LParen, RParen, Name Str, Number U32, String Str]
Token : {
position : Position,
token : TokenTag,
}
All I did was move the Tag union out of Token
into its own top-level
location.
The signature of endString
will now take a callback function as its third
argument:
endString : TokenizerState, List U8, (Str -> TokenTag) -> TokenizerState
endString = \currentState, nameBytes, makeToken ->
Then we can replace the Name string
with makeToken string
in the success case:
Ok string ->
{ currentState &
currentToken: {
position: currentState.currentToken.position,
token: None,
},
tokens: List.append currentState.tokens {
position: currentState.currentToken.position,
token: makeToken string,
},
}
Now we need to update the call-sites for endString
. We can use
the fact that Roc treats a Tag
as a function to construct that tag
when it is passed as a callback. So the call-sites are simply:
endString currentState nameBytes Name
Next, update the crash
arm to call the new endString
implementation:
(String stringBytes, '"') ->
endString currentState stringBytes String
Looking at the hello example, I think the only syntax we haven’t covered is
referring to variable names, which start with a $
.
Variable tokens are old hat
It’s possible we could just add $
to the validNameStartBytes
set and place
variables in the Name
token. But I feel like it deserves its own token
Shouldn’t take too long!
As usual, we first update the types:
TokenTag : [LParen, RParen, Name Str, Number U32, String Str, Variable Str]
Token : {
position : Position,
token : TokenTag,
}
CurrentToken : {
position : Position,
token : [None, Name (List U8), Number (List U8), String (List U8), Variable (List U8)],
}
Then we add a unit test:
expect
result = tokenize
"""
(module
(func $main)
)
"""
result
== Ok [
{ position: { row: 1, column: 1 }, token: LParen },
{ position: { row: 1, column: 2 }, token: Name "module" },
{ position: { row: 2, column: 3 }, token: LParen },
{ position: { row: 2, column: 4 }, token: Name "func" },
{ position: { row: 2, column: 9 }, token: Variable "main" },
{ position: { row: 2, column: 14 }, token: RParen },
{ position: { row: 3, column: 1 }, token: RParen },
]
I actually didn’t expect this example to be valid WAT syntax, but it seems to build and
run with wasm-tools
and wasmtime
. Notice that I’ve stripped the $
from
the Variable token. As far as I know, all WAT variable identifiers have $
, so
its existence is implied by the fact we are in a Variable
token.
A variable can only be started if the currentToken
is currently None
:
(None, '$') ->
{ currentState &
currentToken: {
position: currentState.currentPosition,
token: Variable [],
},
}
There is enough symmetry between all the None
cases in our match arm that I
think it would be good to abstract it into a function (something like
endString
) passing a token constructor as an argument. I’ll leave that as an
exercise for the reader!
We’ll need the case to extend the variable with additional characters and two
cases to end the variable. All three are virtually identical to Name
:
(Variable variableBytes, variableByte) if Set.contains validNameBytes variableByte ->
currentToken = currentState.currentToken
{ currentState &
currentToken: { currentToken &
token: Variable (List.append variableBytes variableByte),
},
}
(Variable variableBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
endString currentState variableBytes Variable
(Variable variableBytes, ')') ->
withStrState = endString currentState variableBytes Variable
{ withStrState &
tokens: List.append withStrState.tokens {
position: withStrState.currentPosition,
token: RParen,
},
}
Finally, we need to add a (Variable _, unexpectedByte) |
to the
error-catching arm. I must be getting better at measuring where positions are
in the unit test, because my test passed the first time this time!
At this point, I was able to run roc dev --hello.wat
with no errors. The
output is about 70 lines of tokens and I didn’t look too closely to ensure they
were the correct token. But I’m feeling pretty confident in my unit tests, so
I’m going to cap this section off with one more refactor.
Extracting Common Types
I started writing the parsing section and immediately got involved in some
messiness around positions and errors. Currently, both these types are
stored in the Tokenizer
module, but they are clearly going to be needed
during parsing as well. I got it working with both types, but I found
it was teaching bad practices!
So let’s create a new Common.roc
file and move the Position
and Error
types into it:
module [Position, Error, formatErrors]
Position : {
row : U32,
column : U32,
}
Error : {
message : Str,
position : Position,
}
When removing these from Tokenize.roc
, don’t forget to also remove them from
the exports list at the top of the module.
Now use the command roc check
to see all the errors we need to fix. Start by
importing Common
at the top of Tokenize.roc
. Then replace two instances of
Error
with Common.Error
and Position
with Common.Position
in that file.
You’ll also need to import Common
in the main file. Now simplify the signature of
the compile function so we aren’t wrapping the errors in a TokenizeError
tag:
compile : Str -> Result (List U8) (List Common.Error)
You’ll also get to simplify the errors branch to just return the Err
directly
(don’t worry, we’ll be simplifying this further soon).
Err errors -> Err errors
Next, simplify the Err
branch in the compileResult
pattern in main
:
Err errors ->
errors
|> formatErrors
|> Stderr.line
|> Task.mapErr \_ -> Exit 99 "System is failing"
Finally, for a little housekeeping, we can move the formatErrors
function
into Common.roc
as well:
formatErrors : List Error -> Str
formatErrors = \errors ->
Str.joinWith
(
errors
|> List.map \error ->
row = error.position.row |> Num.toStr
column = error.position.column |> Num.toStr
"$(row):$(column) $(error.message)"
)
"\n"
The single call-site in main
will need to be updated to match:
|> Common.formatErrors
And now. NOW, we are finally done with tokenizing. I wonder how much fun parsing will be in the next article?