In earlier articles, I introduced this compiler project, wrote some Roc code to load an input file, and started implementing a tokenizer with error handling.

I think I need to admit that I have absolutely no clue how to estimate how long a blog article is going to be. I thought “build a compiler” would be one article. And then I thought “build a tokenizer” would be one article. sigh

I swear we’ll be done with tokenizing at the end of this post. But first we’ll take a detour to have a look at Roc’s very elegant built-in testing syntax.

Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.

Testing In Roc

I thought it was time to add more types to the parser, but I decided I need some unit tests instead. Test expectations are built into Roc, and you can put them anywhere. It seems like tradition is to include them in the same file as the code they are testing, probably because that means you can access internal state as well as the public definition. My preference has always been to put my tests in separate folders, but that works better with highly dynamic languages than with compiled languages.

So far, I’ve been testing by running roc dev -- main.wat and tweaking main.wat to have contents that put the compiler in situations I want to test. But the compile function is pure functional with no IO or side effects, so I can SO easily pass any string into it and assert on the output without worrying about other files.

For example, let’s confirm that the basics are still working after that big refactor:

expect
    tokens = tokenize "(module)"

    tokens
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name ("module" |> Str.toUtf8) },
        { position: { row: 1, column: 8 }, token: RParen },
    ]

That’s it! The expect expression must return a Boolean value and pass or fail is the result of that boolean. In Roc equality checks are always structural, so we just have to manually construct the expected set of tags and assert on the output. If you aren’t sure what output you should be expecting, just put a dbg tokens in there and inspect whatever you really got. If it’s correct, copy it into the expectation!

Here’s another one:

expect
    result = tokenize "(module!)"

    result == Err [{ position: { row: 1, column: 8 }, message: "Unexpected character '!'" }]

What about multi-line?

expect
    tokens = tokenize "(\n  module\n)"

    tokens
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
        { position: { row: 3, column: 1 }, token: RParen },
    ]

expect
    result = tokenize "(!\n  module!\n)"

    result
    == Err [
        { position: { row: 1, column: 2 }, message: "Unexpected character '!'" },
        { position: { row: 2, column: 9 }, message: "Unexpected character '!'" },
    ]

This is all very pretty in principle, but I should mention a couple of caveats:

  • If you don’t separate the result and the comparison into their own variables, you don’t get any output to compare the actual and expected values. This is planned to be fixed in Roc, but the issue has been open for a while.
  • About half the time when I get my expectations wrong, instead of a test failure, I get a Rust-level panic. This isn’t very friendly to debug! (Ironically, this seems to be more likely to happen when using dbg)
  • I miss being able to “name” tests so I can more easily identify which one went wrong. The output (when it doesn’t panic) includes line numbers, which should be good enough, but a name would be more friendly.

One last thing: Expect statements can be placed inline in functions, and they will be executed every time you run roc dev. This is a good way to assert that code is doing what you expect. It just outputs the failure and doesn’t crash the program if it doesn’t meet the expectation. In production builds, the expect lines are stripped out altogether.

roc test doesn’t currently have a “watch” mode, so this will be a good time to reach for entr (available in Homebrew and other package managers). I use entr a lot because many of the little-known languages I play around with don’t have built-in watching tools. I use the command fd .roc | entr roc test to rerun roc test every time I save a file. (If you don’t have fd, it’s just a variation of find, but I’m not going to guess what the correct incantation of find is. Ask an AI).

More Tokens

Now that we know how to do testing, we can start using test driven development to add new types of tokens. Let’s tackle this valid WAT program next:

(
  module
    (memory 1)
)

This allocates one page of heap memory in a linear buffer that the program can use later. It doesn’t currently use it, but it can. Running roc dev -- main.wat on this file indicates an error at line 3, column 13, because it doesn’t know what to do with a 1 that isn’t part of a name. But we don’t need roc dev -- main.wat anymore. Let’s move this to an expect in Tokenizer.roc:

expect
    result = tokenize
        """
        (
          module
          (memory 1)
        )
        """
    dbg result

    Result.isOk result

This test introduces Roc’s multi-line strings, which use the same triple-quote syntax as Python, except it also helpfully strips off common prefix indentation. For now, we just check if the Result is Ok, which it isn’t. Roc fails and outputs the result variable for us to introspect.

It looks like we need a new Number type that will behave similar to Name, but for integers. I’m not going to code this to the WAT spec, so let’s pretend “numbers” are just plain non-negative integers. I’m ignoring negatives, floating points, hex, and other notations.

I don’t expect to hand-code the exact result correctly on my first attempt (see previous comment about off-by-one errors), but I’m going to set the test up like this, for starters:

expect
    result = tokenize
        """
        (
          module
          (memory 10)
        )
        """
    dbg result

    result
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
        { position: { row: 3, column: 3 }, token: LParen },
        { position: { row: 3, column: 4 }, token: Name ("memory" |> Str.toUtf8) },
        { position: { row: 3, column: 8 }, token: Number 10 },
        { position: { row: 3, column: 10 }, token: RParen },
        { position: { row: 4, column: 1 }, token: RParen },
    ]

Now we just need to chase down test (and compiler) errors. The first error is upset because our test result has a Number in it, but it needs to match the Token type. I can solve this by adding Number to both Token and CurrentToken. But there’s a catch:

Token : {
    position : Position,
    token : [LParen, RParen, Name (List U8), Number U32],
}
CurrentToken : {
    position : Position,
    token : [None, Name (List U8), Number (List U8)],
}

The final token will automatically convert the number to an integer, but while we are collecting the initial token, we’ll store it as a List U8 because we don’t know if we have all the digits yet.

Now it’s crashing on the big-ass when statement not covering the new token type. Before I address that, let’s split up the validNameBytes set so that we have a separate set for integer digits.

validDigitBytes =
    Set.fromList [
        '0',
        '1',
        '2',
        '3',
        '4',
        '5',
        '6',
        '7',
        '8',
        '9',
    ]

validNameBytes =
    validNameStartBytes
    |> Set.union validDigitBytes
    |> Set.union
        (
            Set.fromList [
                '.',
                '_',
            ]
        )

From the tokenizer’s point of view, Numbers are collected almost identically to Names except we need to look things up in this new Set instead of validNameBytes. I don’t like the number of arms this is adding to our when expression, but when I tried to abstract out some common pieces, it just got messier. For now, I’m going to keep the when statement as is, but we may be looking at a refactor later.

        (None, numberByte) if Set.contains validDigitBytes numberByte ->
            { currentState &
                currentToken: {
                    position: currentState.currentPosition,
                    token: Number [numberByte],
                },
            }

If we currently don’t have a token and we encounter a digit, we need to start a number token and store it in currentToken.

Next we have the case where we need to add a digit to an existing number token:

        (Number numberBytes, numberByte) if Set.contains validDigitBytes numberByte ->
            currentToken = currentState.currentToken
            { currentState &
                currentToken: { currentToken &
                    token: Number (List.append numberBytes numberByte),
                },
            }

We also need the two “end number” states, which arise if we encounter whitespace or a RParen. These won’t be a straight up copy-paste because we also need to convert the list of number bytes to a U32.

Note: Now that I think of it, we should probably be doing something similar with the name bytes, converting them back to a Str before emitting the token. I don’t want to do that right now while I have this Number thing in the air, but I’ll make that change later.

        (Number numberBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            { currentState &
                currentToken: {
                    position: currentState.currentPosition,
                    token: None,
                },
                tokens: List.append currentState.tokens {
                    position: currentState.currentToken.position,
                    token: Number (numberBytes |> digitsToNumber),
                },
            }

        (Number numberBytes, ')') ->
            { currentState &
                currentToken: {
                    position: currentState.currentToken.position,
                    token: None,
                },
                tokens: List.concat currentState.tokens [
                    {
                        position: currentState.currentToken.position,
                        token: Number (numberBytes |> digitsToNumber),
                    },
                    {
                        position: currentState.currentPosition,
                        token: RParen,
                    },
                ],
            }

Both of these cases rely on a digitsToNumber helper function that I haven’t defined yet. We have a sequence of bytes that we know are digit ASCII characters. The two ways I can think of to convert this to an Integer are:

  • Convert each byte to a string and then the string to a number using standard library utilities. Both of those operations return results because they can fail, but we can ignore the results because we know the input bytes are legitimate digits.
  • Convert the U8 byte to a U32 and multiply by 10 for each placeholder. This has the potential for overflows, but I’m going to file that under “wait until a user reports it as a bug.” This is only a good place to file things when you are working on projects that aren’t intended to have users.

This is where I uncovered another problem with Roc’s youthful test framework. I thought I’d put this little function in a separate file so I could unit test it, but roc test Util.roc crashes with a Rust-level panic (unwrapping a None option). Apparently the presence of other broken files in the directory causes weird things to happen when trying to test unbroken files. When I moved my Util.roc to its own project folder, the test command ran ok. So I ended up doing that until I was confident the function was doing what I wanted, and then I just copied it into tokenizer.roc.

digitsToNumber : List U8 -> U32
digitsToNumber = \digits ->
    digits |> List.walk 0u32 \state, digit ->
        (state * 10u32) + ((digit - 48) |> Num.toU32)

expect digitsToNumber [] == 0
expect digitsToNumber [48u8] == 0
expect digitsToNumber [49u8] == 1
expect digitsToNumber [49u8, 48u8] == 10
expect digitsToNumber [49u8, 52u8, 55u8] == 147
expect digitsToNumber [48u8, 48u8, 49u8, 52u8, 55u8] == 147

Roc doesn’t have loops, so we have to call the walk function with a reducer pattern. People sometimes struggle with reducers, but I find them fairly straightforward; each invocation of the function is equivalent to one iteration of the “loop” (looping keywords are essentially functions instead of statements). We initialize the state to a 0u32 and each iteration/invocation returns a new state by doing the following:

  • multiplies the existing state by 10
  • convert the next digit from an ASCII value to its integer by subtracting 48 (because 0 = 48 in the Ascii table)
  • explicitly converts the resulting U8 to a U32. This is a lossless operation since U32 is bigger than U8
  • adds the next digit to the previously 10x’d state
  • returns the result as the new state to be passed to the next iteration

That’s a surprising amount of work for effectively two lines of code. It even does the right thing with leading zeros. The input digits are unsigned (non-negative), so we don’t have to worry about the math when negatives are involved.

Now we just need to add Number to our “wildcard” case. The body can stay the same:

        (Number _ , unexpectedByte) | (Name _, unexpectedByte) | (None, unexpectedByte) ->

Note: if we wanted to, we could split these out into separate cases so we could give more informative “I was expecting a xxx or yyy but got a zzz” style of messages. I’m not going to bother with that. There are enough cases here!

The unit test I wrote earlier did indeed have a couple errors in the positions. I compared the real output of the failing test to what I had estimated and agreed that my numbers were off. The corrected test looks like this:

expect
    result = tokenize
        """
        (
          module
          (memory 10)
        )
        """

    dbg result

    result
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 2, column: 3 }, token: Name ("module" |> Str.toUtf8) },
        { position: { row: 3, column: 3 }, token: LParen },
        { position: { row: 3, column: 4 }, token: Name ("memory" |> Str.toUtf8) },
        { position: { row: 3, column: 11 }, token: Number 10 },
        { position: { row: 3, column: 13 }, token: RParen },
        { position: { row: 4, column: 1 }, token: RParen },
    ]

Converting bytes to names

I mentioned earlier that I realized output tokens should have Names as strings instead of lists of bytes. This is somewhat nuanced because converting bytes to strings can fail, if they aren’t valid UTF-8. When that happens, we need to return an error instead of a token. This is complicated by the fact that our code will need to do the conversion in (at least) two match arms. My solution is a new top-level function that takes the current state and the Name bytes and returns the updated state.

The first step is to change the type of Token so that the Name tag hosts a Str as a payload (the CurrentToken continues to be a list of U8s):

Token : {
    position : Position,
    token : [LParen, RParen, Name Str, Number U32],
}

Now I need a function that takes the current state and name bytes and returns the next state, updating either errors or tokens, depending on whether the name bytes can be converted to a string:

endString : TokenizerState, List U8 -> TokenizerState
endString = \currentState, nameBytes ->
    when Str.fromUtf8 nameBytes is
        Err _ ->
            bytesDisplay = nameBytes |> List.map Num.toStr |> Str.joinWith ', '
            { currentState &
                errors: List.append currentState.errors {
                    message: "Unexpected Utf8 byte in: $(bytesDisplay)",
                    position: currentState.currentPosition,
                },
            }

        Ok string ->
            { currentState &
                currentToken: {
                    position: currentState.currentToken.position,
                    token: None,
                },
                tokens: List.append currentState.tokens {
                    position: currentState.currentToken.position,
                    token: Name string,
                },
            }

There isn’t really anything new in this snippet, in spite of its length!

Now I need to update the two when arms that terminate a name in the main pattern match:

        (Name nameBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            endString currentState nameBytes

        (Name nameBytes, ')') ->
            withStrState = endString currentState nameBytes
            { withStrState &
                tokens: List.append withStrState.tokens {
                    position: withStrState.currentPosition,
                    token: RParen,
                },
            }

The whitespace case becomes trivially simple as we just return the result of calling endString directly. The ) case is slightly more complicated because we need to store the endString result as intermediate state and construct a new result that appends the RParen token.

The code is still failing to compile at this point because my tests have Name (List U8) tokens, but the result is now a Name Str. Fixing that is actually a simplification; we need to change the expect expressions that contain lines like this:

        { position: { row: 1, column: 2 }, token: Name ("module" |> Str.toUtf8) },

to

        { position: { row: 1, column: 2 }, token: Name "module" },

Now it compiles and the tests pass.

Technically, the tests didn’t pass for me the first time. There was a bug in one of my match arms that I fixed before showing it to you. I hate it when authors make it seem like coding always goes right. The whole point of using unit tests is to catch mistakes. If we didn’t make mistakes, we wouldn’t need unit tests! That said, I’m not going to force you to type all my mistakes all over again.

I realized I don’t have a test for the “number with whitespace” arm yet, so I added one:

expect
    result = tokenize
        """
        (
          module
          (memory 10 )
        )
        """

    result
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 2, column: 3 }, token: Name "module" },
        { position: { row: 3, column: 3 }, token: LParen },
        { position: { row: 3, column: 4 }, token: Name "memory" },
        { position: { row: 3, column: 11 }, token: Number 10 },
        { position: { row: 3, column: 14 }, token: RParen },
        { position: { row: 4, column: 1 }, token: RParen },
    ]

There are other edge cases, I’m sure, but I’m not going to worry about them right now. Let’s see what it takes to add a string token type!

A string token

Strings are normally tricky because you need to manage escapes, but for expediency, I’m going to assume no escapes, mostly because I want to move on from tokenizing without introducing yet another post!

I’m starting to think I shouldn’t have picked quite so complicated an example as “hello world in WAT”. Compilers may be simpler than folks think, but apparently they ARE big.

First, I’ll add the String token to both the Token and CurrentToken types:

Token : {
    position : Position,
    token : [LParen, RParen, Name Str, Number U32, String Str],
}
CurrentToken : {
    position : Position,
    token : [None, Name (List U8), Number (List U8), String (List U8)],
}

Next, I’ll make a simple test case, again acknowledging it might not be accurate:

expect
    result = tokenize
        """
        (module
          (export "memory" (memory 0))
        )
        """

    result
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
        { position: { row: 2, column: 3 }, token: LParen },
        { position: { row: 2, column: 4 }, token: Name "export" },
        { position: { row: 2, column: 11 }, token: String "memory" },
        { position: { row: 2, column: 20 }, token: LParen },
        { position: { row: 2, column: 21 }, token: Name "memory" },
        { position: { row: 2, column: 28 }, token: Number 0 },
        { position: { row: 2, column: 29 }, token: RParen },
        { position: { row: 2, column: 30 }, token: RParen },
        { position: { row: 3, column: 1 }, token: RParen },
    ]

Notice that we are discarding the quotation marks that delimit the string and only storing the string contents in the emitted token.

We’ll need a match arm that matches on a " when the current token is None to start a string:

        (None, '"') ->
            { currentState &
                currentToken: {
                    position: currentState.currentPosition,
                    token: String [],
                },
            }

Strings are a bit different from what we’ve seen before because all bytes get appended to the string unless we encounter the ending " mark. To cover this, we’ll put the " mark handler before the “other character” handler:

        (String stringBytes, '"') ->
            crash "TODO: Handle string terminus"

        (String stringBytes, stringByte) ->
            currentToken = currentState.currentToken
            { currentState &
                currentToken: { currentToken &
                    token: String (List.append stringBytes stringByte),
                },
            }

There is no error case when processing a string, as all (non-quote) bytes are validly appended to the List. There could be an error if those bytes are not valid UTF-8, but that is encountered later.

Replacing the crash arm above is going to require a bit of legwork. I need to convert the bytes I have to a string. I already have code to do that in the endString function we just wrote and it’d be a shame to let that work go to waste. But endString returns tokens that are Name tags, and we need one that returns a String instead.

The solution is to add a third parameter to our endString function. It will represent a function that accepts a String as input and returns an appropriate tag. But in order to return an appropriate tag, we’ll need to define a new TokenTag type in our type definitions:

TokenTag : [LParen, RParen, Name Str, Number U32, String Str]
Token : {
    position : Position,
    token : TokenTag,
}

All I did was move the Tag union out of Token into its own top-level location.

The signature of endString will now take a callback function as its third argument:

endString : TokenizerState, List U8, (Str -> TokenTag) -> TokenizerState
endString = \currentState, nameBytes, makeToken ->

Then we can replace the Name string with makeToken string in the success case:

        Ok string ->
            { currentState &
                currentToken: {
                    position: currentState.currentToken.position,
                    token: None,
                },
                tokens: List.append currentState.tokens {
                    position: currentState.currentToken.position,
                    token: makeToken string,
                },
            }

Now we need to update the call-sites for endString. We can use the fact that Roc treats a Tag as a function to construct that tag when it is passed as a callback. So the call-sites are simply:

            endString currentState nameBytes Name

Next, update the crash arm to call the new endString implementation:

        (String stringBytes, '"') ->
            endString currentState stringBytes String

Looking at the hello example, I think the only syntax we haven’t covered is referring to variable names, which start with a $.

Variable tokens are old hat

It’s possible we could just add $ to the validNameStartBytes set and place variables in the Name token. But I feel like it deserves its own token Shouldn’t take too long!

As usual, we first update the types:

TokenTag : [LParen, RParen, Name Str, Number U32, String Str, Variable Str]
Token : {
    position : Position,
    token : TokenTag,
}
CurrentToken : {
    position : Position,
    token : [None, Name (List U8), Number (List U8), String (List U8), Variable (List U8)],
}

Then we add a unit test:

expect
    result = tokenize
        """
        (module
          (func $main)
        )
        """

    result
    == Ok [
        { position: { row: 1, column: 1 }, token: LParen },
        { position: { row: 1, column: 2 }, token: Name "module" },
        { position: { row: 2, column: 3 }, token: LParen },
        { position: { row: 2, column: 4 }, token: Name "func" },
        { position: { row: 2, column: 9 }, token: Variable "main" },
        { position: { row: 2, column: 14 }, token: RParen },
        { position: { row: 3, column: 1 }, token: RParen },
    ]

I actually didn’t expect this example to be valid WAT syntax, but it seems to build and run with wasm-tools and wasmtime. Notice that I’ve stripped the $ from the Variable token. As far as I know, all WAT variable identifiers have $, so its existence is implied by the fact we are in a Variable token.

A variable can only be started if the currentToken is currently None:

        (None, '$') ->
            { currentState &
                currentToken: {
                    position: currentState.currentPosition,
                    token: Variable [],
                },
            }

There is enough symmetry between all the None cases in our match arm that I think it would be good to abstract it into a function (something like endString) passing a token constructor as an argument. I’ll leave that as an exercise for the reader!

We’ll need the case to extend the variable with additional characters and two cases to end the variable. All three are virtually identical to Name:

        (Variable variableBytes, variableByte) if Set.contains validNameBytes variableByte ->
            currentToken = currentState.currentToken
            { currentState &
                currentToken: { currentToken &
                    token: Variable (List.append variableBytes variableByte),
                },
            }

        (Variable variableBytes, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
            endString currentState variableBytes Variable

        (Variable variableBytes, ')') ->
            withStrState = endString currentState variableBytes Variable
            { withStrState &
                tokens: List.append withStrState.tokens {
                    position: withStrState.currentPosition,
                    token: RParen,
                },
            }

Finally, we need to add a (Variable _, unexpectedByte) | to the error-catching arm. I must be getting better at measuring where positions are in the unit test, because my test passed the first time this time!

At this point, I was able to run roc dev --hello.wat with no errors. The output is about 70 lines of tokens and I didn’t look too closely to ensure they were the correct token. But I’m feeling pretty confident in my unit tests, so I’m going to cap this section off with one more refactor.

Extracting Common Types

I started writing the parsing section and immediately got involved in some messiness around positions and errors. Currently, both these types are stored in the Tokenizer module, but they are clearly going to be needed during parsing as well. I got it working with both types, but I found it was teaching bad practices!

So let’s create a new Common.roc file and move the Position and Error types into it:

module [Position, Error, formatErrors]

Position : {
    row : U32,
    column : U32,
}
Error : {
    message : Str,
    position : Position,
}

When removing these from Tokenize.roc, don’t forget to also remove them from the exports list at the top of the module.

Now use the command roc check to see all the errors we need to fix. Start by importing Common at the top of Tokenize.roc. Then replace two instances of Error with Common.Error and Position with Common.Position in that file.

You’ll also need to import Common in the main file. Now simplify the signature of the compile function so we aren’t wrapping the errors in a TokenizeError tag:

compile : Str -> Result (List U8) (List Common.Error)

You’ll also get to simplify the errors branch to just return the Err directly (don’t worry, we’ll be simplifying this further soon).

        Err errors -> Err errors

Next, simplify the Err branch in the compileResult pattern in main:

                Err errors ->
                    errors
                    |> formatErrors
                    |> Stderr.line
                    |> Task.mapErr \_ -> Exit 99 "System is failing"

Finally, for a little housekeeping, we can move the formatErrors function into Common.roc as well:

formatErrors : List Error -> Str
formatErrors = \errors ->
    Str.joinWith
        (
            errors
            |> List.map \error ->
                row = error.position.row |> Num.toStr
                column = error.position.column |> Num.toStr
                "$(row):$(column) $(error.message)"
        )
        "\n"

The single call-site in main will need to be updated to match:

                    |> Common.formatErrors

And now. NOW, we are finally done with tokenizing. I wonder how much fun parsing will be in the next article?