Build a Wasm Compiler in Roc - Part 3
In part 1 and part 2 of this series, I introduced the project and we wrote some Roc code to load an input file and save the compiled result to a different file.
Note: Other articles in this series are collected here.
However, we are a long ways from actually having that compiled result available! This article introduces the phases involved in writing a compiler and focus on implementing the first phase, known as lexical analysis or tokenizing.
But first, let’s briefly cover what a compiler actually does.
Reminder: You are reading content that took a great deal of effort to craft, compose, and debug. If you appreciate this work, consider supporting me on Patreon or GitHub.
Dude, What Does a Compiler do?
Real-world compilers are incredibly complex pieces of software. But the basics are fairly simple, and follow a linear process of steps, some of which are optional:
- Lexical Analysis or Tokenizing splits the input program (a string) into meaningful tokens. The result is just a single dimensional array of token objects. The relationship between tokens is not understood at this point.
- Syntax Analysis or Parsing takes that token stream as output, ensures it follows the appropriate grammar rules, and constructs an Abstract Syntax Tree.
- Transformation converts the AST into a different AST. Whether you are
transpiling (source code in one language to source code in a different
language) or compiling (source code in one language to machine code or byte
code), typically the input AST will be different from the output. Even though
WAT
is designed to translate directly toWasm
, the AST is actually drastically different, so we’ll need this step. - Semantic Analysis does things program validation (like type checking) to ensure that the Abstract Syntax Tree is well-formed (for some definition of well-formed). The Wasm specification actually defines exactly what kind of type checking a true WAT compiler should implement, but for the purposes of this series, I intend to skip it.
- Code Generation converts the abstract syntax tree to the desired output representation. When compiling, this would typically be a sequence of bytes in machine code or byte code. When transpiling it will be source code in a new target programming language.
Each of these steps is both easier and harder than it sounds. The basic idea for each is easy to describe, but every different piece of syntax needs to be handled according to its own rules, so the total amount of work involved is not small. I’m too stubborn to shrink my input example, though, so we’re going to compile the whole Hello World example, which (as a reminder) looks like this in WAT syntax:
(module
(import "wasi_snapshot_preview1" "fd_write" (func $fd_write (param i32 i32 i32 i32) (result i32)))
(memory 1)
(export "memory" (memory 0))
(export "_start" (func $main))
(data (offset (i32.const 8)) "hello world")
(func $main
(i32.const 0)
(i32.const 8)
(i32.store)
(i32.const 4)
(i32.const 11)
(i32.store)
(i32.const 1)
(i32.const 0)
(i32.const 1)
(i32.const 20)
(call $fd_write)
drop
)
)
Tokenizing
The first step, and the focus of this article, is tokenizing. Most compilers are able to do tokenizing and parsing in one step, and indeed any sane person would probably start with the roc-parser package. However, for simplicity and education, we’ll do them separately.
Tokenizing is the process of turning the input string (which is effectively an array of characters) into an array of logical tokens. Some tokens such as a parenthesis in our input string will map to a token all by themselves, but others, such as the identifier “module” will cover several characters with a single token.
In fact, the simplest possible Wasm module is (module)
, so we can start with
just these three tokens!
I want to limit our main.roc
file to “code that has side effects”, so I’m going
to start by creating a new Roc module (file) named Tokenizer.roc
.
All Roc files need to start with a header that is designed to communicate the
purpose of the file with respect to other modules. For example, our main
file
defined the file as an app
, and indicated what platform the file needed to
run on.
This Tokenizer.roc
file will instead be a module
. The header for a module
just needs to list what definitions are intended to be exported from that
module. Any definitions not included in that list are not exported.
Our primary tokenizer
interface will be a function that takes the input string
as an argument and outputs an array of Token
s as output. We don’t know
exactly what a Token
is yet, but given that there will be different types of
tokens, I know it will need to be some kind of Tag union. For now, I’m just
going to define the module
header to indicate we are exporting a tokenize
function and a Token
type. The header only needs to specify the names of the
exported elements, not any types or definitions, so this is sufficient:
module [
Token,
tokenize,
]
To access these exports from the main.roc
app module, we just need to add
import Tokenizer
to the imports.
Now we can update our compile
function (written in the previous article) in
main.roc
to call the tokenize
function with our input:
compile : Str -> List U8
compile = \input ->
tokens = Tokenizer.tokenize input
dbg tokens
Str.toUtf8 "TODO: Compile Input"
If we try to compile this now, of course we’ll get errors. Specifically, two
“Missing Definition” errors. Our Tokenizer.roc
file is exporting two definitions,
but they don’t exist! Let’s start by adding dummy definitions for each of them:
module [
Token,
tokenize,
]
Token : []
tokenize : Str -> List Token
tokenize = \input ->
dbg input
[]
Functions in Roc are always specified with Roc’s lambda syntax, which is
\arguments -> body
. Everything in Roc is an expression that returns a result,
so there is no explicit return statement. The last element in the function is
the “return” value of that function. The type definition before the function is
optional, but in a classic “help me help you,” situation, it gives the compiler
extra information to shout at us if our body or call site doesn’t match what we
documented.
Now it should compile. The input string and an empty token array are output
from each dbg
statement (one in each module) when we run roc dev -- main.wat
.
Iterating Tokens
The next step is to actually iterate over the characters in input
and extract
meaningful tokens from it. There are two reasons this is harder than it sounds:
- WAT syntax is written in UTF-8 and can technically support any unicode
character. Roc’s standard library doesn’t support unicode characters out of the
box and we would need to rely on the
roc/unicode
package to handle this correctly. However, the delimiters in Roc syntax are all single byte ASCII characters, so we can get away with just assuming anything that is not a delimiter is part of a UTF-8 name. - Roc is a purely functional programming language, which means it doesn’t have
loops. Iteration is encapsulated in a function call. The typical “walk”
functions deal with either one byte or one unicode character at a time.
However, tokenizing is traditionally implemented by checking multiple
characters until a condition is met. Another option would be to use the
List.walkUntil
andList.walkFrom
functions from the standard library to work on list indices, but I feel like that would be messier than looking at each character once and storing it if it isn’t complete.
The Str.walkUtf8
function acts like a reducer, accumulating state while it
runs a custom function that evaluates each byte (not unicode character) in the
string. The function needs to accept the current state and the next character
as arguments, and return the new state.
Records
Now is a good time to introduce records in Roc. Records are kind of like structs in other languages. They are strongly typed, but the types can be soundly inferred based on usage. So you can basically define what looks like an untyped Javascript object, but Roc will infer the types for all the fields.
Our initial state will be a fairly simple record:
initialState = {
currentToken: None,
tokens: [],
}
This is a top-level definition in the module. If you are from the school of thought that globals are bad, you’re right, but things are a little different in functional programming. All values are immutable, which means that this def is more like a “const” than a global “variable”.
The currentToken
is the tag literal None
. I pulled this value out of thin
air. It isn’t a special Roc type like Python’s None
or null
or
Option.None
in various other languages. It’s a literal tag with no payload.
The cool thing about tags is that they can act like variants; currentToken
can take one of many tags only one of which has been defined at this time:
None
.
I’ll add a type to this record later, but I wanted to illustrate that they aren’t mandatory.
The second piece we need is a function that takes the current state and the
next token and returns the next state. For starters, let’s just debug the next
token and return the currentState
unmodified:
evaluateOneChar = \currentState, nextByte ->
dbg nextByte
currentState
Armed with these two definitions and the input
string passed to tokenize
,
we can now implement the tokenize
function that we roughed out earlier:
tokenize : Str -> List Token
tokenize = \input ->
finalState = Str.walkUtf8
input
initialState
evaluateOneChar
finalState.tokens
Str
is a builtin module so we don’t need to import it. The walkUtf8
function will call our evaluateOneChar
function for each byte in the string.
We can run our code on empty_module.wat
(which contains (module)
) and get
the following debug output:
[Tokenizer.roc:23] nextByte = 40
[Tokenizer.roc:23] nextByte = 109
[Tokenizer.roc:23] nextByte = 111
[Tokenizer.roc:23] nextByte = 100
[Tokenizer.roc:23] nextByte = 117
[Tokenizer.roc:23] nextByte = 108
[Tokenizer.roc:23] nextByte = 101
[Tokenizer.roc:23] nextByte = 41
[Tokenizer.roc:23] nextByte = 10
[main.roc:14] tokens = []
The bytes are all U8
values, so they show up as integers, but we can see that
it has been called for every byte in the input string.
Evaluating Tokens
The S-expression syntax used in WAT means that we can always expect the first
token to be an opening parenthesis. Let’s update evaluateOneChar
to parse a
single parentheses. If it encounters any other characters, it will continue to
debug the character and return the current state unmodified. I’m not clear on
whether we are better off using pattern matching or if
expressions for this, but
I’m going to start with pattern matching and see how it works.
evaluateOneChar = \currentState, nextByte ->
when (currentState.currentToken, nextByte) is
(None, '(') ->
{
currentToken: None,
tokens: List.append currentState.tokens LParen,
}
_ ->
dbg nextByte
currentState
This when
expression is matching on two values at the same time, by collecting
them in a tuple. If the current Token
is None
and the next character is an
opening parenthesis, we just return new state that appends an LParen
tag to
the list of tokens. We would have to handle a LParen
differently if the
currentToken
was an identifier or string, so we need to match on both values.
Any non-matching values fall into the “wildcard” arm, denoted by _
which just
debugs the nextByte
and returns the current state unmodified.
Notice the single quotes in '('
. These indicate that the character in
question is a character and not a string, and its type is an integer. Note
that it is not like characters in e.g. C, which only map to bytes (integers
less than 255). A character can be multiple bytes long for unicode characters
and have codepoints higher than the traditional 127 ASCII characters.
If we try to compile this now, Roc will complain that our tokenize
function
is returning a List [LParen
, when a List []
was expected. This is because
the tokenize
function is defined to return a List Token
, and our Token
type is currently an empty list of tags. You can fix this easily by changing
the Token
type as follows:
Token : [LParen]
We’ll need to add all our tags to this tag union type as we go. If we run the
program now, notice that the 40
byte (a ‘(’ character is a 40
byte in ASCII
and UTF-8) is no longer outputted as a dbg
and our tokens
array now
contains a LParen
tag:
[Tokenizer.roc:31] nextByte = 109
[Tokenizer.roc:31] nextByte = 111
[Tokenizer.roc:31] nextByte = 100
[Tokenizer.roc:31] nextByte = 117
[Tokenizer.roc:31] nextByte = 108
[Tokenizer.roc:31] nextByte = 101
[Tokenizer.roc:31] nextByte = 41
[Tokenizer.roc:31] nextByte = 10
[main.roc:14] tokens = [LParen]
Parsing a name
The next sequence we’ll need to parse is a name. I should really be trying to
match the formal WAT grammar defined
here, but
for expediency, I’m going to define a name
based on the subset of WAT that is
needed to compile the “hello world” example I listed above.
Specifically, a name (for this abused subset of the grammar) is any sequence of
lowercase ASCII letters, a number, or the underscore or period characters. This
will match things like module
, export
, or i32.const
, but not variables
like $fd_write
or strings like "_start"
.
Normally I’d use a regular expression for this, but given the uncertain status
of the only Roc regex library I’m going to create a top-level Set
instead. A
Set
is more suitable for this than a List
because checking if it contains a given byte
happens in constant time, as opposed to the linear scan that would be needed if
we used a List. I actually decided I need two sets because the first
character of a name cannot be a number or punctuation character. The definitions
looks like this:
validNameStartBytes = Set.fromList [
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
'y', 'z']
validNameBytes = validNameStartBytes |> Set.union (Set.fromList [
'0', '1', '2', '3', '4', '5',
'6', '7', '8', '9', '.', '_',
])
Note that the Set.union
call is returning a new Set
and not modifying the
existing validNameStartBytes
in place. Roc never modifies anything in
place. So both sets contain the values we expect after these calls are made.
Now we can add a pattern to our when
statement to start a new Name
current
token when we encounter a letter and the current token is currently None
:
(None, nameByte) if Set.contains validNameStartBytes nameByte ->
{
currentState &
currentToken: Name [nameByte],
}
This pattern match has a condition on it so it only matches if whatever value
in the nameByte
variable is in the validNameStartBytes
set. If it does
match, we return a new record that is derived from the existing currentState
record. That’s what the &
does; it says “return a record that contains all
the values of the existing currentState
record but change the following
fields to…”. Then we say we’re replacing the currentToken
field in that
record with a new Name
tag. Unlike our previous tags (None
and LParen
),
the Name
tag gets a payload, which we set to a List
of bytes (containing
only one byte, for now).
This code will compile and run, but it is only handling one character. Our debug output now looks like this:
[Tokenizer.roc:36] nextByte = 111
[Tokenizer.roc:36] nextByte = 100
[Tokenizer.roc:36] nextByte = 117
[Tokenizer.roc:36] nextByte = 108
[Tokenizer.roc:36] nextByte = 101
[Tokenizer.roc:36] nextByte = 41
[Tokenizer.roc:36] nextByte = 10
[main.roc:14] tokens = [LParen]
This is subtly different from our last output listing, as the 109
byte is no longer
output. We aren’t actually doing anything with it yet, in terms of updating the
output tokens, but we are handling it.
We can handle the remaining name characters similarly, except that at the time
those bytes are read, the currentToken
is now a Name
rather than a None
.
So we need to match and destructure a Name
token from the currentToken
inside the when
:
(Name nameBytes, nameByte) if Set.contains validNameBytes nameByte ->
{ currentState &
currentToken: Name (List.append nameBytes nameByte),
}
The tuple we are matching on now matches a Name
token and extracts the
current nameBytes
from it. It then appends the new nameByte
to create
a new copy of that list and sets that as the new currentToken
.
Running it now, we see that only two bytes are unhandled (a )
and a newline,
for those who don’t have the ascii table memorized / open in a separate browser
tab):
[Tokenizer.roc:41] nextByte = 41
[Tokenizer.roc:41] nextByte = 10
[main.roc:14] tokens = [LParen]
But we’re still not adding the name to the array of tokens. There are a few
characters that could end a name (notably whitespace), but I’m going to code
to the (module)
example for now and only match on a closing )
:
(Name nameBytes, ')') ->
{
currentToken: None,
tokens: List.concat currentState.tokens [Name nameBytes, RParen],
}
We need to append two tokens to the array in this case (the Name
and the )
we just consumed), so we use List.concat
to combine two lists instead of
List.append
.
Now that we are actually updating the tokens
, tokenize
will complain that
we are not returning the right type. We need to update the Token
type
definition to include our new Name
and RParen
tags:
Token : [LParen, RParen, Name (List U8)]
Notice how we define the Name
type to have a payload that is a List U8
.
Now our output looks like this:
[Tokenizer.roc:47] nextByte = 10
[main.roc:14] tokens = [LParen, (Name [109, 111, 100, 117, 108, 101]), RParen]
We’re handling everything in a basic input and getting an array of tokens back!
Aren’t we special? Let’s also eat that whitespace character. Start by adding
a new top-level Set
to hold various whitespace characters:
whitespaceBytes = Set.fromList [9u8, 10u8, 13u8, 32u8]
Rather than trying to figure out how to get the single quote syntax to pick up tabs
and newlines, I just entered their ASCII/UTF-8 values directly. The u8
suffix says
that these should by stored as bytes rather than the default 32-bit integers.
We only want to drop whitespace if the current token is None
(otherwise we’ll
encounter issues when we target e.g. spaces inside quoted strings). Dropping it
is as simple as returning the currentState
without modification:
(None, whitespaceByte) if Set.contains whitespaceBytes whitespaceByte ->
currentState
Now we are able to tokenize every character in the input string and get valid output. This is getting pretty long, so I’m going to pause here! In the next article, we’ll deal with error handling and hopefully finish tokenizing the larger hello world string.