In a previous blog post, we looked at a simple one-pass compiler written in C. Now, let’s look at how we can design an abstract syntax tree and work with it.
An abstract syntax tree (or an AST) is a tree-shaped representation of source code that is convenient for a compiler to operate. A compiler might represent an expression like 4 + 2 * 10 + 3 * (5 + 1)
using a tree structure like this:
In object-oriented languages, it is common to represent an AST using a hierarchy of data classes, one subclass per tree node. In functional languages, it is common to use a variant data type, in other words, a tagged union, with one variant per tree node. C does not have good support for either of those language features. However, we can manually implement a tagged union. Here’s how we can do that.
We define our AST type as a struct
holding two members,
enum
called tag
, enumerating each node type, andunion
called data
, consisting of data members for each tag type.typedef struct AST AST; // Forward reference
struct AST {
enum {
,
AST_NUMBER,
AST_ADD,
AST_MUL} tag;
union {
struct AST_NUMBER { int number; } AST_NUMBER;
struct AST_ADD { AST *left; AST *right; } AST_ADD;
struct AST_MUL { AST *left; AST *right; } AST_MUL;
} data;
};
We want data members to reference other AST nodes, and since C does not allow recursive type definitions, we make a forward reference by defining AST
as struct AST
. We can do this because the namespace of struct
s and the namespace of types are separate in C.
In this tagged union implementation, an identifier like AST_NUMBER
is mentioned three times:
enum
member,struct
, andC does not have a built-in concept of a tagged union, so there are many ways to implement it. You could name the enum
and the union
types, for example, and define them separately. Or, you could use an anonymous union
or struct
(standard since C11). You can decide not to wrap a single member into a struct
, like in the case of int number
. However, the design chosen here has some advantages that we will discuss further.
With AST
defined this way, our original example (4 + 2 * 10 + 3 * (5 + 1)
) has the following layout in memory.
To heap-allocate an AST node we can use a function like this:
*ast_new(AST ast) {
AST *ptr = malloc(sizeof(AST));
AST if (ptr) *ptr = ast;
return ptr;
}
Now, we can represent an expression 5 + 1
using the following code:
*term_ = ast_new((AST){
AST ,
AST_ADD{
.AST_ADD=(struct AST_ADD){
((AST){
ast_new,
AST_NUMBER{.AST_NUMBER=(struct AST_NUMBER){5}}
}),
((AST){
ast_new,
AST_NUMBER{.AST_NUMBER=(struct AST_NUMBER){1}}}
),
}
}
});
This is just C being C, isn’t it? There is no local type inference, so we must cast each struct
to the correct type. We also have to use designated initializers (like .AST_ADD=
) because that’s the only way to initialize an arbitrary union in a single expression.
To some extent, we even asked for it when we wrapped the harmless int number
into a struct AST_NUMBER
.
However, C being C, we can create a vararg macro that can reduce this boilerplate:
#define AST_NEW(tag, ...) \
ast_new((AST){tag, {.tag=(struct tag){__VA_ARGS__}}})
This macro ties together the matching identifiers from the enum
, the struct
, and the data members. With this, we can initialize 4 + 2 * 10 + 3 * (5 + 1)
as follows.
*term =
AST (AST_ADD,
AST_NEW(AST_NUMBER, 4),
AST_NEW(AST_ADD,
AST_NEW(AST_MUL,
AST_NEW(AST_NUMBER, 2),
AST_NEW(AST_NUMBER, 10),
AST_NEW),
(AST_MUL,
AST_NEW(AST_NUMBER, 3),
AST_NEW(AST_ADD,
AST_NEW(AST_NUMBER, 5),
AST_NEW(AST_NUMBER, 1),
AST_NEW),
),
),
);
We could even shorten AST_NEW(AST_ADD, …)
to AST(ADD, …)
with token concatenation (##
macro operator), but that is beyond my taste in C programming.
Now, how do we operate with this AST…
A common way to operate on a tagged union is to use something like a switch
statement, which switches over the tag value, where each case operates on the corresponding data value.
void ast_print(AST *ptr) {
= *ptr;
AST ast switch (ast.tag) {
case AST_NUMBER: {
struct AST_NUMBER data = ast.data.AST_NUMBER;
("%d", data.number);
printfreturn;
}
case AST_ADD: {
struct AST_ADD data = ast.data.AST_ADD;
("(");
printf(data.left);
ast_print(" + ");
printf(data.right);
ast_print(")");
printfreturn;
}
case AST_MUL: {
struct AST_MUL data = ast.data.AST_MUL;
("(");
printf(data.left);
ast_print(" * ");
printf(data.right);
ast_print(")");
printfreturn;
}
}
}
A modern compiler like gcc
or clang
will detect that you are using an enum
type in the switch statement and will warn if the switch is not exhaustive. Use the flag -Wswitch
to ensure that, and -Werror
to make it an error.
A common error is forgetting a break
or a return
at the end of each statement, thus triggering a fall-through that can lead to some hard-to-debug bugs. Use -Wimplicit-fallthrough
to require an explicit annotation if you want the fall-through behaviour and warn otherwise. The exact annotation syntax varies with other flags, but usually, __attribute__((fallthrough))
works.
With all these warnings enabled, the experience of using tagged unions in C is not that bad, huh? The one serious issue is that C will not stop you from accessing the wrong union member. For example, you can match on AST_NUMBER
, but then query for ast.data.AST_MUL.left
.
To somewhat counter that, I suggest the programming style where each case is immediately followed by binding the correct union member as data
, like this:
case AST_NUMBER: {
struct AST_NUMBER data = ast.data.AST_NUMBER;
("%d", data.number);
printfreturn;
}
Here, our matching naming pays off, I think. It makes it easy to code review that we are accessing the correct union data, and if we accidentally use the wrong union member down the line, it will scream at us something like ast.data.AST_ADD
.
We could extract this pattern into a CASE
macro, but this is also beyond me.
We’ve looked at how we can allocate AST nodes using ast_new
function and AST_NEW
macro. Let’s write deallocation function to match that. We will use a similar technique to our ast_print
function.
void ast_free(AST *ptr) {
= *ptr;
AST ast switch (ast.tag) {
case AST_NUMBER: {
struct AST_NUMBER data = ast.data.AST_NUMBER;
break;
}
case AST_ADD: {
struct AST_ADD data = ast.data.AST_ADD;
(data.left);
ast_free(data.right);
ast_freebreak;
}
case AST_MUL: {
struct AST_MUL data = ast.data.AST_MUL;
(data.left);
ast_free(data.right);
ast_freebreak;
}
}
(ptr);
free}
Let’s make a simple code generator that emits x86-64 Intel assembly. Similar to what we did for the one-pass compiler.
#define emitf printf
void ast_emit(AST *ptr) {
= *ptr;
AST ast switch (ast.tag) {
case AST_NUMBER: {
struct AST_NUMBER data = ast.data.AST_NUMBER;
(" mov rax, %d\n", data.number);
emitfreturn;
}
case AST_ADD: {
struct AST_ADD data = ast.data.AST_ADD;
(data.left);
ast_emit(" push rax\n");
emitf(data.right);
ast_emit(" pop rbx\n");
emitf(" add rax, rbx\n");
emitfreturn;
}
case AST_MUL: {
struct AST_MUL data = ast.data.AST_MUL;
(data.left);
ast_emit(" push rax\n");
emitf(data.right);
ast_emit(" pop rbx\n");
emitf(" mul rbx\n");
emitfreturn;
}
}
}
This code generator will be able to compile our expressions to snippets of assembly. However, to generate a complete program, we need an entry point. Let’s add that as a new node AST_MAIN
, as well as the corresponding functions:
struct AST {
enum {
,
AST_MAIN
…} tag;
union {
struct AST_MAIN { AST *body; } AST_MAIN;
…} data;
};
…
void ast_emit(AST *ptr) {
…case AST_MAIN: {
struct AST_MAIN data = ast.data.AST_MAIN;
(".global main\n");
emitf("main:\n"); // "_main" on macOS
emitf(data.body);
ast_emit(" ret\n");
emitf("\n");
emitfreturn;
}
…}
Now, if we pass our example AST (4 + 2 * 10 + 3 * (5 + 1)
) to ast_emit
, we will get the following assembly code:
.global main
main:
, 4
mov rax
push rax, 2
mov rax
push rax, 10
mov rax
pop rbx
mul rbx
push rax, 3
mov rax
push rax, 5
mov rax
push rax, 1
mov rax
pop rbx, rbx
add rax
pop rbx
mul rbx
pop rbx, rbx
add rax
pop rbx, rbx
add rax ret
Let’s use gcc test.s -masm=intel -o test.exe
as the command to assemble and link it, and we get the expected output:
$ ./test.exe
$ echo $?
42
4 + 2 * 10 + 3 * (5 + 1)
.Thank you for your attention. ☰
Subscribe to receive an occasional email from me about compilers, functional programming, or my book Compiling to Assembly from Scratch.
Unsubscribe at any time.
Are you still considering using C for writing a compiler? Consider again! Here’s the same example written in OCaml with complete type and memory safety:
type ast =
of ast
| Main of int
| Number of ast * ast
| Add of ast * ast
| Mul
let emitf x = Printf.kprintf print_endline x
let rec emit = function
| Main body ->".global main";
emitf "main:";
emitf
emit body;" ret"
emitf
| Number n ->" mov rax, %d" n
emitf
| Add (left, right) ->
emit left;" push rax";
emitf
emit right;" pop rbx";
emitf " add rax, rbx"
emitf
| Mul (left, right) ->
emit left;" push rax";
emitf
emit right;" pop rbx";
emitf " mul rbx" emitf
@misc{Keleshev:2022-1,
title="Abstract Syntax Tree: an Example in C",
author="Vladimir Keleshev",
year=2022,
howpublished=
"\url{https://keleshev.com/abstract-syntax-tree-an-example-in-c}",
}