Implementation

This page describes some salient aspects of how the Elle compiler is implemented. Details about the implementation of the correctess proof are deferred until Correctness.

In Elle Syntax, we described the syntax of the Elle language from the user’s perspective. However, internally, Elle uses a series of annotations to describe Elle programs at various stages of compilation. These can be found in elle/ElleSyntax.thy. This more elaborated representation takes the form of general datatype with extension points (type parameters) into which we can insert annotations later, along with the specific syntax extensions used at various stages of the compiler.

The Elle compiler proceeds in several phases, outlined in the remainder of this document. Note that links in this file are to the (frozen) ITP2019 branch of the repository, to ensure consistency of line numbers as master evolves. Though the line numbers may change, the general ideas should not.

Phase 1 - Generating Size Annotations

As a first step, the Elle compiler generates a pair of integer annotations for each node in the syntax tree given to the compiler as input. These annotations correspond to the range of bytes taken up by the code that will be generated from the syntax tree in the program buffer. These are calculated as one would expect: instructions are simply the length of the encoding of the instruction as bytecode, labels correspond to the lengths of EVM JUMPDEST instructions, and sequence nodes have lengths equal to the sum of the lengths of all their children.

The implementation of this compiler phase can be found here, in elle/ElleCompiler.thy.

After this phase (and throughout the Elle compiler thereafter), syntax trees will be proven to conform to the following predicates ll_valid_q and ll_validl_q (on elaborated Elle syntax trees and lists of elaborated Elle syntax trees, respectively) (the size-annotations are referred to as “quantitative annotations”, and abbreviated by “q” or “(x, x’)”, throughout). These predicates can be found here, in elle/ElleCorrect/Qvalid.thy.

Note that, other than for the s annotations on JUMP and JUMPI, none of the syntax tree node annotations have any impact on the size of the generated code corresponding to a node (thus, no effect on the inference rules for ll_validl_q). This makes sense, as they correspond purely to compile-time artifacts that are not present in the generated code.

Phase 2 - Finding Labels

In the second phase of compilation, we enforce the invariant that each Seq node has exactly 0 or 1 descended labels, as defined according to the ll3'_descend predicate, which can be found in elle/ElleCorrect/Valid3.thy, here.

For this phase of the compiler, we traverse the Elle syntax tree. At each Seq node, we scan the sub-tree for all descended Lab nodes with an index “pointing back up” at the Seq node we are currently considering. If we find no such nodes, we mark the Seq node with an annotation indicating there is no label (an empty list). Otherwise, we take the first Lab node we find (in a preorder traversal), annotate it as having been “consumed” by a Seq node, and store the path to that label at the Seq node. In order to guard against multiple labels “pointing up” at the same Seq scope, this compiler pass fails if it ever encounters a Lab node that has not been consumed in its top-level traversal of the tree (since such a node corresponds either to a nonexistent scope, or to a scope which already has a label corresponding to it).

The compiler pass is implemented in the function ll3_assign_label in elle/ElleCorrect/ElleCompiler.thy, here

For reasoning about this phase and subsequent phases, we use the aforementioned``ll_descend`` predicate, which relates two Elle syntax trees and one list of natural numbers. This predicate captures situations in which the syntax tree l2 can be found as a descendant of l1 by treating k as a path through the tree, selecting which child-node to choose at each step.

After the second phase of compilation (if successful), syntax trees will be proven to conform to the inductive predicate ll_valid3', which can be found here, also in elle/ElleCorrect/Valid3.thy. This predicate essentially corresponds to the intuition that each Sequence node has exactly zero or one labels referencing it, and that the locations of these labels are annotated on the sequence nodes.

Phase 3 - Resolving Jumps

Once we have located the unique corresponding label (or determined the nonexistence of such a label) for each sequence node in the second phase, we need to calculate target addresses for each jump node based on the locations of those labels.

This process involves, essentially, a fixed-point calculation over the Elle syntax tree, in order to ensure that sufficient space has been allocated to store the needed address at each Jump and JumpI node. This process is captured by the function process_jumps_loop, which can be found in elle/ElleCorrect/ElleCompiler.thy, here.

process_jumps_loop makes use of two auxiliary functions. The first is process_jumps, which captures one iteration of the size checks involved in the jump-resolution process. process_jumps returns one of three cases of result: either Success if all jumps have sufficient size to store their corresponding addresses, Fail if there is an un-recoverable error (such as invalid input) and Bump if a node in the tree has a jump-target size that needs to be incremented.

At each Seq node, process_jumps checks the node’s annotation to see if there is a corresponding label. If there isn’t one, process_jumps scans all descended Jump nodes to make sure there are no descended jumps that point to that Seq node (as such jumps would have no target), failing if it finds any. If there is a label according to the annotation, process_jumps looks up that label to find its address (failing if there isn’t one to be found), and then runs on that sequence node’s descendants (doing an in-order traversal) to find all jumps that point to the scope corresponding to this sequence node. If any are found, process_jumps checks the space allocated to that jump node against the space required to encode the address from the label that was looked up previously for the Seq node.

If there is enough space, process_jumps continues scanning the tree for other jumps corresponding to the same Seq node and performing the same check, ultimately returning Success if all of them have enough space. Otherwise, it returns the absolute location (as a childpath) of the first Jump node without enough space in the form of a Bump result.

(For Seq nodes descended from the root, process_jumps first performs these checks for jumps pointing up to the outer Seq node, then recursively performs the same checks on the descended Seq node.)

The second auxiliary function used by process_jumps_loop is inc_jump, which takes a path (corresponding to a Jump node) returned by process_jumps and increments its size, adjusting the size annotations of the rest of the tree in the process as appropriate.

To avoid a complicated termination argument for process_jumps_loop (functions in Isabelle need to be proven to terminate or they become very inconvenient to reason about), the execution of this function is “fuelled” (termination is justified by a decreasing natural-number argument, which is decremented once each time the loop is run - thus, once per time a Jump node’s size needs to be incremented). If this fuel parameter is 0, process_jumps_loop returns a failure (None) Otherwise, runs process_jumps on the root of the Elle syntax tree given as an argument. If process_jumps returns Success, process_jumps_loop returns the input syntax tree as nothing needs to be done. If process_jumps returns Failure, process_jumps_loop also fails (returns None). Otherwise, if process_jumps returns Bump, process_jumps_loop calls inc_jump on the child-path returned by process_jumps, and then calls process_jumps_loop on the same arguments (with fuel parameter decremented).

The correctness of process_jumps_loop is established by a series of validation passes that happen after it runs. However, process_jumps_loop is proven directly to produce valid_q results from valid_q inputs. Additionally, we define a function, get_process_jumps_fuel, which calculates a sufficient amount of fuel to ensure that process_jumps_loop terminates on its input (although this is not formally established with a proof).

By the end of running process_jumps_loop, we have a syntax tree that should obey the predicate ll4_validate_jump_targets. This predicate essentially makes sure that the indices of jump nodes (which point to the sequence node corresponding to the jump; i.e., to the scope the jump’s target is in) correspond to a scope whose label has an address matching the address stored at the jump node (which is the address that will ultimately be written out to bytecode).

The definition of ll4_validate_jump_targets can be found in elle/ElleCorrect/Valid4.thy, here.

The Big Picture

At this point, we have produced a syntax tree that is valid as an ll4 syntax tree, yet meets all of the predicates described above. In its final form, ll4 contains all the information needed to generate concrete EVM machine code, including concrete addresses. At this point, codegen' is used to emit a list of bytes corresponding to the output bytecode. (codegen can be found in elle/ElleCompiler.thy, here)

An additional validation step is used after this point to ensure that all jumps are encodable in EVM (that is, their addresses are at least 1 byte and not more than 32 bytes). Code for these extra validators can be found in elle/ElleCorrect/ElleAltSemantics.thy, here. Examples of how all these pieces may be put together into a single verified compilation pipeline can be found in elle/ElleCompilerVerified.thy (here)

In the next section, Correctness, we will sketch the process by which the generated EVM instructions are proven correct with respect to the input program, making use of the information contained in these intermediate predicates.