Implementation¶
This page describes some salient aspects of how the Elle compiler is implemented. Details about the implementation of the correctess proof are deferred until Correctness.
In Elle Syntax, we described the syntax of the Elle language from the user’s
perspective. However, internally, Elle uses a series of annotations to
describe Elle programs at various stages of compilation. These can be found
in elle/ElleSyntax.thy. This more elaborated representation takes the form
of general datatype
with extension points (type parameters) into which we can insert annotations later,
along with the specific syntax extensions
used at various stages of the compiler.
The Elle compiler proceeds in several phases, outlined in the remainder of this document. Note that links in this
file are to the (frozen) ITP2019 branch of the repository, to ensure consistency of line numbers as master evolves.
Though the line numbers may change, the general ideas should not.
Phase 1 - Generating Size Annotations¶
As a first step, the Elle compiler generates a pair of integer annotations for each node in the syntax tree given to the compiler as input. These annotations correspond to the range of bytes taken up by the code that will be generated from the syntax tree in the program buffer. These are calculated as one would expect: instructions are simply the length of the encoding of the instruction as bytecode, labels correspond to the lengths of EVM JUMPDEST instructions, and sequence nodes have lengths equal to the sum of the lengths of all their children.
The implementation of this compiler phase can be found
here, in elle/ElleCompiler.thy.
After this phase (and throughout the Elle compiler thereafter), syntax
trees will be proven to conform to the following predicates
ll_valid_q and ll_validl_q (on
elaborated Elle syntax trees and lists of elaborated
Elle syntax trees, respectively)
(the size-annotations are referred to as “quantitative annotations”,
and abbreviated by “q” or “(x, x’)”, throughout). These predicates
can be found here, in
elle/ElleCorrect/Qvalid.thy.
Note that, other than for the s annotations on JUMP and JUMPI,
none of the syntax tree node annotations have any impact on the size of the
generated code corresponding to a node (thus, no effect on the inference rules
for ll_validl_q). This makes sense, as they correspond purely to compile-time
artifacts that are not present in the generated code.
Phase 2 - Finding Labels¶
In the second phase of compilation, we enforce the invariant that
each Seq node has exactly 0 or 1 descended labels, as defined
according to the ll3'_descend predicate, which can be found in
elle/ElleCorrect/Valid3.thy, here.
For this phase of the compiler, we traverse the Elle syntax tree.
At each Seq node, we scan the sub-tree for all descended Lab
nodes with an index “pointing back up” at the Seq node we are
currently considering. If we find no such nodes, we mark the
Seq node with an annotation indicating there is no label
(an empty list). Otherwise, we take the first Lab node we
find (in a preorder traversal),
annotate it as having been “consumed” by a Seq node,
and store the path to that label at the Seq node. In order to
guard against multiple labels “pointing up” at the same Seq
scope, this compiler pass fails if it ever encounters a Lab node
that has not been consumed in its top-level traversal of the tree
(since such a node corresponds either to a nonexistent scope,
or to a scope which already has a label corresponding to it).
The compiler pass is implemented in the function ll3_assign_label in
elle/ElleCorrect/ElleCompiler.thy, here
For reasoning about this phase and subsequent phases,
we use the aforementioned``ll_descend`` predicate, which relates
two Elle syntax trees and one list of natural numbers. This
predicate captures situations in which the syntax tree
l2 can be found as a descendant of l1 by treating
k as a path through the tree, selecting which child-node
to choose at each step.
After the second phase of compilation (if successful),
syntax trees will be proven to conform to the
inductive predicate ll_valid3', which can be found
here, also in
elle/ElleCorrect/Valid3.thy. This predicate essentially corresponds to the
intuition that each Sequence node has exactly
zero or one labels referencing it, and that the locations
of these labels are annotated on the sequence nodes.
Phase 3 - Resolving Jumps¶
Once we have located the unique corresponding label (or determined the nonexistence of such a label) for each sequence node in the second phase, we need to calculate target addresses for each jump node based on the locations of those labels.
This process involves, essentially, a fixed-point calculation over the Elle syntax tree,
in order to ensure that sufficient space has been allocated to store the needed address
at each Jump and JumpI node.
This process is captured by the function
process_jumps_loop, which can be found in elle/ElleCorrect/ElleCompiler.thy,
here.
process_jumps_loop makes use of two auxiliary functions. The first is
process_jumps, which captures one iteration of the size checks involved in
the jump-resolution process. process_jumps returns one of three cases of
result: either Success if all jumps have sufficient size to store their corresponding
addresses, Fail if there is an un-recoverable error (such as invalid input)
and Bump if a node in the tree has a jump-target size that needs to be incremented.
At each Seq node, process_jumps checks the node’s annotation to see if there
is a corresponding label. If there isn’t one, process_jumps scans all descended
Jump nodes to make sure there are no descended jumps that point to that Seq node
(as such jumps would have no target), failing if it finds any.
If there is a label according to the annotation,
process_jumps looks up that label to find its address
(failing if there isn’t one to be found), and then
runs on that sequence node’s descendants (doing an in-order traversal)
to find all jumps that point to the scope corresponding
to this sequence node. If any are found, process_jumps checks the space allocated to that jump
node against the space required to encode the address from the label that was looked up previously
for the Seq node.
If there is enough space, process_jumps continues scanning the
tree for other jumps corresponding to the same Seq node and performing the same check,
ultimately returning Success if all of them have enough space. Otherwise, it returns
the absolute location (as a childpath)
of the first Jump node without enough space in the form of a
Bump result.
(For Seq nodes descended from the root, process_jumps first performs these checks for
jumps pointing up to the outer Seq node, then recursively performs the same checks on the
descended Seq node.)
The second auxiliary function used by process_jumps_loop is inc_jump, which takes
a path (corresponding to a Jump node) returned by process_jumps and increments its size,
adjusting the size annotations of the rest of the tree in the process as appropriate.
To avoid a complicated termination argument for
process_jumps_loop (functions in Isabelle need to be proven to terminate
or they become very inconvenient to reason about),
the execution of
this function is “fuelled” (termination is justified by a decreasing natural-number argument,
which is decremented once each time the loop is run - thus, once per time a Jump node’s
size needs to be incremented).
If this fuel parameter is 0, process_jumps_loop returns a failure (None)
Otherwise, runs process_jumps on the root
of the Elle syntax tree given as an argument. If process_jumps returns
Success, process_jumps_loop returns the input syntax tree as nothing
needs to be done. If process_jumps returns Failure,
process_jumps_loop
also fails (returns None). Otherwise, if process_jumps returns Bump,
process_jumps_loop calls inc_jump on the child-path returned by
process_jumps, and then calls process_jumps_loop on the same arguments
(with fuel parameter decremented).
The correctness of process_jumps_loop is established by a series of
validation passes that happen after it runs. However, process_jumps_loop
is proven directly to produce valid_q results from valid_q inputs.
Additionally, we define a function, get_process_jumps_fuel, which
calculates a sufficient amount of fuel to ensure that process_jumps_loop
terminates on its input (although this is not formally established with a proof).
By the end of running process_jumps_loop, we have a syntax tree that should obey
the predicate ll4_validate_jump_targets. This predicate essentially makes sure that
the indices of jump nodes (which point to the sequence node corresponding to the jump; i.e.,
to the scope the jump’s target is in) correspond to a scope whose label has an address
matching the address stored at the jump node (which is the address that will ultimately be
written out to bytecode).
The definition of ll4_validate_jump_targets can be found in
elle/ElleCorrect/Valid4.thy, here.
The Big Picture¶
At this point, we have produced a syntax tree that is valid as an
ll4 syntax tree, yet meets all of the predicates described above.
In its final form, ll4 contains all the information needed to generate
concrete EVM machine code, including concrete addresses. At this point,
codegen' is used to emit a list of bytes corresponding to the
output bytecode. (codegen can be found in elle/ElleCompiler.thy,
here)
An additional validation
step is used after this point to ensure that all jumps are encodable
in EVM (that is, their addresses are at least 1 byte and not more than 32
bytes). Code for these extra validators can be found
in elle/ElleCorrect/ElleAltSemantics.thy, here.
Examples of how all these pieces may be put together into a single verified
compilation pipeline can be found in elle/ElleCompilerVerified.thy (here)
In the next section, Correctness, we will sketch the process by which the generated EVM instructions are proven correct with respect to the input program, making use of the information contained in these intermediate predicates.