Contents
- Introduction
- Compiler Architecture
- Program Semantics
- Type Inference
- Intraprocedural Dataflow Analysis
- Pointer Analysis
- Information Flow
Introduction
Benefits of program analysis:
- Type checking, including classes, interfaces, traits
- Type inference (polymorphic, flow-sensitive)
- Pointer analysis (ownership, nullability)
- Dataflow optimizations in the assembly code
- Coverage-guided fuzzing
- JIT compilation
- Parallelization and differentiation
There are limits to program analysis. For example, the halting problem is undecidable, and Rice's theorem extends that to most other interesting properties.
Compiler Architecture
Abstract Syntax Tree (AST)
ASTs are the output of parsing a program.
Intermediate Representation (IR)
IRs are used for program analysis, such as type-checking, type-inference, and optimizations. It may also convert the code into a useful bytecode representation. There can be multiple layers of IRs.
One way to optimize code is through a control flow graph (CFG). A CFG is a directed graph of basic blocks. Basic blocks are defined by a series of statements and a final terminator.
L1:
t0 <- 2
t1 <- f(t0)
t2 <- t0 + t1
JMP(L2)
It might also be necessary to translate the language into a bytecode format recognizable by other runtime backends. For example, WASM.
Reference: Brown CSCI 1260 IRs and Register Allocation Notes, CMU 15-411 IR Notes,
Program Semantics
There are formal ways to describe the syntax, semantics, and type system of a programming language.
Syntax
The syntax of a simple language can be described using metavariables and grammars. Metavariables describe different categories of syntax. Grammars describe how to construct valid syntax in the language.
Here are some simple syntax rules:
Notice that some branches are like base cases, such as and , and other branches are recursive, like .
Semantics
The semantics of a language can be described using operational semantics.
Type Safety
An expression is well-typed if it can be assigned a type according to the typing rules of the language. Type safety means that a well-typed expression can't go wrong.
Type safety is guaranteed with two properties:
- Progress ensures that a program will either be a final value or can take another step of evaluation. If , then either or .
- Preservation ensures that a program will have the same type after taking a step of evaluation. If and , then .
Intraprocedural Dataflow Analysis
Dataflow Analysis
A dataflow analysis is used to get information about the the program at different nodes of the CFG. For example, we can figure out if there is:
- Division by 0
- Assignment to unused variables
- Dead code
- Live variables
Generally, dataflow analysis is performed on an intermediate representation (IR) of the program.
Rvalues are expressions that produce a value. Assignments generally have the form <MemLoc> = <Rvalue>
.
Monotonic Frameworks
Abstract values represent the information we have about a variable at a particular point in execution. means that we have no information and means that we have too much information.
Abstract values can be represented with a lattice structure. Lattices have a join function which produces the least upper bound for any two elements. A set is a join-semilattice if
- exists
- contains a greatest element
The domain is a map between program variables and abstract values . The transfer function describes how changes after executing an instruction.
The goal of the analysis is to compute, for each node , the domain before executing the instruction and after executing the instruction.
The system of equations can be solved using a fixed-point iteration algorithm, such as the worklist algorithm.
let sigma_bot = {x -> bot for x in variables}
let sigma_in = [sigma_bot, .., sigma_bot]
let sigma_out = [sigma_bot, .., sigma_bot]
let worklist = [1, .., n]
while let Some(i) = worklist.dequeue() {
sigma_in[i] = union(predecessors(i).map(|p| sigma_out[p]))
sigma_out[i] = f(p_i, sigma_in[i])
if sigma_out[i] changed {
worklist.enqueue(successors(i))
}
}
Pointer Analysis
To handle pointers, we need to extend our IR.
Information Flow
post-dominates (pdom) if appears on every path from to the exit node. In other words, after executing , we always execute .
is control-dependent (cdep) on if post-dominates a child of , but not . That is to say, post-dominates some, but not all children of . In other words, whether or not we execute depends on the outcome of .
for all edges y -> x:
for x in rpo(pdom tree):
// x is a child of y, and thus trivially pdoms a child of y
// We need to check whether there is another child of y
if x does not pdom y, then x cdep y
for each node z immediately pdom by x:
for each node y such that z cdep y:
if x does not pdom y, then x cdep y