Introduction
Benefits of program analysis:
- Type checking, including classes, interfaces, traits
- Type inference (polymorphic, flow-sensitive)
- Pointer analysis (ownership, nullability)
- Dataflow optimizations in the assembly code
- Coverage-guided fuzzing
- JIT compilation
- Parallelization and differentiation
There are limits to program analysis. For example, the halting problem is undecidable, and Rice's theorem extends that to most other interesting properties.
Compiler Architecture
Abstract Syntax Tree (AST)
ASTs are the output of parsing a program.
Intermediate Representation (IR)
IRs are used for program analysis, such as type-checking, type-inference, and optimizations. It may also convert the code into a useful bytecode representation. There can be multiple layers of IRs.
One way to optimize code is through a control flow graph (CFG). A CFG is a directed graph of basic blocks. Basic blocks are defined by a series of statements and a final terminator.
L1:
t0 <- 2
t1 <- f(t0)
t2 <- t0 + t1
JMP(L2)
It might also be necessary to translate the language into a bytecode format recognizable by other runtime backends. For example, WASM.
Reference: Brown CSCI 1260 IRs and Register Allocation Notes, CMU 15-411 IR Notes,
Program Semantics
There are formal ways to describe the syntax, semantics, and type system of a programming language.
Syntax
The syntax of a simple language can be described using metavariables and grammars. Metavariables describe different categories of syntax. Grammars describe how to construct valid syntax in the language.
Here are some simple syntax rules:
Notice that some branches are like base cases, such as and , and other branches are recursive, like .
Semantics
The semantics of a language can be described using operational semantics.
Type Safety
An expression is well-typed if it can be assigned a type according to the typing rules of the language. Type safety means that a well-typed expression can't go wrong.
Type safety is guaranteed with two properties:
- Progress ensures that a program will either be a final value or can take another step of evaluation. If , then either or .
- Preservation ensures that a program will have the same type after taking a step of evaluation. If and , then .
Intraprocedural Dataflow Analysis
A dataflow analysis is used to get information about the the program at different nodes of the CFG. For example, we can figure out if there is:
- Division by 0
- Assignment to unused variables
- Dead code
- Live variables
Generally, dataflow analysis is performed on an intermediate representation (IR) of the program.
Rvalues are expressions that produce a value. Bytecode instructions have the form <Place> = <Rvalue>.
Lattices
A lattice is a set of abstract values with a partial order . Lattices have a join function which produces the least upper bound for any two elements. A set is a join-semilattice if
- exists
- contains a greatest element
Monotonic Frameworks
There are 3 components to a dataflow analysis:
- is a lattice containing abstract values and a join function
- is a map from program values to abstract values
- is a transfer function that describes how the changes after executing an instruction
The goal of the analysis is to compute, for each node , the domain before executing the instruction and after executing the instruction.
The system of equations can be solved using a fixed-point iteration algorithm, such as the worklist algorithm.
let sigma_bot = {x -> bot for x in variables}
let sigma_in = [sigma_bot, .., sigma_bot]
let sigma_out = [sigma_bot, .., sigma_bot]
let worklist = [1, .., n]
while let Some(i) = worklist.dequeue() {
sigma_in[i] = union(predecessors(i).map(|p| sigma_out[p]))
sigma_out[i] = f(p_i, sigma_in[i])
if sigma_out[i] changed {
worklist.enqueue(successors(i))
}
}
Constant Analysis
The lattice is . The domain is .
The transfer function is defined as:
Constant analysis is used for constant propagation.
Live-Variable Analysis
The lattice is the powerset of all variables. The domain is .
The transfer function removes the left-hand side variable from the set and adds any variables used on the right-hand side.
Live-variable analysis is used for dead-code elimination.
Pointer Analysis
Consider the following code:
...
*x = 42
*y = 23
z = *x
It is not possible to determine the value of z unless we know whether x and y point toward the same allocation. This is where pointer analysis comes in.
The goal of pointer analysis is to map variables to sets of possible allocations they can point to.
To handle pointers, we need to extend our IR.
Escape Analysis
Information Flow
post-dominates (pdom) if appears on every path from to the exit node. In other words, after executing , we always execute .
is control-dependent (cdep) on if post-dominates a child of , but not . That is to say, post-dominates some, but not all children of . In other words, whether or not we execute depends on the outcome of .
for all edges y -> x:
for x in rpo(pdom tree):
// x is a child of y, and thus trivially pdoms a child of y
// We need to check whether there is another child of y
if x does not pdom y, then x cdep y
for each node z immediately pdom by x:
for each node y such that z cdep y:
if x does not pdom y, then x cdep y
Taint Analysis
The goal of taint analysis is to identify information flows from secure sources to insecure sinks.
Information can flow in 2 different ways:
- An explicit flow occurs when a variable is directly assigned a value from a tainted source
- An implicit flow occurs when the control flow of the program depends on a tainted source
#[secure] fn secure() -> string { "Secret" }
fn explicit_flow() {
let s = secure() in
println(s)
}
fn implicit_flow() {
let p = secure() in
if p == "Secret" {
println!("Done!")
}
}