Program Analysis Notes

Introduction

Benefits of program analysis:

  1. Type checking, including classes, interfaces, traits
  2. Type inference (polymorphic, flow-sensitive)
  3. Pointer analysis (ownership, nullability)
  4. Dataflow optimizations in the assembly code
  5. Coverage-guided fuzzing
  6. JIT compilation
  7. Parallelization and differentiation

There are limits to program analysis. For example, the halting problem is undecidable, and Rice's theorem extends that to most other interesting properties.

Compiler Architecture

Abstract Syntax Tree (AST)

ASTs are the output of parsing a program.

Intermediate Representation (IR)

IRs are used for program analysis, such as type-checking, type-inference, and optimizations. It may also convert the code into a useful bytecode representation. There can be multiple layers of IRs.

One way to optimize code is through a control flow graph (CFG). A CFG is a directed graph of basic blocks. Basic blocks are defined by a series of statements and a final terminator.

L1:
  t0 <- 2
  t1 <- f(t0)
  t2 <- t0 + t1
  JMP(L2) 

It might also be necessary to translate the language into a bytecode format recognizable by other runtime backends. For example, WASM.

Reference: Brown CSCI 1260 IRs and Register Allocation Notes, CMU 15-411 IR Notes,

Program Semantics

There are formal ways to describe the syntax, semantics, and type system of a programming language.

Syntax

The syntax of a simple language can be described using metavariables and grammars. Metavariables describe different categories of syntax. Grammars describe how to construct valid syntax in the language.

Here are some simple syntax rules:

Number nZString sBool b::=falsetrueVariable xConst c::=nsbExpr e::=ce1e2let x=e1 in e2xType τ::=intstringboolBinop ::=+^=\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\efalse}{\msf{false}} \newcommand{\etrue}{\msf{true}} \newcommand{\tyint}{\msf{int}} \newcommand{\tystring}{\msf{string}} \newcommand{\tybool}{\msf{bool}} \newcommand{\concat}{\mathop{\textrm{\textasciicircum}}} \newcommand{\elet}[3]{\msf{let}~{#1} = {#2} ~ \msf{in} ~ {#3}} \begin{aligned} \msf{Number}~n &\in \mathbb{Z} \s \msf{String}~s \s \msf{Bool}~b ::= \efalse \mid \etrue \s \msf{Variable}~x \\ \msf{Const}~c &::= n \mid s \mid b \s \\ \msf{Expr}~e &::= c \mid e_1 \oplus e_2 \mid \elet{x}{e_1}{e_2} \mid x \\ \msf{Type}~\tau &::= \tyint \mid \tystring \mid \tybool \\ \msf{Binop}~\oplus &::= {+} \mid {\concat} \mid {=} \end{aligned}

Notice that some branches are like base cases, such as cc and xx, and other branches are recursive, like e1e2e_1 \oplus e_2.

Semantics

The semantics of a language can be described using operational semantics.

Type Safety

An expression is well-typed if it can be assigned a type according to the typing rules of the language. Type safety means that a well-typed expression can't go wrong.

Type safety is guaranteed with two properties:

  1. Progress ensures that a program will either be a final value or can take another step of evaluation. If e:τe : \tau, then either e vale ~\mathsf{val} or eee \hookrightarrow e'.
  2. Preservation ensures that a program will have the same type after taking a step of evaluation. If e:τe : \tau and eee \hookrightarrow e', then e:τe' : \tau.

Intraprocedural Dataflow Analysis

A dataflow analysis is used to get information about the the program at different nodes of the CFG. For example, we can figure out if there is:

  1. Division by 0
  2. Assignment to unused variables
  3. Dead code
  4. Live variables

Generally, dataflow analysis is performed on an intermediate representation (IR) of the program.

Program P::=IInstr I::=p=rvgoto iif p goto i1 else i2Place p::=xRvalue rv::=cpp1p2\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{Program}~P &::= I^* \\ \msf{Instr}~I &::= p = rv \mid \msf{goto}~i \mid \eif{p}{i_1}{i_2} \\ \msf{Place}~p &::= x \\ \msf{Rvalue}~rv &::= c \mid p \mid p_1 \oplus p_2 \\ \end{aligned}

Rvalues are expressions that produce a value. Bytecode instructions have the form <Place> = <Rvalue>.

Lattices

A lattice is a set of abstract values with a partial order \sqsubseteq. Lattices have a join function \sqcup which produces the least upper bound for any two elements. A set L\mathcal{L} is a join-semilattice if

  1. x,yL,xy\forall x, y \in \mathcal{L}, x \sqcup y exists
  2. L\mathcal{L} contains a greatest element \top

Monotonic Frameworks

There are 3 components (L,σ,f)(\mathcal{L}, \sigma, f) to a dataflow analysis:

  1. L\mathcal{L} is a lattice containing abstract values and a join function
  2. σ:VarL\sigma : \mathsf{Var} \mapsto \mathcal{L} is a map from program values to abstract values
  3. f:Iσσf : I \to \sigma \to \sigma is a transfer function that describes how the σ\sigma changes after executing an instruction

The goal of the analysis is to compute, for each node nn, the domain σnin\sigma_n^{\text{in}} before executing the instruction and σnout\sigma_n^{\text{out}} after executing the instruction.

σiin=jpred(i)σjout\sigma^{\text{in}}_i = \sqcup_{j \in \text{pred}(i)} \sigma^{\text{out}}_j σiout=f(Ii,σiin)\sigma^{\text{out}}_i = f(I_i, \sigma^{\text{in}}_i)

The system of equations can be solved using a fixed-point iteration algorithm, such as the worklist algorithm.

let sigma_bot = {x -> bot for x in variables}
let sigma_in = [sigma_bot, .., sigma_bot]
let sigma_out = [sigma_bot, .., sigma_bot]
let worklist = [1, .., n]

while let Some(i) = worklist.dequeue() {
    sigma_in[i] = union(predecessors(i).map(|p| sigma_out[p]))
    sigma_out[i] = f(p_i, sigma_in[i])
    if sigma_out[i] changed {
      worklist.enqueue(successors(i))
    }
}

Constant Analysis

The lattice is L={,}Const\mathcal{L} = \{\top, \bot\} \cup \mathsf{Const}. The domain is σ:PlaceL\sigma : \mathsf{Place} \mapsto \mathcal{L} .

The transfer function is defined as:

  • f(x=c,σ)=σ[xc]f(x = c, \sigma) = \sigma[x \mapsto c]
  • f(x=y,σ)=σ[xσ[y]]f(x = y, \sigma) = \sigma[x \mapsto \sigma[y]]
  • f(x=yz,σ)=σ[xeval(σ(y)σ(z))]f(x = y \oplus z, \sigma) = \sigma[x \mapsto \text{eval}(\sigma(y) \oplus \sigma(z))]
  • f(goto i,σ)=σf(\mathsf{goto}~i, \sigma) = \sigma
  • f(if p goto i1 else i2)=σf(\mathsf{if}~{p}~\mathsf{goto}~{i_1}~\mathsf{else}~{i_2}) = \sigma

Constant analysis is used for constant propagation.

Live-Variable Analysis

The lattice is the powerset of all variables. The domain is σ:PlaceL\sigma : \mathsf{Place} \mapsto \mathcal{L}.

The transfer function removes the left-hand side variable from the set and adds any variables used on the right-hand side.

  • f(x=rv,σ)=(σ{x})uses(rv)f(x = rv, \sigma) = (\sigma \setminus \{x\}) \cup \mathsf{uses}(rv)
  • f(goto i,σ)=σf(\mathsf{goto}~i, \sigma) = \sigma
  • f(if p goto i1 else i2)=σuses(p)f(\mathsf{if}~{p}~\mathsf{goto}~{i_1}~\mathsf{else}~{i_2}) = \sigma \cup \mathsf{uses}(p)

Live-variable analysis is used for dead-code elimination.

Pointer Analysis

Consider the following code:

...
*x = 42
*y = 23
z = *x

It is not possible to determine the value of z unless we know whether x and y point toward the same allocation. This is where pointer analysis comes in.

The goal of pointer analysis is to map variables to sets of possible allocations they can point to.

To handle pointers, we need to extend our IR.

Instr I::=...alloc(p1,...,pn)Place p::=x(.i)Allocation APointer ptr::=xA.iPointsTo θ=ptrARvalue rv::=...allocA(p)MemLoc l::=xA.iA[_]\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{Instr}~I &::= ... \mid \msf{alloc(p_1, ..., p_n)} \\ \msf{Place}~p &::= x(.i)^* \\ \msf{Allocation}~A \\ \msf{Pointer}~ptr &::= x \mid A.i \\ \msf{PointsTo}~\theta &= { ptr \to { A }} \\ \msf{Rvalue}~rv &::= ... \mid \msf{alloc}^A(p^*) \\ \msf{MemLoc}~l &::= x \mid \mathcal{A}.i \mid \mathcal{A}[\_] \\ \end{aligned}

Escape Analysis

Information Flow

xx post-dominates (pdom) yy if xx appears on every path from yy to the exit node. In other words, after executing yy, we always execute xx.

xx is control-dependent (cdep) on yy if xx post-dominates a child of yy, but not yy. That is to say, xx post-dominates some, but not all children of yy. In other words, whether or not we execute xx depends on the outcome of yy.

for all edges y -> x:
    for x in rpo(pdom tree):
        // x is a child of y, and thus trivially pdoms a child of y
        // We need to check whether there is another child of y
        if x does not pdom y, then x cdep y
    
    for each node z immediately pdom by x:
        for each node y such that z cdep y:
            if x does not pdom y, then x cdep y

Taint Analysis

The goal of taint analysis is to identify information flows from secure sources to insecure sinks.

Information can flow in 2 different ways:

  1. An explicit flow occurs when a variable is directly assigned a value from a tainted source
  2. An implicit flow occurs when the control flow of the program depends on a tainted source
#[secure] fn secure() -> string { "Secret" }

fn explicit_flow() {
  let s = secure() in 
  println(s)
}

fn implicit_flow() {
  let p = secure() in
  if p == "Secret" {
    println!("Done!")
  }
}