Topics in Programming Languages Notes

Contents

  1. Introduction
  2. Compiler Architecture
  3. Program Semantics
  4. Type Inference
  5. Intraprocedural Dataflow Analysis
  6. Pointer Analysis
  7. Information Flow

Introduction

Benefits of program analysis:

  1. Type checking, including classes, interfaces, traits
  2. Type inference (polymorphic, flow-sensitive)
  3. Pointer analysis (ownership, nullability)
  4. Dataflow optimizations in the assembly code
  5. Coverage-guided fuzzing
  6. JIT compilation
  7. Parallelization and differentiation

There are limits to program analysis. For example, the halting problem is undecidable, and Rice's theorem extends that to most other interesting properties.

Compiler Architecture

Abstract Syntax Tree (AST)

ASTs are the output of parsing a program.

Intermediate Representation (IR)

IRs are used for program analysis, such as type-checking, type-inference, and optimizations. It may also convert the code into a useful bytecode representation. There can be multiple layers of IRs.

One way to optimize code is through a control flow graph (CFG). A CFG is a directed graph of basic blocks. Basic blocks are defined by a series of statements and a final terminator.

L1:
  t0 <- 2
  t1 <- f(t0)
  t2 <- t0 + t1
  JMP(L2) 

It might also be necessary to translate the language into a bytecode format recognizable by other runtime backends. For example, WASM.

Reference: Brown CSCI 1260 IRs and Register Allocation Notes, CMU 15-411 IR Notes,

Program Semantics

There are formal ways to describe the syntax, semantics, and type system of a programming language.

Syntax

The syntax of a simple language can be described using metavariables and grammars. Metavariables describe different categories of syntax. Grammars describe how to construct valid syntax in the language.

Here are some simple syntax rules:

Number nZString sBool b::=falsetrueVariable xConst c::=nsbExpr e::=ce1e2let x=e1 in e2xType τ::=intstringboolBinop ::=+^=\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\efalse}{\msf{false}} \newcommand{\etrue}{\msf{true}} \newcommand{\tyint}{\msf{int}} \newcommand{\tystring}{\msf{string}} \newcommand{\tybool}{\msf{bool}} \newcommand{\concat}{\mathop{\textrm{\textasciicircum}}} \newcommand{\elet}[3]{\msf{let}~{#1} = {#2} ~ \msf{in} ~ {#3}} \begin{aligned} \msf{Number}~n &\in \mathbb{Z} \s \msf{String}~s \s \msf{Bool}~b ::= \efalse \mid \etrue \s \msf{Variable}~x \\ \msf{Const}~c &::= n \mid s \mid b \s \\ \msf{Expr}~e &::= c \mid e_1 \oplus e_2 \mid \elet{x}{e_1}{e_2} \mid x \\ \msf{Type}~\tau &::= \tyint \mid \tystring \mid \tybool \\ \msf{Binop}~\oplus &::= {+} \mid {\concat} \mid {=} \end{aligned}

Notice that some branches are like base cases, such as cc and xx, and other branches are recursive, like e1e2e_1 \oplus e_2.

Semantics

The semantics of a language can be described using operational semantics.

Type Safety

An expression is well-typed if it can be assigned a type according to the typing rules of the language. Type safety means that a well-typed expression can't go wrong.

Type safety is guaranteed with two properties:

  1. Progress ensures that a program will either be a final value or can take another step of evaluation. If e:τe : \tau, then either e vale ~\mathsf{val} or eee \hookrightarrow e'.
  2. Preservation ensures that a program will have the same type after taking a step of evaluation. If e:τe : \tau and eee \hookrightarrow e', then e:τe' : \tau.

Intraprocedural Dataflow Analysis

Dataflow Analysis

A dataflow analysis is used to get information about the the program at different nodes of the CFG. For example, we can figure out if there is:

  1. Division by 0
  2. Assignment to unused variables
  3. Dead code
  4. Live variables

Generally, dataflow analysis is performed on an intermediate representation (IR) of the program.

Program P::=IInstr I::=p=rvgoto iif p goto i1 else i2Place p::=xMemLoc l::=xRvalue rv::=cpp1p2\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{Program}~P &::= I^* \\ \msf{Instr}~I &::= p = rv \mid \msf{goto}~i \mid \eif{p}{i_1}{i_2} \\ \msf{Place}~p &::= x \\ \msf{MemLoc}~l &::= x \\ \msf{Rvalue}~rv &::= c \mid p \mid p_1 \oplus p_2 \\ \end{aligned}

Rvalues are expressions that produce a value. Assignments generally have the form <MemLoc> = <Rvalue>.

Monotonic Frameworks

Abstract values vv represent the information we have about a variable at a particular point in execution. \bot means that we have no information and \top means that we have too much information.

Abstract values can be represented with a lattice structure. Lattices have a join function \sqcup which produces the least upper bound for any two elements. A set L\mathcal{L} is a join-semilattice if

  1. x,yL,xy\forall x, y \in \mathcal{L}, x \sqcup y exists
  2. L\mathcal{L} contains a greatest element \top

The domain σ\sigma is a map between program variables xx and abstract values vv. The transfer function ff describes how σ\sigma changes after executing an instruction.

The goal of the analysis is to compute, for each node nn, the domain σnin\sigma_n^{\text{in}} before executing the instruction and σnout\sigma_n^{\text{out}} after executing the instruction.

σiin=jpred(i)σjout\sigma^{\text{in}}_i = \sqcup_{j \in \text{pred}(i)} \sigma^{\text{out}}_j σiout=f(Ii,σiin)\sigma^{\text{out}}_i = f(I_i, \sigma^{\text{in}}_i)

The system of equations can be solved using a fixed-point iteration algorithm, such as the worklist algorithm.

let sigma_bot = {x -> bot for x in variables}
let sigma_in = [sigma_bot, .., sigma_bot]
let sigma_out = [sigma_bot, .., sigma_bot]
let worklist = [1, .., n]

while let Some(i) = worklist.dequeue() {
    sigma_in[i] = union(predecessors(i).map(|p| sigma_out[p]))
    sigma_out[i] = f(p_i, sigma_in[i])
    if sigma_out[i] changed {
      worklist.enqueue(successors(i))
    }
}

Pointer Analysis

To handle pointers, we need to extend our IR.

MemLoc l::=xA.iA[_]Rvalue rv::=...allocA(p)Place\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{MemLoc}~l &::= x \mid \mathcal{A}.i \mid \mathcal{A}[\_] \\ \msf{Rvalue}~rv &::= ... \mid \msf{alloc}^A(p^*) \\ \msf{Place} \end{aligned}

Information Flow

xx post-dominates (pdom) yy if xx appears on every path from yy to the exit node. In other words, after executing yy, we always execute xx.

xx is control-dependent (cdep) on yy if xx post-dominates a child of yy, but not yy. That is to say, xx post-dominates some, but not all children of yy. In other words, whether or not we execute xx depends on the outcome of yy.

for all edges y -> x:
    for x in rpo(pdom tree):
        // x is a child of y, and thus trivially pdoms a child of y
        // We need to check whether there is another child of y
        if x does not pdom y, then x cdep y
    
    for each node z immediately pdom by x:
        for each node y such that z cdep y:
            if x does not pdom y, then x cdep y