Topics in Programming Languages Notes

Introduction
Compiler Architecture
- Abstract Syntax Tree (AST)
- Intermediate Representation (IR)
Program Semantics
Type Inference
Intraprocedural Dataflow Analysis
- Dataflow Analysis
- Monotonic Frameworks
Pointer Analysis
Information Flow

Introduction

Benefits of program analysis:

Type checking, including classes, interfaces, traits
Type inference (polymorphic, flow-sensitive)
Pointer analysis (ownership, nullability)
Dataflow optimizations in the assembly code
Coverage-guided fuzzing
JIT compilation
Parallelization and differentiation

There are limits to program analysis. For example, the halting problem is undecidable, and Rice's theorem extends that to most other interesting properties.

Compiler Architecture

Abstract Syntax Tree (AST)

ASTs are the output of parsing a program.

Intermediate Representation (IR)

IRs are used for program analysis, such as type-checking, type-inference, and optimizations. It may also convert the code into a useful bytecode representation. There can be multiple layers of IRs.

One way to optimize code is through a control flow graph (CFG). A CFG is a directed graph of basic blocks. Basic blocks are defined by a series of statements and a final terminator.

L1:
  t0 <- 2
  t1 <- f(t0)
  t2 <- t0 + t1
  JMP(L2)

It might also be necessary to translate the language into a bytecode format recognizable by other runtime backends. For example, WASM.

Reference: Brown CSCI 1260 IRs and Register Allocation Notes, CMU 15-411 IR Notes,

Program Semantics

There are formal ways to describe the syntax, semantics, and type system of a programming language.

Syntax

The syntax of a simple language can be described using metavariables and grammars. Metavariables describe different categories of syntax. Grammars describe how to construct valid syntax in the language.

Here are some simple syntax rules:

$\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\efalse}{\msf{false}} \newcommand{\etrue}{\msf{true}} \newcommand{\tyint}{\msf{int}} \newcommand{\tystring}{\msf{string}} \newcommand{\tybool}{\msf{bool}} \newcommand{\concat}{\mathop{\textrm{\textasciicircum}}} \newcommand{\elet}[3]{\msf{let}~{#1} = {#2} ~ \msf{in} ~ {#3}} \begin{aligned} \msf{Number}~n &\in \mathbb{Z} \s \msf{String}~s \s \msf{Bool}~b ::= \efalse \mid \etrue \s \msf{Variable}~x \\ \msf{Const}~c &::= n \mid s \mid b \s \\ \msf{Expr}~e &::= c \mid e_1 \oplus e_2 \mid \elet{x}{e_1}{e_2} \mid x \\ \msf{Type}~\tau &::= \tyint \mid \tystring \mid \tybool \\ \msf{Binop}~\oplus &::= {+} \mid {\concat} \mid {=} \end{aligned}$

Notice that some branches are like base cases, such as $c$ and $x$ , and other branches are recursive, like $e_1 \oplus e_2$ .

Semantics

The semantics of a language can be described using operational semantics.

Type Safety

An expression is well-typed if it can be assigned a type according to the typing rules of the language. Type safety means that a well-typed expression can't go wrong.

Type safety is guaranteed with two properties:

Progress ensures that a program will either be a final value or can take another step of evaluation. If $e : \tau$ , then either $e ~\mathsf{val}$ or $e \hookrightarrow e'$ .
Preservation ensures that a program will have the same type after taking a step of evaluation. If $e : \tau$ and $e \hookrightarrow e'$ , then $e' : \tau$ .

Intraprocedural Dataflow Analysis

Dataflow Analysis

A dataflow analysis is used to get information about the the program at different nodes of the CFG. For example, we can figure out if there is:

Division by 0
Assignment to unused variables
Dead code
Live variables

Generally, dataflow analysis is performed on an intermediate representation (IR) of the program.

$\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{Program}~P &::= I^* \\ \msf{Instr}~I &::= p = rv \mid \msf{goto}~i \mid \eif{p}{i_1}{i_2} \\ \msf{Place}~p &::= x \\ \msf{MemLoc}~l &::= x \\ \msf{Rvalue}~rv &::= c \mid p \mid p_1 \oplus p_2 \\ \end{aligned}$

Rvalues are expressions that produce a value. Assignments generally have the form <MemLoc> = <Rvalue>.

Monotonic Frameworks

Abstract values $v$ represent the information we have about a variable at a particular point in execution. $\bot$ means that we have no information and $\top$ means that we have too much information.

Abstract values can be represented with a lattice structure. Lattices have a join function $\sqcup$ which produces the least upper bound for any two elements. A set $\mathcal{L}$ is a join-semilattice if

$\forall x, y \in \mathcal{L}, x \sqcup y$ exists
$\mathcal{L}$ contains a greatest element $\top$

The domain $\sigma$ is a map between program variables $x$ and abstract values $v$ . The transfer function $f$ describes how $\sigma$ changes after executing an instruction.

The goal of the analysis is to compute, for each node $n$ , the domain $\sigma_n^{\text{in}}$ before executing the instruction and $\sigma_n^{\text{out}}$ after executing the instruction.

\sigma^{\text{in}}_i = \sqcup_{j \in \text{pred}(i)} \sigma^{\text{out}}_j

\sigma^{\text{out}}_i = f(I_i, \sigma^{\text{in}}_i)

The system of equations can be solved using a fixed-point iteration algorithm, such as the worklist algorithm.

let sigma_bot = {x -> bot for x in variables}
let sigma_in = [sigma_bot, .., sigma_bot]
let sigma_out = [sigma_bot, .., sigma_bot]
let worklist = [1, .., n]

while let Some(i) = worklist.dequeue() {
    sigma_in[i] = union(predecessors(i).map(|p| sigma_out[p]))
    sigma_out[i] = f(p_i, sigma_in[i])
    if sigma_out[i] changed {
      worklist.enqueue(successors(i))
    }
}

Pointer Analysis

To handle pointers, we need to extend our IR.

$\newcommand{\msf}[1]{\mathsf{#1}} \newcommand{\s}{\hspace{2em}} \newcommand{\eif}[3]{\msf{if}~{#1}~\msf{goto}~{#2}~\msf{else}~{#3}} \begin{aligned} \msf{MemLoc}~l &::= x \mid \mathcal{A}.i \mid \mathcal{A}[\_] \\ \msf{Rvalue}~rv &::= ... \mid \msf{alloc}^A(p^*) \\ \msf{Place} \end{aligned}$

Information Flow

$x$ post-dominates (pdom) $y$ if $x$ appears on every path from $y$ to the exit node. In other words, after executing $y$ , we always execute $x$ .

$x$ is control-dependent (cdep) on $y$ if $x$ post-dominates a child of $y$ , but not $y$ . That is to say, $x$ post-dominates some, but not all children of $y$ . In other words, whether or not we execute $x$ depends on the outcome of $y$ .

for all edges y -> x:
    for x in rpo(pdom tree):
        // x is a child of y, and thus trivially pdoms a child of y
        // We need to check whether there is another child of y
        if x does not pdom y, then x cdep y
    
    for each node z immediately pdom by x:
        for each node y such that z cdep y:
            if x does not pdom y, then x cdep y