Data structures and algorithms notes

Contents

Data structures

  1. Abstract data types
  2. Arrays
  3. Linked lists
  4. Trees
  5. Heaps
  6. Graphs
  7. Union find
  8. Multisets
  9. Maps

Algorithms

  1. Backtracking
  2. Dynamic programming
  3. Bit math

Abstract data types

Abstract data types include lists, priority queues, trees, sets, multisets, maps, and graphs.

Abstract data typeImplementation
ListArrays, linked lists
Priority queueHeaps
TreesPointers, arrays
Setsset
MultisetsCounter
Mapsdict, defaultdict
GraphsAdjacency lists, adjacency matrices
Disjoint setUnion find

*The difference between lists and priority queues is that lists maintain order, while priority queues maintain priority. Generally lists implement get and insert and priority queues implement push and pop.

*A stack is a priority queue where the most recently added element has the highest priority (LIFO). A queue is a priority queue where the least recently added element has the highest priority (FIFO).

Arrays

Two pointer

Used to find valid subarrays.

Keep two pointers, l and r. Move l or r depending on some condition.

def f(arr):
    l, r = 0, len(arr) - 1
    while l < r:
        if l > r:
            l += 1
        else:
            r -= 1
    return l

Example problems include 3Sum, Container With Most Water and Trapping Rainwater.

Sliding window

Used to find valid subarrays.

To find the most [BLANK] of valid subarrays, keep two pointers l and r, the output value, and the curr value. Slide r until condition is broken, and then slide l until condition is not broken.

To find the total number of valid subarrays, fix the value of r. The number of valid windows ending at index r is equal to the size of the window.

There are n(n + 1)/2 subarrays, but a sliding window runs in O(n).

def f(arr):
    output, l, curr = 0, 0
    for r in range(len(arr)):
        curr += arr[r]
        while curr > k:
            curr -= arr[l]
            l += 1
        output = max(output, r - l + 1)
    return output

Prefix sum

Used to efficiently calculate the sum of a subarray.

def f(arr):
    prefix = [arr[0]]
    for i in range(1, len(arr)):
        prefix.append(prefix[-1] + arr[i])
    return prefix

Kadane's algorithm

Used to find the maximum sum of a contiguous subarray within an array.

At each step, either start a new subarray at n, or extend the previous subarray with curr_sum + n.

def f(arr):
    curr_sum = 0
    best_sum = float('-inf')
    for n in arr:
        curr_sum = max(n, curr_sum + n)
        best_sum = max(best_sum, curr_sum)
    return best_sum

Used to find an element in a sorted array.

Classic

Used for finding exact targets, when it is not necessary to compare to neighbors. The loop invariant is l <= r, and loop ends when the search space is empty.

def f(n, arr):
    l, r = 0, len(arr) - 1
    while l <= r:
        m = (l + r) // 2
        if arr[m] == n:
            return m
        elif arr[m] < n:
            l = m + 1
        else:
            r = m - 1
  return -1

Example problems include Search a 2D Matrix.

Single-candidate postprocessing

Used when the condition depends on a neighbor, such as finding finding the minimum or maximum value that satisfies a condition. The loop invariant is l < r, and the loop ends when exactly one element remains.

def f(n, arr):
    l, r = 0, len(arr) - 1
    while l < r:
        m = (l + r) // 2
        if g(arr[m]):
            r = mid
        else:
            l = mid + 1
    if g(arr[l]) == n:
        return l
    return -1

Double-candidate postprocessing

Used when the condition depends on two neighbors, such as finding a peak element in an array. The loop invariant is l + 1 < r, and the loop ends when exactly two elements remain.

def f(n, arr):
    l, r = 0, len(arr) - 1
    while l + 1 < r:
        m = (l + r) // 2
        if g(arr[m]):
            r = m
        else:
            l = m
    if g(arr(l)): 
        return l
    if g(arr(r)): 
        return r
    return -1

Merge sort

The intuition behind mergesort is that you recursively break the list into left and right sublists. At the base case, the list is length 0 or 1 and is considered sorted. After calling mergesort recursively on the left and the right side, the merge function merges the now sorted lists.

def merge_sort(arr): 
    if len(arr) <= 1:
        return arr
    
    left = []
    right = []

    for i in range(len(arr)):
        if i < len(arr) / 2:
            left.append(arr[i])
        else:
            right.append(arr[i])

    left = merge_sort(left)
    right = merge_sort(right)
    return merge(left, right)

def merge(left, right):
    output = []
    while left and right:
        if left[0] <= right[0]:
            output.append(left[0])
            left = left[1:]
        else:
            output.append(right[0])
            right = right[1:]
    if left:
        output.extend(left)
    if right:
        output.extend(right)

    return output

Bucket sort

The idea of bucket sort is to distribute the elements into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm, or by recursively applying the bucket sort algorithm. Combined with the pigeonhole principle, it can be useful for problems like Maximum Gap.

Linked lists

Dummy node

Some problems can be simplified by using a dummy node at the start of the linked list. This avoids edge cases when the head needs to be removed or modified. Example problems include Remove Linked List Elements and Reverse Linked List).

Fast and slow pointers

Used to detect cycles in a linked list. Notice that the distance between slow and fast decreases by 1 every turn.

def hasCycle(head):
    slow, fast = head, head
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
        if fast == slow:
            return True
    return False

At the point when slow and fast meet, we have A + B = A + B + C + B. We also know that slow is half the speed of fast, so 2(A + B) = A + B + C + B. Simplifying, we get A = C. This helps us find when the cycle starts.

Fast and Slow Pointers

Trees

Tree with pointers

Any arbitrary tree can be implemented with pointers.

class Node:
    def __init__(self, value):
        self.value = value
        self.children = []

Complete binary tree with arrays

A binary tree is a tree where nodes can have at most two children, referred to as the left and right child. An important type of binary tree are complete binary trees, which are trees that fill every level from left to right. Complete binary trees are used in binary heaps and segment trees. Complete binary trees are often implemented with arrays for efficiency.

Let the root node be index 0. For any node at index i:

  1. Left child at index 2 * i + 1
  2. Right child at index 2 * i + 2
  3. Parent at index floor((i - 1) / 2)

Tree traversal

There are multiple ways to traverse a tree. They can be categorized by whether they are breadth-first (BFS) or depth-first (DFS).

There are three variants of DFS: pre-order (node-left-right), post-order (left-right-node), and in-order (left-node-right).

DFS Tree Traversal

BFS needs a queue to keep track of nodes at the current level, so it is usually implemented iteratively.

# BFS, iterative
def traverse(root):
    curr = [root]
    while curr:
        nxt = []
        for node in curr:
            if node is not None:
                for child in node.children:
                    nxt.append(child)
        curr = nxt

DFS needs a stack, so it is usually implemented recursively to take advantage of the call stack. However, it can also be implemented iteratively.

# DFS, pre-order, recursive
def traverse(node):
    if node is None:
        return
    visit(node)
    traverse(node.left)
    traverse(node.right)
# DFS, pre-order, iterative
def traverse(root):
    if root is None:
        return
    stack = [root]
    while len(stack) > 0:
        node = stack.pop()
        if node is None:
            continue
        visit(node)
        # note that right is pushed first
        # so that left is processed first
        if node.right is not None:
            stack.push(node.right)
        if node.left is not None:
            stack.push(node.left)

Segment trees

A segment tree is used for range queries and udpates. It can achieve both in O(log n) time. This is better than simple loops, which take O(n) time for querying, and prefix sums, which take O(n) time for updates.

tree = [0] * (2 * n) # n is the len of the array

def build(arr):
    # arr is at the bottom of the tree 
    # from n to 2n-1
    for i in range(n):
        tree[i + n] = arr[i]
    # parents start at n-1 and before 0
    # the left child is at 2i and the right child is at 2i + 1
    for i in range(n - 1, 0, -1):
        tree[i] = tree[i << 1] + tree[i << 1 | 1]

# [l, r)
def query(l, r):
    output = 0
    l += n # index of l in tree
    r += n # index of r in tree
    while l < r:
        # if l is a right child node, take it 
        if l & 1:
            output += tree[l]
            l += 1
        # if r is a right child node, take the left child
        # because of exclusive or
        if r & 1:
            r -= 1
            output += tree[r]
        # get the next parent
        l >>= 1
        r >>= 1 
    return output

def update(i, val):
    i += n # index of val in tree
    tree[i] = val
    while i > 1:
        # parent is the sum of its children
        tree[i >> 1] = tree[i] + tree[i^1] 
        i >>= 1

Note that a segment tree can also be implemented with pointers.

References: [1].

Heaps

A heap is a tree-based data structure that satisfies the heap property, which states that a parent node should be less than (or greater than) its children nodes. Types of heaps include binary, skew, binomial, and fibonacci.

Binary heap

A binary heap is a heap implemented as a binary tree. It implements push and pop in O(log n) time. While not the fastest, it can be initialized by inserting every element into it.

push

  1. Add the element to the bottom left
  2. While the element is smaller than its parent, swap

pop

  1. Replace the root with the node at the bottom left
  2. While the new root is larger than either of its children, swap with the smaller child

Built-in heap

The python heap library heapq implements a min heap.

import heapq

arr = [3, 4, 1, 2]
heapq.heapify(arr)
heapq.heappop(arr)
heapq.heappush(5)
heapq.nsmallest(1, arr)
heapq.nlargest(1, arr)

Graphs

Topological Sort

You can only implement topological sort on DAGs. There no ordering if there is a cycle.

Kahn's algorithm

This is based on BFS.

  1. Keep track of free nodes that have no dependencies
  2. When removing the node, decrement the degree of neighboring nodes
output, free = [], []

while free:
    n = free.pop()
    output.append(n)

    for node m adjacent to n:
        decrement degree of m
        if degree of m == 0:
            free.append(m)

if len(output) != len(nodes):
    return False # graph has at least one cycle

Depth-first search

  1. Choose a node
  2. Do a DFS on the unvisited node
  3. On callback, add current node to the toplogical ordering
output, temp = [], []

def dfs(node n)
    if n in output:
        return True
    if n in temp:
        return False # graph has at least one cycle

    temp.append(n)

    for node m adjacent to n
        dfs(m)

    temp.remove(n)
    output.append(n)

for n in nodes:
    dfs(n)

References:

  1. Kahn's Algorithm
  2. Topological Sort with DFS

Union find

Used to keep track of disjoint sets.

union takes two elements and combines their sets. find takes an element and returns the representative element ("parent") in the set.

Basic union find

At first, every element is its own parent. union makes one element the parent of the other. find returns the topmost parent.

parent = list(range(n))

def find(i):
    if parent[i] == i:
        return i
    return find(parent[i])

def union(a, b):
    pa, pb = find(a), find(b)
    parent[pa] = pb

Path compression and rank-ordering

find can be optimized with path compression, which flattens the structure of the tree whenever find is called. This means that every node points directly to the topmost parent, speeding up future find operations.

union can be optimized with rank-ordering, which keeps track of the maximum height of the trees. When two trees are combined, the tree with a lower height is made a child of the tree with a higher height. This keeps the overall structure more balanced and prevents long chains of nodes, which would slow down find operations. Note that only when they are equal could the height increase.

parent = list(range(n))
rank = [0] * n

def find(i):
    if parent[i] == i:
        return i
    parent[i] = find(parent[i])
    return parent[i]

def union(a, b):
    pa, pb = find(a), find(b)
    if pa == pb:
        return
    if rank[pa] > rank[pb]:
        parent[pb] = pa
    elif rank[pa] < rank[pb]:
        parent[pa] = pb
    else:
        parent[pa] = pb
        rank[pb] += 1 

Multisets

Counter

Counter takes an iterator and returns the count of each element. It return a zero count for missing elements. Methods: items(), most_common([n])

Maps

Defaultdict

defaultdict takes a default_factory function and returns a dictionary that returns the default value for missing keys. eg. defaultdict(list) or defaultdict(lambda x: x).

Backtracking

Incrementally build candidates to the solution. Abandon a candidate ("backtrack") as soon as you determine that the candidate cannot lead to a valid solution.

Used for constraint satisfaction problems (CSP) with partial candidate solutions which can be checked relatively quickly.

def f():
    output = []  # All possible solutions
    partial = [] # Current partial solution

    def backtrack():
        nonlocal output, partial

        # Base case
        if is_solution(partial):   
            output.append(partial[:])
            return

        # Pruning
        if invalid(partial):
            return

        # Try every candidate
        for c in candidates:
            partial.append(c) # Add to the partial
            backtrack(i + 1)
            partial.pop()     # Remove from the partial

    backtrack()
    return output

Popular problems include Permutations and Combinations.

Dynamic programming

Might be right for problems that involve looking for the optimum value or the number of ways to do something. Most importantly, future decisions depend on earlier decisions.

Top-down memoization

memo = {}
def fib(n):
    if n == 0 or i == 1:
        return n
    if n not in memo:
        memo[n] = f(n - 1) + f(n - 2)
    return memo[n]

Bottom-up dynamic programming

Start with base cases and iteratively build up to the solution in order.

def fib(n):
    if n == 0:
        return 0
    prev, curr = 0, 1
    for i in range(2, n):
        prev, curr = curr, prev + curr
    return curr

Bit math

Use bin to convert an integer to its binary representation:

n = 97
b = bin(n) # '0b1100001'
b[2:]

Use int to convert a binary string to an integer:

bin('1100001', 2) # 97

Bit shifting:

1 << 3 # 01000

To check if the nth bit of x is set with 0-based indexing:

x & (1 << n)