Datacenter and cloud operating systems notes

Topics

Multicore
- Background: Processors and cores, Cache coherence
- Papers: Multikernels, COST
OS structure and abstractions
- Papers: Exokernels, Dune
High-performance I/O
- Background: Linux network stack, User-level network stack
- Papers: IX, Arrakis
High-performance I/O (cont.)
- Background: eBPF, Polling vs. interrupts
- Papers: XDP, XRP
Remote direct memory access
- Background: RDMA protocol
- Papers: FaRM, PRISM
Kernel bypass and OS abstractions
- Background: IOMMU, Zero-copy memory allocation
- Papers: Demikernel
NIC Interfaces
- Background: Packetized interface
- Papers: Enso
CPU Scheduling
- Background
- Papers: Shenango, Attack of the Killer Microseconds
CPU Scheduling (cont.)
- Background:
- Papers: Syrup, Ghost
CPU Scheduling & Interference
- Background:
- Papers: Caladan
Signals vs. Interrupts
- Background:
- Papers: Aspen

Multicore

Processors and cores

A processor contains cores, a memory controller unit (MCU), a shared L3 cache, I/O interfaces (such as the PCI express), and interconnects. Cache lines are the fixed-sized blocks of data that are loaded into the shared L3 cache. A core is a basic computational unit which maintains its own set of registers.

Diagram of a system, a processor, and a core

A system may have multiple processors, which may have multiple cores that support multiple hardware threads or contexts.

In a symmetric/shared memory multiprocessor (SMP), all processors can run any task and access any memory. The OS is responsible for scheduling threads across processors. Sometimes non-uniform memory access (NUMA) is used, which means that each processor has local RAM but can also access remote RAM attached to another processor.

In massively parallel processing (MPP), there are many processors, each with their own local memory, that communicate by message-passing. One example of this are GPUs.

Diagram of a multiprocessor system

Cache coherence

One big problem with multiprocessors and multicores is cache incoherence. When a memory location is modified, the cached copies of that memory location are no longer valid. It is necessary to have a cache coherence protocol to make sure that all copies are the same. The policy needs to guarantee:

Write propagation: writes to a memory location must be seen by everyone
Transaction serialization: reads and writes to a memory location must be seen in the same order by everyone

Nevertheless, cache coherence comes with its own set of problems:

Cache ping-pong: when two threads on different processors repeatedly modify variables that reside on the same cache line, the cache line will be transferred back and forth between the processors, causing a lot of cache coherence traffic. This can be caused by false sharing or true sharing
False sharing: when two threads on different processors modify variables that reside on the same cache line, the cache coherence protocol will force the cache line to be transferred between the processors, even though the threads are not actually sharing any data

Multikernels

The Multikernel: A New OS Architecture for Scalable Multicore Systems. SOSP 2009.
Fish in a Barrel: An Insider’s Retrospective of the SOSP’09 Multikernel Paper.

The authors wanted to address the problem of (1) scalibility with rising core counts where cache coherence policies become a bottleneck, and (2) the diversity of hardware within a system and across systems, which makes it difficult to optimize the OS for all hardware types.
The idea of a multikernel is to have a distributed system of OS instances on each core that communicate with each other via message-passing rather than shared state. The problem with this idea was that core-count didn't actually increase dramatically in the last few decades.
Its important to think about the benefits of shared state versus message-passing. Shared state is the traditional way of building operating systems, where all cores share the same memory space and communicate by reading and writing to shared memory. It can be hard to reason about. Message-passing involves asynch-programming, which can complicated in its own right.

COST paper

Scalability! But at what COST?

The authors argue that parallelizable overheads in many distributed systems paper cause the scalability of the system to be exaggerated. Instead, distributed systems papers should also consider the COST (configuration that outperforms a single thread) metric in order to gauge the true performance benefits of the system.

OS structure and abstractions

Exokernels

Exokernel: An Operating System Architecture for Application-Level Resource Management. SIGOPS '95.

The Exokernel paper addresses the problem of only providing general, fixed, high-level abstractions such as virtual memory and interprocess communication for applications to use. This makes it difficult for databases, garbage collectors, sandboxes, and web severs to optimize their performance.
The Exokernel paper proposes a new operating system architecture which only handles the protection and multiplexing of resources. The authors envisioned that general applications would use specialized, user-level library operating systems built on primitives (secure bindings) provided by the operating system (Aegis) to optimize their work.

Dune

Dune: Safe User-level Access to Privileged CPU Features. OSDI '12.

The Dune paper tackles the same problem as the exokernel paper. They propose using the virtualization hardware in CPUs to provide a process which could access privileged hardware features. This would allow applications to take advantage of hardware features without needing to make changes to existing operating system abstractions.
The prototype of this solution is Dune, which comprises of a loadable kernel module on top of the OS (2.5k lines of code) and a user-level library, libDune.
Dune is an example of how advancements in hardware can motivate new papers.

High-performance I/O

Metrics

Characteristics of modern datacenters:

Requests are distributed across thousands of servers
Requests are small
There is variable load, both bursty (µs) and diurnal

Important metrics:

Throughput (requests per second)
Tail latency, since end-to-end response times are determined by the slowest individual response
CPU utilization under variable load

Linux network stack

The Linux network stack is slow because it involves many memory copies, context switches, interrupts, and queueing delays.

kernel network stack

User-level network stacks

To achieve higher performance, user-level network stacks allow applications to directly read and write to device driver queues, bypassing the kernel. This reduces the number of memory copies, context switches, and interrupts. Having a run-to-completion model also allows the application to avoid context switches and interrupts.

The downside is that this means that applications to need to roll out their own network stack (TCP), and polling may waste cycles when packets aren't being received.

user network stack

IX

IX: A Protected Dataplane Operating System for High Throughput and Low Latency. OSDI '14.

The paper addresses the problem of providing protection and isolation for kernel-bypass networking stacks.

The authors propose creating a dataplane operating system which separate the control plane, which handles course-grain resource allocation, from the dataplane, which runs the actual networking stack and the application logic. IX depends on virtualization hardware which (1) allows the dataplane to make use of hardware resources directly and (2) runs the control plane, the dataplane, and the application at different protection levels.

Arrakis

Arrakis: The Operating System is the Control Plane. OSDI '14.

High-performance I/O (cont.)

eBPF

Extended Berkeley Packet Filter (eBPF) allows users to run special programs in the Linux kernel without changing kernel source code or loading kernel modules. The verifier ensures that the program is safe to run in the kernel by checking for memory safety and run to completion (bounded loops, instruction limit, no arbitrary jumps).

eBPF programs can be set on hookpoints, which could be system calls, kernel function entry points, trace points, or network events. When the eBPF program is loaded, the jump address of the hookpoint is modified to point to the eBPF program and then back. Note that the goal of eBPF is for observability and tracing, not kernel bypass.

eBPF maps provide a mechanism for eBPF to store and share data within the kernel and communicate with user-space applications. The map can be a diversity of types, including hash tables, arrays, and ring buffers.

eBPF must be run on bare-metal; it is not supported in virtual machines or containers.

Polling vs. interrupts

When the NIC writes into pre-prepared packet buffers, the CPU should be notified when a packet arrives. There are two ways to do this: polling and interrupts. With polling, the CPU constantly monitors a particular memory address and wastes resources. With interrupts, the kernel sends a signal to the CPU when a packet arrives.

XDP

The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel. CoNEXT '18.

Many applications (e.g. DPDK) achieve high-performance networking through kernel bypass techniques to avoid the overhead of the operating system. Drawbacks of kernel bypass is losing kernel isolation and security, not having access to kernel network stack features (routing table, TCP stack), and needing to dedicate full CPU cores to packet processing.

The authors propose eBPF-based eXpress Data Path (XDP), which lets user-defined programs run in the Linux kernel at the NIC driver level. This achieves high performance while retaining kernel isolation, security, and access to the network stack.

XDP showed higher performance than baseline Linux but lower than DPDK for packet drop/forwarding benchmarks. The eBPF model was flexible enough to be used to implement layer-3 routing, inline DDoS protection, and layer-4 load balancing.

XRP

XRP: In-Kernel Storage Functions with eBPF. OSDI '22.

Similar to XDP, but for storage.

Remote direct memory access

RDMA protcol

RDMA (Remote Direct Memory Access) allows one machine to make remote memory reads or writes without server CPU involvement. This is different from DPDK, where the server needs the CPU to poll for incoming messages in the NIC queues (tx_burst, rx_burst).

The transport protocol is offloaded to hardware itself. The transport protocol is responsible for ensuring that packets "reliably" arrive. This means:

Packets shouldn't be dropped
Packets shouldn't be received out-of-order

The transport protcol includes ACKs, flow control (how many messages can I have in flight at once), and congestion control. Hardware could be RDMA-enabled NICs or RDMA-enabled switches.

RoCE (RDMA over Converged Ethernet) relies on PFC (Priority Flow Control) and RDMA-enabled NICs. PFC which allows switches to send PAUSE and RESUME frames to other switches, which ensures that no packets will get dropped. The RDMA-enabled NICs are able to send NACKs to signal that packets are out-of-order.

PFC

IWARP is the full TCP protocol implemented on NICs. It is less common that RoCE.

RDMA primitives

One-sided operations like READ and WRITE
Two-sided operations like SEND and RECEIVE
Limited support for compare-and-swap (=)

Problems with RDMA:

Most data structures require multiple round trips to traverse
Hard to ensure consistency
Hard to implementing transactions

FaRM

No compromises: distributed transactions with consistency, availability, and performance. SOSP '15.

FaRM is a system to provide transactions using RDMA. Every object is associated with a version number and a lock. During a transaction, FaRM gathers the read set (all values needing to be read) and the write set (all values needing to be written). Then it:

Performs one-sided reads of all values in the read set
Make write updates locally
Lock all objects in the write set
Reread all objects in the read set and redo the protcol if any of them has changed
Update write objects
Unlock

Note that this protcol uses optimistic concurrency control. It requires multiple READs and RPCs, which means that the CPU is involved.

PRISM

PRISM: Rethinking the RDMA Interface for Distributed Systems. SOSP '21.

PRISM primitives:

Indirect reads and writes
Allocate
Enhanced CAS
Operating chaining

In PRISM-TX, the memory layout is changed so that each key keeps the following metadata:

PR: timestamp of most recent transaction that has read the key and has prepared to commit
PW: timestamp of most recent transaction that needs to write the key and has prepared to commit
C: the timestamp of the most recent committed transaction that wrote this key

To do a transaction:

Perform one-sided reads of all values in the read set
Make write updates locally
Check for no concurrent writes by checking pw has not increased
Update write objects
Unlock

Kernel bypass and OS abstractions

IOMMU

The IOMMU is the I/O memory management unit that translates virtual address to device (CPU/RAM/NIC) physical memory.

The range of memory the NIC is reading needs to be:

Physically backed and pinned, since the NIC can't handle page-faults
Registered when the NIC has IOMMU support

IOMMU

There are two ways to use physically_backed memory and pinned memory:

mprotect() pins the memory you pass in
OS mantains huge pages (2MB or 1GB). The 2MB pages are always pinned and physically backed. The NIC can register the huge pages with the IOMMU.

Zero-copy memory allocation

ZC send:

App copies app memory into the network stack
No further copies until the NIC fetches the packet from RAM

ZC receive:

The NIC writes the packet into RAM
No further copies until the app copies the packet from the network stack

Zero-copy

Demikernel

The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems. SOSP '21.

Demikernel aims to do zero-copy memory allocation even between the app and the NIC.

Demikernel send:

No copies at all, so NIC must directly access app memory
The NIC to have access to app memory, we must either pin memory on demand, or expect the app to use a sepcial memory allocator which allocates pinned memory

Demikernel receive:

Hand apps direct pointer to the receive buffer

To provide use-after free protection, when memory is allocated, the Demikernel returns a queue token and increments the reference count. The app can use the queue token to push the address to the NIC. When the NIC is done with the memory region, it will decrement the reference count. When the app is done with the memory region, it will also decrement the reference count. When the reference count reaches 0, the memory region can be freed.

NIC Interfaces

Packetized interface

Traditionally, NICs and applications communicate through buffers that contain packet descriptors. Packet descriptors hold metadata, a flag bit, and a pointer to a fixed-sized packet buffer which holds the data.

In the TX buffer, the NIC reads from the head and the CPU writes to the tail.

NIC tx buffer

In the RX buffer, the NIC writes to the head and the CPU reads from the tail.

NIC rx buffer

Packet buffers make it efficient to do software multiplexing and demultiplexing, since only packet pointers need to be delivered. However, now that demultiplexing can be done with NIC offloads, this is no longer as relevent.

There are two problems with the packetized interface:

Large received data is unnecessarily broken into buffers, causing CPU overhead and cache misses
More metadata needs to be transferred around, making the network bottlenecked by PCIe

Enso

zero-copy allows better cache utilization, since this doesn't lead to dirtying the cache Enso showed that you can increase

DDIO, when NIC recv data, can put in L3 cache

CPU scheduling

CPU scheduling algorithms

Most Theoretical

First come first serve (FCFS)
Prioritization
Shortest remaining processing time (SRPT)

Assumes they know how long tasks will take and the implementation of algorithm has no overhead. Makes sense at long timescales.

Kernel-Bypass Scheduling Policies

ZygOS
Shenango
Arachne
Caladan

Very low overhead to effectively deal with microsecond-scale events. Requires application changes. Not all policies implemented. Very good performance.

Improve Linux Scheduling

Syrup
Ghost
Snap

Some changes to the kernel will be necessary. Worse performance than kernel-bypass.

Most Practical

Completely fair scheduling (CFS) [Linux]
Earliest eligible virtual deadline first (EEVDF) [Linux]

Shenango

Quick core allocation.

Shenango is a runtime and a kernel module (IOKernel).

IOKernel

Algorithm for scheduling
Core pinning
Control plane - decides where packets go
Data plane - moves packets

Attack of the Killer Microseconds

Hardware can mask the latency of nanosecond-scale events. e.g. Cache misses are around 100ns. During that time, hardware can do out-of-order execution or the CPU can switch to another hyper-thread.

Software can mask the latency of millisecond-scale events. e.g. Disk I/Os, WAN network events, and low-end flash devices are 10s of milliseconds. During that the time, the CPU can just switch to another thread.

Nothing can effectively mask the latency of microsecond-scale events. e.g. Datacenter round trip time and high-end flash devices.

CPU scheduling (cont.)

ghOSt

Problem:

Scheduling is good but hard

Solution

Mechanism vs policy
- Mechanism is execution
- Policy is deciding what to run

Syrup

Mark signals they want to monitor. Map from signal to policy.

Caladan

Identified three sources of interference: hyperthread, memory bandwidth, and last-level cache (LLC).

Caladan Interference

Hyperthread interference

Assume interference is always present when both siblings are active
Monitor LC tasks, queuing delays, response-time delays

Memory bandwidth interference

Poll DRAM to see how much bandwidth is used currently by all tasks
To attribute to individual task, sample LLC misses

LLC interference

Measure increased queueing delays caused by reductions in compute capacity

Signals vs. interrupts

Aspen

In Aspen-KB, a timer core needs to be allocated to periodically preempt runtime cores. The network RX queue is frequently polled for incoming packets, or requests. Requests are added to a new queue. The thread on the new queue runs for some quanta, and then gets added to the preempted queue. We assume that the priority of new requests is greater than preempted ones.

In Aspen-Go, a dedicated thread called sysmon reempts goroutines periodically. Each core has a queue of runnable goroutines, and there is also a global queue of preempted goroutines. Tasks on per-core queues are prioritized over global queues. One problem is that sysmon does not have direct access to the network RX queue.