Topics
- Multicore
- Background: Processors and cores, Cache coherence
- Papers: Multikernels, COST
- OS structure and abstractions
- Papers: Exokernels, Dune
- High-performance I/O
- Background: Linux network stack, User-level network stack
- Papers: IX, Arrakis
- High-performance I/O (cont.)
- Background: eBPF, Polling vs. interrupts
- Papers: XDP, XRP
- Remote direct memory access
- Background: RDMA protocol
- Papers: FaRM, PRISM
- Kernel bypass and OS abstractions
- Background: IOMMU, Zero-copy memory allocation
- Papers: Demikernel
- NIC Interfaces
- Background: Packetized interface
- Papers: Enso
- CPU Scheduling
- Background
- Papers: Shenango, Attack of the Killer Microseconds
- CPU Scheduling (cont.)
- CPU Scheduling & Interference
- Background:
- Papers: Caladan
- Signals vs. Interrupts
- Background:
- Papers: Aspen
Multicore
Processors and cores
A processor contains cores, a memory controller unit (MCU), a shared L3 cache, I/O interfaces (such as the PCI express), and interconnects. Cache lines are the fixed-sized blocks of data that are loaded into the shared L3 cache. A core is a basic computational unit which maintains its own set of registers.

A system may have multiple processors, which may have multiple cores that support multiple hardware threads or contexts.
In a symmetric/shared memory multiprocessor (SMP), all processors can run any task and access any memory. The OS is responsible for scheduling threads across processors. Sometimes non-uniform memory access (NUMA) is used, which means that each processor has local RAM but can also access remote RAM attached to another processor.
In massively parallel processing (MPP), there are many processors, each with their own local memory, that communicate by message-passing. One example of this are GPUs.

Cache coherence
One big problem with multiprocessors and multicores is cache incoherence. When a memory location is modified, the cached copies of that memory location are no longer valid. It is necessary to have a cache coherence protocol to make sure that all copies are the same. The policy needs to guarantee:
- Write propagation: writes to a memory location must be seen by everyone
- Transaction serialization: reads and writes to a memory location must be seen in the same order by everyone
Nevertheless, cache coherence comes with its own set of problems:
- Cache ping-pong: when two threads on different processors repeatedly modify variables that reside on the same cache line, the cache line will be transferred back and forth between the processors, causing a lot of cache coherence traffic. This can be caused by false sharing or true sharing
- False sharing: when two threads on different processors modify variables that reside on the same cache line, the cache coherence protocol will force the cache line to be transferred between the processors, even though the threads are not actually sharing any data
Multikernels
The Multikernel: A New OS Architecture for Scalable Multicore Systems. SOSP 2009.
Fish in a Barrel: An Insider’s Retrospective of the SOSP’09 Multikernel Paper.
- The authors wanted to address the problem of (1) scalibility with rising core counts where cache coherence policies become a bottleneck, and (2) the diversity of hardware within a system and across systems, which makes it difficult to optimize the OS for all hardware types.
- The idea of a multikernel is to have a distributed system of OS instances on each core that communicate with each other via message-passing rather than shared state. The problem with this idea was that core-count didn't actually increase dramatically in the last few decades.
- Its important to think about the benefits of shared state versus message-passing. Shared state is the traditional way of building operating systems, where all cores share the same memory space and communicate by reading and writing to shared memory. It can be hard to reason about. Message-passing involves asynch-programming, which can complicated in its own right.
COST paper
Scalability! But at what COST?
The authors argue that parallelizable overheads in many distributed systems paper cause the scalability of the system to be exaggerated. Instead, distributed systems papers should also consider the COST (configuration that outperforms a single thread) metric in order to gauge the true performance benefits of the system.
OS structure and abstractions
Exokernels
Exokernel: An Operating System Architecture for Application-Level Resource Management. SIGOPS '95.
- The Exokernel paper addresses the problem of only providing general, fixed, high-level abstractions such as virtual memory and interprocess communication for applications to use. This makes it difficult for databases, garbage collectors, sandboxes, and web severs to optimize their performance.
- The Exokernel paper proposes a new operating system architecture which only handles the protection and multiplexing of resources. The authors envisioned that general applications would use specialized, user-level library operating systems built on primitives (secure bindings) provided by the operating system (Aegis) to optimize their work.
Dune
Dune: Safe User-level Access to Privileged CPU Features. OSDI '12.
- The Dune paper tackles the same problem as the exokernel paper. They propose using the virtualization hardware in CPUs to provide a process which could access privileged hardware features. This would allow applications to take advantage of hardware features without needing to make changes to existing operating system abstractions.
- The prototype of this solution is Dune, which comprises of a loadable kernel module on top of the OS (2.5k lines of code) and a user-level library, libDune.
- Dune is an example of how advancements in hardware can motivate new papers.
High-performance I/O
Metrics
Characteristics of modern datacenters:
- Requests are distributed across thousands of servers
- Requests are small
- There is variable load, both bursty (µs) and diurnal
Important metrics:
- Throughput (requests per second)
- Tail latency, since end-to-end response times are determined by the slowest individual response
- CPU utilization under variable load
Linux network stack
The Linux network stack is slow because it involves many memory copies, context switches, interrupts, and queueing delays.

User-level network stacks
To achieve higher performance, user-level network stacks allow applications to directly read and write to device driver queues, bypassing the kernel. This reduces the number of memory copies, context switches, and interrupts. Having a run-to-completion model also allows the application to avoid context switches and interrupts.
The downside is that this means that applications to need to roll out their own network stack (TCP), and polling may waste cycles when packets aren't being received.

IX
IX: A Protected Dataplane Operating System for High Throughput and Low Latency. OSDI '14.
The paper addresses the problem of providing protection and isolation for kernel-bypass networking stacks.
The authors propose creating a dataplane operating system which separate the control plane, which handles course-grain resource allocation, from the dataplane, which runs the actual networking stack and the application logic. IX depends on virtualization hardware which (1) allows the dataplane to make use of hardware resources directly and (2) runs the control plane, the dataplane, and the application at different protection levels.
Arrakis
Arrakis: The Operating System is the Control Plane. OSDI '14.
Similar solution to IX.
High-performance I/O (cont.)
eBPF
Extended Berkeley Packet Filter (eBPF) allows users to run special programs in the Linux kernel without changing kernel source code or loading kernel modules. The verifier ensures that the program is safe to run in the kernel by checking for memory safety and run to completion (bounded loops, instruction limit, no arbitrary jumps).
eBPF programs can be set on hookpoints, which could be system calls, kernel function entry points, trace points, or network events. When the eBPF program is loaded, the jump address of the hookpoint is modified to point to the eBPF program and then back. Note that the goal of eBPF is for observability and tracing, not kernel bypass.
eBPF maps provide a mechanism for eBPF to store and share data within the kernel and communicate with user-space applications. The map can be a diversity of types, including hash tables, arrays, and ring buffers.
eBPF must be run on bare-metal; it is not supported in virtual machines or containers.
Polling vs. interrupts
When the NIC writes into pre-prepared packet buffers, the CPU should be notified when a packet arrives. There are two ways to do this: polling and interrupts. With polling, the CPU constantly monitors a particular memory address and wastes resources. With interrupts, the kernel sends a signal to the CPU when a packet arrives.
XDP
The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel. CoNEXT '18.
Many applications (e.g. DPDK) achieve high-performance networking through kernel bypass techniques to avoid the overhead of the operating system. Drawbacks of kernel bypass is losing kernel isolation and security, not having access to kernel network stack features (routing table, TCP stack), and needing to dedicate full CPU cores to packet processing.
The authors propose eBPF-based eXpress Data Path (XDP), which lets user-defined programs run in the Linux kernel at the NIC driver level. This achieves high performance while retaining kernel isolation, security, and access to the network stack.
XDP showed higher performance than baseline Linux but lower than DPDK for packet drop/forwarding benchmarks. The eBPF model was flexible enough to be used to implement layer-3 routing, inline DDoS protection, and layer-4 load balancing.
XRP
XRP: In-Kernel Storage Functions with eBPF. OSDI '22.
Similar to XDP, but for storage.
Remote direct memory access
RDMA protcol
RDMA (Remote Direct Memory Access) allows one machine to make remote memory reads or writes without server CPU involvement. This is different from DPDK, where the server needs the CPU to poll for incoming messages in the NIC queues (tx_burst, rx_burst).
The transport protocol is offloaded to hardware itself. The transport protocol is responsible for ensuring that packets "reliably" arrive. This means:
- Packets shouldn't be dropped
- Packets shouldn't be received out-of-order
The transport protcol includes ACKs, flow control (how many messages can I have in flight at once), and congestion control. Hardware could be RDMA-enabled NICs or RDMA-enabled switches.
RoCE (RDMA over Converged Ethernet) relies on PFC (Priority Flow Control) and RDMA-enabled NICs. PFC which allows switches to send PAUSE and RESUME frames to other switches, which ensures that no packets will get dropped. The RDMA-enabled NICs are able to send NACKs to signal that packets are out-of-order.

IWARP is the full TCP protocol implemented on NICs. It is less common that RoCE.
RDMA primitives
- One-sided operations like
READandWRITE - Two-sided operations like
SENDandRECEIVE - Limited support for compare-and-swap (=)
Problems with RDMA:
- Most data structures require multiple round trips to traverse
- Hard to ensure consistency
- Hard to implementing transactions
FaRM
No compromises: distributed transactions with consistency, availability, and performance. SOSP '15.
FaRM is a system to provide transactions using RDMA. Every object is associated with a version number and a lock. During a transaction, FaRM gathers the read set (all values needing to be read) and the write set (all values needing to be written). Then it:
- Performs one-sided reads of all values in the read set
- Make write updates locally
- Lock all objects in the write set
- Reread all objects in the read set and redo the protcol if any of them has changed
- Update write objects
- Unlock
Note that this protcol uses optimistic concurrency control. It requires multiple READs and RPCs, which means that the CPU is involved.
PRISM
PRISM: Rethinking the RDMA Interface for Distributed Systems. SOSP '21.
PRISM primitives:
- Indirect reads and writes
- Allocate
- Enhanced CAS
- Operating chaining
In PRISM-TX, the memory layout is changed so that each key keeps the following metadata:
- PR: timestamp of most recent transaction that has read the key and has prepared to commit
- PW: timestamp of most recent transaction that needs to write the key and has prepared to commit
- C: the timestamp of the most recent committed transaction that wrote this key
To do a transaction:
- Perform one-sided reads of all values in the read set
- Make write updates locally
- Check for no concurrent writes by checking pw has not increased
- Update write objects
- Unlock
Kernel bypass and OS abstractions
IOMMU
The IOMMU is the I/O memory management unit that translates virtual address to device (CPU/RAM/NIC) physical memory.
The range of memory the NIC is reading needs to be:
- Physically backed and pinned, since the NIC can't handle page-faults
- Registered when the NIC has IOMMU support

There are two ways to use physically_backed memory and pinned memory:
mprotect()pins the memory you pass in- OS mantains huge pages (2MB or 1GB). The 2MB pages are always pinned and physically backed. The NIC can register the huge pages with the IOMMU.
Zero-copy memory allocation
ZC send:
- App copies app memory into the network stack
- No further copies until the NIC fetches the packet from RAM
ZC receive:
- The NIC writes the packet into RAM
- No further copies until the app copies the packet from the network stack

Demikernel
The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems. SOSP '21.
Demikernel aims to do zero-copy memory allocation even between the app and the NIC.
Demikernel send:
- No copies at all, so NIC must directly access app memory
- The NIC to have access to app memory, we must either pin memory on demand, or expect the app to use a sepcial memory allocator which allocates pinned memory
Demikernel receive:
- Hand apps direct pointer to the receive buffer
To provide use-after free protection, when memory is allocated, the Demikernel returns a queue token and increments the reference count. The app can use the queue token to push the address to the NIC. When the NIC is done with the memory region, it will decrement the reference count. When the app is done with the memory region, it will also decrement the reference count. When the reference count reaches 0, the memory region can be freed.
NIC Interfaces
Packetized interface
Traditionally, NICs and applications communicate through buffers that contain packet descriptors. Packet descriptors hold metadata, a flag bit, and a pointer to a fixed-sized packet buffer which holds the data.
In the TX buffer, the NIC reads from the head and the CPU writes to the tail.

In the RX buffer, the NIC writes to the head and the CPU reads from the tail.

Packet buffers make it efficient to do software multiplexing and demultiplexing, since only packet pointers need to be delivered. However, now that demultiplexing can be done with NIC offloads, this is no longer as relevent.
There are two problems with the packetized interface:
- Large received data is unnecessarily broken into buffers, causing CPU overhead and cache misses
- More metadata needs to be transferred around, making the network bottlenecked by PCIe
Enso
zero-copy allows better cache utilization, since this doesn't lead to dirtying the cache Enso showed that you can increase
DDIO, when NIC recv data, can put in L3 cache
CPU scheduling
CPU scheduling algorithms
Most Theoretical
- First come first serve (FCFS)
- Prioritization
- Shortest remaining processing time (SRPT)
Assumes they know how long tasks will take and the implementation of algorithm has no overhead. Makes sense at long timescales.
Kernel-Bypass Scheduling Policies
- ZygOS
- Shenango
- Arachne
- Caladan
Very low overhead to effectively deal with microsecond-scale events. Requires application changes. Not all policies implemented. Very good performance.
Improve Linux Scheduling
- Syrup
- Ghost
- Snap
Some changes to the kernel will be necessary. Worse performance than kernel-bypass.
Most Practical
- Completely fair scheduling (CFS) [Linux]
- Earliest eligible virtual deadline first (EEVDF) [Linux]
Shenango
Quick core allocation.
Shenango is a runtime and a kernel module (IOKernel).
IOKernel
- Algorithm for scheduling
- Core pinning
- Control plane - decides where packets go
- Data plane - moves packets
Attack of the Killer Microseconds
Hardware can mask the latency of nanosecond-scale events. e.g. Cache misses are around 100ns. During that time, hardware can do out-of-order execution or the CPU can switch to another hyper-thread.
Software can mask the latency of millisecond-scale events. e.g. Disk I/Os, WAN network events, and low-end flash devices are 10s of milliseconds. During that the time, the CPU can just switch to another thread.
Nothing can effectively mask the latency of microsecond-scale events. e.g. Datacenter round trip time and high-end flash devices.
CPU scheduling (cont.)
ghOSt
Problem:
- Scheduling is good but hard
Solution
- Mechanism vs policy
- Mechanism is execution
- Policy is deciding what to run
Syrup
Mark signals they want to monitor. Map from signal to policy.
Caladan
Identified three sources of interference: hyperthread, memory bandwidth, and last-level cache (LLC).

Hyperthread interference
- Assume interference is always present when both siblings are active
- Monitor LC tasks, queuing delays, response-time delays
Memory bandwidth interference
- Poll DRAM to see how much bandwidth is used currently by all tasks
- To attribute to individual task, sample LLC misses
LLC interference
- Measure increased queueing delays caused by reductions in compute capacity
Signals vs. interrupts
Aspen
In Aspen-KB, a timer core needs to be allocated to periodically preempt runtime cores. The network RX queue is frequently polled for incoming packets, or requests. Requests are added to a new queue. The thread on the new queue runs for some quanta, and then gets added to the preempted queue. We assume that the priority of new requests is greater than preempted ones.
In Aspen-Go, a dedicated thread called sysmon reempts goroutines periodically. Each core has a queue of runnable goroutines, and there is also a global queue of preempted goroutines. Tasks on per-core queues are prioritized over global queues. One problem is that sysmon does not have direct access to the network RX queue.