Topics
- Multicore
- Background: Processors and cores, Cache coherence
- Papers: Multikernels, COST
- OS structure and abstractions
- Papers: Exokernels, Dune
- High-performance I/O
- Background: Linux network stack, User-level network stack
- Papers: IX, Arrakis
- High-performance I/O (cont.)
- Background: eBPF, Polling vs. interrupts
- Papers: XDP, XRP
- Remote direct memory access
- Background: RDMA protocol
- Papers: FaRM, PRISM
- Kernel bypass and OS abstractions
- Background: IOMMU, Zero-copy memory allocation
- Papers: Demikernel
Multicore
Processors and cores
A processor contains cores, a memory controller unit (MCU), a shared L3 cache, I/O interfaces (such as the PCI express), and interconnects. Cache lines are the fixed-sized blocks of data that are loaded into the shared L3 cache. A core is a basic computational unit which maintains its own set of registers.
A system may have multiple processors, which may have multiple cores that support multiple hardware threads or contexts.
In a symmetric/shared memory multiprocessor (SMP), all processors can run any task and access any memory. The OS is responsible for scheduling threads across processors. Sometimes non-uniform memory access (NUMA) is used, which means that each processor has local RAM but can also access remote RAM attached to another processor.
In massively parallel processing (MPP), there are many processors, each with their own local memory, that communicate by message-passing. One example of this are GPUs.
Cache coherence
One big problem with multiprocessors and multicores is cache incoherence. When a memory location is modified, the cached copies of that memory location are no longer valid. It is necessary to have a cache coherence protocol to make sure that all copies are the same. The policy needs to guarantee:
- Write propagation: writes to a memory location must be seen by everyone
- Transaction serialization: reads and writes to a memory location must be seen in the same order by everyone
Nevertheless, cache coherence comes with its own set of problems:
- Cache ping-pong: when two threads on different processors repeatedly modify variables that reside on the same cache line, the cache line will be transferred back and forth between the processors, causing a lot of cache coherence traffic. This can be caused by false sharing or true sharing
- False sharing: when two threads on different processors modify variables that reside on the same cache line, the cache coherence protocol will force the cache line to be transferred between the processors, even though the threads are not actually sharing any data
Multikernels
The Multikernel: A New OS Architecture for Scalable Multicore Systems. SOSP 2009.
Fish in a Barrel: An Insider’s Retrospective of the SOSP’09 Multikernel Paper.
- The authors wanted to address the problem of (1) scalibility with rising core counts where cache coherence policies become a bottleneck, and (2) the diversity of hardware within a system and across systems, which makes it difficult to optimize the OS for all hardware types.
- The idea of a multikernel is to have a distributed system of OS instances on each core that communicate with each other via message-passing rather than shared state. The problem with this idea was that core-count didn't actually increase dramatically in the last few decades.
- Its important to think about the benefits of shared state versus message-passing. Shared state is the traditional way of building operating systems, where all cores share the same memory space and communicate by reading and writing to shared memory. It can be hard to reason about. Message-passing involves asynch-programming, which can complicated in its own right.
COST paper
Scalability! But at what COST?
The authors argue that parallelizable overheads in many distributed systems paper cause the scalability of the system to be exaggerated. Instead, distributed systems papers should also consider the COST (configuration that outperforms a single thread) metric in order to gauge the true performance benefits of the system.
OS structure and abstractions
Exokernels
Exokernel: An Operating System Architecture for Application-Level Resource Management. SIGOPS '95.
- The Exokernel paper addresses the problem of only providing general, fixed, high-level abstractions such as virtual memory and interprocess communication for applications to use. This makes it difficult for databases, garbage collectors, sandboxes, and web severs to optimize their performance.
- The Exokernel paper proposes a new operating system architecture which only handles the protection and multiplexing of resources. The authors envisioned that general applications would use specialized, user-level library operating systems built on primitives (secure bindings) provided by the operating system (Aegis) to optimize their work.
Dune
Dune: Safe User-level Access to Privileged CPU Features. OSDI '12.
- The Dune paper tackles the same problem as the exokernel paper. They propose using the virtualization hardware in CPUs to provide a process which could access privileged hardware features. This would allow applications to take advantage of hardware features without needing to make changes to existing operating system abstractions.
- The prototype of this solution is Dune, which comprises of a loadable kernel module on top of the OS (2.5k lines of code) and a user-level library, libDune.
- Dune is an example of how advancements in hardware can motivate new papers.
High-performance I/O
Linux network stack
The Linux network stack is slow because it involves many memory copies, context switches, interrupts, and queueing delays.
User-level network stacks
To achieve higher performance, user-level network stacks allow applications to directly read and write to device driver queues, bypassing the kernel. This reduces the number of memory copies, context switches, and interrupts. Having a run-to-completion model also allows the application to avoid context switches and interrupts.
The downside is that this means that applications to need to roll out their own network stack (TCP), and polling may waste cycles when packets aren't being received.
IX
IX: A Protected Dataplane Operating System for High Throughput and Low Latency. OSDI '14.
The paper addresses the problem of providing protection and isolation for kernel-bypass networking stacks.
The authors propose creating a dataplane operating system which separate the control plane, which handles course-grain resource allocation, from the dataplane, which runs the actual networking stack and the application logic. IX depends on virtualization hardware which (1) allows the dataplane to make use of hardware resources directly and (2) runs the control plane, the dataplane, and the application at different protection levels.
Arrakis
Arrakis: The Operating System is the Control Plane. OSDI '14.
Similar solution to IX.
High-performance I/O (cont.)
eBPF
Extended Berkeley Packet Filter (eBPF) allows users to run special programs in the Linux kernel without changing kernel source code or loading kernel modules. The verifier ensures that the program is safe to run in the kernel by checking for memory safety and run to completion (bounded loops, instruction limit, no arbitrary jumps).
eBPF programs can be set on hookpoints, which could be system calls, kernel function entry points, trace points, or network events. When the eBPF program is loaded, the jump address of the hookpoint is modified to point to the eBPF program and then back. Note that the goal of eBPF is for observability and tracing, not kernel bypass.
eBPF maps provide a mechanism for eBPF to store and share data within the kernel and communicate with user-space applications. The map can be a diversity of types, including hash tables, arrays, and ring buffers.
eBPF must be run on bare-metal; it is not supported in virtual machines or containers.
Polling vs. interrupts
When the NIC writes into pre-prepared packet buffers, the CPU should be notified when a packet arrives. There are two ways to do this: polling and interrupts. With polling, the CPU constantly monitors a particular memory address and wastes resources. With interrupts, the kernel sends a signal to the CPU when a packet arrives.
XDP
The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel. CoNEXT '18.
Many applications (e.g. DPDK) achieve high-performance networking through kernel bypass techniques to avoid the overhead of the operating system. Drawbacks of kernel bypass is losing kernel isolation and security, not having access to kernel network stack features (routing table, TCP stack), and needing to dedicate full CPU cores to packet processing.
The authors propose eBPF-based eXpress Data Path (XDP), which lets user-defined programs run in the Linux kernel at the NIC driver level. This achieves high performance while retaining kernel isolation, security, and access to the network stack.
XDP showed higher performance than baseline Linux but lower than DPDK for packet drop/forwarding benchmarks. The eBPF model was flexible enough to be used to implement layer-3 routing, inline DDoS protection, and layer-4 load balancing.
XRP
XRP: In-Kernel Storage Functions with eBPF. OSDI '22.
Similar to XDP, but for storage.
Remote direct memory access
RDMA protcol
RDMA (Remote Direct Memory Access) allows one machine to make remote memory reads or writes without server CPU involvement. This is different from DPDK, where the server needs the CPU to poll for incoming messages in the NIC queues (tx_burst, rx_burst).
The transport protocol is offloaded to hardware itself. The transport protocol is responsible for ensuring that packets "reliably" arrive. This means:
- Packets shouldn't be dropped
- Packets shouldn't be received out-of-order
The transport protcol includes ACKs, flow control (how many messages can I have in flight at once), and congestion control. Hardware could be RDMA-enabled NICs or RDMA-enabled switches.
RoCE (RDMA over Converged Ethernet) relies on PFC (Priority Flow Control) and RDMA-enabled NICs. PFC which allows switches to send PAUSE and RESUME frames to other switches, which ensures that no packets will get dropped. The RDMA-enabled NICs are able to send NACKs to signal that packets are out-of-order.
IWARP is the full TCP protocol implemented on NICs. It is less common that RoCE.
RDMA primitives
- One-sided operations like
READ
andWRITE
- Two-sided operations like
SEND
andRECEIVE
- Limited support for compare-and-swap (=)
Problems with RDMA:
- Most data structures require multiple round trips to traverse
- Hard to ensure consistency
- Hard to implementing transactions
FaRM
No compromises: distributed transactions with consistency, availability, and performance. SOSP '15.
FaRM is a system to provide transactions using RDMA. Every object is associated with a version number and a lock. During a transaction, FaRM gathers the read set (all values needing to be read) and the write set (all values needing to be written). Then it:
- Performs one-sided reads of all values in the read set
- Make write updates locally
- Lock all objects in the write set
- Reread all objects in the read set and redo the protcol if any of them has changed
- Update write objects
- Unlock
Note that this protcol uses optimistic concurrency control. It requires multiple READs and RPCs, which means that the CPU is involved.
PRISM
PRISM: Rethinking the RDMA Interface for Distributed Systems. SOSP '21.
PRISM primitives:
- Indirect reads and writes
- Allocate
- Enhanced CAS
- Operating chaining
In PRISM-TX, the memory layout is changed so that each key keeps the following metadata:
- PR: timestamp of most recent transaction that has read the key and has prepared to commit
- PW: timestamp of most recent transaction that needs to write the key and has prepared to commit
- C: the timestamp of the most recent committed transaction that wrote this key
To do a transaction:
- Perform one-sided reads of all values in the read set
- Make write updates locally
- Check for no concurrent writes by checking pw has not increased
- Update write objects
- Unlock
Kernel bypass and OS abstractions
IOMMU
The IOMMU is the I/O memory management unit that translates virtual address to device (CPU/RAM/NIC) physical memory.
The range of memory the NIC is reading needs to be:
- Physically backed and pinned, since the NIC can't handle page-faults
- Registered when the NIC has IOMMU support
There are two ways to use physically_backed memory and pinned memory:
mprotect()
pins the memory you pass in- OS mantains huge pages (2MB or 1GB). The 2MB pages are always pinned and physically backed. The NIC can register the huge pages with the IOMMU.
Zero-copy memory allocation
ZC send:
- App copies app memory into the network stack
- No further copies until the NIC fetches the packet from RAM
ZC receive:
- The NIC writes the packet into RAM
- No further copies until the app copies the packet from the network stack
Demikernel
The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems. SOSP '21.
Demikernel aims to do zero-copy memory allocation even between the app and the NIC.
Demikernel send:
- No copies at all, so NIC must directly access app memory
- The NIC to have access to app memory, we must either pin memory on demand, or expect the app to use a sepcial memory allocator which allocates pinned memory
Demikernel receive:
- Hand apps direct pointer to the receive buffer
To provide use-after free protection, when memory is allocated, the Demikernel returns a queue token and increments the reference count. The app can use the queue token to push the address to the NIC. When the NIC is done with the memory region, it will decrement the reference count. When the app is done with the memory region, it will also decrement the reference count. When the reference count reaches 0, the memory region can be freed.