Internet Architecture
The Internet is a BIG distributed system with a large dynamic range.
Design principles
- Federated design: no single entity controls the entire system
- Best effort: the network does not guarantee delivery, but it tries its best to deliver packets
- End-to-end principle: network is as simple as possible, endpoints are responsible for reliability, security, etc.
The tradeoff is that the Internet is hard to manage, has no performance guarantees, and is slow.
Clark 88
The design philosophy of the DARPA internet protocols
TCP/IP was first proposed by the Defense Advanced Research Projects Agency (DARPA). Its main goal was to effectively multiplex across existing networks. Some other goals were:
- Survivability and fault-tolerance
- Supporting a variety of networks
- Distributed management of resources
- Cost-effectiveness
- Accountability
Some design decisions made were:
- Datagrams
- Packet switching instead of circuit switching
- Storing state at the endpoints instead of the network
Circuit-switched packet forwarding involves setting up a dedicated path between the source and destination before data can be sent. By reserving resources, the networks can make performance guarantees. Ex. telephone networks.
Store/forward packet switching allows data to be sent in small packets that can take different paths to the destination. While this supports a flexible topology and benefits from statistical multiplexing, it does not allow for performance guarantees. Ex. the Internet.
E2E
End-to-end arguments in system design
The end-to-end argument is that only endpoints can provide certain functions correctly, and that implementing these functions in the network can be redundant and inefficient. Examples of these functions include:
- Reliable data transmission
- Acknowledgment of delivery
- Data security
- Duplicate message suppression
For example, consider the problem of reliable data transfer. Data transfer may involve the host and client application, the operating system, the disk, and the communication subsystem, all of which could be a point of failure. Thus, reliable data transfer can only be fully implemented at the application layer with end-to-end check and retry, and it may be inefficient to implement reliability in the network as well.
Routing Schemes
Multicast is a routing scheme that sends data from one source to multiple destinations simultaneously. It only replicates packets when necessary in the network path, which optimizes bandwidth.
Multihoming connects a host to multiple networks in order to increase reliability and performance.
Internet Infrastructure
B4 And After
Private Wide Area Networks (WANs) are used by large organizations to connect their offices and data centers. They often have a centralized control plane, which allows for more efficient traffic engineering and better performance than the broader Internet.
B4 is Google's private WAN. One of the main scalability issues in B4 is that increasing site counts (1) complicated capacity planning, (2) slowed the TE algorithm and (3) put pressure on the switch forwarding tables.
One of the things Google did to solve that problem was add more hierarchy to the network topology. Each site now has multiple supernodes (leaf and spine architecture) connected in a full mesh.
Edge Caching as Differentiation
Edge Caching as Differentiation
Edge caching leads to performance differences for end-users, similar to traffic differentiation. Furthermore, these differences do not explicitly come about as a result of service differentiation, but rather arise implicitly from the nature of shared caching.
CityMesh
Scalable Routing in a City-Scale Wi-Fi Network for Disaster Recovery
CityMesh uses static access points and mobile devices equipped by Wi-Fi to provide connectivity in cases where (1) the network is down but (2) the physical infrastructure is still intact.
It uses map data to determine the best path for routing packets between buildings, and it uses grid-based addressing to allow for scalable routing.
Congestion Control
Dismantling a Religion
Flow rate fairness: dismantling a religion
Briscoe '08 argues that flow rate fairness is not a good measure of 'fairness'. For one, most flow rate fairness schemes can be taken advantage of by users who open multiple flows, and thus receive more bandwidth.
Instead, cost fairness, which considers the congestion caused by a user, is a better measure. "You get what you pay for."
Access
cISP
cISP: A Speed-of-Light Internet Service Provider
The best latency between two points on Earth is the speed of light in a vacuum, or c-latency. Protocol inefficiencies account for the fact that most Internet traffic is 36-100x c-latency. However, infrastructural inefficiencies account for around 3-4x c-latency. This is mainly because fiber cable has a transmission speed around two-thirds that of c.
cISP is a service provider that uses microwave antennas for long-haul routing and uses fiber for the last mile. Microwave has a short range (~100 km) and limited bandwidth, but also a transmission speed essentially equal to c. However, it is very sensitive to weather and obstructions, and is currently only widely used in high-frequency trading (HFT).
Traffic Control
L4S
Low, Latency, Low-Loss, Scalable Throughput (L4S) is an architecture for Internet congestion control. It uses Explicit Congestion Notification (ECN) to transmit information about congestion ahead of time.
Endpoints using L4S is given preferential treatment in exchange for cooperating using improved CCAs. Remarkably, both L4S and non-L4S traffic see improved performance.
RCS
Principles for Internet Congestion Management
The Internet relies on host-based congestion control algorithms (CCAs) to prevent overloads. However, users have an incentive to deploy more aggressive CCAs to receive more bandwidth. To prevent this, the Internet informally requires all CCAs to be TCP-friendly (TCPF), which means
its arrival rate does not exceed the arrival of a conformant TCP connection in the same circumstances
There are multiple problems with TCPF:
- Difficult to enforce
- Limits CCAs' ability to achieve full efficiency
- In practice, non-TCPF CCAs like CCA BBR is deployed widely
The authors' proposal is to have the network actively enable all reasonable CCAs to achieve the same bandwidth in the same static circumstances, or CCA independence (CCAI). They describe a Recursive Congestion Shares (RCS) framework which uses existing commercial agreements to determine packets' relative rights in a link.
Datacenters
ECMP leads to collisions but is easy to implement to hardware.
Fat-Tree
A scalable, commodity data center network architecture
Let be the number of ports on a switch. A fat-tree topology has:
- pods with two layers of switches
- core switches, each connected to one switch in each pod
- hosts
Benefits
- Don't need to buy more powerful switches for aggregation and core layers
- Fault-tolerance
- 1:1 oversubscription ratio
Issues
- No great solution for TOR redundancy
- Difficult to load-balance between the core and the aggregate switches
- Not amenable to incremental expansion. is limited by number of ports. Hosts scale with and switches scale with .
Jupiter Rising
Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network
Datacenter bandwidth demands have doubled every year for the past decade, and are expected to continue to do so. Jupiter Rising describes the evolution of Google's datacenter network over the past decade to support this growth.
Jellyfish
Jellyfish: Networking Data Centers Randomly
The idea behind Jellyfish is to use random graphs instead of structured topologies like hypercubes and fat-trees. Randomly connecting nodes allows for incremental expansion, shorter average path lengths, and better bandwidth. However, it also leads to more complex routing and load balancing.
Bundling and patch panels are two techniques for reducing the number of cables in a datacenter network.
Management
SDN Talk
The Future of Networking, and the Past of Protocols
Road to SDN
The road to SDN: an intellectual history of programmable networks
Software-defined networking (SDN) separates the control plane (which decides how to handle traffic) from the data plane (which forwards traffic according to the control plane's instructions). It also consolidates the control plane so that a single piece of software can control multiple dataplane elements.
OpenFlow is an API that allows the control plane to configure packet-handling rules on the data plane.
Motivations for SDN
- Computer networks have many different components, including routers, switches, and middleboxes (firewalls, load balancers, NATs, intrusion detection systems). Network administrators have to use different closed and proprietary interfaces to configure each of these components.
- SDNs lower the barrier to innovation and experimentation.
History leading to SDN
- Active networks (mid-1990s to early 2000s)
- Control and data plane separation (early 2000s to mid-2000s)
- OpenFlow API and network operating systems (2007-2010)
The idea behind active networks was to allow users to inject code into the network to customize how packets are handled. There are two models:
- The capsule model, where the code is carried in the packet itself
- The programmable switch model, where the code is stored on the switch and packets can trigger the execution of the code
The separation of the control and data plane was motivated by demands for technology to manage routing within an ISP. Compared to active networking, this research focused more on
- Network administrators (rather than end-users)
- Programmability in the control plane (rather than the data plane)
- Control over the network (rather than individual devices)
Notably, control functionality was moved off of network equipment and into separate servers, who can store all of the routing state and compute all of the routing decisions.
dSDN
A Decentralized SDN Architecture for the WAN
In a typical SDN, the control plane is a centralized controller who runs a traffic engineering algorithm to compute paths (while accounting for capacity). However, to run an SDN in a global WAN, we need a lot more "SDN control infrastructure".
Network Functions
APLOMB
Making middleboxes someone else's problem: network processing as a cloud service
Middleboxes sit at the edge of the network and intercept and modify traffic. Examples include firewalls, intrusion detection systems, proxies, load balancers, and WAN optimizers.
There are two main ideas in APLOMB:
- Move from hardware appliances to software appliances
- Take software appliances like middleboxes and run them in the cloud
The former was a very influential idea; the latter less so.
Click
Before Click, it was assumed that routers must be fixed-function devices.
VFP
VFP: A Virtual Switch Platform for Host SDN in the Public Cloud
VFP is Microsoft's virtual switch platform used to implement network functions in software. Its features include:
- Multiple independent network controllers can program network applications
- Rules apply to stateful connections, not just packets
- Flow policy offloaded to programmable NICs (FPGAs)
Note: Google's philosophy in the 2010s was to run everything on the CPU (e.g. Andromeda).
RMT
Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN
Switching chips are 100x faster at switching than CPUS and 10x faster than network processors. They take advantage of pipelining and parallism.
Pipelining involves dividing the execution of an instruction into several steps, each of which could run in parallel with other steps. This allows for much higher throughput than a CPU, which typically executes instructions sequentially.
Note: RMT was developed into a commercial product by Barefoot Networks, which was acquired by Intel in 2019. Their commercial product is the Tofino switches.
Hypergiants
Microservices
Microservices are an alternative to monolithic applications. It divides the application into multiple services which communicate through the network through remote procedure calls (RPCs).
Advantages of microservices:
- Independent development
- Independent deployment and scaling
- Fault isolation
Advantages of monoliths:
- Lower network and serialization overhead
- Tight coupling
In addition to application logic, microservices also need to do:
- Authentication
- Tracing
- Service discovery
- Making connection
Because this code tends to be duplicated in microservices, sometimes this is instead handled automatically by the service mesh.
There are two main styles of service mesh: (1) sidecar and (2) remote proxy.
Advantages and disadvantages of sidecars:
- Low communication overhead (only between proxies)
- Don't need to change code
Advantages and disadvantages of remote proxies
- Cross-region
- Need to change code
- High communication overhead (between service and proxy)
Advantages and disadvantages of library
- Low communication overhead (only between services)
- Need to change code
Wireless
RFocus
RFocus: Beamforming Using Thousands of Passive Antennas
Problem: A transmitter can direct a more powerful signal beam to a receiver by beamforming with antenna arrays. However, there are physical limits on the number of antennas we can fit on a device.
The idea of RFocus is to increase antenna capacity by putting antennas in the environment. RFocus surface consists of 3200 antennas, all of which are either reflective or transparent. The receiver sends measurements of the received signal strength to the RFocus controller, which then configures the surface.
Cellular Networks
Cellular networks today are divided into:
- Radio Access Network (RAN), which consists of radio towers that connect to user equipment (UE)
- Mobile core, which connects the RAN to the rest of the Internet
When a UE wants to connect to a mobile network, it performs a cell selection procedure. It first tunes to the appropriate frequency. Then it chooses a tower to connect based on its home operators' Public Land Mobile Network (PLMN) identifier specified in its Subscriber Identity Module (SIM) card, as well as signal strength.
Once it chooses a tower, it synchronizes with the broadcast control channel and attaches to the network. The cell handles user authentication and sets up connections. Later, when a UE moves around in the physical world, it may need to be handed over to an adjacent tower.
When a UE enters an area where its home operator does not have infrastructure, it attempts to roam by attaching onto an available tower. However, this attachment only succeeds if the visited operator has a roaming agreement with the UE's home operator.
There are multiple generations of mobile Internet, notably 4G (LTE) and 5G.
RinP
Problem: Cellular bandwidth demand continues to grow exponentially. Cellular operators have traditionally met this demand by deploying more base stations in an area (densification). Users have typically enjoyed data rates that double every two years, a scaling trend called Cooper's law. However, denser deployments will lead to increased interference, eventually leading to the end of Cooper's law. The problem remains: how can we continue scaling bandwidth in high-demand areas?
The authors of this paper propose to have users connect to operators in overlapping coverage areas in order to dynamically share capacity across operators, known as roaming-in-place (RinP).