December 2022

What Running Internet Measurements Taught Me About Distributed Systems

10 min read Networking, Research, Infrastructure

I spent a year at Northwestern sending packets across the globe and watching what happened. This sounds like a weird hobby, but it's actually serious research. Understanding how the internet works at a physical and routing level turns out to be surprisingly useful when you're building distributed systems for production use.

Here's what I learned.

The Internet is a Bunch of Agreements

The first thing that surprised me is how informal the internet's structure really is. There's no master plan. No central authority. Just thousands of organizations that have agreed to connect to each other and pass traffic around.

These agreements are called "peering relationships." Some are formal contracts between large providers. Some are handshake deals. Some are implicit, based on physical proximity in data centers. The internet works because everyone generally cooperates, not because anyone's in charge.

This has real implications for distributed systems. When you deploy services across multiple cloud regions, you're relying on these peering relationships. If AWS and Azure decide to route traffic through different paths, your inter-region latency changes. And you have essentially no control over it.

Geography Still Matters

We like to think of the internet as this abstract cloud where everything is instantaneous. It's not. Light travels fast, but not infinitely fast. And the cables that carry internet traffic follow real geographic paths.

I ran measurements from 33 cloud instances across 18 countries. One thing became very clear: the path packets take often makes no sense geographically. Traffic from Australia to Indonesia might route through the US. Traffic within Europe might go through a submarine cable to the US and back.

Why? Because routing decisions are made based on business agreements, not geography. ISP A might not have a peering deal with ISP B, so traffic goes through ISP C even if that adds 50ms of latency.

The lesson for distributed systems: don't assume that "closer" means "faster." Test your actual latency between regions. You might be surprised.

Submarine Cables Are Everything

About 99% of intercontinental internet traffic goes through submarine cables. These are literal cables on the ocean floor. Some of them are decades old. Some are brand new fiber optic lines with terabits of capacity.

Submarine cables fail more often than you'd think. Ships drop anchors on them. Earthquakes damage them. Sometimes they just break. When a major cable goes down, traffic reroutes, latency spikes, and sometimes connections just fail.

I built a monitoring system to track submarine cable outages in real-time. The goal was to understand how the internet's physical infrastructure affects connectivity. What we found was that the internet is surprisingly resilient to individual failures, but major cable cuts do cause real problems.

For production systems, this means: if you're serving global users, understand the physical paths between your regions. Have redundancy. Monitor for connectivity issues at the network level, not just the application level.

BGP is Both Amazing and Terrifying

BGP (Border Gateway Protocol) is the protocol that makes the internet work. It's how networks tell each other "I can reach this destination, send traffic to me." Every router on the internet runs BGP and exchanges routing information with its neighbors.

BGP is amazing because it works at all. There are millions of networks, and somehow traffic finds its way from source to destination most of the time. It's a genuinely impressive feat of decentralized coordination.

BGP is terrifying because it's based entirely on trust. When a network announces "I can reach Google," other networks believe it. There's no authentication. This leads to incidents where someone misconfigures their router and accidentally claims they're the best path to half the internet. Traffic gets sucked into their network, overloading it and breaking connectivity for millions of users.

These are called "BGP hijacks" and they happen more often than you'd like. Sometimes they're accidents. Sometimes they're malicious. Either way, there's not much you can do about them from an application level.

The takeaway: the internet's routing layer is inherently unreliable. Build your applications to handle intermittent connectivity issues. Don't assume that just because DNS resolves and you get an IP address, packets will actually reach their destination.

Measurement is Harder Than It Sounds

When you're running measurements at scale, you discover all kinds of problems:

Time synchronization across 33 instances in 18 countries is hard
Network conditions change constantly, so point-in-time measurements can be misleading
Cloud providers sometimes route your traffic through unexpected paths
Some destinations rate-limit or block measurement traffic
Results can be skewed by local conditions (overloaded host, noisy neighbor, etc.)

I ended up processing 179 measurements per day from 358 instances. That's a lot of data, and a lot of ways for it to be wrong. The data pipeline I built had to handle missing data, outliers, time zone issues, and constantly changing network conditions.

This experience made me much better at building observability systems. When you've dealt with unreliable measurement data at internet scale, handling application metrics feels almost easy.

How This Applies to Production Systems

You might be wondering: why does any of this matter for normal software engineering? Most of us aren't building internet measurement platforms.

Here's the thing: if you're building distributed systems, you're relying on the internet. And the internet is way more complicated and unreliable than most engineers realize. Understanding how it actually works helps you build more resilient systems.

Specific lessons:

Don't trust latency assumptions. Measure actual latency between your components. Network conditions change, and what was true last month might not be true today.
Build for failure. Any network path can fail at any time. Have fallbacks. Use circuit breakers. Assume that remote services will be unavailable sometimes.
Understand your dependencies. Know which cloud regions your services run in. Know how traffic flows between them. Know what happens when a particular path goes down.
Monitor at multiple layers. Application metrics are important, but so are network metrics. High latency or packet loss at the network layer will manifest as weird application behavior.
Geographic distribution helps. Running services in multiple regions isn't just about being closer to users. It's about having redundancy when network paths fail.

What I'm Doing With This Now

I moved from research to industry after Northwestern, joining Expedia. The work is different but the lessons apply directly. When we're migrating millions of requests between services, understanding network behavior helps predict what might go wrong.

I also find myself thinking about scale differently. When you've seen how much effort goes into keeping the global internet running, you appreciate the complexity of any large distributed system. It's not magic. It's thousands of small decisions and optimizations and failure modes that people have figured out over time.

The internet is fragile and robust at the same time. Understanding how it actually works made me a better engineer. If you get the chance to dig into the network layer, take it. You'll learn things that make your application-level work make a lot more sense.