Early Tuesday morning, large portions of the web sputtered out for about an hour. The downed sites shared no obvious theme or geography; the outages were global, and they hit everything from Reddit to Spotify to The New York Times. (And yes, also WIRED.) In fact, the only thing they have in common is Fastly, a content-delivery network (CDN) provider whose predawn hiccup reverberated across the internet.
You may not have heard of Fastly, but you likely interact with it in some fashion every time you go online. Along with Cloudflare and Akamai, it’s one of the biggest CDN providers in the world. And while Fastly has been vague about what specific glitch caused Tuesday’s worldwide disruptions, the incident offers a stark reminder of how fragile and interconnected internet infrastructure can be, especially when so much of it hinges on a handful of companies that operate largely outside of public awareness.
To understand how a Fastly problem can quickly become everyone’s problem, it’s worth spending a minute on the role CDNs play in the internet ecosystem. While it’s tempting to think of the internet as amorphous—they even call it “the cloud”—the articles you read, the movies and songs you stream, the photos you post, they all live on physical servers. And while that content might be primarily hosted on a cloud provider, you still need a way to get it to people quickly and efficiently.
That’s where a CDN comes in. By operating servers around the globe, CDNs can whittle down the distance between your smartphone and the internet experience of your choice. Think of it as the internet’s equivalent of a relay man in baseball: Rather than try to throw the ball to home plate on their own, an outfielder will instead toss it to an infielder, who in turn fires it to the catcher. It’s faster and more efficient.
“It basically enables really high performance for content, whether that’s streaming video or a site or all the little images that pop up when you go to an ecommerce site,” says Angelique Medina, director of product marketing at the network monitoring firm ThousandEyes. “Serving it really close to the user takes away a lot of the load time, and it enables everyone to have a really great experience when they’re surfing the web.”
Take this article that you’re reading right now. Chances are you’re reading a copy of it held in the cache of what's known as a “point of presence,” a server somewhere in your region. A Fastly network map indicates that the company operates POPs in at least 58 cities around the world, including multiples in densely populated areas like Los Angeles, London, and Singapore. It lists their combined global capacity at a whopping 130 terabits per second.
Global users attempting to reach Reddit.com****, served by Fastly's CDN service.Courtesy of ThousandEyes
And that’s not all! CDNs don’t just store content closer to the devices that crave it. They also help direct it across the internet. “It is like orchestrating traffic flow on a massive road system,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a principle architect at Akamai. “If some link on the internet fails or gets congested, CDN algorithms quickly find an alternate route to the destination.”
So you can start to see how when a CDN goes down, it can take heaping portions of the internet along with it. Although that alone doesn’t quite explain how the impacts on Tuesday were so far-reaching, especially when there are so many redundancies built into these systems. Or at least, there should be.
Again, it’s not clear exactly what happened at Fastly. “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration,” a company spokesperson said in a statement. “Our global network is coming back online.”
“Service configuration” can mean any number of things; the only certainty is that whatever the root cause, it had wide-ranging effects. According to Fastly’s incident report page, every continent other than Antarctica felt the impact. Even after Fastly had fixed the underlying issue, it cautioned that users could still see a lower “cache hit ratio”—how often you can find the content you’re looking for already stored in a nearby server—and “increased origin load,” which refers to the process of going back to the source for items not in the cache. In other words, the cupboards are still fairly bare.
That an outage occurred is surprising, given that CDNs are typically designed to weather these tempests. “In principle, there is massive redundancy,” says Sitaraman, speaking about CDNs generally. “If a server fails, others servers could take over the load. If an entire data center fails, the load can be moved to other data centers. If things worked perfectly, you could have many network outages, data center problems, and server failures; the CDN’s resiliency mechanisms would ensure that the users never see the degradation.”
When things do go wrong, Sitaraman says, it typically relates to a software bug or configuration error that gets pushed to multiple servers at once.
Even then, the sites and services that employ CDNs typically have their own redundancies in place. Or at least, they should. In fact, you could see hints of how diversified various services are in the speed of their response this morning, says Medina. It took Amazon about 20 minutes to get back up and running, because it could divert traffic to other CDN providers. Anyone who relied solely on Fastly, or who didn’t have automated systems in place to accommodate for the disruption, had to wait it out.
“The outage was the result of monoculture,” says Roland Dobbins, principal engineer of security firm Netscout. He suggests that every organization with a substantial online presence should have multiple CDN providers to avoid precisely this sort of situation.
Their options, though, are increasingly limited. Just as the cloud has largely been subsumed by Amazon, Google, and Microsoft, three CDN providers—Cloudflare, Akamai, and Fastly—dominate the flow of content online. “There’s a lot of concentration of usage within very few service providers,” Medina says. “Whenever any one of those three providers has an issue, typically it’s not something that lasts a very long time, but it has a major impact across the internet.”
That’s a big part, Medina says, of why these sorts of outages have been more frequent of late, and why they’ll only continue to get worse. Baseball needs a cutoff man; intersections need traffic cops. The fewer of those there are to rely on, the more connections get missed, and the bigger the crashes.
Additional reporting by Lily Hay Newman.
More Great WIRED Stories