Cloudflare: What was behind the latest global internet outage?

3 Jul 2019

Miniature engineer figures trying to untangle a knotted number of multicoloured ethernet cables.

Image: © kirill_makarov/Stock.adobe.com

Cloudflare, the backbone of many of the web’s biggest sites, experienced a global outage that left many wondering what could have happened.

The fragility of the internet was exposed yesterday (2 July) when users across the world came across many websites displaying the error message ‘502 Bad Gateway’. Shortly after, social media was flooded with questions as to what caused such an outage across seemingly unconnected sites.

Soon after, Cloudflare, a content delivery and DDoS protection provider, said an error on its part was behind the massive outage. A quick look at the company’s systems status page showed that almost every major city in the world was affected in some way, including Dublin.

23 minutes after Cloudflare confirmed that it was experiencing issues, it announced that it had “implemented a fix”. 35 minutes later, it revealed the cause of the outage.

“We saw a massive spike in CPU that caused primary and secondary systems to fall over,” a statement said. “We shut down the process that was causing the CPU spike. Service restored to normal within ~30 minutes.”

Soon after, it announced that normal operations had resumed. So what could have caused such a major outage so soon after another one that occurred on 24 June?

Testing processes were ‘insufficient in this case’

In a blogpost, Cloudflare CTO John Graham-Cumming was able to reveal that the CPU spike was the result of “bad software deploy that was rolled back”. He stressed that this was not the result of a well-crafted DDoS attack.

“The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF managed rules,” Graham-Cumming said.

“We make software deployments constantly across the network and have automated systems to run test suites, and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.”

He went on to admit that such an outage was “very painful” for customers and that the company’s testing processes were “insufficient in this case”.

This outage was different to the one that occurred on 24 June, which Cloudflare described as the internet having “a small heart attack”. It was revealed that network provider Verizon directed a significant portion of the internet’s traffic to a small company in the US state of Pennsylvania, resulting in a major information pile-up.