Cloudflare Outage Shouldn’t Prompt ‘Knee-Jerk Decisions’: Gartner
While the three-hour outage Tuesday points to the need for strengthening resilience within current providers, more dramatic moves are not worthwhile for short-duration outages, according to analysts at the research firm.
While the widely felt Cloudflare network services outage Tuesday points to the need for strengthening resilience within current cloud providers, the introduction of redundant services is not likely to be worthwhile for short-lived outages, according to an analysis from research firm Gartner.
The three-hour outage Tuesday brought down numerous popular websites for many users—including OpenAI’s ChatGPT, X.com and e-commerce platform Shopify—as well as causing transportation disruptions such as impacts to the New Jersey Transit system and Uber.
[Related: The 10 Biggest Cloud Outages Of 2025 (So Far)]
The global network outage was caused by a database permissions change and was not the result of a cyberattack, Cloudflare co-founder and CEO Matthew Prince disclosed in a post Tuesday.
In a Gartner analysis Tuesday, the research firm urged IT infrastructure and security leaders to “resist overreactions” to the outage.
It’s certainly prudent to explore ways to improve resilience within existing service providers, but “don’t make knee-jerk decisions to partition applications or providers,” the report from multiple Gartner analysts said.
“Reactive responses to a single outage, such as adding multicloud or redundant architectures, often introduce unnecessary complexity and cost without significantly improving resilience for short-duration incidents,” the analysts wrote.
Still, in the case of critical applications, architecting for fail-over between providers “may be possible”—though it will come with high costs and limitations on the services that can be consumed, the Gartner analysts wrote.
The best course of action is to “prioritize resilience, not redundancy everywhere,” the analysts said. “Apply diversification sparingly and only for critical systems where downtime has material business impact.”
Cloudflare Pledges Changes
While Cloudflare “initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack,” the vendor was soon able to correctly identify the issue, Prince wrote in the post Tuesday.
The error was caused by a database permissions change in the vendor’s ClickHouse cluster, which caused a configuration file used by its Bot Management service to inadvertently double in size and then propagate throughout its network, he wrote.
Because the configuration file exceeded the size limit imposed by Cloudflare’s software, “that caused the software to fail,” Prince wrote.
The incident ended up being the “worst” outage since 2019 for Cloudflare, he noted.
Going forward, Cloudflare will implement improvements to prevent future recurrences of these types of failures, Prince wrote. The planned improvements include hardening ingestion of configuration files generated by the vendor and enabling a greater number of “kill switches” for features, according to Prince.
Ultimately, “an outage like today is unacceptable,” he wrote. “On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today.”