On the morning of November 18, 2025, Cloudflare – the content delivery network (CDN) and cybersecurity company that powers roughly a fifth of all web traffic – suffered a sudden, global service disruption. Major websites and apps such as ChatGPT, X (formerly Twitter), Spotify, Zoom, Canva, and many others went offline or showed error pages for several hours. Users around the world were greeted with Cloudflare’s “Internal server error (Error code 500)” page (see image below) instead of the expected content.
Thousands of reports flooded outage tracking sites like Down detector, which recorded over 11,000 incident reports at the peak of the outage. In short, this was a major global outage that disrupted sites and services for developers, businesses, and everyday users alike.
Cloudflare quickly reassured the public that the root cause was an internal software bug – not a DDoS attack or hack. Within about four hours, Cloudflare engineers had isolated the problem and restored normal traffic flow, but the incident laid bare how a single configuration error can ripple across the internet. In this deep-dive, we examine what happened, why it happened, and how Cloudflare resolved the issue, drawing on official postmortems and contemporary news reports for every key detail.
Outage Timeline and Impact
According to Cloudflare’s own incident report, the problems began at 11:20 UTC (6:20 a.m. ET) on November 18, 2025. At that time, Cloudflare’s network suddenly started failing to deliver core traffic. End users began encountering the classic “Internal server error” pages instead of website content
. Monitoring dashboards showed a massive spike in HTTP 5xx errors served by Cloudflare’s global proxies – an indication of internal server failures – that spiked at over 30 million error requests per second before gradually subsiding.
Evidence from multiple sources pinpoints the sequence of events:
- ~11:20–11:30 UTC: The outage starts. Cloudflare’s network begins to fail every ~5 minutes in waves, alternating between partial recovery and failure. Monitoring services (like Cisco ThousandEyes) observed time-outs and 5XX errors on Cloudflare’s edge servers globally. Many users immediately reported downtime on social media and outlets like Down detector. Cloudflare’s status page – ironically hosted off-platform – went down as well, causing initial confusion. Engineers initially suspected some kind of DDoS or cyberattack, since the errors looked like a massive traffic surge.
- Late morning: Cloudflare’s teams investigate. The failure pattern was odd: every five minutes the network would recover briefly and then fail again. Eventually engineers discovered that an internal configuration file was causing the issue (see below). During this period, thousands more outage reports poured in. For example, Down detector hit a peak of 11,145 reported Cloudflare issues around 14:19 GMT (as shown in outage trackers). By late morning, the outage was confirmed to be broadening: sites like Canva, X, Grindr and ChatGPT were confirmed down or degraded.
- 13:05 UTC: Cloudflare rolls back a preliminary change. (Cloudflare’s timeline notes that some internal rollbacks began around 13:05, which began to stabilize one aspect of the system.) In particular, authentication (Access) systems were failing for users and a rollback was initiated. However, the core HTTP 5xx errors continued on.
- 14:30 UTC: Root cause identified and partial fix applied. Cloudflare engineers identified that a recently updated configuration file (for their Bot Management system) was corrupt or oversized. They stopped the generation and propagation of the bad file and manually reverted to a known good version. After reloading the good file and restarting the core proxies, the volume of errors immediately dropped. By 14:30 UTC most core traffic was flowing again.
- 14:30–17:06 UTC: Cleanup and full recovery. With the main problem fixed, Cloudflare’s team still worked to bring every system back up. Automated systems that had crashed were restarted, and traffic rushes were managed. By 17:06 UTC, Cloudflare reported all systems returned to normal. The thunderstorm of 5xx errors had subsided to baseline levels. Cloudflare continued monitoring for late errors, but effectively the crisis was over.
In summary, the outage lasted roughly four hours from first failure to full recovery. It was not caused by external attackers; the ultimate culprit was a misconfiguration in an internal system. But because Cloudflare sits in front of so many sites, the effect was global. The charts below illustrate the scale: they show error volumes skyrocketing once the faulty file hit the network, then dropping after the fix.
Who and What Was Affected?
The scope of the outage was vast. Any website, API, or app that relied on Cloudflare’s CDN or DDoS protection could show errors. Some of the high-profile services that were disrupted included:
- Social media and AI platforms: Users of X (Twitter) and ChatGPT (OpenAI) reported they couldn’t reach these sites or got error pages. (Ironically, Downdetector itself was briefly down since it also used Cloudflare.)
- Media and entertainment: Services like Spotify and music-streaming apps experienced downtime for their U.S. user base.
- Business and productivity: Work-from-home tools such as Zoom, and content-creation sites like Canva, were inaccessible for many users.
- Cloudflare services: Even parts of Cloudflare’s own platform were hit. Its Turnstile authentication widget failed to load, Workers KV key-value store showed elevated errors, and parts of the Cloudflare dashboard were effectively unusable (since login used Turnstile). Their own status page briefly showed errors (coincidence).
- Generic websites: Any customer website behind Cloudflare’s CDN saw 5xx error pages (like the one shown above) instead of their normal content. Because Cloudflare proxies traffic, many otherwise unrelated sites all went down at once.
As the Guardian notes, this was “a key piece of the internet’s usually hidden infrastructure” going offline. A security expert quoted in the press called Cloudflare a “gatekeeper” of the internet, and noted that failures at such an infrastructure giant quickly become very visible. Downdetector’s peak of ~11,000 reports shows that large numbers of end users worldwide were suddenly unable to reach countless websites and online services.
Cloudflare’s stock price also dipped on the news – as Reuters observed, shares fell 2.3% that morning – reflecting investor concern about the reliability of Cloudflare’s global network.
Root Cause: The Bot Management Configuration Bug
The official postmortem from Cloudflare identifies the source of the failure as a bug in their Bot Management system’s configuration file. Here’s what happened in more detail:
Cloudflare’s Bot Management product uses a “feature file” – a configuration file listing characteristics (features) of known malicious bots and threats. This file is generated by querying Cloudflare’s ClickHouse database and is published to every node in Cloudflare’s network every few minutes, so that edge servers can update their bot-detection models. Normally this feature file has a predictable size and is parsed quickly by the core proxy software.
On November 18, a recent change to Cloudflare’s database system permissions triggered a problem: the automated query that builds the feature file started producing duplicate entries. In Cloudflare’s words, “It was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system. That feature file, in turn, doubled in size.”. In effect, the file grew much larger than expected (roughly double its usual size) because many bot features were listed twice.
Once the oversized file was published across the network, the core traffic routers (proxies) encountered it and crashed. The proxy software has a limit on feature file size, and this corrupted file exceeded that limit. In Cloudflare’s words, the “larger-than-expected feature file” caused their routing software to fail. This is why HTTP requests began failing with 5xx errors: the very process that inspects and routes incoming requests was crashing when it tried to load the oversized bot feature file.
To clarify the technical details: Cloudflare discovered that every five minutes, a new feature file was generated by a query on a ClickHouse database cluster. Because the cluster was being updated in phases, only some database shards returned the bad duplicate data at first. This meant that intermittently a “good” file (normal size) or a “bad” file (double size) would propagate to the network. That explains the strange recovery/relapse pattern early in the outage: sometimes a small file was rolled out and traffic briefly worked, and sometimes the big file hit and crashed things. Eventually the bad file took over completely and the system stayed down.
Importantly, at no point was an external attacker involved. Cloudflare explicitly stated “there was no evidence that this was the result of an attack or caused by malicious activity”. The anomaly was purely an internal configuration glitch. As the Axios report summarized it, Cloudflare’s spokesperson said the outage was caused by a “configuration file that is automatically generated to manage threat traffic”, which “grew beyond an expected size of entries and triggered a crash” in their traffic-handling software. The Guardian similarly reported that after fixing the issue, Cloudflare said the root cause “was a configuration file that is automatically generated to manage threat traffic” which “grew beyond its expected size and triggered a crash in the software system that handles traffic”.
How the Bug Triggered the Crash
The key technical takeaway is that a fixed-size data structure assumption was violated. Cloudflare’s proxy code expected the Bot Management file to be within a certain size, and when it doubled, the software did not handle it gracefully. Cloudflare’s own blog post elaborates:
“A change in our underlying ClickHouse query behaviour … caused [the feature file] to have a large number of duplicate ‘feature’ rows. This changed the size of the previously fixed-size feature configuration file, causing the bots module to trigger an error.”
When the bots module errored out, the entire core proxy (codenamed “Frontline” or FL) began returning HTTP 5xx errors for all incoming requests that needed bot processing. This is why essentially all traffic through Cloudflare suddenly failed. Even Cloudflare’s own features like Workers KV and Access, which rely on the core proxy layer, saw errors.
Another nuance from Cloudflare’s detailed postmortem: they are in the process of migrating to a new proxy engine called FL2. Both the old and new proxies received the bad file, but the effects differed. Sites on FL2 saw outright 5xx failures, while sites on the legacy FL saw traffic continue but with bot scores defaulting to zero (meaning bots could slip through or be misidentified). Either way, customer service was impacted.
In practical terms, the extra entries in the file pushed its size beyond Cloudflare’s threshold. Cloudflare engineers had not anticipated that the ClickHouse query could suddenly change its output volume in this way. When the expanded file rolled out, the anomaly cascaded into a total network disruption. Cloudflare admitted this was a latent bug in their Bot Management system that only surfaced under these conditions.
Cloudflare’s Response and Mitigation
Once the cause was identified, Cloudflare’s engineering teams took swift action. According to the postmortem, by 14:30 UTC they stopped the faulty file’s distribution and reverted to a safe version. Specifically, they “stopped the generation and propagation of the bad feature file” and manually inserted a known good file into the distribution queue. They then forced a restart of the core proxies (Frontline) to ensure all nodes would load the corrected file.
That change had an immediate effect: traffic throughput began recovering. The chart of 5xx errors on Cloudflare’s status site shows a steep drop after 14:30 (the team calls this point “Errors continued until the underlying issue was identified and resolved starting at 14:30”). In Cloudflare’s words, “core traffic was largely flowing as normal by 14:30 UTC” and “all systems were functioning as normal” by 17:06 UTC. So within about 3 hours of pinpointing the bug, the fix was in place and services gradually came back online.
Cloudflare kept customers and the public updated via status page posts (and social media). Initially they mis-suspected a DDoS attack, but after identifying the bug they clarified it was internal. They apologized to customers and emphasized that reliability was their priority. As Axios reported, Cloudflare said “We apologize to our customers and the Internet in general for letting you down today… We will learn from today’s incident and improve.”. CEO Matthew Prince (in a quote captured by Tom’s Hardware) similarly said “earlier today we failed our customers… this was not an attack… Work is already underway to make sure it does not happen again”.
By late Tuesday (UTC), all affected services were back and Cloudflare had deployed monitoring to ensure no residual problems. Downdetector reports fell from the 11,000+ peak to just a few thousand as the outage wound down. By Wednesday, businesses and users saw their sites and apps come back to life.
Why This Matters
The Cloudflare outage of Nov 18 is more than a footnote in tech news; it’s a reminder of how fragile the modern internet can be. Cloudflare sits in front of so many sites and services, acting as their shield against attacks and as a global CDN. When Cloudflare goes down, it’s not just one website – it’s a wave of errors across many domains.
This incident, like a recent Amazon AWS outage, underscores that even infrastructure giants are vulnerable to internal bugs. As one expert commented, outages at a few big providers show “how few of these companies there are in the infrastructure of the internet, so that when one fails it becomes really obvious quickly.”. For developers and businesses relying on Cloudflare, this was a wake-up call to have contingency plans. For instance, some sites use multiple CDN providers or maintain redundant hosting to survive such incidents.
From a technical perspective, the root cause is a case study in configuration management and testing. A small change in database permissions – a routine maintenance task – led to cascading failures. It shows the importance of understanding how database queries scale, of testing how changes affect automated data pipelines, and of having alerts when generated data files exceed normal thresholds.
Moving forward, Cloudflare has committed to adding checks to prevent oversized feature files and improving monitoring of its Bot Management system. They will likely introduce more safeguards in their continuous integration and deployment processes. For their customers, the key lesson is to watch for similar issues and communicate with Cloudflare when big updates happen.
Key Takeaways
- Unprecedented Scope: The Cloudflare outage on Nov 18, 2025, knocked major websites offline for ~4 hours. Down detector saw 11,000+ user outage reports at the peak, affecting services like ChatGPT, X, Spotify, Zoom, and many more.
- Not a Cyberattack: Despite initial suspicions, this was not a DDoS or hack. Cloudflare confirmed “no evidence” of malicious activity. The cause was internal: a routine configuration error in their Bot Management system’s data generation.
- Configuration File Bug: A database update caused duplicate entries in the Bot Management “feature file,” doubling its size beyond expected limits. The oversized file crashed Cloudflare’s proxy software. This is why HTTP 500 errors spiked.
- Rapid Fix: Cloudflare halted the bad file and reverted to a previous safe version. By 14:30 UTC they had core services back up, and by 17:06 UTC normal operations had resumed.
- Infrastructure Lesson: The incident highlights that even cloud/CDN providers can fail. Businesses should plan for third-party outages (multi-CDN, fallback modes). For engineers, it stresses the need for thorough testing of any change that affects automated system configurations.
- Public Response: Cloudflare publicly apologized and promised improvements. CEO Matthew Prince admitted they “failed our customers,” and noted that fixes would be prioritized to “earn [customers’] trust back”.
In the end, the Nov 18 outage was caused by a simple software bug in a “threat traffic” config file – but it had massive consequences. The event is now well-documented by Cloudflare’s own postmortem and by independent reporting. For developers and website operators, the best defense is awareness: keep communication lines open with your CDN provider, have backup plans, and learn from these deep post-incident reports to improve resilience.
Also Read: Cloud Gaming 2025: Key Trends, Growth & What Gamers Must Know
