Cloudflare Outage November 2025: How Automated DNS Failover Kept Our Clients Online

It was 11:23 in the morning on 18 November 2025 when my phone started going off.

Not a single message. Not two. The notifications were stacking so fast the screen barely had time to settle between them. Slack. WhatsApp. Email. Then Slack again. I remember picking the phone up and thinking someone had done something catastrophic in production.

Nobody had. At least, not on our side.

Cloudflare had gone down.

What actually happened on 18 November 2025

Cloudflare is not just a CDN. It is the silent infrastructure layer underneath approximately 20% of everything on the internet. ChatGPT sits behind it. So does X, Spotify, Shopify, Canva, Dropbox, Coinbase, Discord. When Cloudflare has a bad day, it is not one website that goes dark. It is a significant portion of the internet.

On that morning, a permission change in one of Cloudflare’s internal database systems caused the database to emit duplicate entries into a file that Cloudflare’s Bot Management system reads on every request. The Bot Management system could not handle the corrupted file. Requests started failing at scale. The error cascaded across Cloudflare’s global network. Millions of websites started returning 500 errors, blank pages, or just timing out entirely.

By 11:30 UTC, the reports were everywhere. ChatGPT was down. X was down. Spotify was showing errors. Users were flooding Downdetector. Tech Twitter was in a collective panic. The BBC was covering it. Cloudflare’s stock dropped sharply before markets had even fully processed what was happening.

The outage lasted nearly six hours for some services. Cloudflare lost approximately $1.8 billion in market capitalisation that day.

And our clients? The ones we manage Cloudflare for?

They were back online within four minutes of the incident starting. Most of their users never noticed anything at all.

Here is exactly how.

The quiet system running in the background

No client ever asked us to build this.

That is the part I want to be clear about. Nobody called us and said “what happens if Cloudflare itself goes down?” Nobody included it in a requirements document. Nobody raised it in an onboarding meeting. As far as our clients were concerned, they had hired us to manage Cloudflare, and they trusted us to figure out what that actually meant.

That trust is exactly why we built it.

About eighteen months before this outage, we were setting up Cloudflare for a new client. The usual stack - WAF rules, DDoS protection, rate limiting, proxied DNS. Everything configured, everything tested, everything working. And I sat back and thought about what we had actually built. We had made Cloudflare the single point of entry for everything. All traffic. All security. All edge logic. One vendor, no backup, and a client who was now completely dependent on that vendor staying up.

Nobody had asked us to do it differently. But that is the wrong standard. If you are waiting for your client to think of the question before you address the risk, you are not managing their infrastructure. You are just operating it.

So we built the failover before anyone asked for it. We added it to every deployment as a baseline, not as an option. We set the low DNS TTLs, built the synthetic monitoring, wired up the DNS switching automation, and tested the bypass path before it was ever needed. Every client we onboarded after that got it automatically, without a conversation about it, because in our view it is simply part of what it means to manage Cloudflare properly.

The system is not complicated in concept. It is just disciplined about what it checks and what it does when something is wrong.

How the automation works

Cloudflare automated failover flowchart - synthetic monitoring, error detection, DNS switch to origin, and automatic recovery

Every few minutes, our monitoring runs synthetic checks against each client’s website. Not just a ping. Not just a TCP connection check. It loads actual page assets - HTML, CSS, key JavaScript files, images - the same way a real user’s browser would. It checks HTTP status codes, response times, and whether the content it receives matches what it expects. If a checkout page should return a 200 with a specific element present, that is what we check for.

When those checks start failing, the system does not immediately panic. A single failed check could be a network hiccup, a slow response, anything. It runs the check again. If it fails again, it starts investigating.

Step one: Is this a Cloudflare error?

Cloudflare has a distinctive set of error codes it returns when it is the problem rather than the origin server. 520 (unknown error), 521 (web server is down), 522 (connection timed out), 524 (a timeout occurred), 526 (invalid SSL certificate). There are others. The system looks specifically for these codes and for the HTML signature of Cloudflare’s own error pages - that distinctive Cloudflare-branded error page that anyone who was online on 18 November 2025 became very familiar with.

If the errors match Cloudflare’s fingerprint, we move to step two.

Step two: Confirm it is not our client’s fault

Before switching anything, the system queries the origin server directly - bypassing Cloudflare entirely - to confirm the origin is healthy. If the origin responds normally, we have confirmation: Cloudflare is the problem, not the site.

Step three: Check the Cloudflare status page

The system queries the Cloudflare status API programmatically. If Cloudflare’s own status page is showing an active incident affecting the services our client relies on, that is the final confirmation. This step matters because it separates a Cloudflare-wide incident from a misconfiguration specific to the client’s account. The response tells us which kind of problem we are dealing with.

Step four: Switch DNS to origin

If all three checks align - Cloudflare errors on the site, origin healthy, Cloudflare status page showing an active incident - the system initiates a DNS change. The client’s DNS records, managed through their registrar’s API, are updated to point directly to the origin server, bypassing Cloudflare entirely.

The TTL on these records is kept low deliberately - 60 seconds. That means the change propagates within a minute for most users.

Step five: Monitor and restore

The system keeps polling the Cloudflare status page and the site itself. Once Cloudflare recovers and our synthetic checks pass through the Cloudflare proxy again, it initiates a rollback - DNS records go back to Cloudflare, and the client is back behind the full protection stack.

The whole sequence from detection to DNS switch typically runs in three to four minutes.

What happened on 18 November 2025, in practice

At 11:21 UTC, the first synthetic check failure came in for one of our clients. A checkout page that should have returned a 200 returned a Cloudflare 520 instead.

The system ran the check again at 11:22. Another 520.

At 11:22:30, it hit the origin directly. The origin returned a clean 200.

At 11:22:45, it queried the Cloudflare status API. Active incident, confirmed.

At 11:23:10, the DNS records updated. Pointed to origin.

By 11:24, the synthetic checks were passing again. The client’s users were hitting the origin server directly. No Cloudflare. No errors. The site was working.

My phone was going off because I was watching the alerts land in real time and then watching them resolve, one client after another, in the space of minutes. Not hours. Minutes.

Meanwhile, for anyone not behind our setup - ChatGPT users, Spotify users, Shopify merchants - the outage dragged on until approximately 17:00 UTC. Nearly six hours of downtime.

What this means for how we build Cloudflare implementations

Our clients did not call us on the morning of 18 November 2025. There was nothing to call about. Their sites were running. The monitoring system had already handled it, silently, while the rest of the internet was working out what was happening.

That silence is what good managed services should sound like.

The failover layer is not something we offer as an add-on. It is not something clients can request if they think of it. It is not a premium tier. It is included in every Cloudflare managed services deployment we run, by default, because we decided that is what the baseline looks like. DNS TTLs are low. Origin servers are independently monitored. The bypass path is tested during onboarding, not discovered during a crisis.

The wider point is about what managed services actually means. Most providers interpret it as “we manage the tool.” We interpret it as “we are responsible for the outcome, including outcomes caused by the tool failing.” Those are very different standards. Cloudflare failing is not a force majeure event that absolves the partner of responsibility. It is a known, recurring risk that a competent partner plans for in advance.

If your current Cloudflare implementation partner has not had a conversation with you about what happens when Cloudflare goes down, they either have not thought about it or they did not think it was their problem. Either way, that is the answer you need.

The November 2025 outage was not the last one. There was another in December 2025 and another in February 2026. These are not anomalies. They are the statistical reality of running 20% of the internet through a single network. The infrastructure underneath fails sometimes. The question is whether your setup was built for that - before it happened, by someone who decided it was their job to anticipate it.

A note on the architecture trade-offs

Bypassing Cloudflare directly means you lose WAF protection, DDoS mitigation, and bot management for the duration of the outage. That is a deliberate trade-off. An unprotected site that is online is better than a protected site that is not. In practice, the periods when Cloudflare is down are also the periods when sophisticated attackers are least likely to strike, because the signal-to-noise ratio on the internet is extreme and their own tooling often relies on the same CDN infrastructure.

The bypass is also time-limited. As soon as Cloudflare recovers, the protection layer comes back automatically. It is not a permanent degradation - it is a temporary bypass with automatic restoration.

For clients with extremely high security requirements, we maintain a secondary WAF layer at the origin level as well, so even during a Cloudflare bypass the site is not completely naked. That is a more complex setup and not always necessary, but it is available.

If you are not behind this kind of setup, you should be

Your current Cloudflare partner probably never raised this with you. That is not unusual - most do not. But the November 2025 outage made the risk visible in a way that is hard to ignore now.

Our Cloudflare professional services team can audit your existing setup, assess whether your failover path is viable, and retrofit the architecture if it is not. We do this for clients who switch to us from other partners and for businesses that manage Cloudflare in-house without realising what is missing.

For businesses in Saudi Arabia under NCA or PDPL requirements, our Cloudflare managed services for Saudi Arabia builds the compliance layer and the resilience architecture together from the start.

There will be another Cloudflare outage. We do not know the date. We do know that the clients who stay online when it happens are the ones whose partner built for it months before it became a headline.

Talk to Tasrie IT Services about your Cloudflare setup →

Cloudflare Outage November 2025: How Automated DNS Failover Kept Our Clients Online

What actually happened on 18 November 2025

The quiet system running in the background

How the automation works

What happened on 18 November 2025, in practice

What this means for how we build Cloudflare implementations

A note on the architecture trade-offs

If you are not behind this kind of setup, you should be

Cloudflare Enterprise Support Saudi Arabia: NCA, PDPL, and Partner Selection

How to Choose a Cloudflare Implementation Partner

Cloudflare Zero Trust Setup: AWS VPC Access via WARP (2026)

Cloudflare DDoS Protection: L3, L4, L7 Mitigation Guide

Cloudflare Rate Limiting: Production Patterns for APIs

Need better observability?

Tasrie IT Support

Start a conversation

What actually happened on 18 November 2025

The quiet system running in the background

How the automation works

What happened on 18 November 2025, in practice

What this means for how we build Cloudflare implementations

A note on the architecture trade-offs

If you are not behind this kind of setup, you should be

Related Articles

Cloudflare Enterprise Support Saudi Arabia: NCA, PDPL, and Partner Selection

How to Choose a Cloudflare Implementation Partner

Cloudflare Zero Trust Setup: AWS VPC Access via WARP (2026)

Cloudflare DDoS Protection: L3, L4, L7 Mitigation Guide

Cloudflare Rate Limiting: Production Patterns for APIs

Need better observability?

One Production Insight a Week

What you'll get

Subscribe to weekly insights

You're subscribed.

Tasrie IT Support

Start a conversation