Techie yanked some cables and took Cloudflare’s dashboard and API down for four hours

Single point of failure, imprecise instructions and un-labelled cables are a bad, bad, mix

Cloudflare has admitted that a four-and-a-bit-hour outage today was caused by someone pulling out cables that should have been left in place, but which were yanked because techies were given unhelpfully imprecise instructions.

The incident started with some “planned maintenance at one of our core data centers” that saw techies told “to remove all the equipment in one of our cabinets.”

Cloudflare said the cabinet in question “contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet.”

But there was more to this cabinet than met the eye:

The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel.

It turned out that patch panel was a single point of failure for Cloudflare’s data centre. Or as Cloudflare has explained in its incident report: “Starting at 1531 UTC and lasting until 1952 UTC, the Cloudflare Dashboard and API were unavailable because of the disconnection of multiple, redundant fibre connections from one of our two core data centers.”

The company scrambled to sort things out, but that took time because cables weren’t clearly labelled. Coronavirus-caused off-site working didn’t help matters either.

The company has had the good grace to not throw its techies under a bus, writing that it needs a process change along these lines: “While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched.”

At least customers were merely disrupted, rather than damaged, as all configuration data was preserved during the incident.

The company is nonetheless “very sorry for the disruption”, wrote Cloudflare CTO John Graham-Cumming. ®

Practical tips for Office 365 tenant-to-tenant migration

Articles You May Like

DRDO Sets Up Tech Centres to Research Futuristic Military Applications
Redmi 9 to Go on Sale in India Today via Amazon, at 12 Noon: Price, Specifications
Huawei Plans More Cuts to Jobs, Investment in Australia
Epic, Spotify, ProtonMail and pals rise up as one against Apple’s 30% cut, call for end to Cupertino-style markets
Microsoft Buys Bethesda-Owner ZeniMax for $7.5 Billion

Leave a Reply

Your email address will not be published. Required fields are marked *