Single point of failure, imprecise instructions and un-labelled cables are a bad, bad, mix
Cloudflare has admitted that a four-and-a-bit-hour outage today was caused by someone pulling out cables that should have been left in place, but which were yanked because techies were given unhelpfully imprecise instructions.
The incident started with some “planned maintenance at one of our core data centers” that saw techies told “to remove all the equipment in one of our cabinets.”
Cloudflare said the cabinet in question “contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet.”
But there was more to this cabinet than met the eye:
The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel.
It turned out that patch panel was a single point of failure for Cloudflare’s data centre. Or as Cloudflare has explained in its incident report: “Starting at 1531 UTC and lasting until 1952 UTC, the Cloudflare Dashboard and API were unavailable because of the disconnection of multiple, redundant fibre connections from one of our two core data centers.”
The company scrambled to sort things out, but that took time because cables weren’t clearly labelled. Coronavirus-caused off-site working didn’t help matters either.
The company has had the good grace to not throw its techies under a bus, writing that it needs a process change along these lines: “While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched.”
At least customers were merely disrupted, rather than damaged, as all configuration data was preserved during the incident.
The company is nonetheless “very sorry for the disruption”, wrote Cloudflare CTO John Graham-Cumming. ®