Facebook Releases More Details About the Cause of Yesterday’s Six-Hour Outage ⇥ engineering.fb.com
Santosh Janardhan, VP of engineering and infrastructure at Facebook:
This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.
These sorts of posts always go through legal and public relations teams, so it is hard to know how complete an accounting of yesterday’s outage it is. But what is written here is pretty embarrassing for Facebook — not the outage itself, but that a routine maintenance misconfiguration took out a single point of failure that rendered the entire company’s infrastructure inaccessible. Whether this actually makes sense as presented is something best judged by networking professionals operating at Facebook’s scale.
That said, I think it is commendable that Facebook issued an explanation for its outage under a VP’s name. It could have had its communications team issue a typically pissy statement attributed only to the company. When Google services were down in December, it was similarly transparent. I wish this could be the standard rather than the exception. It builds confidence.
For comparison, as I write this, Apple’s System Status page shows a resolved outage in Apple Pay and Wallet. For over seven hours yesterday, “users were not able to add, suspend, or remove existing cards to Apple Pay”, and this issue has simply been marked as “Resolved” but there are no more details. This explanation-free status update has been the standard for every iCloud-related outage, including serious incidents. It does not build confidence.