Issues at Facebook – Robert Hill

philipp-katzenberger-iIJrUoeRoCQ-unsplash-Custom.jpg

Yesterday’s Facebook outage caught a lot of people off guard and created a great deal of speculation. Facebook has gone, in the space of ~17 years, from a narrow scope social tool built in a dorm room to a monster tech giant with its tentacles into many aspects of everyday life. Facebook has become an international commerce, communication and news tool, and this platform literally, and almost unbelievably, disrupted the lives of many people, some of which seemingly live their entire lives dependent upon the platform. Whole companies are run on it, marketplaces built, business transacted. There are unfortunately companies whose businesses halted completely because of the Facebook outages (which included FB companies such as Instagram and WhatsApp).

Santosh Janardhan (Facebook’s VP of Infrastructure) posted a blog entry about its origin, along with an apology for the “inconvenience caused by today’s outage across our platform”. The post notes that the outage was caused by “changes on the backbone routers that coordinate network traffic between our datacenters”, basically a botched internal update / configuration change. This was most likely a Border Gateway Protocol (BGP) technical issue. The most famous example of this, until yesterday, was in 2008 when the Pakistan Telecommunication Authority (PTA) made a decision to block YouTube traffic to and from their country. As an AS (autonomous system) the PTA incorrectly formatted the update. This led to rapid global propagation that resulted in a majority of global YouTube traffic being incorrectly routed to them thereby overloading their systems and effectively bringing YouTube down. In all actuality, YouTube servers themselves were not actually down, the traffic was just not being routed to them- so they were “down” or inaccessible. China, Russia, and Iran have all had their own instances of such global traffic rerouting, but today’s Facebook outage was orders of magnitude bigger.

The question that should be on everyone’s mind is the unfortunate timing and the extraordinary length of time to restore traffic. What are the dangers of a more insidious possibility, an actual BGP hack? Originally there were very few AS, but now there are estimated to be 80,000. While this has provided some built-in redundancy, it also has created some unintended vulnerabilities. What if a nation-state or simply a disgruntled engineer decided to introduce a virtual detour sign on the internet superhighways simply rerouting the traffic to a black hole? A malicious attacker does not need to take down a well-protected server farm if they can simply prevent traffic from reaching the desired destination. What about other sites we have come to rely on with the work from home paradigm shift? Many people rely on Grub-Hub and Uber Eats for their meals, banking sites for transacting financial business, and sites like Amazon for necessities of daily life…all without leaving the “safety” of their homes. This could very well be the harbinger of a new set of threats and attacks against business and individuals as well, or it could simply be a Facebook engineer or contractor having a bad day. Having worked with DNS (Domain Name Services) for many years, I know firsthand how frustrating DNS entry issues can be, and how easy it is to fat finger an IP address for a server or gateway, or mess up a configuration setting.

Either way, yesterday’s events bring business owners and leaders an opportunity to make sure that we are thinking about risks to our companies. This is a call to action on being proactive and prepared. To getting a handle on the assets we own, where our systems and processes are vulnerable, being risk aware and knowing what our alternatives are if one of our critical systems go down.

Our mission is to help organizations identify risks, prioritize them as they apply to their business, and manage the remediation process. Reach out to me or one of my team if you’d like to have an executive conversation about how we have done that in our business at Cyturus and helped others to do the same.