Diversify Security - Learn from Facebook's Outage

As many people know, this week Facebook had a massive outage that affected all of their services.  The cause of this is straightforward, simple, and incredibly silly to anyone who looks at it.  In order to explain this fully, let us go into a quick breakdown of what happened:

  • Facebook had an issue during routine maintenance that affected their internal data centers and how they connect to the network.  This issue arose somewhere around 11:40 EST.
  • One of the smaller facilities responsible for maintaining connections to outside requests during these issues had a secondary failure, it uses border gateway protocol (BGP) - where in order to verify a healthy connection exists, the servers will disconnect themselves if there is an error. This also triggered and took down their DNS servers.
  • Facebook's Communication Applications, their internal messaging, and even their card reader software all are up on Facebook's internal networks and servers - which were unreachable and down, so effectively the building couldn't be entered without physical keys.
  • This went down to the servers themselves, where physical access via cardkey was not possible. Few theories on how they got into the server cages come up, but regardless they managed to get everything operational around 6:30 EST


So, how can a security engineer look at this situation and draw conclusions?  Well, first thing to look at is how does a system cope with failure?  If one system fails, is the failover on the same network or could it be impacted by the initial failure?  This happens all the time in Security processes - if you have malware detected at your ID / network protection solution - do you have a secondary scanner on endpoints to pick it up?  What does your response look like?

Thinking about all of these things is important when handling a security event or planning for a breach.  As I've written about before, good Security Architecture is having plans and processes that can deal with cascading failure.  If enough potential problems rest on a single point of failure, this can completely bring down a company (similar to Facebook's issues this week) - and this can be a more common problem without diversifying or using layered security.

Some considerations / goals of layered security:

  • Removing a single point of failure.
    • By layering security solutions, if one fails - the others will still operate preventing malware infections, breaches, or data exfiltration from occurring beyond the simple failure of a solution.
  • Having a human & automated response options.
    • In every case that requires a human response - make sure you have an employee or service available that will cover that issue.  If you are concerned about threat events in the middle of the night - it is critically important to have an employee monitoring for or prepared for responding to those threats.  Further, how are you monitoring them? Is the Employee watching threat feeds for a night shift? Are they on call for automated alerts to respond to?  Are they going to be awake during that period and actually respond?
    • Further, for automated answers - make sure automation has fallbacks when a human response is required.  It doesn't help to know your entire system went out at 2 AM if that isn't resolved because no one is aware of it.  Losing money or having an outage with no engineers available to fix it... because you didn't know about it or have an assigned response isn't an acceptable failure option.
  • Don't be afraid to use external trusted vendors.
    • Facebook's issue shows us that even the largest companies relying solely on internal solutions will have issues that can bring down their entire network.  Imagine a separate situation where Facebook had options from external sources that could get ahead of this problem sooner, how much money would that have saved Facebook? Would the issue have even been as widespread?
  • Make sure you have a plan for an outage happening, and clear indicators of who to contact and what actions they need to take.
    • Write up a plan for an outage and make sure your staff or fellow employees know what their duties are in that situation.  Maybe run a drill once a quarter so that your employees can respond to an outage as quickly as possible.  When security goes down, it can cause huge vulnerabilities within a network - and it should be treated just as critically as a network outage.

What if you aren't sure how to proceed with the above?  Well, don't be afraid to reach out to security vendors and ask what their processes are when one of their systems goes down.  Most service providers have details they can provide that keep you appraised of what actions to take in case of an outage.  Make sure this information is included in your overall plan - it helps to know who to call or what process to follow if it is a service you do not control fully internally.

And if those answers aren't clear, it is important to reach out to trusted security advisors or other security providers you work with in order to have a clear plan built for your business.  Plan for the best, but also plan for the worst. It is important to have both plans in place and resolve any potential issues before they become glaring weaknesses in your security posture.

Additional Sources:

Facebook's Write up

The Verge's Explanation