In mid-May, a massive global cyber attack was executed that affected major companies in more than 150 countries. Major victims of the attack included Britain’s National Health Service, “causing widespread disruptions and interrupting medical procedures across hospitals in England and Scotland.” Other major victims included Spain’s Telefonica, Deutsche Bahn (Germany’s national railway service), French carmaker Renault, and government agencies world-wide (including Russia’s Central Bank).
A great many of the companies and government agencies that were impacted by this attack have APIs. If your product utilizes information collected from APIs provided by the affected companies and agencies, then your product may have also been affected by this attack — even if your own servers were not affected.
In this post, I outline some principles for coping with API outages due to massive cyber attacks.
This is just a test…
The Emergency Broadcast System was a United States network which sought to quickly spread news nationwide in the event of “grave national crises” (like a nuclear attack). Messages were broadcast via radio and television, always beginning with the words: “This is a test.”
Attacks like the one that occurred in May must be considered as tests. Surely it was an attempt to make money immediately (using code stolen from the U.S. National Security Agency); but, it was also an experiment, a test to determine how much money could be made from creating a massive disruption of individual computers; and also a test to determine how great a disruption could be accomplished using the method, and what the global reaction would be to such a disruption.
The results are surely being analyzed, both by the perpetrators of the May incident, and by similarly minded individuals and groups around the globe. The May event was a stunning success in terms of its global impact and the publicity it generated. Perhaps it was only minimally successful financially. Still, realistically, we can only expect events like this to increase in frequency (more groups executing attacks), intensity (more servers pushing the attack software onto undefended computers), and cleverness (smarter software that can attack computers previously thought to be well-defended).
So, as owner of an API-centric product, how can you be prepared for these events, and how do you react when the inevitable happens?
Advance preparation for cyber attacks
Your product depends on information retrieved from external and internal APIs. For every call to an API, your platform must embed a conditional response for the possibility that an individual API may not return the information needed for your product to produce a complete result. In the case where one or two APIs are down with respect to your product, your software can simply substitute a message stating that no current information is available; or you can post the last available information from that API, noting the time of the last data update.
Your software should automatically address this type of problem. However, a cyber attack that takes down almost all internet resources over a large geographical region is entirely another situation. In that case, you may discover that the majority of the APIs upon which your product depends are down.
Your first level protocol for addressing individual APIs that are unresponsive may not be sufficient in this case. A decision will have to be made as to whether calls should be executed to APIs that are dependent on missing data that is normally received from prior calls. Will those APIs respond with a valid (though perhaps partial) result given an incomplete or old set of input data? Or will the response to missing, old, or invalid data be a failed or erroneous result? The last thing you want to do is present your customers with erroneous information.
To avoid this, your product software must be constructed to address the possibility of any combination of API calls failing. For example, if your software calls 6 APIs, you need to have code that provides a legitimate response to 6 factorial (i.e., 64) combinations of successful or unsuccessful calls to those 6 APIs.
In a massive cyber attack, potentially only one or two (or none) of those 6 APIs may be returning a response that includes data. If none of your input APIs is responding, you either present your latest available data with a tag identifying to the customer the time of the latest data; and/or you state that no current information is available.
If you want your product to be always up, and always accurate, you need to be able to accurately report what data is missing or old. This involves significant programming, but it will keep your customers confident that they can trust you to provide the latest valid information. That’s why they use your product, not your competitor’s.
The API Science platform enables you to create monitors that can call an API that is provided from multiple locations around the globe. Why is this important? Because the closer you are located to the server that responds to your API request, the quicker the response is likely to be.
For example, the latest performance and uptime statistics for my monitors that call the World Bank’s Countries API from my Connecticut, US location to different API Science servers are as follows:
- Washington, DC: 50 msec average response time, 100% uptime
- Oregon: 311 msec average response time, 100% uptime
- Ireland: 387 msec average response time, 100% uptime
- Tokyo: 931 msec average response time, 75% uptime
These numbers are stark. Clearly, it’s best if my Eastern United States customers access my API from my Washington, DC data center. But customers on the US West Coast would likely receive better service if they connected with my Oregon data center; European customers would likely be better served by accessing my API via my Ireland data center; and Asian customers are likely best served by accessing my Tokyo data center.
These numbers prove that if your API is intended for a global customer base, it’s critical for you to have data centers located around the globe. The distance between your servers and your customers matters; and the distance between your servers and your data provider servers also matters, since the delays in receiving a response from your data sources is perceived by your customers as slowness in your own product’s response.
Having multiple locations for your servers is a highly-recommended method for increasing performance for your global customer base; but, it’s also a method for providing fall-back support, should a localized cyber attack bring down access to some of your servers. In preparation for such an event, you could provide your customers with instruction on how they can access your product from your other global servers. The response time may be slower, but your API will not be down when it is accessed from server locations that have not been affected by the cyber attack.
Responding to active cyber attacks
So, let’s say you’ve done all of the above preparation for the possibility of a massive cyber attack that takes down most or all of your product. What can you do while that attack is active?
This is where your contacts platform can have critical import. Your software engineers have already done what can be done using software. The cyber attack is disabling APIs that are critical for your product, and anyone who can still see your product is seeing your messages describing the outage.
But, it could be that your own site has been disabled by the cyber attack. What can you do?
Your customers have likely provided you with an email address. Some may have provided you with a cell phone number to which you could send a text message.
If your product is partially disabled, and you’ve done the necessary preparations, it can still run the code you’ve prepared. But, if your product is entirely down, it still may be possible for you to communicate with your customers via email or text messages.
In the event that you can provide your customers with only a minimal product, or none at all, your software can send registered users an email or text messages, alerting them that a major problem has occurred, and the duration of the outage is unknown.
This is preferable to customers simply seeing a blank screen when they go to your app.
Again, location matters: your API may be down from one location, but up from other locations around the world. Your global servers can be programmed to regularly ping one another, so that each server knows which of your servers from other locations is up or down. Using this information, your Asian server, for example, could broadcast to your entire customer base by email or text message that your Washington, DC server may be down.
If you’ve provided a means for your customers to switch to calling different server locations when they query your API, they can apply a simple “if” statement in their code to respond to a localized outage. In the example above, your customers near Washington, DC could switch to calling your Asia server, if they received the message that your Washington, DC site appears down.
You can aid your customers even in the face of massive cyber attacks by making your software account for all potential input data stream interruptions: by having multiple server farms around the globe; and by communicating directly with your customers in the event of an attack, telling them you are aware of the attack, describing how you are coping with its impacts, and providing alternative approaches they might implement while the attack remains in progress.