You may have noticed that yesterday, February 28, 2017, we had some issues.
More specifically, a service run by Amazon called Amazon Web Services (AWS) had some trouble, and it impacted MerusCase and our users.
AWS is a set of services provided by Amazon that includes file storage, which Amazon calls Simple Storage Service (S3), and a lot of companies and sites on the Internet (including MerusCase) use S3 to store their files, documents, and other digital assets.
The craziness began shortly before 10AM PST. Documents would not open. Messages were not synchronizing. Even the bouncing MerusCase logo that appears when reloading the app was a blank grey ball!
Yesterday Amazon's main S3 datacenter in Virginia (US EAST 1) went offline; Amazon reported this, two hours after all of you began blowing us up on various support channels, as 'we are seeing increased error rates'. So far Amazon has not provided any other details about what actually happened. Maybe they are hoping we will all forget the millions of dollars in lost productivity. Maybe they do not quite know (yet) what happened themselves.
What We Did
Shortly before our customers began experiencing slowdowns in MerusCase we received notifications from our monitoring software that meruscase.com was unavailable.
A quick check of Amazon's status page indicated that everything was OK, but it very clearly wasn't. Further checks of individual Amazon services determined that S3 was not responding.
Because S3 is where MerusCase stores many of its assets, loading MerusCase began to slow down considerably. Your browser loads basic directives, like what colors are in your MerusCase theme, the icon libraries, and various common resources first. These "block" the loading of additional items, like the magic that translates data into stuff on your screen. Additionally, for the patient folks among you, all email messages and uploaded documents are stored in S3, so attempts to retrieve mail, documents, or activities related to those two things also began to fail.
About that same time Twitter came alive (#awsoutage) with multiple reports from users that confirmed what we already knew.
Despite reports that it was only the Virginia data center that was affected, I personally began cycling through every other data center in the world. This began probably before most of you attempted your third browser refresh. We know how this stuff works, and part of the reason we run on the Amazon backbone is because we are supposed to be able to switch to different data centers in a matter of minutes with no compromise in data integrity.
São Paulo ignored me. Ohio maintained its status as a swing state. Singapore 404'd. Ireland timed out. Germany was out to lunch. Oregon declined to participate. And our home data center in San Jose, California was nowhere to be found.
Apparently there is a single point of failure in the S3 system that routes traffic through Virginia. It makes some sense... they were the first data center in the Amazon cloud system, so if master controls exist, they would exist there. The problem was global.
After unsuccessfully attempting to reach these other S3 datacenters around the world, we turned our attention to disabling the features in MerusCase that relied on S3. We figured some access to MerusCase was better than no access at all.
To this end we disabled the retrieval of assets like icons and logos, and disabled the ability to upload documents and display email messages as both of these actions require access to S3.
This allowed users to continue using MerusCase albeit in a limited fashion while Amazon worked to bring their service back online. If you were able to log billable time yesterday despite not seeing any icons, you know this first hand. Nice work.
A very long 5 hours later, Amazon had resolved whatever issue had caused the disruption, and we were able to restore full functionality to MerusCase by rolling back the emergency changes we had made earlier in the day.
Overall, not the best day to be using the Internet, but we are grateful and appreciative that our users weathered it with grace and patience.
What We Are Going To Do
For now, despite the very real recency bias, S3 is still the best in the business. The service has enjoyed a very impressive uptime since its inception in 2008, when we began using it. There have been other problems in the Amazon cloud over time, but they have always been mitigated by "best practices", that is, by spreading data out across multiple geographies and automatically or almost-automatically "failing over" to alternate data centers in the event natural phenomena (like a hurricane) or human error (like someone unknowingly severing a major fiber optic cable while digging a trench in Nebraska). We have weathered these things quite well in the past, including many events that took down much larger companies employing very smart people, like NetFlix, Instagram, and even government agencies like the SEC, not to mention Amazon themselves, all of which were major outages that did not have an effect on MerusCase (other than making us scramble a bit behind the scenes -- "no effect", in this case, means you all could continue to get your work done uninterrupted, it does not mean that the folks keeping the lights on at Merus were relaxing).
This was the first time a failure in the Amazon cloud infrastructure did so in a global manner. We count on the service to be available somewhere in the world for real-time access to both your documents and ours. Whereas in the past our commitment to the Amazon infrastructure has been without hesitation, we have always held the door open for exploration of other available options. You count on us to manage your technology infrastructure and provide the best, most reliable tools available, and we take that duty quite seriously.
Thanks for being part of the MerusCase family.
Leave a Reply