Friday, April 29, 2011

ReplyManager and Amazon Web Services Outage Report

On Thursday morning, April 21, ReplyManager experienced an unprecedented outage of our infrastructure provider, Amazon Web Services (AWS). This outage affected us as well as companies such as Hootsuite, Wildfire, Reddit and Heroku. In short, AWS experienced problems with the network that caused some servers to freeze up, making them inaccessible to ReplyManager customers. On Friday morning, Amazon began to restore service and little by little, customers came back online. By Friday evening, all affected ReplyManager customers were functioning normally again. The good news in all of this was that there was absolutely no loss of data.

ReplyManager has always prided itself on its excellent up-time track record and have experienced few outages, certainly nothing to the scale of this event. We know how important it is for our customers to have access to manage their incoming email and appreciate all the incredible patience and understanding exhibited by them.

What Happened
Our technical support team put together a detailed accounting of the outage, including why the outage occurred and steps we took to minimize potential problems. Their detailed report is as follows:
On Thursday April 21, at around 4:00AM EST, the AWS data center that hosts part of our services started experiencing problems with the network drives we use to store our database files. The problem was caused by a configuration error that occurred during a routine maintenance of the network drives performed by AWS. This caused a cascading series of events, that culminated in the affected accounts hosted in this data center to become unavailable. To our disbelief and against all our expectations, the problem spilled over into the following day, Friday April 22. At 9:45AM on Friday, one of the affected servers came back online. By 2:45PM, all but one of the affected customers were back online. The last affected customer came back online by 9:30PM on Friday April 22.

Throughout this outage, we maintained communications with AWS technical support. At their recommendation, we did not take any corrective action on the database servers that were affected. The servers did eventually wake up and continue their work as if nothing had happened. During the outage, we had two options: 1) wait for the servers to come back up, or 2) to restore information from the available backups.

By 4:00PM on the first day of the outage we were able to regain access to most of the backups that had been started at 12:05AM that same day. We took the action of staging copies of the backups on a different data center, were we to require them. The information contained in the backups was at least 4 hours old. Some of the backups were actually caught by the beginning of the outage, and were not completed until the following day. This left us with some backups as old as 28 hours.

Our primary imperative has always been to keep our customer’s data safe and complete. Since AWS was working hard at solving the problem on Thursday, we made a conscious decision not to entertain rolling out the backups for the affected customers once we gained access to them, since that could have represented some data loss. On Friday we were prepared to offer that to the remaining affected customers, however, with the restoration of the first database server on Friday morning, things started looking up and our advice at that time was to wait for AWS to complete their work to resolve the situation.

ReplyManager has enjoyed a great track record with AWS for more than a year. However, technology can fail, and in this case some of the basic assumptions we made about failure points in our system were put to question.

In the coming weeks and months, we will be re-assessing how our infrastructure is deployed with AWS in order to improve our systems, and give more choices to our customers to meet their particular expectations.

You can read a full account of the incident provided by AWS by following the below link:
http://aws.amazon.com/message/65648/

Communication Was Key
During this outage, the Service Notices Blog was linked directly from affected instances of ReplyManager for easy access. Our technical support team updated posts every few hours (sometimes more) and even worked through-out the night to monitor the situation and plan for the possible roll-out of back-up systems. We spoke with many of you over the phone on Thursday and even made phone calls on Friday morning to those that did not already have service restored.

Again, we really appreciated the patience and level-headedness shown by our customers. You are truly a pleasure to work with and we look forward to continuing our relationship far into the future!