Tag Archives: datacenter

The Amazon Outage…And the Lessons Learned

At 1:49am EDT yesterday, Amazon’s cloud server (known as AWS) suffered a major and ongoing outage that took down many small and large web services around the world. The NPR has a great, easily understood explanation of what happened:

Major websites including Foursquare and Reddit crashed or suffered slowdowns Thursday after technical problems rattled Amazon.com’s widely used Web servers, frustrating millions of people who couldn’t access their favorite sites.

Read the rest of their article here: http://n.pr/fAMEoG. At the time of this writing, thousands of sites are still affected, as Amazon still has not resolved their issues. Interested clients can visit the Amazon status page directly to see a current status report of the Amazon issues.

While the Amazon outage did not affect any of our hosted websites, nor any of our dedicated or cloud server clients, it did take down our shared mail systems. As we employ third-party monitoring on all of our systems, our team was immediately alerted to the Amazon issues, even before Amazon itself acknowledged there was a problem. We immediately began working with Amazon to find the issues, and quickly discovered that this was not an issue isolated to our mail servers, but instead to wide swaths of the Internet. At that point our team had no choice but to wait on Amazon to provide the expected resolution.

When it became apparent that the problems at Amazon were getting worse, and not better, our team began reviewing our options, and decided to rebuild our mail systems from our last backup. We keep regular nightly backups of our mail systems in a different physical location with Amazon, and thus our backups were unaffected by the Amazon outage. By yesterday afternoon, we had restored all mail services. Because we were forced to rebuild our mail systems from a day-old backup, a few clients may experience a few lost emails. This is unfortunate, and we do apologize for this inconvenience.

Some clients may wonder if there was anything that we could have done to prevent this Amazon outage from affecting our mail systems. While the exact cause and after-action report from Amazon is still not available (they are still trying to fix the problem), some facts are known, and in the balance, the general consensus among Internet experts is that this Amazon outage was unprecedented and was nearly impossible to plan for:

This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn’t a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications.

You can read the rest of this informative article here: http://bit.ly/hPOjkH. However, just because this event was unprecedented and a complete failure of Amazon doesn’t mean that our team didn’t learn valuable lessons. Frequently, it is the largest of failures that teach the best of lessons.

This epic event showed our team that our basic recovery plans were sound and did allow us to recover relatively quickly from this outage. In addition, we have used this event to make adjustments to our disaster recovery procedures to ensure that the next time such an event occurs we will have significantly less downtime.

  1. Our team will begin taking more frequent backups of our systems. This will allow us to recover from a more recent backup should such a failure even occur again.
  2. We have changed the way we build our mail systems to allow for a quicker recovery in future events. In this case, it took our team 5 hours to rebuild the mail systems; with these changes already implemented, we expect that future re-builds will take less than an hour.
  3. We have updated our disaster recovery documentation to reflect the lessons learned during this outage to ensure that our team has the latest and best procedures to follow for future events.

While our team is very disappointed that our clients had to suffer through another outage, we are pleased that we were able to recover services relatively quickly. As we write this, over 16 hours since we recovered our services, major sites like reddit.com are still not fully recovered. As always clients can watch our Trust Site for details on current uptime, and subscribe to our Twitter feed for regular updates for all problems.

Comments ( 0 )

Amazon Issues

Like thousands of companies around the world, we host many of our cloud services with Amazon. Currently, Amazon is experiencing crippling issues in their US-East datacenters, which is affecting thousands of websites and services around the globe. While all of our client websites remain up and unaffected, our client email systems are down because of Amazon’s issues. Amazon is aware of the problem, and is working to resolve this.

Comments ( 0 )

Amazon Still Down…And Counting…

For 12 hours now, Amazon’s cloud service (known as AWS or EC2) has been down, taking with it thousands of some of the most popular web services in the world. This outage has affected our services as well, resulting in the outage of our client email systems. Unfortunately, the Nerds On Site team has absolutely no control over this, and is in the same boat as thousands of other web teams around the world.

12 hours into this outage, Amazon is not able to provide an estimated time for a fix, saying:

“A number of people have asked us for an ETA on when we’ll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.”

Our team continues to monitor the situation, and we’ll keep you updated via this blog and our Twitter account: http://twitter.com/nerdshosting.

Comments ( 0 )

Downtime in 2010

2010 saw plenty of high-profile outages for social media sites like Twitter, Facebook and free hosting services, as well as at least four major outages for e-commerce services. There were also incidents that knocked government services offline for days.

Read a full accounting of 2010′s major outages: http://www.datacenterknowledge.com/archives/2010/12/22/2010-the-year-in-downtime/

Comments ( 0 )

Allstream cable cut in Toronto

Yesterday morning, at approximately 9am, workers for the City of Toronto accidentally cut a main Internet cable for Allstream. Crews are working to temporarily fix this cable, and in the meantime our systems, which are directly affected by this accident, have been switched to our redundant upstream providers.

While all our systems continue to work with our backup upstream providers, there are momentary (millisecond) outages caused by our BGP routers switching between providers to find the best connection while we wait for this main cable to be repaired. Our team continues to monitor the situation, and will provide updates if we lose any type of connectivity.

Comments ( 0 )

Data Center Outage Takes Down 24 State Agencies

A few days ago a data center in Chesterfield, Virginia went down, taking 24 Virginia state agencies with it. Last fall, the state of Virginia made similar headlines when rolling IT outages brought to light the state’s failure to include network redundancies as part of their new 10-year IT outsourcing deal with Northrop Grumman.

Redundancy is an important aspect of any online presence, and Nerds On Site can help your business realize full redundancy in all your online business. As a first step, every one of our hosting clients is protected by our geographically redundant DNS system.

Contact our team today to learn how we can ensure such a problem doesn’t happen to your business.

See additional coverage in the Roanoke Times, Virginia Business and Washington Post.

Comments ( 0 )

Hosted Exchange Services

Most small and mid-sized businesses are using email as a primary communication channel with customers, colleagues and suppliers. But many of these companies stop there, missing out on productivity-boosting features like shared calendars, contact information and files. By upgrading to the world’s most popular business messaging software, Microsoft Exchange Server 2007, you can significantly raise your team’s efficiency for a small monthly fee. Basically, Exchange is a computer server that stores your company’s email, calendars, address books and files centrally, so they are available 24×7 and can be shared among your team, if you wish. It is the messaging system of choice for most Fortune 500 corporations.

Nerds On Site offers hosted Exchange for our clients, which takes your email to the next level. Hosted Microsoft Exchange means that we provide all hardware and software, then run and maintain it for you in a tier-1 datacenter. You pay a low monthly fee, and can access your email using Outlook or Entourage on your desktop, Outlook Web Access (OWA) in any Web browser, or wirelessly from a device like the BlackBerry or Treo.

Our service operates out of a tier-one datacenter, with state-of-the-art cooling, power supply, backup generator and fire suppressants. For you this means that your email is constantly running. Your Exchange service will run on clustered servers and an EMC storage area network – so if one server or disk drive stops working, your email keeps working. As well as offering the highest-quality IT infrastructure, we do nightly backups to protect your data.

We offer a complete range of add-on features. For instance, hosted BlackBerry and GoodLink (for Treos and other devices), so you can instantly activate email service for your handheld. Rather than spending thousands of dollars on your own server, you can use our managed service for just a few dollars a month. You can also add archiving – from simple message journaling, to keep an exact copy of all your message, to compliant archiving, which fulfils regulatory requirements from the SEC, NASD, HIPAA and so on. Also available is fax via email, which gives you a local fax number and emails your faxes to you as PDFs.

Comments ( 0 )