Tag Archives: accountability

A New Level of Transparency and Trust

Trust Site Screenshot

Trust Site Screenshot

Nerds On Site has maintained a Trust Site for quite some time now, which is powered by Pingdom and provides realtime availability statistics for our shared hosting platform.  We’ve now upgraded our Trust Site to include a breaking email alert system.  Once you subscribe to this system you will get instant and breaking status alerts regarding the status of our hosting platform.  If our third-party monitoring systems (wormly.com & pingdom.com) detect an issue, they will instantly alert you directly, ensuring that our team cannot doctor the alerts or hide problems!  Once the initial alert has gone out, our team will keep you informed with updates as to our progress in resolving the problem.

All these alerts are also copied to our two Twitter feeds: @nerdshosting & @noslaerts.  Follow us today!

Comments ( 0 )

Availability Reports for January, 2012

During the month of January, 2012, our shared hosting systems averaged 99.76% of availability. This was a disappointing result after what we achieved for our clients in the December before, but it has only redoubled our efforts to do even better for you! Below is a breakdown of the statistics.

Check name Uptime Downtime Outages Response time
Linux Shared Hosting 99.64% 2h 40m 38s 6 1607 ms
SMTP (Inbound) 100.00% 0h 00m 00s 0 796 ms
SMTP (Outbound) 99.70% 2h 10m 06s 10 257 ms
IMAP Services 99.70% 2h 10m 02s 9 313 ms
POP Services 99.73% 1h 58m 00s 9 321 ms
Comments ( 0 )

Hosting Availability Report: December, 2011

Nerds On Site believes in openness and transparency, and this is especially important in our hosting availability statistics. We have long hired an outside company to monitor our services and to provide direct statistics to our clients and the public, to ensure that these statistics are honest, fair and open. Pingdom is a very trusted source of monitoring, and they watch all our systems closely, providing publicly available statistics at trust.nerdsisp.com.

The month of December (2011) was a particularly good month for our hosting services, in large part because of our new relation with SpamExperts that started in November. During the month of November, our team integrated the SpamExperts filtering service into our mail systems, giving us a huge leap forward in spam filtering and mail reliability. This new reliability is reflected in our numbers for December, and we fully believe to see even better numbers in January.

Availability Statistics for December, 2011

IMAP: 99.99%
POP: 99.97%
SMTP (inbound): 100%
SMTP (outbound): 99.95%
Shared Web Hosting: 99.98%

Comments ( 0 )

Pingdom September Report

Overview: Average of all checks

Uptime Outages Response time
99.92% 17 513 ms

Checks with downtime

Check name Uptime Downtime Outages Response time
Linux Shared Hosting 99.86% 1h 00m 00s 4 1032 ms
SMTP Services 99.93% 0h 29m 58s 5 272 ms
IMAP Services 99.95% 0h 20m 01s 4 373 ms
POP Services 99.95% 0h 19m 59s 4 373 ms
Comments ( 0 )

Mail Issues Recap

Earlier today, our client mail services suffered an extreme slowdown that caused many clients to experience timeouts, effectively blocking their email access. Our entire team is very sorry that this happened, and we also understand that you deserve a full explanation as to what happened. Unfortunately, we are still waiting on Amazon for a full explanation.

Here is the short explanation: Like most computers and servers, our mail system depends on a storage device (like a hard drive), and this is provided to us by Amazon Web Services. Early this morning, the hard drive Amazon provides us with started experiencing extreme slow downs. This caused our mail server to slow down to the point of being unusable at certain times. Our team closely monitors all our online services, and we were immediately notified of this issue this morning, and had the service restored within 20 minutes. Once the service was restored, it took nearly an hour for our mail system to fully catch-up to the mail backlog this outage generated.

While Amazon has confirmed the issue on their side, we are still awaiting their engineer’s explanation as to why this happened. We will update our blog with this explanation when and if we receive it from Amazon. As always when an issue arrises, our team will learn from this event and find new ways to avoid such troubles.

Thank you for your patience during today’s outage, and thanks for trusting our team to work toward giving you the best experience possible!

Comments ( 0 )

Availability Statistics for June, 2011

The report from Pingdom.com is in, and our shared hosting server for the month of June averaged 99.97% of availability. We always strive for 100% uptime, and even when we don’t get it, it’s important to publish our stats.

Check name Uptime Downtime Outages Response time
IMAP Services 99.97% 0h 15m 00s 2 370 ms
Linux Shared Hosting 99.97% 0h 15m 00s 1 651 ms
POP Services 99.97% 0h 14m 59s 2 380 ms
SMTP Services 99.98% 0h 10m 00s 2 264 ms

 

Comments ( 0 )

Monitoring Trust

Our team has completed the move of our Trust Site to be wholly hosted on the Pingdom servers. Thus, even a complete failure of our hosting system will keep our status site up and live for our clients to monitor. Visit it today: http://trust.nerdsisp.com. Also, you can follow us on Twitter to get all the latest news 24×7.

Comments ( 0 )

Re-Tooling the Trust Site

Our team is re-configuring our Trust Site to ensure that absolutely no downtime in the Nerds On Site systems can affect it. Our Trust Site will be down for about the next 24 hours while the new DNS settings propagate.

Once the new DNS settings propagate, our Trust Site will not be hosted on any Nerds On Site systems, but instead it will be wholly hosted on the Pingdom monitoring system.

Comments ( 0 )

The Amazon Outage…And the Lessons Learned

At 1:49am EDT yesterday, Amazon’s cloud server (known as AWS) suffered a major and ongoing outage that took down many small and large web services around the world. The NPR has a great, easily understood explanation of what happened:

Major websites including Foursquare and Reddit crashed or suffered slowdowns Thursday after technical problems rattled Amazon.com’s widely used Web servers, frustrating millions of people who couldn’t access their favorite sites.

Read the rest of their article here: http://n.pr/fAMEoG. At the time of this writing, thousands of sites are still affected, as Amazon still has not resolved their issues. Interested clients can visit the Amazon status page directly to see a current status report of the Amazon issues.

While the Amazon outage did not affect any of our hosted websites, nor any of our dedicated or cloud server clients, it did take down our shared mail systems. As we employ third-party monitoring on all of our systems, our team was immediately alerted to the Amazon issues, even before Amazon itself acknowledged there was a problem. We immediately began working with Amazon to find the issues, and quickly discovered that this was not an issue isolated to our mail servers, but instead to wide swaths of the Internet. At that point our team had no choice but to wait on Amazon to provide the expected resolution.

When it became apparent that the problems at Amazon were getting worse, and not better, our team began reviewing our options, and decided to rebuild our mail systems from our last backup. We keep regular nightly backups of our mail systems in a different physical location with Amazon, and thus our backups were unaffected by the Amazon outage. By yesterday afternoon, we had restored all mail services. Because we were forced to rebuild our mail systems from a day-old backup, a few clients may experience a few lost emails. This is unfortunate, and we do apologize for this inconvenience.

Some clients may wonder if there was anything that we could have done to prevent this Amazon outage from affecting our mail systems. While the exact cause and after-action report from Amazon is still not available (they are still trying to fix the problem), some facts are known, and in the balance, the general consensus among Internet experts is that this Amazon outage was unprecedented and was nearly impossible to plan for:

This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn’t a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications.

You can read the rest of this informative article here: http://bit.ly/hPOjkH. However, just because this event was unprecedented and a complete failure of Amazon doesn’t mean that our team didn’t learn valuable lessons. Frequently, it is the largest of failures that teach the best of lessons.

This epic event showed our team that our basic recovery plans were sound and did allow us to recover relatively quickly from this outage. In addition, we have used this event to make adjustments to our disaster recovery procedures to ensure that the next time such an event occurs we will have significantly less downtime.

  1. Our team will begin taking more frequent backups of our systems. This will allow us to recover from a more recent backup should such a failure even occur again.
  2. We have changed the way we build our mail systems to allow for a quicker recovery in future events. In this case, it took our team 5 hours to rebuild the mail systems; with these changes already implemented, we expect that future re-builds will take less than an hour.
  3. We have updated our disaster recovery documentation to reflect the lessons learned during this outage to ensure that our team has the latest and best procedures to follow for future events.

While our team is very disappointed that our clients had to suffer through another outage, we are pleased that we were able to recover services relatively quickly. As we write this, over 16 hours since we recovered our services, major sites like reddit.com are still not fully recovered. As always clients can watch our Trust Site for details on current uptime, and subscribe to our Twitter feed for regular updates for all problems.

Comments ( 0 )

Amazon Issues

Like thousands of companies around the world, we host many of our cloud services with Amazon. Currently, Amazon is experiencing crippling issues in their US-East datacenters, which is affecting thousands of websites and services around the globe. While all of our client websites remain up and unaffected, our client email systems are down because of Amazon’s issues. Amazon is aware of the problem, and is working to resolve this.

Comments ( 0 )