Mail Service Outage Round-Up
With our mail server issues now resolved, we wanted to post a quick summary of the outage and the resolution that we implemented to get all services running at optimal speeds. Once again, we apologize to our clients that use this free-tier email service for the inconvienence that they experienced during the day outage and the nearly week-long performance slowdown that followed it.
On Wednesday evening, May 30, our team performed a relatively minor file system change to increase the amount of overall storage that would be available to our clients. Unfortunately, this 60-second operation ended with catastrophic results, as the file system was extremely corrupted and the partition management software we were using failed to catch and stop the operation. The end-result of this operation was a complete failure of the file system, rendering all data inaccessible.
Our team then performed a complete file recovery operation, which took approximately 30 hours, after which time we restored all mail server operations early Friday morning. However, once mail operations resumed, extreme performance slowdowns were observed, with our team originally believing that this was due to a back-log of email and an unusually high connection rate by our users. This performance slowdown resulted in many webmail timeouts, as well as POP, IMAP and SMTP timeouts regularly.
On Wednesday, June 6, it was discovered that the root cause of the performance slowdowns was actually corrupted emails on our system, the corruption being part of the underlying file system crash on May 30. Once this was determined, SmarterTools (our mail system software vendor) wrote a custom piece of software that found and deleted all corrupted mail items. These mail items were unrecoverable and unfixable, and thus needed to be deleted. (Many users would have seen these items as blank emails, or would have noticed ‘Loading….’ errors in their webmail.)
All corrupted emails were deleted during the evening of June 6, and starting with the morning of June 7 our mail systems have been running without issues. Again, we apologize to our clients affected by this, and encourage them to talk to their Primary Nerds about what email options might be available to their business.