Project Management Blog

Learn how to manage projects efficiently. Tips and strategy from experts. Stories, & new approaches to project management, videos & training.

May 26, 2014 by Vadim Katcherovski in Project Management 101 & Tools

Service interruption status update

Last week within 24 hours we had two service interruptions that affected some of our clients.
I’d like to provide an update on what has happened and how we plan to prevent such issues from happening again.

Quick Summary

There were two service interruptions related to an internal server failure (affected about 15% of North American clients) and power outage (affected our European and Asia-Pacific clients).

No data was lost.

Immediate solution: add redundancy to our hosting architecture to avoid similar interruptions.
Full-fledge solution (ETA 3 months): migrate the entire hosting infrastructure to Microsoft Azure Cloud (99.9% – 99.95% SLA).

What Happened – Detailed Report

The first issue started manifesting itself around 3.45pm ET. First a number of accounts have experienced decreased performance, which further deteriorated into the complete loss of ability to login or perform any action with the account.

With the first problem report our tech support started investigating the root of the issue and had to escalate it to the attention of the development/engineering team.

After certain amount of debugging we identified one of our SQL database servers has consumed all available physical memory and was in a frozen state. Our fault tolerance system automatically switches accounts to other servers when any server goes down, however in this case the server was still operational and therefore the switch over process was not triggered. Once we were able to identify and isolate this issue, the faulty server was rebooted and by 4.53pm ET the system was up and running again.

Later that date, specifically around 10:55pm ET our data center provider was forced to perform Emergency Electrical Maintenance Service to correct power phase fluctuations which took place during local evenings storms and neighborhood power outages.

During similar storms and power outages in the past our data center backup power generators were used, so no Easy Projects servers experienced any interruptions. We’re yet to receive a detailed report from our data center provider explaining what happened this time and why backup diesel generators were not used, and as a result all of our systems were inaccessible until
3.30am ET.

What Happens Next

We take full responsibilities for these failures and we’re terribly sorry for the inconveniences and disruptions these outages caused to you and your clients.
These events have prompted us to accelerate our plans to migrate Easy Projects accounts from our dedicated datacenter to Azure – the cloud services managed and operated by Microsoft.

Azure offers one of the best SLAs in the industry: 99.9% availability.

With our accelerated migration plan in place, we estimate to be able to migrate all of our hosted accounts there by the end of August. Meanwhile, our team will keep working on adding redundancy to all elements of our systems to provide you the best service possible.

As always, we thank all of you for your continued support of Easy Projects. Please feel free to reach out to us if you have any other questions regarding this outage, or Easy Projects in general.

Vadim Katcherovski
CEO

vadim-katcherovski
Vadim Katcherovski

Vadim Katcherovski is the CEO of Logic Software Inc. based in Toronto, Canada. He has over 20 years of experience in the IT industry and has managed dozens of software development projects.

Comments