Feedback on the interruption of service on Thursday 11/28

Written by Jerome Granados on Friday 29 November 2019

GoodBarber and WMaker services were disrupted on the 28th of November due to an electrical incident. This incident occurred on the general power supply of a rack located in one of the datacenters that hosts part of our technical infrastructure.
The service was temporarily interrupted, and partially degraded when it was restarted, but no data was lost.
It has now been more than 24 hours since the services have been working perfectly and normally. It is now the time for explanations, assessments and lessons learned, which we share with you in this note.

Yesterday morning, around 8:30 am Paris time, a power failure occurred on a rack hosting about twenty of our calculation servers. This incident occurred on power supply equipment provided by our OVH host, as part of one of our hosting contracts in a Parisian datacenter.

We asked the OVH technician to intervene to restore the power in the rack, which allowed us to bring up all the services impacted (25% of our installation) within the hour. At 9:30 am, the services were working normally again.

At first, OVH thought that the incident was due to a problem with one of their UPS at almost the same time.

The failure caused a service interruption. However, no data was lost. We duplicate data several times, persistently and in different places. However, the service should not have stopped completely. It stopped because we have a session management service that did not switch properly to a machine in another rack. If the switch had worked correctly, we would have avoided downtime.

At 11:30 am we were moving some services to another rack when a second power problem occurred on the first rack. This problem again resulted in a 30-minute downtime. The OVH technician intervened once again and from 12:00, all services were restored.

As a precaution, two people from the team went physically to the datacenter where the rack that was impacted by the electrical problem is. They spent 6 hours on site to analyze the status of all our equipment. All the equipment concerned is less than one year old. We use only HP, Cisco and APC equipment, which has proven its reliability.
As we do not notice any anomalies on our equipment, we have agreed with our host that they will proceed with the preventive replacement of their electrical equipment that supplies our rack.
A joint intervention with our host took place between 7pm and 9pm to replace this equipment. This could have caused very short disruptions, without interrupting service.

Our team continues to closely monitor the service but no instability has been detected for 24 hours.

This type of failure is one of the most complicated scenarios to manage. Our objective is that our architecture should be able to tolerate this type of incident without service interruption. We will now reassess our systems to keep the service up even in the event of a power failure on 50% of the architecture.