OVH Experiences Major Downtime
OVH value just surpassed $1 Billion last year and it appears that they are not stopping there. Founder and CEO Octave Klaba who in under 20 years and at the age built one of Europes most well known and respected Cloud Hosting Companies.Mr Klaba is passionate about technology and he uses this expertise and passion to continue to grow OVH and to continue to provide performance hosting solutions to customers globally. One thing is certain about Mr Klaba, he has intense drive and business acumen as it can only be assumed that for someone at that at the age of 24 started his own hosting business and turning it into a global behemoth can only be applauded and revered.
OVH opened its first data centre in 2001, producing its first server in 2002, and opening its first European subsidiaries in Spain and Poland by 2004 has seen OVH grow and continue to grow at a record rate. But is the growth affecting other aspects and/or is the growth making them missing some important aspects of cloud hosting ?
Yesterday OVH experienced a major outage that affected literally 1000’s of servers and many more websites across Europe and beyond. Data centers in Roubaix and Strasbourg were hit the hardest being the latter through many other could have been affected though OneHost Cloud does not have any servers so we were unable to tell and as this is about us we will leave it to those two datacenters.
Here is a brief summary of the issues as compiled by CEO Octava Klaba:-
The SBG site is powered by a 20KVA power line consisting of 2 cables each delivering 10MVA. The 2 cables work together, and are connected to the same source and on the same circuit breaker at ELD (Strasbourg Electricity Networks). This morning, one of the two cables was damaged and the circuit breaker cut power off to the datacenter.
The SBG site is designed to operate, without a time limit, on generators. For SBG1 and SBG4, we have set up a first back up system of 2 generators of 2MVA each, configured in N+1 and 20kv. For SBG2, we have set up 3 groups in N+1 configuration 1.4 MVA each. In the event of an external power failure, the high-voltage cells are automatically reconfigured by a motorized failover system. In less than 30 seconds, SBG1, SBG2 and SBG4 datacenters can have power restored with 20kv. To make this switch-over without cutting power to the servers, we have Uninterrupted Power Supplies (UPS) in place that can maintain power for up to 8 minutes.
This morning, the motorized failover system did not work as expected. The command to start of the backup generators was not given by the NSM. It is an NSM (Normal-emergency motorised), provided by the supplier of the 20KV high voltage cells. We are in contact with the manufacture/suplier to understand the origin of this issue. However, this is a defect that should have been detected during periodic fault simulation tests on the external source. SBG’s latest test for backup recovery were at the end of May 2017. During this last test, we powered SBG only from the generators for 8 hours without any issues and every month we test the backup generators with no charge. And despite everything, this system was not enough to avoid today’s outage.
The Roubaix site is connected via 6 fibre optic cables to these 6 POPs: 2x RBX<>BRU, 2x RBX<>LDN, 2x RBX<>Paris (1x RBX<>TH2 et 1x RBX<>GSW). These 6 fibre optic cables are connected to a system of optical nodes which means each fibre optic cable can carry 80 x 100 Gbps.
For each 100 G connected to the routers, we use two optical paths which are in distinct geographic locations. If any fibre optic link is cut, the system reconfigures in 50ms and all the links stay UP.
To connect RBX to our POPs, we have 4.4Tbps capacity, 44x100G: 12x 100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to the GRA DC and 2x100G to SBG DC.
At 8:01, all the 100G links, 44x 100G, were lost in one go. Given that we have a redundancy system in place, the root of the problem could not be the physical shutdown of 6 optical fibres simultaneously. We could not do a remote diagnostic of the chassis because the management interfaces were not working. We had to intervene directly in the routing rooms themselves, to sort out the chassis: disconnect the cables between the chassis and restart the system and finally do the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because each chassis needs 10 to 12 minutes to boot. This is the main reason that it the incident lasted such a long time.
Diagnostic: all the interface cards that we use, ncs2k-400g-lk9, ncs2k-200g-cklc, went into “standby” mode. This could have been due to a loss of configuration. We therefore recovered the backup and reset the configuration, which allowed the system to reconfigure all the interface cards. The 100Gs in the routers came back naturally and the RBX connection to the 6 POPs was restored at 10:34.
There is clearly a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 monitoring cards. Despite all these security measures, the database disappeared. We will work with the OEM to find the source of the problem and help fix the bug. We do not doubt the equipment manufacturer, even if this type of bug is particularly critical. Uptime is a matter of design that must consider every eventuality, including when nothing else works. OVH must make sure to be even more paranoid than it already is in every system that it designs.
Downtime is never a good thing and as OneHost Cloud is a great deal smaller than OVH and at just 4 years old such downtime that affects our business impacts us much more than it does a multi national like OVH and we are thankful that our customers were very understanding but this could been far worse and downtime for a smaller Cloud provider is something that is very difficult to come back from as we are hosting customers websites, applications and a multitude of other services and up-time, performance and customer service is something that many other hosting providers offer but rarely offer them all at the same time.
As a perfect example yesterdays outage customers were left to rely on social media for answers as the phones and OVH were engaged ( and understandably so ) and while OVH employs 100’s of staff, information and updates were given by the CEO himself which is is a perfect example of customer service and to have such a person informing his customers above what is occurring is impressive indeed.
For the non-french here is the translation:-
“We have a concern of food of SBG1/SBG4. 2 electric EDF arrivals are down (!!) and 2 channels of generators began in default (!). All 4 arrivals elec fuel more room routing. We are all on the problem.”
It appears that they had a power issue and then an issue with their backup generators in one of their datacenters and an optical cable problem in another which affected OneHost.
From the perspective of OneHost Cloud we think that the support members should take a look at just how it is done as the support team at OVH can do with much improvement but again I can understand the volume of support requests they receive and so a deterioration in support however unacceptable in inevitable. We must also thank Vincent ( staff at OVH ) whom actively took to twitter to Twitter to notify customers and also OneHost Cloud personally so from me personally that is appreciated. OneHost Cloud has been an OVH customer for the entire time was have been in business and overall we have never had any form of downtime before so once in four years – we cannot complain with that. OVH has provided servers that suit our needs and the needs of our customers and we can honestly say we are happy, however those wanting to start your own hosting business we strongly suggest you steer clear until you grow as from our experience OVH is not a managed provider an hence support is limited to hardware and networking which suits us fine as I have been in the IT business for 20 years so setting up our complex VMware and Openstack was not beyond me but others may need support which you will not get from OVH. That is not a criticism of OVH as they only offer great performance servers and fast networking without the added cost of support which is how they can keep their prices down which suits out business model.
Some criticism though, OVH has a status page where customers or anyone else for that matter can go to and see the status of the various services offered by OVH however this was also down so having this was completely pointless and I am sure that they will learn from this and have this hosted elsewhere. OneHost Cloud would love to offer however all our servers are in your datacenters so this would be a pointless exercise. No matter what downtime and service interruption occurs with hosting providers, it is imperative that they learn from these mistakes so in the event that it happens again one can have the redundancy in place to have downtime kept to a minimum however the downtime we experienced at RBX was just over 2 hours wish in itself is quite impressive as we know that others were down for much longer.
The way in which OVH handled the issues is fantastic but can always be improved and with a CEO at the helm like Octave we are certain that they will have learnt from this and that is all we could ask for.
February 17, 2018
February 13, 2018
February 13, 2018