In May this year, international headlines were ablaze with news of a British Airways (BA) data center outage resulting in the cancellation of over 400 flights, leaving 75,000 passengers stranded on a busy bank holiday weekend.
The incident was allegedly traced to a single engineer who disconnected and reconnected a power supply, causing a power surge that severely damaged critical IT equipment. BA’s technology infrastructure crashed, causing hundreds of flights to be cancelled over several days while the airline worked to restore the affected equipment.
The technician, part of a team operating at the Heathrow facility, was authorized to be on the premises but not to disconnect the power supply in question. The disconnection resulted in instant loss of power to the site because it bypassed the backup batteries and generators.
When power was reconnected in an uncontrolled manner it caused catastrophic physical damage to the British Airways’ servers.
The outage is estimated to have cost up to $112 million in refunds and compensation, not to mention the loss of productivity and future revenues.
A spokesperson for CBRE Group, the US property services company that manages the site, said the incident is still under investigation. BA has two data centers, Boadicea House and Comet House, designed to work together to provide uninterrupted service. But it appears that the incident resulted in both facilities going down. It is not yet understood why the backup site also failed.
The question remains as to how a lone engineer could cause such serious disruption and why British Airways’ business continuity processes proved ineffective in this case. BA Chief, Alex Cruz, is leading an inquiry into the outage.
This major incident caused by the hand of an employee certainly isn’t an isolated event. A 2016 report by the Ponemon Institute found human error to be the second most common reason for data center downtime, accounting for 22% of all incidents. This follows Uninterruptible Power Supply (UPS) failure which is the number one reason for data center downtime. In comparison, water, heat or air conditioning failures caused 11% of outages, 10% were weather-related and only 4% were IT equipment related. Cyber crime was the fastest-growing cause of outages, rising from 2% in 2010 to a concerning 22% in 2016.
According to the Ponemon report, the average cost per outage was $740,357 (up from $690,204 in 2013), with business disruption, missed revenue and reduced productivity driving the financial losses. Interestingly, the proportion of failures attributed to human error remains unchanged since 2013, suggesting little progress has been made to improve what should be an avoidable cause of disruption.
Another study by the Uptime Institute suggests the problem may be even worse. Uptime’s analysis of data center outages found that over 70% were directly attributable to human error, with staff training seen as one of the biggest data center oversights.
A separate study by Enlogic, a company that provides data centre energy monitoring products, also found human error to be most common cause of data center downtime. Simple mistakes such as incorrectly adjusting the temperature to Fahrenheit instead of Celsius or removing the wrong power cord were some of the greatest threats.
In February this year, Amazon Web Services experienced a widespread outage when an employee, who was debugging a billing system, inadvertently shut down more servers than he intended. A simple typo initiated a cascade effect, progressively bringing down more connected systems. It resulted in a period of several hours where users could not stream online content or make purchases. To reduce the potential for future incidents, Amazon has now placed restrictions on how quickly employees can shut down servers.
In another incident in May this year, Cloudflare experienced a service disruption when an engineer at one of its partner carriers, Telia, mistyped a network configuration command and shut down a transatlantic fiber cable. A similar Cloudflare issue arose last year when an improperly configured Telia router sent Cloudflare’s European traffic to Hong Kong.
In October 2016, Level 3 Communications suffered a voice network outage affecting its North American customers, caused once again by a configuration error. Investigations revealed that a technician failed to specify a phone number to which a configuration change should apply, meaning the incorrect routing was applied to all country code +1 calls.
Reports of employee-related outages continue to emerge. And while most are accidental, some have been malicious. Just last month, Dutch hosting provider, Verelox, was forced to restore its servers after a disgruntled ex-administrator deliberately wiped all customer data. The company offered customers compensation for the outage and asked all users to reset their server passwords.
These incidents highlight the need to prioritize regular comprehensive training for data center personnel, ensure procedures are carefully documented and enforce robust processes to mitigate the risk of human error. Data center operators can minimize downtime by ensuring only highly trained professionals are managing, monitoring and maintaining the power and IT infrastructure within each facility.
If you found this of interest, you may also enjoy reading about Switch’s new Tier 5 data center standard designed to compete with the Uptime Institute.