June 22, 2012

Twitter says outage wasn't hackers or Euro 2012, but a software fault




Twitter says outage wasn't hackers or Euro 2012, but a software fault
Two-hour failure in system blamed on 'cascading bug' that tripped up system and caused worldwide lockout, explains senior engineer – while hacking claims are dismissed
Charles Arthur
guardian.co.uk, Friday 22 June 2012
It wasn't hackers, or excitement over Euro 2012, nor avatars using the Gif image format that knocked Twitter over so thoroughly on Thursday that it couldn't even display its famous fail whale, which usually indicates problems at the site.


Instead, the two-hour outage was due to a "cascading bug" in part of the infrastructure powering the social network connecting its 140 million users worldwide, the company's head of engineering Mazen Rawashdeh explains in a blog post.

The Guardian has confirmed separately with trusted senior sources inside Twitter that hackers had nothing to do with the outage – though that did not prevent some claiming to have caused the crash, which ruined the social network's uptime record.

The "fail whale" – indicating that Twitter's servers couldn't keep up with demand – used to be a common sight in the network's early years, but has become progressively less common as the service has grown larger.

Rawashdeh notes that: "For the past six months, we've enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today [Thursday] though."

The problem was that the bug's effects spread out from their initial location to affect other parts of the system – rather as a power outage in one part of a city can lead to overloads on supply in other parts, and cause a cascade of outages which eventually shuts down the network.

Rawashdeh explains: "At approximately 9am PDT [5pm BST on Thursday], we discovered that Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets. We immediately began to investigate the issue and found that there was a cascading bug in one of our infrastructure components. This wasn't due to a hack or our new office or Euro 2012 or Gif avatars, as some have speculated.

"One of the characteristics of such a [cascading] bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter."

That took about an hour, but a half-hour respite between 10:10 and 10:40 Pacific time then saw the site drop out again, with "full recovery" following at 11:08 PDT.

A hacking group called Ugnazi claimed in emails to several organisations – and later, with no apparent irony, in tweets – to have caused the outage.

The group has given no credible explanation of how it did so, though, and Twitter was quick to dismiss the claim.



No comments:

Post a Comment