(Pre-scriptum: This is a developing story. This post was originally titled "Two days without Skype: the price of free". I changed it into "Two days without S: Blame Microsoft" after adding the correction below: if Windows Update indeed triggered the outage, then the problem had nothing to do with the fact that Skype is free; then Skype clarified that "the update patches were not the cause of the disruption", so I changed the headline again in "Blame Microsoft, or maybe not".)
Quick summary: For about two days, Thursday to Saturday last week, Skype's free or cheap voice-over-Internet services haven't been available to the 200+ million people who have signed up to use them. The tech explanation, as given on the Skype blog,
is: "The disruption was initiated by a massive restart of our user’s
computers across the globe within a very short timeframe as they
re-booted after receiving a routine set of patches through Windows Update.
The abnormally high number of restarts affected Skype’s network
resources. This caused a flood of log-in requests, which, combined with
the lack of peer-to-peer network resources, prompted a chain reaction
that had a critical impact."
In other words, Skype sent out a
software update, which -- apparently because a 4 year old bug in the
client software that had gone undetected so far -- prompted many
computers to restart, temporarily depriving the peer-to-peer system of
substantial network resources, which started the chain reaction. CORRECTION 20 Aug: I got the previous sentence wrong in the original version of this post, because the original version of Skype's post did not include the words "through Windows Update", suggesting that the whole was triggered by an update of Skype's own software. Now, Skype blames Microsoft for the crash. The explanation is not totally convincing, though: earlier, in an interview with the New York Times,
Skype's engineers had said that a bug in all Skype clients since 2003,
which had gone undetected so far, could have started the disruption.
So, the latest (partial) version could be: Microsoft sent out a routine Windows software update (as it does every second Tuesday of the month) which required the
computers to restart, temporarily depriving the Skype peer-to-peer
system of substantial network resources and creating a flood of
re-log-in requests, which triggered the chain reaction, with the Skype bug possibly magnifying the problem -- which would explain why this didn't happen on other WIndows Update Tuesdays. (thanks Amir
for pointing out the mistake) UPDATE 22 Aug: Or it could be different, as a new Skype blog post says: there was a critical weakness in the Skype software hat was just made worse by the Windows update.
The Skype system is now back to normal, when I used it this morning it worked fine. But questions remain.
In August 2004, I interviewed Niklas Zennström, one of the two Skype founders. At the time, there were about 550'000 concurrent Skype users on average. I asked him about scaling to serve a larger crowd:
With this design can you scale up to vast number of users or is there a point where you need to redesign your architecture?
We won't need to invest in infrastructure. What we will need to do at some point is to make some changes in the technology to be able to scale more. If we didn’t do anything, when we reach 10 million concurrent users (20 times more than now) we believe there will be problems. So before that happens we will have to spend some time to make some changes in the architecture. But there is no investment needed in hardware etc. Just in development.
The 10-million mark has been reached. So one can wonder whether the issue with Skype is not larger than what their blog post says (I don't have more info on this, am just wondering, but here is what's written on the Skype blog: "Skype’s peer-to-peer core was not properly tuned to cope with the load and core size changes").
There are a couple other things to consider:
- The whole Internet seems to be bursting at its seams. Bandwidth is abundant, but the growth of transferred data is phenomenal
-- just think of video and music up-/downloads and streaming, and of
VoIP calls and videoconferencing such as Skype's. Some are even
wondering whether the network is reaching saturation and the upcoming
online television services (Joost, BBC's iPlayer, etc) are going to crash the Internet. A report by Cisco, quoted by iTnews, found that American video websites currently transmit more data per month than the entire amount of traffic sent over the internet in 2000.
- Some have argued that the Skype outage is a sign of the unreliability of Internet telephony. But what surprises me is actually that the Skype outage didn't happen sooner
-- that the system has worked so well for five years (the company was
founded in 2002), which is a testament to the genius in the
application. In the meantime, dozens of other VOIP providers have closed down. Traditional telecom services are by no means perfect, either (not to mention cell phones, and Blackberries).
And let's not forget that Skype is a free service (so the inconvenience of two days with no service is actually the price of free.)