About

Download

  • A free mini-guide on how to blog a conference in detail, by Ethan Zuckerman and Bruno Giussani.

Search LoIP

  • Web LoIP

Get LoIP per email

  • Enter your email address:

Non-profit

Books by Bruno Giussani

« A Dark River Runs Through It | Main | links for 2007-08-20 »

August 20, 2007

Two days without Skype: Blame Microsoft, or maybe not

(Pre-scriptum: This is a developing story. This post was originally titled "Two days without Skype: the price of free". I changed it into "Two days without S: Blame Microsoft" after adding the correction below: if Windows Update indeed triggered the outage, then the problem had nothing to do with the fact that Skype is free; then Skype clarified that "the update patches were not the cause of the disruption", so I changed the headline again in "Blame Microsoft, or maybe not".)

Quick summary: For about two days, Thursday to Saturday last week, Skype's free or cheap voice-over-Internet services haven't been available to the 200+ million people who have signed up to use them. The tech explanation, as given on the Skype blog, is: "The disruption was initiated by a massive restart of our user’s computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update. The abnormally high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact." In other words, Skype sent out a software update, which -- apparently because a 4 year old bug in the client software that had gone undetected so far -- prompted many computers to restart, temporarily depriving the peer-to-peer system of substantial network resources, which started the chain reaction. CORRECTION 20 Aug: I got the previous sentence wrong in the original version of this post, because the original version of Skype's post did not include the words "through Windows Update", suggesting that the whole was triggered by an update of Skype's own software. Now, Skype blames Microsoft for the crash. The explanation is not totally convincing, though: earlier, in an interview with the New York Times, Skype's engineers had said that a bug in all Skype clients since 2003, which had gone undetected so far, could have started the disruption. So, the latest (partial) version could be: Microsoft sent out a routine Windows software update (as it does every second Tuesday of the month) which required the computers to restart, temporarily depriving the Skype peer-to-peer system of substantial network resources and creating a flood of re-log-in requests, which triggered the chain reaction, with the Skype bug possibly magnifying the problem -- which would explain why this didn't happen on other WIndows Update Tuesdays. (thanks Amir for pointing out the mistake) UPDATE 22 Aug: Or it could be different, as a new Skype blog post says: there was a critical weakness in the Skype software hat was just made worse by the Windows update.

The Skype system is now back to normal, when I used it this morning it worked fine. But questions remain.

In August 2004, I interviewed Niklas Zennström, one of the two Skype founders. At the time, there were about 550'000 concurrent Skype users on average. I asked him about scaling to serve a larger crowd:

With this design can you scale up to vast number of users or is there a point where you need to redesign your architecture?
We won't need to invest in infrastructure. What we will need to do at some point is to make some changes in the technology to be able to scale more. If we didn’t do anything, when we reach 10 million concurrent users (20 times more than now) we believe there will be problems. So before that happens we will have to spend some time to make some changes in the architecture. But there is no investment needed in hardware etc. Just in development.

The 10-million mark has been reached. So one can wonder whether the issue with Skype is not larger than what their blog post says (I don't have more info on this, am just wondering, but here is what's written on the Skype blog: "Skype’s peer-to-peer core was not properly tuned to cope with the load and core size changes").

There are a couple other things to consider:

  • The whole Internet seems to be bursting at its seams. Bandwidth is abundant, but the growth of transferred data is phenomenal -- just think of video and music up-/downloads and streaming, and of VoIP calls and videoconferencing such as Skype's. Some are even wondering whether the network is reaching saturation and the upcoming online television services (Joost, BBC's iPlayer, etc) are going to crash the Internet. A report by Cisco, quoted by iTnews, found that American video websites currently transmit more data per month than the entire amount of traffic sent over the internet in 2000.

  • Some have argued that the Skype outage is a sign of the unreliability of Internet telephony. But what surprises me is actually that the Skype outage didn't happen sooner -- that the system has worked so well for five years (the company was founded in 2002), which is a testament to the genius in the application. In the meantime, dozens of other VOIP providers have closed down. Traditional telecom services are by no means perfect, either (not to mention cell phones, and Blackberries). And let's not forget that Skype is a free service (so the inconvenience of two days with no service is actually the price of free.)

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d834517e6e69e200e54ee1dc3f8834

Listed below are links to weblogs that reference Two days without Skype: Blame Microsoft, or maybe not:

» Fiber Optics Fellows, Get Ready For The Next Big Wave from FiberGeneration
Cisco found that American video websites currently transmit more data per month than the entire amount of traffic sent over the internet in 2000. This, and much more, into the article written by Matt Chapman of the Australian online magazine [Read More]

Comments

The sad thing is that this is not just the price of free. Service on all sectors of telecom has gone down hill.

I pay good money to Wanadoo (former France Telecom) but still find I need to reboot my router several times a month and pay to wait on hold when I have more serious problems.

Not sure what the solution is, but free services on the one hand and the death throes of former telecom monopolies on the other makes for a bad spot to be a consumer!!

Wonder if there are equivalent examples in other industries or situations hit by a paradigm shift in technology?

In the case of this recent outage at Skype, the costless aspect of the app is irrelevant. Because it has been provoqued by an external factor, which, speaking of costs, is not free (MS Windows).
What is unacceptable here is the contrary : that a mass-market product for the use of which customers are charged a premium (think of the total cost of ownership of your production Windows PC...) causes troubles onto a third-party' s product.

What is remarkable here is the fact that Skype has been able to stay up and running all this time without any major problem for its customers - i mean, for its first 5 years of operation until last week' s bug.

Skype's issues last week are instructive for the entire industry. On the one hand, Skype has done a remarkable thing in generating a large user base very quickly. There are, however, important concerns about their architecture and approach, and questions have quite rightly been raised about peer to peer networks.

In fact, all Peer-to-Peer models are not created equal. Skype uses a different type of Peer-To-Peer network than most companies, based on SuperNodes. A SuperNode Peer-to-Peer system is one in which you rely on your customers rather than your own servers to handle the majority of your traffic. SuperNodes are just normal computers which get promoted by the Skype software to serve as the traffic cops for their entire network. In theory this is a good idea, but it does have unique vulnerabilities. Skype, as a company, has no physical or programmatic control over the most vital piece of its product when the network destabilizes for any reason.

Another issue with SuperNode models concerns system recovery after a crash. A SuperNode-based network can only recover as fast as new SuperNodes can be identified. Skype's formal post on Monday about the cause of its crash essentially confirmed this point.

Skype's model also creates usage issues. A Skype user who installs Skype on a university or corporate network agrees in the End-User License Agreement to let Skype route calls through his or her PC (and by extension the organizationís network). In many cases this is a violation of the terms of use the student/employee has agreed to with the university or corporate IT dept. It can cause legal and bandwidth issues.

Other companies such as SightSpeed use a standards-based Peer-to-Peer architecture built on SIP (the standard protocol as opposed to Skypeís proprietary protocol) that allows them to manage all the core functionality themselves. Telephony protocols such as SIP (which SightSpeed uses) were designed from the outset to be fault tolerant. Companies such as Microsoft, Cisco, Sprint/Nextel, Verizon, AT&T, Comcast, Time Warner and SightSpeed all ship standards based SIP software and hardware.

Skype's proprietary SuperNode architecture is what is risky. Peer-to-peer CAN be done right.

Aron Rosenberg
CTO SightSpeed
http://www.sightspeed.com

@Aron : thanks for the detailed explanation. Now I understand the difference between traditional SIP and Skype's own protocol. Do you think Skype could switch to SIP, easily ? Any guess how much efforts (technical, whatsoever) would it take for them ? I know you're rival, but Sun Tzu said : know you enemy better than yourself ;-)

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Upcoming conferences