There’s been a lot of discussion on Apple’s Facetime product, and some questions asked over whether Facetime uses the traditional phone network at all. The answer appears to be: No, not at all. There is no traditional call setup if you initiate a Facetime session outright.
At Jobs’ most recent keynote, he popped up a slide which rattled off a few open standards: SIP, ICE (and by extension, STUN and TURN), RTP, and H.264. Now, I’ve used the Skype video calling integrated into my N900, and it works well, but Skype remains a closed protocol. And while Skype borrows lots of ideas from at least ICE and RTP, it doesn’t necessarily play by the standards and so it’s difficult for third-parties to inter-operate.
So, all dues to Apple on this one: They’re using open standards, and using open standards implies that anybody can play. Maybe Apple will need to standardise some additional glue: perhaps they have some custom SIP stuff for Facetime that they need to publish, or perhaps they’re doing something neat with the vertical handover from the GSM call to the IP call and back. I haven’t seen any new drafts from Apple float past, so I don’t know what they want to standardise here.
That aside, anybody should be able to use SIP to initiate an RTP session with another iPhone. All you need is the user’s registered identity within SIP. Obviously your device must understand these protocols, and be able to encode/decode the data streams. But that’s not such a big ask: The standards have been around for a long time, and lots of hardware either already supports them, or is perfectly capable of doing so. And since all this works over IP, you need is a connection to the Internet (and of course, this means that 3G networks are valid carriers for video calls, too, even though Facetime prefers Wifi).
But that was quite the acronym soup up there. What do any of those protocols actually do? How does this work if you don’t technically make a phone call, at least in the traditional sense?
To explain, let’s take Alice and Bob and go through step-by-step the sequence of events required to set up a call. When both Alice’s and Bob’s phones are running and connected to the network, they’ll register with a SIP proxy somewhere, presumably hosted by Apple. Let’s say Alice wants some Facetime with Bob:
1. Alice’s phone must first learn about its environment: This is where STUN and TURN come in. STUN is a simple protocol for a host to bounce packets off a STUN server (again, in this case presumably hosted by Apple) to learn whether it’s behind a NAT. TURN is an extension to STUN to reserve an auxiliary address on the STUN server which can be used to relay data between endpoints. So, Alice’s phone uses STUN to learn the IP address on the _public_ side of the NAT she’s behind, and uses TURN to request a relay address at the same time, as a worst-case, if-all-else-fails data path between the two phones.
2. Alice’s phone then uses SIP to send an INVITE to Bob’s phone. SIP is a standard protocol used the world over to setup and teardown calls over IP; it doesn’t handle data, it only handles the control operations and negotiation for the call. This SIP invitation contains the learned information about Alice’s network using STUN and TURN, which Bob will use later. Bob’s phone will receive the INVITE from the SIP proxy, prompting it to also learn about its network environment. Once it’s done this, it’ll send a response (_ACK_) with this information via the SIP proxy back to Alice.
3. So, now Alice and Bob both have information about the other’s network environment. At this point, each phone runs through the ICE algorithm. ICE systematically probes pairs of these addresses learned from Alice and Bob’s current networks to determine which pairs can be used to pass data packets between the phones. By this process, both phones can learn if they’re located behind the same NAT or not. If they are, they can communicate directly, and this is always the best outcome. If they aren’t they must try the public addresses discovered via STUN to determine if those addresses will allow traffic to flow or, in the worst case, will nominate to relay data via the relay address allocated in Step 1. Usually though, ICE can figure out a combination of packets to punch holes through the NATs to allow data to flow directly between the phones without the need for the relay. The relay really is the worst case.
4. Assuming ICE finds a set of valid candidates, the phones finalise the ICE algorithm by choosing the pair which they will use, and coordinate to complete the interaction. Once complete, both phones have a clear path on which to send data using RTP. RTP is a packet format for handling real-time data over UDP.
If all goes well, all these interactions happen within the exchange of a few tens of packets and hopefully not too much delay to the user. This is a reasonably complex sequence of events primarily because of the presence of those dastardly NAT boxes. In an ideal Internet, we’d all have clear paths to each other and not have to deal with this sort of negotiation, but that’s another argument for another day. From the human perspective, Alice initiates a call and within a few seconds, Bob picks up. End of story.
What I don’t know about this story is how people are identified on the SIP backend. By their phone number? IMEI? Presumably more information will become available if the product takes off.
This is a really short summary of the sequence of events. If you’d like some pictures to go with that, I have some (slightly outdated) slides which run over the sequence of events that take place.