Packetizer Logo

Using XMPP with VoIP Protocols

March 20, 2012

As many know, I am a big advocate for enabling a plurality of devices and applications to be used together as a part of a multimedia communication session. That is the whole idea behind the work the ITU is presently doing with respect to H.325 (or AMS). AMS is aims to be the next generation multimedia communications protocol for the now aging SIP and H.323 protocols, both of which are now 16 years old. While work on AMS is still progressing, there are things we can do in the interim to make it easier to integrate some applications, perhaps most important is text and voice/video.

XMPP is the international standard for instant messaging and presence. It is widely used within enterprises around the world and used by services like Google Talk. Due to its design, it has the potential to be as ubiquitous as email is today. And like email, it fully allows for federation between different domains. With XMPP, it is just as easy to have an instant messaging (IM) session with a colleague as it is anybody around the world.

H.323 and SIP are the two leading voice and video communication standards in the market today. H.323 is still the most widely used protocol for videoconferencing, while SIP is primarily used as a voice “trunking” protocol between enterprise and service providers. In the core of the service provider networks, both H.323 and SIP are employed, with SIP perhaps now leading as the replacement as a pure voice replacement.

It is becoming increasingly possible to use VoIP (voice or video) to place calls between colleagues and with other people around the world. Since VoIP generally means “voice” in my mind, I prefer to use a more generic term of IP Multimedia Communications (IPMC), of which voice, video, instant messaging, whiteboarding, etc. are all a part. So, I’ll use IPMC below, but you can think of that as “VoIP” if you prefer that term.

When I initiate an IPMC session, it usually offers only a single mode of communication. Quite often, it is just a voice or voice/video call (admittedly, that is two modalities) or instant messaging. Rarely do we have the ability to initiate one session (e.g., voice) and have the ability to use instant messaging with that, especially if the two applications are not a single unified application. For example, if I make a call using my IP phone, my IM client has no idea that I’m talking to somebody. Likewise, if I am carrying on a few instant messaging sessions, my IP phone is oblivious to this fact.

What we need is a means of better integrating voice/video applications with XMPP. There was some work that started in the IETF to do this, but I do not think that work progressed too far. Nonetheless, I think it is important work and I figured I would write up my thoughts here.

We have two problems we need to solve:

  • My voice/video phone (desk phone or soft client) needs to know when I have an instant messaging session active with somebody so that I can just press a button to launch a voice call, and it needs to know the voice contact information for the other person
  • My instant messaging client needs to know when my voice/video phone is in an active call with somebody, and it needs to know the XMPP JID (the user’s identity) for the person with whom I am having a conversation

From these two requirements, we can see there is a need to share addressing information and there is a need to convey some presence state between the phone and the instant messaging client.

One way to convey addressing information is to simply advertise it within the protocols themselves. For example, when I configure my voice application, I could tell it my XMPP address. Likewise, when I configure my XMPP application, I can tell it the URI for my voice/video application. That’s pretty simple. You can imagine in SIP, for example, that we might introduce a header like this:

Source Code


In fact, XMPP already defines the means through which addresses can be advertised for other applications.

A small addition like this to SIP and H.323 would allow me to call you, for example, and immediately know your XMPP address or your voice/video URL. One could also advertise one's H.323 or SIP URI via XMPP, too. If I have XMPP and voice/video integrated into a single application, that would be all I need to know in order to quickly launch a different mode of communication right from within my application.

Often, though, these applications are separate. So what we need is a means of allowing the voice/video application and XMPP application to convey their status information to each other. A very reasonable way to do that is to re-use XMPP. After all, XMPP was designed to be a presence protocol. It has the ability to learn and maintain state information related to various presentities (“presence entities”).

Now, with the phone knowing about active IM sessions and the XMPP client knowing about active voice/video sessions, it is now trivial to initiate new modes of communication with the touch of a button. If I call you using my phone, my IM client would know I am on a call with you. I could press a button on my IM client that corresponds to the active voice call and use instant messaging without ever having to manually enter an address.

There are also ways for clients to learn about addressing information for users automatically, too. For example, rather than tell my phone my JID, we can use technologies like WebFinger. Using WebFinger, it would be possible for my phone to query to learn the other addressing information related to me. Further, it would be possible for the person I call to learn my other addresses (IM, voice, email, etc.).

It is also possible to map telephone numbers to WebFinger account URIs using ENUM. So, it would be possible to convey only the phone number and then discover all of the other addressing information related to a user.

WebFinger makes it very easy to discover information about another person, but I realize that some people might be concerned with privacy. Therefore, WebFinger should be considered as one option and not the only solution. Still, it is one option to make provisioning significantly simpler.

ENUM could also be used to map a phone number to an XMPP address only. However, since we would still need to have the ability to map from an XMPP address to a phone number, we need to either advertise addresses via the session protocols or use WebFinger. I’m open to other recommendations.

Click here to view the main blog page.