How Does VoIP Work?

It is very easy to get into a discussion that is very technical and confusing to most readers. The purpose of this section will be to provide a very high-level overview of Voice over IP (VoIP) aimed at those who do not consider themselves experts in the subject and hopefully with enough clarity that it serves as a good introduction to most readers.

Many people have used a computer and a microphone to record a human voice or other sounds. The process involves sampling the sound that is heard by the computer at a very high rate (usually 8,000 times per second or more) and storing those "samples" in memory or in a file on the computer. Each sample of sound is just a very tiny bit of the person's voice or other sound recorded by the computer. The computer has the wherewithal to take all of those samples and play them, so that the listener can hear what was recorded.

VoIP is based on the same idea, but the difference is that the audio samples are not stored locally. Instead, they are sent over the IP network to another computer and played there.

Of course, there is much more required in order to make VoIP work. When recording the sound samples, the computer might compress those sounds so that they require less space and will record only a limited frequency range. There are a number of ways to compress audio, the algorithm for which is referred to as a "compressor/de-compressor", or simply CODEC. Many CODECs exist for a variety of applications (e.g., movies and sound recordings) and, for VoIP, the CODECs are optimized for compressing voice, which significantly reduce the bandwidth used compared to an uncompressed audio stream. Speech CODECs are optimized to improve spoken words at the expense of sounds outside the frequency range of human speech. Recorded music and other sounds do not generally sound very good when passed through a speech CODEC, but that is perfectly OK for the task at hand. There are, however, some "wideband" CODECs that do a pretty good job both with speech and music.

Once the sound is recorded by the computer and compressed into very small samples, the samples are collected together into larger chunks and placed into data packets for transmission over the IP network. This process is referred to packetization. Generally, a single IP packet will contain 10 or more milliseconds of audio, with 20 or 30 milliseconds being most common.

Vint Cerf, who is often called the Father of the Internet, once explained packets in a way that is very easy to understand. Paraphrasing his description, he suggested to think of a packet as a postcards sent via postal mail. A postcard contains just a limited amount of information. To deliver a very long message, one must send a lot of postcards. Of course, the post office might lose one or more postcards. One also has to assemble the received postcards in order, so some kind of mechanism must be used to properly order to postcards, such as placing a sequence number on the bottom right corner. One can think of data packets in an IP network as postcards.

Just like postcards sent via the postal system, some IP data packets get lost and the CODECs must compensate for lost packets by "filling in the gaps" with audio that is acceptable to the human ear. This process is referred to as packet-loss concealment (PLC). In some cases, packets are sent multiple times in order to overcome packet loss. This method is called, appropriately enough, redundancy. Another method to address packet loss, known as forward-error correction (FEC), is to include some information from previously transmitted packets in subsequent packets. By performing mathematical operations in a particular FEC scheme, it is possible to reconstruct a lost packet from information bits in neighboring packets.

Packets are also sometimes delayed, just as with the postcards sent through the post office. This is particularly problematic for VoIP systems, as delays in delivering a voice packet means the information is too old to play. Such old packets are simply discarded, just as if the packet was never received. This is acceptable, as the same PLC or FEC algorithms can work to provide good audio quality.

VoIP applications generally measure the packet delay and expect the delay to remain relatively constant, though delay can increase and decrease during the course of a conversation. Variation in delay (called jitter) nonetheless exists and must be addressed. Delay, itself, just means it takes longer for the recorded voice spoken by the first person to be heard by the user on the far end. In general, good networks have an end-to-end delay of less than 100ms, though delay up to 400ms is considered acceptable (especially when using satellite systems). Jitter can result in choppy voice or temporary glitches, so VoIP devices must implement jitter buffer algorithms to compensate for jitter. Essentially, this means that a certain number of packets are queued before play-out and the queue length may be increased or decreased over time to reduce the number of discarded, late-arriving packets or to reduce "mouth to ear" delay. During the course of a conversation, the VoIP application may need to increase or decrease the length of the jitter buffer as the end-to-end delay increases or decreases. Such jitter buffer management behavior is referred to as and "adaptive jitter buffer" algorithm.

Video works in much the same way as voice. Video information received through a camera is broken into small pieces, compressed with a CODEC, placed into small packets, and transmitted over the IP network. This is one reason why VoIP is promising as a new technology: adding video or other media is relatively simple. Of course, there are certain issues that must be considered that are unique to video (e.g., frame refresh and much higher bandwidth requirements), but the basic principles of VoIP equally apply to video telephony.

Of course there is much more to VoIP than just sending the audio/video packets over the Internet. There must also be an agreed protocol for how computers find each other and how information is exchanged in order to allow packets to ultimately flow between the communicating devices. There must also be an agreed format (called payload format) for the contents of the media packets. We will describe some of the popular VoIP protocols in the next section.

Through this section, we have focused on computers that communicate with each other. However, VoIP is certainly not limited to desktop or laptop computers. VoIP is implemented in a variety of hardware devices, including IP phones, analog terminal adapters (ATAs), and gateways. There are also VoIP applications that run on smartphones and tablets. In short, a large number of devices can enable VoIP communication, some of which allow one to use traditional telephone devices to interface with the IP networks: one does not have to throw out existing equipment to migrate to VoIP.