How Many Audio What's Per What? (Q76)

The information in this article applies to:

The integer value provided in the AudioCapability.x field represents the maximum number of frames per packet, not necessarily the number of milliseconds or bytes per packet. G.723.1 and G.729, for instance, are frame-based codecs, whereas G.711 is a sample-based codec. Each G.723.1 frame represents 30ms of audio and each G.729 frame represents 10ms of audio. For sample-based codecs, a "frame" is considered to be eight samples. Since G.711 generates 8,000 samples per second and a frame consists of eight samples, each G.711 "frame" happens to represent 1ms of audio. Therefore, if an EP can accept no more than, for example, 60ms worth of audio per packet, for G.711 it would encode the value, 60; for G.723.1 it would encode 2; for G.729 it would encode 6. An implementation may have other constraints on packet size, but it is usually limited by the size of incoming audio stores and, if present, by constraints imposed by a hardware codec. Also, many implementations have no internal constraints on packet size, although they are must indicate something in their termCapSet, such as a maximum of 200fpp for G.711 and 4fpp for G.723.1.

Although we have called it “packet” in the above paragraph, the max-audio-frames field expresses the maximum number of frames per unit, and the unit depends on which multiplexer one is using. The multiplexer is, in turn, determined by the umbrella Recommendation. For the H.222.1 multiplex, this is the size of the available STD buffer in units of 256 octets; for H.225.0, this is a packet; for H.223, this is an AL-SDU.