Variable Length Integer Encoding
August 29, 2023
In recent years, I’ve seen several methods for variable-length encoding of integers for transmission over a network. The most recent one I encountered was defined in section 16 of the QUIC protocol specification. In that document, 64-bit integers are encoded as 1, 2, 4, or 8 octets based on the value of the integer. Another format is specified in Appendix B of the CBOR specification, which specifies how to encode any one of the typical integer sizes commonly used on modern computers today. Using a rigid encoding like these offers a means of quickly serializing data, but at a cost of increasing the number of octets required. In the case of CBOR, it perhaps isn’t accurate to refer to it as a variable-length encoding, but rather an encoding for each possible signed and unsigned integer type. However, the effect is more-or-less similar, though larger integers are encoded with less space efficiency.
When considering space-efficiency, another encoding approach is to encode integers so that the most-significant bit (MSb) of each serialized octet indicates whether this is the final octet or whether there is another octet to consume. The following illustrates that idea:
10000011 11111111 01111111
^ ^ ^--- 0 == final octet
Here, if a 1 is present as the MSb, it means the next octet is a part of the integer. If the MSb is a 0, it indicates this octet is the last octet in the serialized sequence of octets. Following this method, a value between 0 and 127 can be serialized into a single octet, while incrementally larger integers consume additional octets. This method can be used for both unsigned and signed integers, where the signed integers are stored in twos-complement format.
Serializing and deserializing data can be efficient. Below, I present functions that will perform those operations on a buffer given an unsigned or signed 64-bit integer. This code uses C++ constexpr functions, though these could easily be transformed into C macros if one prefers those.
// Space-efficient encoding of variable length integers
// Copyright (C) 2023
// Paul E. Jones <paulej@packetizer.com>
#include <cstdint>
#include <cstddef>
#include <span>
// Find the most significant bit position for unsigned integers
constexpr std::size_t FindMSb(std::uint64_t v)
{
std::size_t p = 0;
if (v >= std::uint64_t(1) << 32) p += 32, v >>= 32;
if (v >= std::uint64_t(1) << 16) p += 16, v >>= 16;
if (v >= std::uint64_t(1) << 8) p += 8, v >>= 8;
if (v >= std::uint64_t(1) << 4) p += 4, v >>= 4;
if (v >= std::uint64_t(1) << 2) p += 2, v >>= 2;
if (v >= std::uint64_t(1) << 1) p += 1, v >>= 1;
return p;
}
// Find the most significant bit position for signed integers
constexpr std::size_t FindMSb(std::int64_t v)
{
return ((v >= 0) ? FindMSb(static_cast<std::uint64_t>(v)) :
FindMSb(static_cast<std::uint64_t>(~v)));
}
// Return number of octets required to encode given std::uint64_t value
constexpr std::size_t VarUintSize(std::uint64_t value)
{
return FindMSb(value) / 7 + 1;
}
// Return number of octets required to encode given std::int64_t value
constexpr std::size_t VarIntSize(std::int64_t value)
{
return (FindMSb(value) + 1) / 7 + 1;
}
// Serialize the unsigned integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::uint64_t value)
{
// Determine space requirements for the variable-width integer
const std::size_t octets_required = VarUintSize(value);
// Ensure the buffer is of sufficient length
if (buffer.size() < octets_required) return 0;
// Write octets from right to left (reverse order)
for (std::size_t i = octets_required; i > 0; i--)
{
// Get the group of 7 bits
std::uint8_t octet = value & 0x7f;
// Shift the data bits vector by 7 bits
value >>= 7;
// If this is not the last octet, set the MSb to 1
if (i != octets_required) octet |= 0x80;
// Write the value into the buffer
buffer[i - 1] = octet;
}
return octets_required;
}
// Deserialize the unsigned integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
std::uint64_t &value)
{
std::uint8_t octet{0x80};
std::size_t total_octets{0};
// Initialize the integer value
value = 0;
// Read octets until we find the last one having a 0 MSb
while (octet & 0x80)
{
// A 64-bits value should never require more than 10 octets
if (++total_octets == 11) return 0;
// Ensure we do not read beyond the buffer
if (total_octets > buffer.size()) return 0;
// Get the target octet
octet = buffer[total_octets - 1];
// Add these bits to the returned value
value = (value << 7) | (octet & 0x7f);
}
// If the total length is 10 octets, initial octet must be 0x81
if ((total_octets == 10) && (buffer[0] != 0x81)) return 0;
return total_octets;
}
// Serialize the signed integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::int64_t value)
{
// Determine space requirements for the variable-width integer
std::size_t octets_required = VarIntSize(value);
// Ensure there is sufficient space in the buffer
if (octets_required > buffer.size()) return 0;
// Write octets from right to left (reverse order)
for (std::size_t i = octets_required; i > 0; i--)
{
// Get the group of 7 bits
std::uint8_t octet = value & 0x7f;
// Shift the data bits vector by 7 bits
value >>= 7;
// If this is not the last octet, set the MSb to 1
if (i != octets_required) octet |= 0x80;
// Write the value into the buffer
buffer[i - 1] = octet;
}
return octets_required;
}
// Deserialize the signed integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
std::int64_t &value)
{
std::uint8_t octet{0x80};
std::size_t total_octets{0};
// Ensure we do not read beyond the buffer
if (buffer.empty()) return 0;
// Determine the sign of the number by inspecting the leading sign bit
value = (buffer[0] & 0x40) ? -1 : 0;
// Read octets until we find the last one having a 0 MSb
while (octet & 0x80)
{
// A 64-bits value should never require more than 10 octets
if (++total_octets == 11) return 0;
// Ensure we do not read beyond the buffer
if ((total_octets) > buffer.size()) return 0;
// Get the target octet
octet = buffer[total_octets - 1];
// Add these bits to the returned value
value = (value << 7) | (octet & 0x7f);
}
// If the total length is 10 octets, ensure the initial octet is one
// of the only two valid values
if ((total_octets == 10) && (buffer[0] != 0x80) && (buffer[0] != 0xff))
{
return 0;
}
return total_octets;
}