Variable Length Integer Encoding

August 29, 2023

In recent years, I’ve seen several methods for variable-length encoding of integers for transmission over a network. The most recent one I encountered was defined in section 16 of the QUIC protocol specification. In that document, 64-bit integers are encoded as 1, 2, 4, or 8 octets based on the value of the integer. Another format is specified in Appendix B of the CBOR specification, which specifies how to encode any one of the typical integer sizes commonly used on modern computers today. Using a rigid encoding like these offers a means of quickly serializing data, but at a cost of increasing the number of octets required. In the case of CBOR, it perhaps isn’t accurate to refer to it as a variable-length encoding, but rather an encoding for each possible signed and unsigned integer type. However, the effect is more-or-less similar, though larger integers are encoded with less space efficiency.

When considering space-efficiency, another encoding approach is to encode integers so that the MSb of each serialized octet indicates whether this is the final octet or whether there is another octet to consume. The following illustrates that idea:

Source Code

10000011 11111111 01111111
^        ^        ^--- 0 == final octet

Here, if a 1 is present as the MSb, it means the next octet is a part of the integer. If the MSb is a 0, it indicates this octet is the last octet in the serialized sequence of octets. Following this method, a value between 0 and 127 can be serialized into a single octet, while incrementally larger integers consume additional octets. This method can be used for both unsigned and signed integers, where the signed integers are stored in twos-complement format.

Serializing and deserializing data can be efficient. Below, I present functions that will perform those operations on a buffer given an unsigned or signed 64-bit integer. This code uses C++ constexpr functions, though these could easily be transformed into C macros if one prefers those.

Source Code

// Space-efficient encoding of variable length integers
// Copyright (C) 2023
// Paul E. Jones <paulej@packetizer.com>

#include <cstdint>
#include <cstddef>
#include <span>

// Find the most significant bit position for unsigned integers
constexpr std::size_t FindMSb(std::uint64_t v)
{
    std::size_t p = 0;

    if (v >= std::uint64_t(1) << 32) p += 32, v >>= 32;
    if (v >= std::uint64_t(1) << 16) p += 16, v >>= 16;
    if (v >= std::uint64_t(1) <<  8) p +=  8, v >>=  8;
    if (v >= std::uint64_t(1) <<  4) p +=  4, v >>=  4;
    if (v >= std::uint64_t(1) <<  2) p +=  2, v >>=  2;
    if (v >= std::uint64_t(1) <<  1) p +=  1, v >>=  1;

    return p;
}

// Find the most significant bit position for signed integers
constexpr std::size_t FindMSb(std::int64_t v)
{
    return ((v >= 0) ? FindMSb(static_cast<std::uint64_t>(v)) :
                       FindMSb(static_cast<std::uint64_t>(~v)));
}

// Return number of octets required to encode given std::uint64_t value
constexpr std::size_t VarUintSize(std::uint64_t value)
{
    return FindMSb(value) / 7 + 1;
}

// Return number of octets required to encode given std::int64_t value
constexpr std::size_t VarIntSize(std::int64_t value)
{
    return (FindMSb(value) + 1) / 7 + 1;
}

// Serialize the unsigned integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::uint64_t value)
{
    // Determine space requirements for the variable-width integer
    const std::size_t octets_required = VarUintSize(value);

    // Ensure the buffer is of sufficient length
    if (buffer.size() < octets_required) return 0;

    // Write octets from right to left (reverse order)
    for (std::size_t i = octets_required; i > 0; i--)
    {
        // Get the group of 7 bits
        std::uint8_t octet = value & 0x7f;

        // Shift the data bits vector by 7 bits
        value >>= 7;

        // If this is not the last octet, set the MSb to 1
        if (i != octets_required) octet |= 0x80;

        // Write the value into the buffer
        buffer[i - 1] = octet;
    }

    return octets_required;
}

// Deserialize the unsigned integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
                        std::uint64_t &value)
{
    std::uint8_t octet{0x80};
    std::size_t total_octets{0};

    // Initialize the integer value
    value = 0;

    // Read octets until we find the last one having a 0 MSb
    while (octet & 0x80)
    {
        // A 64-bits value should never require more than 10 octets
        if (++total_octets == 11) return 0;

        // Ensure we do not read beyond the buffer
        if (total_octets > buffer.size()) return 0;

        // Get the target octet
        octet = buffer[total_octets - 1];

        // Add these bits to the returned value
        value = (value << 7) | (octet & 0x7f);
    }

    // If the total length is 10 octets, initial octet must be 0x81
    if ((total_octets == 10) && (buffer[0] != 0x81)) return 0;

    return total_octets;
}

// Serialize the signed integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::int64_t value)
{
    // Determine space requirements for the variable-width integer
    std::size_t octets_required = VarIntSize(value);

    // Ensure there is sufficient space in the buffer
    if (octets_required > buffer.size()) return 0;

    // Write octets from right to left (reverse order)
    for (std::size_t i = octets_required; i > 0; i--)
    {
        // Get the group of 7 bits
        std::uint8_t octet = value & 0x7f;

        // Shift the data bits vector by 7 bits
        value >>= 7;

        // If this is not the last octet, set the MSb to 1
        if (i != octets_required) octet |= 0x80;

        // Write the value into the buffer
        buffer[i - 1] = octet;
    }

    return octets_required;
}

// Deserialize the signed integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
                        std::int64_t &value)
{
    std::uint8_t octet{0x80};
    std::size_t total_octets{0};

    // Ensure we do not read beyond the buffer
    if (buffer.empty()) return 0;

    // Determine the sign of the number by inspecting the leading sign bit
    value = (buffer[0] & 0x40) ? -1 : 0;

    // Read octets until we find the last one having a 0 MSb
    while (octet & 0x80)
    {
        // A 64-bits value should never require more than 10 octets
        if (++total_octets == 11) return 0;

        // Ensure we do not read beyond the buffer
        if ((total_octets) > buffer.size()) return 0;

        // Get the target octet
        octet = buffer[total_octets - 1];

        // Add these bits to the returned value
        value = (value << 7) | (octet & 0x7f);
    }

    // If the total length is 10 octets, ensure the initial octet is one
    // of the only two valid values
    if ((total_octets == 10) && (buffer[0] != 0x80) && (buffer[0] != 0xff))
    {
        return 0;
    }

    return total_octets;
}

Click here to view the main blog page.