Packetizer Logo
 

Variable Length Integer Encoding

August 29, 2023

In recent years, I’ve seen several methods for variable-length encoding of integers for transmission over a network. The most recent one I encountered was defined in section 16 of the QUIC protocol specification. In that document, 64-bit integers are encoded as 1, 2, 4, or 8 octets based on the value of the integer. Another format is specified in Appendix B of the CBOR specification, which specifies how to encode any one of the typical integer sizes commonly used on modern computers today. Using a rigid encoding like these offers a means of quickly serializing data, but at a cost of increasing the number of octets required. In the case of CBOR, it perhaps isn’t accurate to refer to it as a variable-length encoding, but rather an encoding for each possible signed and unsigned integer type. However, the effect is more-or-less similar, though larger integers are encoded with less space efficiency.

When considering space-efficiency, another encoding approach is to encode integers so that the most-significant bit (MSb) of each serialized octet indicates whether this is the final octet or whether there is another octet to consume. The following illustrates that idea:

Source Code

10000011 11111111 01111111
^        ^        ^--- 0 == final octet

Here, if a 1 is present as the MSb, it means the next octet is a part of the integer. If the MSb is a 0, it indicates this octet is the last octet in the serialized sequence of octets. Following this method, a value between 0 and 127 can be serialized into a single octet, while incrementally larger integers consume additional octets. This method can be used for both unsigned and signed integers, where the signed integers are stored in twos-complement format.

Serializing and deserializing data can be efficient. Below, I present functions that will perform those operations on a buffer given an unsigned or signed 64-bit integer. This code uses C++ constexpr functions, though these could easily be transformed into C macros if one prefers those.

Source Code

// Space-efficient encoding of variable length integers
// Copyright (C) 2023
// Paul E. Jones <paulej@packetizer.com>

#include <cstdint>
#include <cstddef>
#include <span>

// Find the most significant bit position for unsigned integers
constexpr std::size_t FindMSb(std::uint64_t v)
{
    std::size_t p = 0;

    if (v >= std::uint64_t(1) << 32) p += 32, v >>= 32;
    if (v >= std::uint64_t(1) << 16) p += 16, v >>= 16;
    if (v >= std::uint64_t(1) <<  8) p +=  8, v >>=  8;
    if (v >= std::uint64_t(1) <<  4) p +=  4, v >>=  4;
    if (v >= std::uint64_t(1) <<  2) p +=  2, v >>=  2;
    if (v >= std::uint64_t(1) <<  1) p +=  1, v >>=  1;

    return p;
}

// Find the most significant bit position for signed integers
constexpr std::size_t FindMSb(std::int64_t v)
{
    return ((v >= 0) ? FindMSb(static_cast<std::uint64_t>(v)) :
                       FindMSb(static_cast<std::uint64_t>(~v)));
}

// Return number of octets required to encode given std::uint64_t value
constexpr std::size_t VarUintSize(std::uint64_t value)
{
    return FindMSb(value) / 7 + 1;
}

// Return number of octets required to encode given std::int64_t value
constexpr std::size_t VarIntSize(std::int64_t value)
{
    return (FindMSb(value) + 1) / 7 + 1;
}

// Serialize the unsigned integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::uint64_t value)
{
    // Determine space requirements for the variable-width integer
    const std::size_t octets_required = VarUintSize(value);

    // Ensure the buffer is of sufficient length
    if (buffer.size() < octets_required) return 0;

    // Write octets from right to left (reverse order)
    for (std::size_t i = octets_required; i > 0; i--)
    {
        // Get the group of 7 bits
        std::uint8_t octet = value & 0x7f;

        // Shift the data bits vector by 7 bits
        value >>= 7;

        // If this is not the last octet, set the MSb to 1
        if (i != octets_required) octet |= 0x80;

        // Write the value into the buffer
        buffer[i - 1] = octet;
    }

    return octets_required;
}

// Deserialize the unsigned integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
                        std::uint64_t &value)
{
    std::uint8_t octet{0x80};
    std::size_t total_octets{0};

    // Initialize the integer value
    value = 0;

    // Read octets until we find the last one having a 0 MSb
    while (octet & 0x80)
    {
        // A 64-bits value should never require more than 10 octets
        if (++total_octets == 11) return 0;

        // Ensure we do not read beyond the buffer
        if (total_octets > buffer.size()) return 0;

        // Get the target octet
        octet = buffer[total_octets - 1];

        // Add these bits to the returned value
        value = (value << 7) | (octet & 0x7f);
    }

    // If the total length is 10 octets, initial octet must be 0x81
    if ((total_octets == 10) && (buffer[0] != 0x81)) return 0;

    return total_octets;
}

// Serialize the signed integer into the given buffer, returning 0 if the
// buffer is too short to hold the serialized integer
std::size_t Serialize(std::span<std::uint8_t> buffer, std::int64_t value)
{
    // Determine space requirements for the variable-width integer
    std::size_t octets_required = VarIntSize(value);

    // Ensure there is sufficient space in the buffer
    if (octets_required > buffer.size()) return 0;

    // Write octets from right to left (reverse order)
    for (std::size_t i = octets_required; i > 0; i--)
    {
        // Get the group of 7 bits
        std::uint8_t octet = value & 0x7f;

        // Shift the data bits vector by 7 bits
        value >>= 7;

        // If this is not the last octet, set the MSb to 1
        if (i != octets_required) octet |= 0x80;

        // Write the value into the buffer
        buffer[i - 1] = octet;
    }

    return octets_required;
}

// Deserialize the signed integer from the given buffer, returning number of
// octets deserialized or zero if there was an error
std::size_t Deserialize(const std::span<std::uint8_t> buffer,
                        std::int64_t &value)
{
    std::uint8_t octet{0x80};
    std::size_t total_octets{0};

    // Ensure we do not read beyond the buffer
    if (buffer.empty()) return 0;

    // Determine the sign of the number by inspecting the leading sign bit
    value = (buffer[0] & 0x40) ? -1 : 0;

    // Read octets until we find the last one having a 0 MSb
    while (octet & 0x80)
    {
        // A 64-bits value should never require more than 10 octets
        if (++total_octets == 11) return 0;

        // Ensure we do not read beyond the buffer
        if ((total_octets) > buffer.size()) return 0;

        // Get the target octet
        octet = buffer[total_octets - 1];

        // Add these bits to the returned value
        value = (value << 7) | (octet & 0x7f);
    }

    // If the total length is 10 octets, ensure the initial octet is one
    // of the only two valid values
    if ((total_octets == 10) && (buffer[0] != 0x80) && (buffer[0] != 0xff))
    {
        return 0;
    }

    return total_octets;
}

Click here to view the main blog page.