Arithmetic coding in Python
- 19. September 2024 (modified 3. October 2024)
- #algorithms
Arithmetic coding is a lossless data compression algorithm. It encodes a sequence of symbols into a sequence of bits. After reading the 1987 paper “Arithmetic coding for data compression” I decided to implement the algorithm in clean, modern Python.
In this article I’ll demonstrate how to use my Python implementation, how well arithmetic coding works when the symbol frequency is skewed, and how to encode “Crime and Punishment.” We also compare arithmetic coding with the more well-known Huffman coding. The implementation is available at: github.com/tommyod/arithmetic-coding
A simple example
Here we create a message
to encode and put an End Of Message (EOM) symbol "<EOM>"
at the end of it.
We also need to set up a table of symbol frequencies.
Arithmetic coding exploits the frequent symbols and assigns fewer bits to them, though unlike Huffman encoding arithmetic encoding does not actually use a fixed bit sequence per symbol.
>>> import collections
>>> from arithmetic_coding import ArithmeticEncoder
>>> message = ["A", "B", "A", "A", "A", "<EOM>"]
>>> frequencies = collections.Counter(message)
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=12)
>>> bits = list(encoder.encode(message))
>>> bits
[1, 1, 0, 0, 0, 0, 1, 1]
We can decode these bits back to symbols and we get the original message back.
>>> list(encoder.decode(bits))
['A', 'B', 'A', 'A', 'A', '<EOM>']
This examples encodes the message to 8 bits.
A naive approach would be to assign a unique 2-bit sequence to each symbol, e.g., {"A": 0b00, "B": 0b01, "<EOM>": 0b10}
.
Since the message contains 6 symbols, we would need 6 * 2 = 12
bits with this naive approach.
Infrequent symbols
To see how the encoder exploits the frequency of symbols, consider this example.
We initialize ArithmeticEncoder(bits=32)
because we need more internal bits in the buffer of the encoder to represent low-frequency symbols.
>>> message = ["A"] + ["B"] * 1_000_000 + ["<EOM>"]
>>> frequencies = collections.Counter(message)
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
43
Amazingly, a message with a million symbols is encoded to only 43 bits! Having seen how efficient arithmetic coding is with skewed frequencies, let’s apply it to a real-world example.
Encoding “Crime and Punishment”
Let’s download the English translation of “Crime and Punishment” by Fyodor Dostoevsky.
>>> import requests
>>> import string
>>> url = "https://www.gutenberg.org/cache/epub/2554/pg2554.txt"
>>> response = requests.get(url)
>>> symbols = (s for s in response.text if s in set(string.printable))
>>> message = list(symbols) + ["<EOM>"]
>>> len(message)
1164281
For this example, we’ll use the actual frequencies from the text, though in practice one might use a more general frequency table for the English language.
>>> frequencies = collections.Counter(message)
>>> len(frequencies)
83
>>> frequencies["e"], frequencies["q"]
(104735, 766)
Naive encoding
The Python code above shows that there are 83 distinct printable symbols in the book.
Since \(2^7 = 128\), a naive approach would be to use seven bits for each symbol, regardless of its frequency.
We would then encode "A"
to 0b0000000
, "B"
to 0b0000001
, "C"
to 0b0000010
and so forth.
This approach would encode the book using 1164281 * 7 = 8149967
bits.
Arithmetic encoding
Instead of using the naive approach, the arithmetic encoder exploits that symbols like "e"
are much more frequent than symbols like "q"
.
In around three seconds we’re able to encode all the 1.05 million symbols in the book:
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
5259420
The result has around 5.26 million bits, and the bits/symbol ratio is 4.52.
Shannon entropy as a lower bound on bits
The Shannon entropy, which is a lower bound on how many bits we need on average to encode a message of this length with these frequencies, is
where \(f_i\) is the frequency (count) of symbol \(i\) and \(p_i = f_i / \sum_j f_j\) is the probability of symbol \(i\). The bound only applies if we assume that each symbol is independent of the others—if we assume higher-order statistics, higher compression ratios might be possible.
Computing this value for “Crime and Punishment”, we find that the best we can hope for is
>>> import math
>>> total = sum(frequencies.values())
>>> probs = {s: c / total for (s, c) in frequencies.items()}
>>> shannon_bound = -sum(frequencies[s] * math.log2(probs[s])
... for s in frequencies.keys())
>>> shannon_bound
5259418.898940693
Huffman encoding
We also use the dahuffman Python library to produce a Huffman encoding, which maps each symbol to a unique code value of variable length.
Notice how common symbols like "e"
are mapped to short code values, while rare symbols like "B"
correspond to long code values.
>>> from dahuffman import HuffmanCodec
>>> codec = HuffmanCodec.from_frequencies(frequencies)
>>> codec.print_code_table()
Bits Code Value Symbol
3 000 0 'e'
4 0010 2 'h'
4 0011 3 'i'
5 01000 8 'u'
9 010010000 144 'R'
10 0100100010 290 'Y'
10 0100100011 291 'B'
...
>>> symbol_to_bitsize = {code:bitsize for (code, (bitsize, _))
... in codec.get_code_table().items()}
>>> sum(symbol_to_bitsize[symbol] for symbol in message)
5298110
Huffman encodes the book to around 5.3 million bits.
Arithmetic coder with an adaptive model
If we do not want to assume known frequencies, we can use a simple adaptive model. The frequency count for each symbol is initialized to one. As the encoder processes symbols it increments the counts to update the probability model. The decoder mimics this process exactly, but in reverse. We still need to know all distinct symbols that the encoder and decoder expects to see beforehand though (unless we iterate over the data twice).
>>> symbols = list(set(message))
>>> encoder = ArithmeticEncoder(symbols, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
5260166
>>> decoded = list(encoder.decode(bits))
>>> decoded == message
True
A comparison table
Comparing all the methods we’ve mentioned, we get the following table.
Compression Method | Bits Used |
---|---|
Naive encoding (7 bits/symbol) | 8,149,967 |
Huffman encoding | 5,298,110 |
Arithmetic encoding (adaptive) | 5,260,166 |
Arithmetic encoding (fixed) | 5,259,420 |
Shannon entropy bound | 5,259,418.9 |
Naively encoding each symbol using a fixed bit length is wasteful, since it does not account for the fact that some symbols are much more frequent than others. Huffman encoding improves greatly upon this, but arithmetic encoding is even better and is able to get within two bits of the Shannon bound.
Summary
Arithmetic coding maps a sequence of symbols into a sequence of bits. Unlike Huffman coding, there is no one-to-one correspondence between a single symbol and a single bit sequence. Both methods exploit the fact that some symbols are typically more common than others.
With arithmetic coding there are two types of probability models: fixed and adaptive. A fixed model uses a static frequency count, computed by either looping the data in advance or by observing a larger corpus of text. A dynamic model starts out with equal probabilities for all symbols, and adapts to the specific message as it iterates through the symbols. Regardless of which model is used, the logic of the decoder must exactly match the logic of the encoder.