**I do consulting work in Norway.**Don't hestitate to contact me to discuss how I can assist you!

# Arithmetic coding in Python

- 19. September 2024 (modified 3. October 2024)
- #algorithms

Arithmetic coding is a lossless data compression algorithm. It encodes a sequence of symbols into a sequence of bits. After reading the 1987 paper “Arithmetic coding for data compression” I decided to implement the algorithm in clean, modern Python.

In this article I’ll demonstrate how to use my Python implementation, how well arithmetic coding works when the symbol frequency is skewed, and how to encode “Crime and Punishment.” We also compare arithmetic coding with the more well-known Huffman coding. The implementation is available at: github.com/tommyod/arithmetic-coding

## A simple example

Here we create a `message`

to encode and put an End Of Message (EOM) symbol `"<EOM>"`

at the end of it.
We also need to set up a table of symbol frequencies.
Arithmetic coding exploits the frequent symbols and assigns fewer bits to them, though unlike Huffman encoding arithmetic encoding does not actually use a fixed bit sequence per symbol.

```
>>> import collections
>>> from arithmetic_coding import ArithmeticEncoder
>>> message = ["A", "B", "A", "A", "A", "<EOM>"]
>>> frequencies = collections.Counter(message)
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=12)
>>> bits = list(encoder.encode(message))
>>> bits
[1, 1, 0, 0, 0, 0, 1, 1]
```

We can decode these bits back to symbols and we get the original message back.

```
>>> list(encoder.decode(bits))
['A', 'B', 'A', 'A', 'A', '<EOM>']
```

This examples encodes the message to 8 bits.
A naive approach would be to assign a unique 2-bit sequence to each symbol, e.g., `{"A": 0b00, "B": 0b01, "<EOM>": 0b10}`

.
Since the message contains 6 symbols, we would need `6 * 2 = 12`

bits with this naive approach.

## Infrequent symbols

To see how the encoder exploits the frequency of symbols, consider this example.
We initialize `ArithmeticEncoder(bits=32)`

because we need more internal bits in the buffer of the encoder to represent low-frequency symbols.

```
>>> message = ["A"] + ["B"] * 1_000_000 + ["<EOM>"]
>>> frequencies = collections.Counter(message)
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
43
```

Amazingly, a message with a million symbols is encoded to only 43 bits! Having seen how efficient arithmetic coding is with skewed frequencies, let’s apply it to a real-world example.

## Encoding “Crime and Punishment”

Let’s download the English translation of “Crime and Punishment” by Fyodor Dostoevsky.

```
>>> import requests
>>> import string
>>> url = "https://www.gutenberg.org/cache/epub/2554/pg2554.txt"
>>> response = requests.get(url)
>>> symbols = (s for s in response.text if s in set(string.printable))
>>> message = list(symbols) + ["<EOM>"]
>>> len(message)
1164281
```

For this example, we’ll use the actual frequencies from the text, though in practice one might use a more general frequency table for the English language.

```
>>> frequencies = collections.Counter(message)
>>> len(frequencies)
83
>>> frequencies["e"], frequencies["q"]
(104735, 766)
```

### Naive encoding

The Python code above shows that there are 83 distinct printable symbols in the book.
Since \(2^7 = 128\), a naive approach would be to use seven bits for each symbol, regardless of its frequency.
We would then encode `"A"`

to `0b0000000`

, `"B"`

to `0b0000001`

, `"C"`

to `0b0000010`

and so forth.
This approach would encode the book using `1164281 * 7 = 8149967`

bits.

### Arithmetic encoding

Instead of using the naive approach, the arithmetic encoder exploits that symbols like `"e"`

are much more frequent than symbols like `"q"`

.
In around three seconds we’re able to encode all the 1.05 million symbols in the book:

```
>>> encoder = ArithmeticEncoder(frequencies=frequencies, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
5259420
```

The result has around 5.26 million bits, and the bits/symbol ratio is 4.52.

### Shannon entropy as a lower bound on bits

The Shannon entropy, which is a lower bound on how many bits we need on average to encode a message of this length with these frequencies, is

where \(f_i\) is the frequency (count) of symbol \(i\) and \(p_i = f_i / \sum_j f_j\) is the probability of symbol \(i\). The bound only applies if we assume that each symbol is independent of the others—if we assume higher-order statistics, higher compression ratios might be possible.

Computing this value for “Crime and Punishment”, we find that the best we can hope for is

```
>>> import math
>>> total = sum(frequencies.values())
>>> probs = {s: c / total for (s, c) in frequencies.items()}
>>> shannon_bound = -sum(frequencies[s] * math.log2(probs[s])
... for s in frequencies.keys())
>>> shannon_bound
5259418.898940693
```

### Huffman encoding

We also use the dahuffman Python library to produce a Huffman encoding, which maps each symbol to a unique code value of variable length.
Notice how common symbols like `"e"`

are mapped to short code values, while rare symbols like `"B"`

correspond to long code values.

```
>>> from dahuffman import HuffmanCodec
>>> codec = HuffmanCodec.from_frequencies(frequencies)
>>> codec.print_code_table()
Bits Code Value Symbol
3 000 0 'e'
4 0010 2 'h'
4 0011 3 'i'
5 01000 8 'u'
9 010010000 144 'R'
10 0100100010 290 'Y'
10 0100100011 291 'B'
...
>>> symbol_to_bitsize = {code:bitsize for (code, (bitsize, _))
... in codec.get_code_table().items()}
>>> sum(symbol_to_bitsize[symbol] for symbol in message)
5298110
```

Huffman encodes the book to around 5.3 million bits.

### Arithmetic coder with an adaptive model

If we do not want to assume known frequencies, we can use a simple adaptive model. The frequency count for each symbol is initialized to one. As the encoder processes symbols it increments the counts to update the probability model. The decoder mimics this process exactly, but in reverse. We still need to know all distinct symbols that the encoder and decoder expects to see beforehand though (unless we iterate over the data twice).

```
>>> symbols = list(set(message))
>>> encoder = ArithmeticEncoder(symbols, bits=32)
>>> bits = list(encoder.encode(message))
>>> len(bits)
5260166
>>> decoded = list(encoder.decode(bits))
>>> decoded == message
True
```

### A comparison table

Comparing all the methods we’ve mentioned, we get the following table.

Compression Method | Bits Used |
---|---|

Naive encoding (7 bits/symbol) | 8,149,967 |

Huffman encoding | 5,298,110 |

Arithmetic encoding (adaptive) | 5,260,166 |

Arithmetic encoding (fixed) | 5,259,420 |

Shannon entropy bound | 5,259,418.9 |

Naively encoding each symbol using a fixed bit length is wasteful, since it does not account for the fact that some symbols are much more frequent than others. Huffman encoding improves greatly upon this, but arithmetic encoding is even better and is able to get within two bits of the Shannon bound.

## Summary

Arithmetic coding maps a sequence of symbols into a sequence of bits. Unlike Huffman coding, there is no one-to-one correspondence between a single symbol and a single bit sequence. Both methods exploit the fact that some symbols are typically more common than others.

With arithmetic coding there are two types of probability models: fixed and adaptive. A fixed model uses a static frequency count, computed by either looping the data in advance or by observing a larger corpus of text. A dynamic model starts out with equal probabilities for all symbols, and adapts to the specific message as it iterates through the symbols. Regardless of which model is used, the logic of the decoder must exactly match the logic of the encoder.