In [12]:

```
import pandas as pd
import math
```

Compression

Fun with compressing names which is either interesting to you as a reader or not :) It does connect to data engineering and data science:

- Column storage with very similar data, check out Roaring Bitmaps
- An old idea to compare music is to compress a song together with a catalogue of music from a genre. The lowest compression is then the genre.
- How well data compresses tells you something about the intrisic complexity/richness of the data. Having a large dataset is a very basic measure, compression (and ideas like compression) tells you more.
- Encoder/decoders kindof work like lossy compression.

This article will be followed up with compressing timeseries.

Read in the 1990 census data of the USA

In [13]:

```
def parse_perc(s: str) -> float:
return float(s.strip('%')) / 100
def parse_count(s: str) -> int:
return int(s.replace(',', ''))
df = pd.read_csv("names.csv", sep=";",
converters={
'Frequency (%) M': parse_perc,
'Frequency (%) F': parse_perc,
'Count M': parse_count,
'Count F': parse_count
})
```

In [14]:

```
df.tail()
```

Out[14]:

Rank | Name M | Frequency (%) M | Count M | Name F | Frequency (%) F | Count F | |
---|---|---|---|---|---|---|---|

995 | 996 | BURL | 0.0001 | 16235 | ROBERT | 0.0001 | 25976 |

996 | 997 | WALKER | 0.0001 | 16235 | ISABELLA | 0.0001 | 25976 |

997 | 998 | TYREE | 0.0001 | 16235 | HERMINIA | 0.0001 | 25976 |

998 | 999 | JEFFEREY | 0.0001 | 16235 | TERRA | 0.0001 | 25976 |

999 | 1000 | AHMED | 0.0001 | 16235 | CELINA | 0.0001 | 25976 |

`Jefferey`

would result in `[je, ef, ff, fe, er, re, ey]`

. The leter `e`

is the start of three 2-grams: `ef`

, `er`

and `ey`

. By summing up the occurences of each 2-gram we can find out what's the most likely character following a specified letter.

Below we select the top 4:

In [15]:

```
# Construct 2-grams
top = 4
def two_grams(s: str):
return list( zip(s[:-1], s[1:]))
# Construct flattened list of weighted 2-grams and get the top 8 per letter
weighted_two_grams = pd.DataFrame(
df.apply(
lambda row: [(a,b, row['Frequency (%) M']) for a,b in two_grams(row['Name M'])],
axis=1
).apply(pd.Series).stack().to_list(),
columns=['first', 'second', 'freq']
).groupby(by=['first', 'second']).sum().sort_values(by='freq', ascending=False).groupby(by='first').head(top).sort_index()
weighted_two_grams.head()
```

Out[15]:

freq | ||
---|---|---|

first | second | |

A | L | 0.0549 |

M | 0.0725 | |

N | 0.1032 | |

R | 0.1251 | |

B | E | 0.0482 |

To simplify we assume we only have 26 letters. To encode names we go through the name and output a (binary) code for each letter. We distinguish two types of codes:

- The number of the letter in the alphabet (1-26)
- The position in the top-4 of letters of the letter
*preceding*this one

Let's define some functions that do this for us!

In [16]:

```
def encode_char(c, prev):
if prev is not None and c in weighted_two_grams.loc[(prev,)].index:
return (0, weighted_two_grams.loc[(prev,)].index.get_loc(c))
else:
return (1, ord(c)-65)
def encode(name):
encoded = []
prev = None
for c in name:
code = encode_char(c, prev)
prev = c
encoded.append(code)
return encoded
def decode_char(code, prev):
t, nr = code
if prev is not None and t == 0:
return weighted_two_grams.loc[(prev,)].iloc[nr].name
else:
return chr(nr+65)
def decode(codes):
decoded = []
prev = None
for code in codes:
prev = decode_char(code, prev)
decoded.append(prev)
return ''.join(decoded)
```

As an example let's see how `robert`

is encoded and decoded:

In [17]:

```
encode("ROBERT")
```

Out[17]:

[(1, 17), (0, 2), (0, 0), (0, 0), (0, 2), (0, 3)]

In [18]:

```
decode(encode("ROBERT"))
```

Out[18]:

'ROBERT'

We could encode the two types of codes with a `0`

followed by 2 bits or `1`

followed by 5 bits. The two bits give the position in the top 4, and the 5 bits give the position in the 26 letter alphabet (2^5=32). There might be some room to use even less bits with some trickery.

The 2-gram lookup table requires 26*4*5=520 bits: sequentially store the top 4 for a...z, and there is no top four pad with zeros).

Suppose we encounter the top 1000 names proportionally than on average we need 21.92 bits per name:

In [19]:

```
def nr_bits(codes):
return sum([ 1 + (math.log2(top) if t == 0 else 5) for t, nr in codes])
```

In [20]:

```
df['enc'] = df['Name M'].map(lambda x: nr_bits(encode(x)))
(df['enc'] * df['Frequency (%) M']).sum()
```

Out[20]:

21.919200000000004

Robert would be encoded as

In [30]:

```
def to_bin(encoding):
return ''.join([str(code) + format(nr, 'b') for code, nr in encoding])
```

In [31]:

```
to_bin(encode('ROBERT'))
```

Out[31]:

'1100010100000010011'

If the first 1000 first names really cover 90% we can also encode the first 1000 names in a fixed list. We then encode it as follows:

- Single bit 0 if not in the list and 1 if in the list
- If in the list 10 bits a number <1000
- If not in the list 5 bits per letter

The table takes 5677 * 5 = 28385 bits to store. On average this takes 10 bits per name.

The break-even point is at roughly encoding 2300 names from the top 1000.

In [33]:

```
(28385-520)/12
```

Out[33]:

2322.0833333333335

In [284]:

```
df.sort_values(by='Frequency (%) M', ascending=False).head(200)['Frequency (%) M'].sum()
```

Out[284]:

0.7261

This is left as an exercise for the reader.