#!/usr/bin/env python
# coding: utf-8
# # Table of Contents
#
# # Short study of the Lempel-Ziv complexity
#
# In this short [Jupyter notebook](https://www.Jupyter.org/) aims at defining and explaining the [Lempel-Ziv complexity](https://en.wikipedia.org/wiki/Lempel-Ziv_complexity).
#
# [I](http://perso.crans.org/besson/) will give examples, and benchmarks of different implementations.
#
# - **Reference:** Abraham Lempel and Jacob Ziv, *« On the Complexity of Finite Sequences »*, IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1.
# ----
# ## Short definition
# The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.
#
# As an example:
#
# ```python
# >>> s = '1001111011000010'
# >>> lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
# 8
# ```
#
# Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$.
#
# - See the page https://en.wikipedia.org/wiki/Lempel-Ziv_complexity for more details.
# Other examples:
#
# ```python
# >>> lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
# 7
# >>> lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
# 9
# >>> lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
# 10
# ```
# ----
# ## Python implementation
# In[112]:
def lempel_ziv_complexity(sequence):
"""Lempel-Ziv complexity for a binary sequence, in simple Python code."""
sub_strings = set()
n = len(sequence)
ind = 0
inc = 1
# this while loop runs at most n times
while True:
if ind + inc > len(sequence):
break
# this can take some time, takes O(inc)
sub_str = sequence[ind : ind + inc]
# and this also, takes a O(log |size set|) in worst case
# max value for inc = n / size set at the end
# so worst case is that the set contains sub strings of the same size
# and the worst loop takes a O(n / |S| * log(|S|))
# ==> so if n/|S| is constant, it gives O(n log(n)) at the end
# but if n/|S| = O(n) then it gives O(n^2)
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
# ----
# ## Tests (1/2)
# In[4]:
s = '1001111011000010'
lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
# In[5]:
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(s)')
# In[6]:
lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
# In[7]:
lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
# In[8]:
lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
# In[9]:
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('100111101100001000001010')")
# In[11]:
import random
def random_string(size, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
return "".join(random.choices(alphabet, k=size))
def random_binary_sequence(size):
return random_string(size, alphabet="01")
# In[13]:
random_string(100)
random_binary_sequence(100)
# In[43]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print(" of sizes {}, Lempel-Ziv complexity runs in:".format(n))
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(r(n))')
# We can start to see that the time complexity of this function seems to grow linearly as the size grows.
# ----
# ## Cython implementation
# As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can easily try to use [Cython](http://Cython.org/) in a notebook cell.
#
# > See [the Cython documentation](http://docs.cython.org/en/latest/src/quickstart/build.html#using-the-jupyter-notebook) for more information.
# In[16]:
get_ipython().run_line_magic('load_ext', 'cython')
# In[45]:
get_ipython().run_cell_magic('cython', '', 'import cython\n\nctypedef unsigned int DTYPE_t\n\n@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe\ndef lempel_ziv_complexity_cython(str sequence not None):\n """Lempel-Ziv complexity for a string, in simple Cython code (C extension)."""\n \n cdef set sub_strings = set()\n cdef str sub_str = ""\n cdef DTYPE_t n = len(sequence)\n cdef DTYPE_t ind = 0\n cdef DTYPE_t inc = 1\n while True:\n if ind + inc > len(sequence):\n break\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings:\n inc += 1\n else:\n sub_strings.add(sub_str)\n ind += inc\n inc = 1\n return len(sub_strings)\n')
# Let try it!
# In[37]:
s = '1001111011000010'
lempel_ziv_complexity_cython(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010
# In[29]:
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(s)')
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_cython(s)')
# In[30]:
lempel_ziv_complexity_cython('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
# In[31]:
lempel_ziv_complexity_cython('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
# In[32]:
lempel_ziv_complexity_cython('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
# Now for a test of the speed?
# In[46]:
for (r, name) in zip(
[random_string, random_binary_sequence],
["random strings in A..Z", "random binary sequences"]
):
print("\nFor {}...".format(name))
for n in [10, 100, 1000, 10000, 100000]:
print(" of sizes {}, Lempel-Ziv complexity in Cython runs in:".format(n))
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_cython(r(n))')
# > $\implies$ Yay! It seems faster indeed! but only x2 times faster...
# ----
# ## Numba implementation
# As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can also try to use [Numba](http://Numba.PyData.org/) in a notebook cell.
# In[48]:
from numba import jit
# In[79]:
@jit
def lempel_ziv_complexity_numba(sequence : str) -> int:
"""Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully)."""
sub_strings = set()
n : int= len(sequence)
ind : int = 0
inc : int = 1
while True:
if ind + inc > len(sequence):
break
sub_str : str = sequence[ind : ind + inc]
if sub_str in sub_strings:
inc += 1
else:
sub_strings.add(sub_str)
ind += inc
inc = 1
return len(sub_strings)
# Let try it!
# In[80]:
s = '1001111011000010'
lempel_ziv_complexity_numba(s) # 1 / 0 / 01 / 1110 / 1100 / 0010
# In[53]:
get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_numba(s)')
# In[54]:
lempel_ziv_complexity_numba('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010
# In[55]:
lempel_ziv_complexity_numba('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000
9
# In[57]:
lempel_ziv_complexity_numba('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101
# In[58]:
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('100111101100001000001010')")
# > $\implies$ Well... It doesn't seem that much faster from the naive Python code.
# > We specified the signature when calling [`@numba.jit`](http://numba.pydata.org/numba-doc/latest/user/jit.html), and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster).
# > But even these tricks didn't help that much.
#
# > I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays.
# > Note that the [`@jit`](http://numba.pydata.org/numba-doc/latest/user/jit.html)-powered function is compiled at runtime when first being called, so the signature used for the *first* call is determining the signature used by the compile function
# ----
# ## Tests (2/2)
#
# To test more robustly, let us generate some (uniformly) random binary sequences.
# In[59]:
from numpy.random import binomial
def bernoulli(p, size=1):
"""One or more samples from a Bernoulli of probability p."""
return binomial(1, p, size)
# In[60]:
bernoulli(0.5, 20)
# That's probably not optimal, but we can generate a string with:
# In[61]:
''.join(str(i) for i in bernoulli(0.5, 20))
# In[62]:
def random_binary_sequence(n, p=0.5):
"""Uniform random binary sequence of size n, with rate of 0/1 being p."""
return ''.join(str(i) for i in bernoulli(p, n))
# In[63]:
random_binary_sequence(50)
random_binary_sequence(50, p=0.1)
random_binary_sequence(50, p=0.25)
random_binary_sequence(50, p=0.5)
random_binary_sequence(50, p=0.75)
random_binary_sequence(50, p=0.9)
# And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result.
# In[64]:
def tests_3_functions(n, p=0.5, debug=True):
s = random_binary_sequence(n, p=p)
c1 = lempel_ziv_complexity(s)
if debug:
print("Sequence s = {} ==> complexity C = {}".format(s, c1))
c2 = lempel_ziv_complexity_cython(s)
c3 = lempel_ziv_complexity_numba(s)
assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3)
return c1
# In[65]:
tests_3_functions(5)
# In[66]:
tests_3_functions(20)
# In[67]:
tests_3_functions(50)
# In[68]:
tests_3_functions(500)
# In[69]:
tests_3_functions(5000)
# ----
# ## Benchmarks
#
# On two example of strings (binary sequences), we can compare our three implementation.
# In[70]:
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('100111101100001000001010')")
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_cython('100111101100001000001010')")
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('100111101100001000001010')")
# In[71]:
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')")
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')")
get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')")
# Let check the time used by all the three functions, for longer and longer sequences:
# In[72]:
get_ipython().run_line_magic('timeit', 'tests_3_functions(10, debug=False)')
get_ipython().run_line_magic('timeit', 'tests_3_functions(20, debug=False)')
get_ipython().run_line_magic('timeit', 'tests_3_functions(40, debug=False)')
get_ipython().run_line_magic('timeit', 'tests_3_functions(80, debug=False)')
get_ipython().run_line_magic('timeit', 'tests_3_functions(160, debug=False)')
get_ipython().run_line_magic('timeit', 'tests_3_functions(320, debug=False)')
# In[73]:
def test_cython(n):
s = random_binary_sequence(n)
c = lempel_ziv_complexity_cython(s)
return c
# In[74]:
get_ipython().run_line_magic('timeit', 'test_cython(10)')
get_ipython().run_line_magic('timeit', 'test_cython(20)')
get_ipython().run_line_magic('timeit', 'test_cython(40)')
get_ipython().run_line_magic('timeit', 'test_cython(80)')
get_ipython().run_line_magic('timeit', 'test_cython(160)')
get_ipython().run_line_magic('timeit', 'test_cython(320)')
# In[75]:
get_ipython().run_line_magic('timeit', 'test_cython(640)')
get_ipython().run_line_magic('timeit', 'test_cython(1280)')
get_ipython().run_line_magic('timeit', 'test_cython(2560)')
get_ipython().run_line_magic('timeit', 'test_cython(5120)')
# In[76]:
get_ipython().run_line_magic('timeit', 'test_cython(10240)')
get_ipython().run_line_magic('timeit', 'test_cython(20480)')
# ----
# ## Complexity ?
# $\implies$ The function `lempel_ziv_complexity_cython` seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$.
#
# But let check more precisely, as it could also have a complexity of $\mathcal{O}(n \log n)$.
# In[95]:
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4)
# In[96]:
import numpy as np
import timeit
# In[109]:
sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int)
times = np.array([
timeit.timeit(
stmt="lempel_ziv_complexity_cython(random_string({}))".format(n),
globals=globals(),
number=10,
)
for n in sizes
])
# In[110]:
plt.figure(figsize=(15, 10))
plt.plot(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity")
plt.show()
# In[111]:
plt.figure(figsize=(15, 10))
plt.loglog(sizes, times, 'o-')
plt.xlabel("Length $n$ of the binary sequence $S$")
plt.ylabel(r"Time in $\mu\;\mathrm{s}$")
plt.title("Time complexity of Lempel-Ziv complexity, loglog scale")
plt.show()
# It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity.
#
# To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$.
# ----
# ## Conclusion
#
# - The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence.
#
# - Using the [Cython](http://Cython.org/) was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation !
#
# - The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences).
# ----
# ## (Experimental) [Julia](http://julialang.org) implementation
#
# I want to (quickly) try to see if I can use [Julia](http://julialang.org) to write a faster version of this function.
# See [issue #1](https://github.com/Naereen/Lempel-Ziv_Complexity/issues/1).
# In[118]:
get_ipython().run_cell_magic('time', '', '%%script julia\n\n"""Lempel-Ziv complexity for a sequence, in simple Julia code."""\nfunction lempel_ziv_complexity(sequence)\n sub_strings = Set()\n n = length(sequence)\n\n ind = 1\n inc = 1\n while true\n if ind + inc > n\n break\n end\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings\n inc += 1\n else\n push!(sub_strings, sub_str)\n ind += inc\n inc = 1\n end\n end\n return length(sub_strings)\nend\n\ns = "1001111011000010"\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010\n\nM = 1000;\nN = 10000;\nfor _ in 1:M\n s = join(rand(0:1, N));\n lempel_ziv_complexity(s);\nend\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010\n')
# And to compare it fairly, let us use [Pypy](http://pypy.org) for comparison.
# In[122]:
get_ipython().run_cell_magic('time', '', '%%pypy\n\ndef lempel_ziv_complexity(sequence):\n """Lempel-Ziv complexity for a binary sequence, in simple Python code."""\n sub_strings = set()\n n = len(sequence)\n\n ind = 0\n inc = 1\n while True:\n if ind + inc > len(sequence):\n break\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings:\n inc += 1\n else:\n sub_strings.add(sub_str)\n ind += inc\n inc = 1\n return len(sub_strings)\n\ns = "1001111011000010"\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010\n\nfrom random import random\n\nM = 1000\nN = 10000\nfor _ in range(M):\n s = \'\'.join(str(int(random() < 0.5)) for _ in range(N))\n lempel_ziv_complexity(s)\n')
# So we can check that on these 1000 random trials on strings of size 10000, the naive Julia version is slower than the naive Python version (executed by Pypy for speedup).
# ----
# ## Ending notes
# > Thanks for reading!
# > My implementation is [now open-source and available on GitHub](https://github.com/Naereen/Lempel-Ziv_Complexity), on https://github.com/Naereen/Lempel-Ziv_Complexity.
#
# > It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity.
#
# > See [this repo on GitHub](https://github.com/Naereen/notebooks/) for more notebooks, or [on nbviewer.jupyter.org](https://nbviewer.jupyter.org/github/Naereen/notebooks/).
#
# > That's it for this demo! See you, folks!