#!/usr/bin/env python # coding: utf-8 # # Table of Contents #

1 Short study of the Lempel-Ziv complexity

1.3 Tests (1/2)

1.4 Cython implementation

1.5 Numba implementation

1.10 (Experimental) Julia implementation

1.11 Ending notes

# # Short study of the Lempel-Ziv complexity # # In this short [Jupyter notebook](https://www.Jupyter.org/) aims at defining and explaining the [Lempel-Ziv complexity](https://en.wikipedia.org/wiki/Lempel-Ziv_complexity). # # [I](http://perso.crans.org/besson/) will give examples, and benchmarks of different implementations. # # - **Reference:** Abraham Lempel and Jacob Ziv, *« On the Complexity of Finite Sequences »*, IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1. # ---- # ## Short definition # The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end. # # As an example: # # ```python # >>> s = '1001111011000010' # >>> lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010 # 8 # ``` # # Marking in the different substrings, this sequence $s$ has complexity $\mathrm{Lempel}$-$\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$. # # - See the page https://en.wikipedia.org/wiki/Lempel-Ziv_complexity for more details. # Other examples: # # ```python # >>> lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010 # 7 # >>> lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000 # 9 # >>> lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101 # 10 # ``` # ---- # ## Python implementation # In[112]: def lempel_ziv_complexity(sequence): """Lempel-Ziv complexity for a binary sequence, in simple Python code.""" sub_strings = set() n = len(sequence) ind = 0 inc = 1 # this while loop runs at most n times while True: if ind + inc > len(sequence): break # this can take some time, takes O(inc) sub_str = sequence[ind : ind + inc] # and this also, takes a O(log |size set|) in worst case # max value for inc = n / size set at the end # so worst case is that the set contains sub strings of the same size # and the worst loop takes a O(n / |S| * log(|S|)) # ==> so if n/|S| is constant, it gives O(n log(n)) at the end # but if n/|S| = O(n) then it gives O(n^2) if sub_str in sub_strings: inc += 1 else: sub_strings.add(sub_str) ind += inc inc = 1 return len(sub_strings) # ---- # ## Tests (1/2) # In[4]: s = '1001111011000010' lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010 # In[5]: get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(s)') # In[6]: lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010 # In[7]: lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000 # In[8]: lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101 # In[9]: get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('100111101100001000001010')") # In[11]: import random def random_string(size, alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"): return "".join(random.choices(alphabet, k=size)) def random_binary_sequence(size): return random_string(size, alphabet="01") # In[13]: random_string(100) random_binary_sequence(100) # In[43]: for (r, name) in zip( [random_string, random_binary_sequence], ["random strings in A..Z", "random binary sequences"] ): print("\nFor {}...".format(name)) for n in [10, 100, 1000, 10000, 100000]: print(" of sizes {}, Lempel-Ziv complexity runs in:".format(n)) get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(r(n))') # We can start to see that the time complexity of this function seems to grow linearly as the size grows. # ---- # ## Cython implementation # As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can easily try to use [Cython](http://Cython.org/) in a notebook cell. # # > See [the Cython documentation](http://docs.cython.org/en/latest/src/quickstart/build.html#using-the-jupyter-notebook) for more information. # In[16]: get_ipython().run_line_magic('load_ext', 'cython') # In[45]: get_ipython().run_cell_magic('cython', '', 'import cython\n\nctypedef unsigned int DTYPE_t\n\n@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe\ndef lempel_ziv_complexity_cython(str sequence not None):\n """Lempel-Ziv complexity for a string, in simple Cython code (C extension)."""\n \n cdef set sub_strings = set()\n cdef str sub_str = ""\n cdef DTYPE_t n = len(sequence)\n cdef DTYPE_t ind = 0\n cdef DTYPE_t inc = 1\n while True:\n if ind + inc > len(sequence):\n break\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings:\n inc += 1\n else:\n sub_strings.add(sub_str)\n ind += inc\n inc = 1\n return len(sub_strings)\n') # Let try it! # In[37]: s = '1001111011000010' lempel_ziv_complexity_cython(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010 # In[29]: get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity(s)') get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_cython(s)') # In[30]: lempel_ziv_complexity_cython('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010 # In[31]: lempel_ziv_complexity_cython('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000 # In[32]: lempel_ziv_complexity_cython('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101 # Now for a test of the speed? # In[46]: for (r, name) in zip( [random_string, random_binary_sequence], ["random strings in A..Z", "random binary sequences"] ): print("\nFor {}...".format(name)) for n in [10, 100, 1000, 10000, 100000]: print(" of sizes {}, Lempel-Ziv complexity in Cython runs in:".format(n)) get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_cython(r(n))') # > $\implies$ Yay! It seems faster indeed! but only x2 times faster... # ---- # ## Numba implementation # As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can also try to use [Numba](http://Numba.PyData.org/) in a notebook cell. # In[48]: from numba import jit # In[79]: @jit def lempel_ziv_complexity_numba(sequence : str) -> int: """Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully).""" sub_strings = set() n : int= len(sequence) ind : int = 0 inc : int = 1 while True: if ind + inc > len(sequence): break sub_str : str = sequence[ind : ind + inc] if sub_str in sub_strings: inc += 1 else: sub_strings.add(sub_str) ind += inc inc = 1 return len(sub_strings) # Let try it! # In[80]: s = '1001111011000010' lempel_ziv_complexity_numba(s) # 1 / 0 / 01 / 1110 / 1100 / 0010 # In[53]: get_ipython().run_line_magic('timeit', 'lempel_ziv_complexity_numba(s)') # In[54]: lempel_ziv_complexity_numba('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010 # In[55]: lempel_ziv_complexity_numba('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000 9 # In[57]: lempel_ziv_complexity_numba('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101 # In[58]: get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('100111101100001000001010')") # > $\implies$ Well... It doesn't seem that much faster from the naive Python code. # > We specified the signature when calling [`@numba.jit`](http://numba.pydata.org/numba-doc/latest/user/jit.html), and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster). # > But even these tricks didn't help that much. # # > I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays. # > Note that the [`@jit`](http://numba.pydata.org/numba-doc/latest/user/jit.html)-powered function is compiled at runtime when first being called, so the signature used for the *first* call is determining the signature used by the compile function # ---- # ## Tests (2/2) # # To test more robustly, let us generate some (uniformly) random binary sequences. # In[59]: from numpy.random import binomial def bernoulli(p, size=1): """One or more samples from a Bernoulli of probability p.""" return binomial(1, p, size) # In[60]: bernoulli(0.5, 20) # That's probably not optimal, but we can generate a string with: # In[61]: ''.join(str(i) for i in bernoulli(0.5, 20)) # In[62]: def random_binary_sequence(n, p=0.5): """Uniform random binary sequence of size n, with rate of 0/1 being p.""" return ''.join(str(i) for i in bernoulli(p, n)) # In[63]: random_binary_sequence(50) random_binary_sequence(50, p=0.1) random_binary_sequence(50, p=0.25) random_binary_sequence(50, p=0.5) random_binary_sequence(50, p=0.75) random_binary_sequence(50, p=0.9) # And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result. # In[64]: def tests_3_functions(n, p=0.5, debug=True): s = random_binary_sequence(n, p=p) c1 = lempel_ziv_complexity(s) if debug: print("Sequence s = {} ==> complexity C = {}".format(s, c1)) c2 = lempel_ziv_complexity_cython(s) c3 = lempel_ziv_complexity_numba(s) assert c1 == c2 == c3, "Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...".format(s, c1, c2, c3) return c1 # In[65]: tests_3_functions(5) # In[66]: tests_3_functions(20) # In[67]: tests_3_functions(50) # In[68]: tests_3_functions(500) # In[69]: tests_3_functions(5000) # ---- # ## Benchmarks # # On two example of strings (binary sequences), we can compare our three implementation. # In[70]: get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('100111101100001000001010')") get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_cython('100111101100001000001010')") get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('100111101100001000001010')") # In[71]: get_ipython().run_line_magic('timeit', "lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')") get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')") get_ipython().run_line_magic('timeit', "lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')") # Let check the time used by all the three functions, for longer and longer sequences: # In[72]: get_ipython().run_line_magic('timeit', 'tests_3_functions(10, debug=False)') get_ipython().run_line_magic('timeit', 'tests_3_functions(20, debug=False)') get_ipython().run_line_magic('timeit', 'tests_3_functions(40, debug=False)') get_ipython().run_line_magic('timeit', 'tests_3_functions(80, debug=False)') get_ipython().run_line_magic('timeit', 'tests_3_functions(160, debug=False)') get_ipython().run_line_magic('timeit', 'tests_3_functions(320, debug=False)') # In[73]: def test_cython(n): s = random_binary_sequence(n) c = lempel_ziv_complexity_cython(s) return c # In[74]: get_ipython().run_line_magic('timeit', 'test_cython(10)') get_ipython().run_line_magic('timeit', 'test_cython(20)') get_ipython().run_line_magic('timeit', 'test_cython(40)') get_ipython().run_line_magic('timeit', 'test_cython(80)') get_ipython().run_line_magic('timeit', 'test_cython(160)') get_ipython().run_line_magic('timeit', 'test_cython(320)') # In[75]: get_ipython().run_line_magic('timeit', 'test_cython(640)') get_ipython().run_line_magic('timeit', 'test_cython(1280)') get_ipython().run_line_magic('timeit', 'test_cython(2560)') get_ipython().run_line_magic('timeit', 'test_cython(5120)') # In[76]: get_ipython().run_line_magic('timeit', 'test_cython(10240)') get_ipython().run_line_magic('timeit', 'test_cython(20480)') # ---- # ## Complexity ? # $\implies$ The function `lempel_ziv_complexity_cython` seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$. # # But let check more precisely, as it could also have a complexity of $\mathcal{O}(n \log n)$. # In[95]: import matplotlib.pyplot as plt import seaborn as sns get_ipython().run_line_magic('matplotlib', 'inline') sns.set(context="notebook", style="darkgrid", palette="hls", font="sans-serif", font_scale=1.4) # In[96]: import numpy as np import timeit # In[109]: sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int) times = np.array([ timeit.timeit( stmt="lempel_ziv_complexity_cython(random_string({}))".format(n), globals=globals(), number=10, ) for n in sizes ]) # In[110]: plt.figure(figsize=(15, 10)) plt.plot(sizes, times, 'o-') plt.xlabel("Length $n$ of the binary sequence $S$") plt.ylabel(r"Time in $\mu\;\mathrm{s}$") plt.title("Time complexity of Lempel-Ziv complexity") plt.show() # In[111]: plt.figure(figsize=(15, 10)) plt.loglog(sizes, times, 'o-') plt.xlabel("Length $n$ of the binary sequence $S$") plt.ylabel(r"Time in $\mu\;\mathrm{s}$") plt.title("Time complexity of Lempel-Ziv complexity, loglog scale") plt.show() # It is linear in $\log\log$ scale, so indeed the algorithm seems to have a linear complexity. # # To sum-up, for a sequence $S$ of length $n$, it takes $\mathcal{O}(n)$ basic operations to compute its Lempel-Ziv complexity $\mathrm{Lempel}-\mathrm{Ziv}(S)$. # ---- # ## Conclusion # # - The Lempel-Ziv complexity is not too hard to implement, and it indeed represents a certain complexity of a binary sequence, capturing the regularity and reproducibility of the sequence. # # - Using the [Cython](http://Cython.org/) was quite useful to have a $\simeq \times 100$ speed up on our manual naive implementation ! # # - The algorithm is not easy to analyze, we have a trivial $\mathcal{O}(n^2)$ bound but experiments showed it is more likely to be $\mathcal{O}(n \log n)$ in the worst case, and $\mathcal{O}(n)$ in practice for "not too complicated sequences" (or in average, for random sequences). # ---- # ## (Experimental) [Julia](http://julialang.org) implementation # # I want to (quickly) try to see if I can use [Julia](http://julialang.org) to write a faster version of this function. # See [issue #1](https://github.com/Naereen/Lempel-Ziv_Complexity/issues/1). # In[118]: get_ipython().run_cell_magic('time', '', '%%script julia\n\n"""Lempel-Ziv complexity for a sequence, in simple Julia code."""\nfunction lempel_ziv_complexity(sequence)\n sub_strings = Set()\n n = length(sequence)\n\n ind = 1\n inc = 1\n while true\n if ind + inc > n\n break\n end\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings\n inc += 1\n else\n push!(sub_strings, sub_str)\n ind += inc\n inc = 1\n end\n end\n return length(sub_strings)\nend\n\ns = "1001111011000010"\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010\n\nM = 1000;\nN = 10000;\nfor _ in 1:M\n s = join(rand(0:1, N));\n lempel_ziv_complexity(s);\nend\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 1110 / 1100 / 0010\n') # And to compare it fairly, let us use [Pypy](http://pypy.org) for comparison. # In[122]: get_ipython().run_cell_magic('time', '', '%%pypy\n\ndef lempel_ziv_complexity(sequence):\n """Lempel-Ziv complexity for a binary sequence, in simple Python code."""\n sub_strings = set()\n n = len(sequence)\n\n ind = 0\n inc = 1\n while True:\n if ind + inc > len(sequence):\n break\n sub_str = sequence[ind : ind + inc]\n if sub_str in sub_strings:\n inc += 1\n else:\n sub_strings.add(sub_str)\n ind += inc\n inc = 1\n return len(sub_strings)\n\ns = "1001111011000010"\nlempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010\n\nfrom random import random\n\nM = 1000\nN = 10000\nfor _ in range(M):\n s = \'\'.join(str(int(random() < 0.5)) for _ in range(N))\n lempel_ziv_complexity(s)\n') # So we can check that on these 1000 random trials on strings of size 10000, the naive Julia version is slower than the naive Python version (executed by Pypy for speedup). # ---- # ## Ending notes # > Thanks for reading! # > My implementation is [now open-source and available on GitHub](https://github.com/Naereen/Lempel-Ziv_Complexity), on https://github.com/Naereen/Lempel-Ziv_Complexity. # # > It will be available from PyPi very soon, see https://pypi.python.org/pypi/lempel_ziv_complexity. # # > See [this repo on GitHub](https://github.com/Naereen/notebooks/) for more notebooks, or [on nbviewer.jupyter.org](https://nbviewer.jupyter.org/github/Naereen/notebooks/). # # > That's it for this demo! See you, folks!