Algorithms and Data Structures (Winter - Spring 2022)¶

Table of Contents

Choosing Algorithms¶

You know from your own experience that some ways of accomplishing a task may take a lot longer than others. Going more slowly may have its advantages, as in real life we're accomplishing multiple tasks and have to weigh trade-offs.

For example, you could wolf your dinner in a hurry, and if the only objective were to eat and run, this could be appropriate. However dinner may be an important social occasion and eating quickly and leaving the table only deprives one of catching up on important developments.

A first question, when faced with a computing challenge, is whether there's a solution at all. The second question is whether this solution is at all practical in terms of time and resources. A third question is how much original work will be required on your part, and how long might that take.

The first question involves research and not reinventing the wheel. If people stopped to reinvent every algorithm from scratch, we would move forward much too slowly. Everyone would need the same insights and genius as everyone else we we know the world is not like that.

Competitive Programming and Coding Interviews are not usually asking for acts of genius, as these do not come on demand, regardless of practice.

On the other hand, simply grabbing a published algorithm from a book without having much insight into how the algorithm actually works, or why, may lead to its own host of troubles down the road.

Not really understanding how an algorithm works should not prevent you from using it though. In computing, we learn to treat some elements of our system as "black boxes" meaning we have no insight into the internals, but may use them anyway. If this seems like a dark ages approach, with its reliance on "magic spells" so be it. The wizards and witches of the UNIX era earned that reputation for a reason. Not everyone could read a regex.

The second question (how long?) involves understanding what's called "Big O notation" which looks something like $O(n^{2})$ or $O(n\ log(n))$. Big O notation is about roughly how long an algorithm will likely take to find a solution, and is pegged to some input "n" of growing dimension.

From: Competitive Programmer’s Handbook, Antti Laaksonen, Draft, August 19, 2019 (link to PDF)

For example, n might be the number of digits in a number, and you want to know if the number is prime, or its prime factors. Many off-the-self algorithms address these very questions.

In [2]:

from primes import isprime, factors, all_factors

isprime(23321)

Out[2]:

True

In [2]:

all_factors(2)

Out[2]:

[1, 2]

In [3]:

factors(4839274)  # this version includes 1 as a factor...

Out[3]:

(1, 2, 11, 11, 19997)

The cell below typifies what you will need if hacking on a module in the background and then testing it here in Jupyter Notebook. If you're not changing the underlying code, you need not force recompilation.

In [4]:

from imp import reload
import primes.primesplay
reload(primes.primesplay)

Out[4]:

<module 'primes.primesplay' from '/Users/mac/Documents/elite_school/primes/primesplay.py'>

In [5]:

factors(12)  # but not the number itself

Out[5]:

(1, 2, 2, 3)

In [6]:

factors(11 * 11 * 13 * 13 * 7)

Out[6]:

(1, 7, 11, 11, 13, 13)

In [7]:

# what factors() returns should multiply together giving
# the original number
from functools import reduce
from operator import mul

f = factors(360)  # list of factors
reduce(mul, f)    # multiply all fs together

Out[7]:

In [8]:

all_factors(12)

Out[8]:

[1, 2, 3, 4, 6, 12]

In [9]:

def proper_divisors(n):
    return [f for f in all_factors(n)][:-1]

In [10]:

print(proper_divisors(28))

[1, 2, 4, 7, 14]

In [11]:

def aliquot_sum(n):
    return sum(proper_divisors(n))

In [12]:

print([aliquot_sum(x) for x in range(1, 30)])

[0, 1, 1, 3, 1, 6, 1, 7, 4, 8, 1, 16, 1, 10, 9, 15, 1, 21, 1, 22, 11, 14, 1, 36, 6, 16, 13, 28, 1]

In [13]:

aliquot_sum(28)  # perfect number

Out[13]:

Or n is the number of points floating in XYZ space and you wish to generate the biggest convex polyhedron that could contain all of them, either as corners, or internally. An open source program named Qhull will do this for you.

If a measure of the algorithm's efficiency is something like $O(n!)$, that means the time needed grows very quickly as n gets bigger. Often times we say an algorithm takes "polynomial time" and usually that means the algorithm is impractical for large values of n. But how large is too large?

If the efficency is more like $O(n)$, we say the time increases "linearly with n" which is about as good as it gets, although $O(log(n))$ is even better. Actually $O(1)$ is maximal and might be said to mean "instantaneous, regardless of size". "Just giving it a name" (e.g. bigdata = BigPile) is perhaps such an operation, but may not get us very far.

For example the amount of time it takes to scan a list for a specific value tends to increase linearly with the length of the list. A list 10 times longer, takes an average of 10 times longer to search. Looking up a value by key in a dict, however, takes about the same amount of time no matter what.

Remember #Python sets and dicts use a hash table internally which makes them very fast - algorithmic complexity of O(1) - for lookups (e.g. using the "in" operator).
— Bob Belderbos (@bbelderbos) February 15, 2022

Inventing Algorithms¶

Sometimes you might find exactly the algorithm you need, including in the computer language of your choice. You may need to customize it a little, to fit into your larger project, but that's a minor detail.

Other times, you may need to invent your own algorithms. You might want to do this anyway, because you see a clear way to accomplish a task and feel relying on others might take more time. You might elect to devise your own solution for many reasons, including because you're in school learning how to write algorithms.

One technique involved in writing your own solution is writing tests for that solution.

How will you know your solution is correct? This often requires having an independent and trusted source of "right answers" such as the digits of pi (if you're trying to generate them) or a list of consecutive prime numbers (if you're trying to generate primes).

On the other hand, perhaps you algorithm is mathematically provable (many of them are, see below), meaning you're able to offer it with great confidance in its correctness, irrespective of implementation details.

Speaking of generating prime numbers, the Sieve of Eratosthenes is a great example of an ancient algorithm, passed down through the ages. See Notes.

In [14]:

from primes import eratosthenes
print(eratosthenes(100))  # all primes <= 100

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]

For example, in the case of our Wordle-like project, evaluating a guess versus a right answer, coming up with the correct feedback, is something one can do manually, without a computer, once the rules of the game are understood. If your algorithm matches all the results you know are correct, having derived them yourself, then that's evidence you're making progress.

Proving Algorithms¶

More valuable than tests, or at least complementary to tests, would be some kind of mathematical proof that the algorithm is in fact correct.

Some algorithms may produce "false positives" or "false negatives". For example, the Fermat Primality test may falsely flag a composite as prime, but it will never falsely flag a prime as composite.

In [15]:

from math import gcd

def fermat_test(n, base=2):
    if gcd(n, base) > 1: # n is composite
        return False
    if pow(base, n-1, n) == 1:
        return True  # pass
    else:
        return False # no pass
    
def fermat(n):
    bases = [2, 3, 5, 7, 11]
    if 5 == sum([fermat_test(n, b) for b in bases]):
        return True
    return False

Although Li Shanlan (1811–1882), a Qing dynasty mathematician, later realized his error in claiming any odd n prime if and only if $2^{n-1} \equiv 1 \pmod n$, it was too late: the so-called "Chinese Hypothesis" had made it into the literature.

We can see why this conjecture was tempting:

In [16]:

shanlan = [n for n in range(1, 300) if fermat_test(n) ]
print(shanlan)

[3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293]

In [17]:

sifted = eratosthenes(300)
sifted[1:] == shanlan # but for 2 itself

Out[17]:

True

In [18]:

shanlan = [ n for n in range(3, 301, 2) if fermat_test(n)]
print(shanlan)

[3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293]

In [19]:

fermat_test(341)  # however...

Out[19]:

True

Dr. Sarrus (1798 - 1861) discovered 341 was the first (lowest) exception to this rule, and Sarrus pseudoprimes are Fermat pseudoprimes that pass with base 2.

A Fermat pseudoprime passes the Fermat test for some relatively prime base, and yet is not prime. Mathematicians know such Fermat pseudoprimes exist for any base and have recipes for creating them.

The Carmichael Numbers pass the Fermat Test for all bases to which they are relatively prime.

In [21]:

factors(341)

Out[21]:

(1, 11, 31)

One way to strengthen the Fermat Test is to apply several bases. The mere fact of checking for relative primality weeds out a lot of false positives.

In [23]:

fermat(1729)

Out[23]:

False

The algorithms below are too slow for use with big numbers, but have the advantage of being conceptually clear, and so good for nailing down concepts ahead of time.

In [25]:

factors(1729)

Out[25]:

(1, 7, 13, 19)

In [26]:

fermat(561)  # Carmichael Number, not really prime

Out[26]:

False

Along with proving and testing comes assessing the algorithm's efficiency (see above). As the inventor of the algorithm, it's up to you to get a sense of what its Big-O signature might be.

If you plan to publish your algorithm as a useful invention available to a wider public, such an analysis may be expected of you.

A lot of deep computer science has gone into figuring out how to prove whether a program is correct. In practice, one often can provide a proof, to everyone's satisfaction. However in theory, there's no guaranteed strategy for proving a program is bullet proof.

At this general a level we're talking about the fabric of logic itself, and logicians had already come to similar conclusions, about the generic incompleteness of proof-based systems, even before computer scientists raised the issue of provability.

In [28]:

from math import gcd

def isprime(n):  # slow, inefficient, but clear
    return (True 
            if sum([gcd(n,i) for i in range(1, n)])==(n-1) 
            else False)

In [29]:

isprime(13)

Out[29]:

True

In [30]:

def primes(n):
    output = []
    for i in range(2, n):
        if isprime(i):
            output.append(i)
    return output

One of the Oldest Algorithms¶

One of the oldest algorithms that's been passed down through the ages is known as Euclid's Algorithm. Whether Euclid invented it is not the key question. The method was included among Euclid's published techniques.

The purpose of Euclid's Algorithm (EA) is to find the biggest number that will divide two other numbers. We're talking about positive integers and a unit, such that if EA(a, b) is 1, then a/b is already a fraction in lowest terms.

For example, EA(24, 16) is 8, because 8 is the largest integer, the greatest common divisor, of both 24 and 16. 2 is also a divisor, but isn't the greatest. If you came across the fraction 16/24, you would know you might reduce it to 2/3, and at that point, there's no further reduction possible. gcd(2, 3) = 1. When the numerator and denominator are relatively prime (e.g. 14/23), have no factors in common, then we say it's in "lowest terms".

When computing the gcd of a pair of numbers, sometimes one of the numbers is already a divisor of the other, in which case that's the answer. The gcd of 10 and 5 is simply 5.

The implementation of a "greatest common divisor" solution below is not Euclid's Method, but a relatively inefficent method that is nevertheless not hard to understand.

Get all the divisors of input numbers j and k, then get the greatest divisor they both have in common.

In [32]:

def divisors(n : int):
    divs = list()
    for d in range(1, n + 1):
        if n % d == 0:
            divs.append(d)
    return divs        

In [33]:

all_factors(12)

Out[33]:

[1, 2, 3, 4, 6, 12]

In [34]:

all_factors(36)

Out[34]:

[1, 2, 3, 4, 6, 9, 12, 18, 36]

In [35]:

def gcd(a, b):
    return max(set(all_factors(a)).intersection(set(all_factors(b))))

In [37]:

gcd(24, 10)

Out[37]:

In [38]:

gcd(19, 21)

Out[38]:

In [39]:

gcd(10045, 4095)

Out[39]:

This works!

However, Euclid's Method (another name for Euclid's Algorithm) is much cleverer.

Consider our two inputs to be m and n. If the smaller m divides the larger n with no remainder, we're done: m is the greatest divisor.

It could be that m is down to 1 at this point (see below), in which case the original n and m were "strangers" to one another, meaning their only factor in common was the number 1. We could also say they were "relatively prime".

In [40]:

from math import gcd

gcd(19, 101)  # both prime

Out[40]:

In [41]:

gcd(18, 13) # relatively prime (strangers)

Out[41]:

Otherwise, if m is not 1, and there's some remainder r after dividing n by m, then the next step is to find the greatest divisor of m and r. Whatever divides both will also divide n, the longer starting length.

And so on, we keep going. The n, m pair keeps getting smaller as we keep taking the remainder from the previous pair. Down to m = 1 at the limit, but we don't necessarily get to that. Whenever m divides n, we're done.

Here's a first attempt at implementing the above:

In [42]:

def EA(m, n):
    if m > n:
        m, n = n, m  # make m the smaller
    while True:
        q = n // m     #  get quotient
        if q * m == n: #  if no remainder, done
            return m
        n, m = m, n - q*m # m, r -- the new terms

In [43]:

EA(12, 100)

Out[43]:

In [44]:

EA(10045, 4095)

Out[44]:

In [45]:

%timeit  gcd(4839274, 639232)

761 ns ± 304 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [46]:

%timeit  EA(4839274, 639232)

5.1 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

EA is hugely more efficient compared to the all-divisors based algorithm, taking only micro-seconds versus milliseconds.

EA does not require finding all the divisors of any number. EA eventually comes to an end, with an answer if 1, so there's no danger the while True loop will be infinite.

Speaking of loops, here's another implementation of EA employing recursion:

In [47]:

def EA(m, n):
    if m > n:
        m, n = n, m  # make m the smaller
    r = n % m        # remainder
    if r == 0: #  if no remainder, done
        return m
    return EA(m, r)

In [48]:

%timeit  EA(4839274, 639232)

5.55 µs ± 823 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The pithiest (most Pythonic) implementation of EA is attributed to Guido van Rossum, Python's inventor.

Here it is:

In [49]:

def EA(m, n):
    while m:
        n, m = m, n % m
    return n

In [50]:

EA(12, 100)

Out[50]:

In [51]:

%timeit  EA(4839274, 639232)

1.74 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Lowest Common Multiple¶

One way to get the Lowest Common Multiple of two numbers, m and n is to get their product, but then to divide by the GCD.

In [52]:

def lcm(a, b):
    return (a * b) // gcd(a, b)

In [53]:

lcm(27, 72)

Out[53]:

The Extended Euclidean Algorithm¶

Once we have the GCD of two numbers, m, n, we should be able to stack these two lengths side-by-side such that the two stacks differ only by the GCD itself. How many bricks of each length will we need? That's what the Extended Euclidean Algorithm (EEA) will tell us.

This puzzle piece proves useful when we introduce RSA, the public key cryptography algorithm. Follow the 2nd link below for more information.

In [54]:

from primes.euler_test import xgcd

In [55]:

m, n = 117, 72
g, a, b = xgcd(m, n)
print(g, a, b)

9 -3 5

In [56]:

a * m + b * n

Out[56]:

Links:¶

Link to source code for the Seive of Eratosthenes, factors, isprime and so on.

Extended Euclidean Algorithm

Cumulative Sequences¶

Suppose you have a sequence such as the Fibonacci numbers, and for some reason want a cumulative total of all the numbers in a sequence up to a specific term. In fact, you would like a cumulative total to go with every term in the original sequence.

Like this:

In [57]:

# Fibonacci Numbers
# https://oeis.org/A000045

def fibo(a=0, b=1):
    while True:
        yield a
        b, a = a + b, b
        
gen = fibo()
fibs = [next(gen) for _ in range(10)]
print(fibs)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In [58]:

def accumulate(seq):
    totals = []
    term = 0
    for i in seq:
        term += i
        totals.append(term)
    return totals

In [59]:

accumulate(fibs)

Out[59]:

[0, 1, 2, 4, 7, 12, 20, 33, 54, 88]

Interestingly, the running total of Fibonacci numbers matches itself, minus one from each term, starting a couple terms further forward.

Imagine subtracting 1 from each of these terms:

In [60]:

fibs[2:]

Out[60]:

[1, 2, 3, 5, 8, 13, 21, 34]

Now let's do it:

In [61]:

[fib-1 for fib in fibs[2:]]

Out[61]:

[0, 1, 2, 4, 7, 12, 20, 33]

Here's another sequence and its correlated cumulative sequence: the cuboctahedral numbers.

Image of Cubocta

A way to pack (equi-sized) balls together very efficiently is in what's called the CCP arrangement, or Cubic Close Packing. The growing shape depicted above is a 14-faced (8 triangle, 6 square) "cuboctahedron" and it is growing outward in layers. The first layer around the nuclear ball has 12 balls. The next layer has 42. The sequence is known in the OEIS as A005901.

Lets generate it, and accumulate.

In [62]:

def cubocta(n : int):
    """
    n is number of layers (0 for nuclear ball)
    """
    if n==0: return 1
    return 10 * n * n + 2

In [63]:

a005901 = [cubocta(i) for i in range(15)]  # first 20 cuboctahedral numbers

In [64]:

print(a005901)

[1, 12, 42, 92, 162, 252, 362, 492, 642, 812, 1002, 1212, 1442, 1692, 1962]

In [65]:

a005902 = accumulate(a005901)
print(a005902)

[1, 13, 55, 147, 309, 561, 923, 1415, 2057, 2869, 3871, 5083, 6525, 8217, 10179]

Now that we have done the work to implement accumulate lets take a look at the tool already provided in the standard library. It's in itertools, a goldmine of generic Python generator type objects.

In [66]:

from itertools import accumulate
result = accumulate(a005901)
print(list(result))

[1, 13, 55, 147, 309, 561, 923, 1415, 2057, 2869, 3871, 5083, 6525, 8217, 10179]

Remember itertools when your topic is permutations and/or combinations.

Exercise: Sequence Differences¶

Write a function that takes a sequence and returns a sequence, of differences between successive terms.

For example, [0, 1, 2, 4, 7, 12, 20, 33] would output [1, 1, 2, 3, 5, 8, 13]. The resulting sequence will have one less term than the input sequence.

In [67]:

from ads_solutions import diffs  # secret stash of solutions?  what's your solution?

In [68]:

diffs([0, 1, 2, 4, 7, 12, 20, 33])

Out[68]:

[1, 1, 2, 3, 5, 8, 13]

In [69]:

fibs

Out[69]:

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In [70]:

print(diffs(fibs))

[1, 0, 1, 1, 2, 3, 5, 8, 13]

In [71]:

print(diffs(a005902))

[12, 42, 92, 162, 252, 362, 492, 642, 812, 1002, 1212, 1442, 1692, 1962]

In [72]:

list(zip([1, 2, 3, 4, 5], [2, 3, 4, 5, 6]))

Out[72]:

[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

In [73]:

diffs([0, -10, 5, -1, -2, 100])

Out[73]:

[-10, 15, -6, -1, 102]

Resources¶

List Comprehension Syntax and related concepts
Learning Math with Python (slide show on Github) by the author of a book by that title, Amit Saha.
Math Adventures with Python by Peter Farrell, depicted below (on the left):

Lots of Python in this MIT course:

In [74]:

from IPython.display import YouTubeVideo
YouTubeVideo("k6U-i4gXkLM")

Out[74]: