Notebook

Brute Forces, Secretaries, and Dichotomies¶

Chapter 11 of Real World Algorithms.

Panos Louridas
Athens University of Economics and Business

Sequential Search¶

Sequential search is perhaps the most straightforward search method.

We start from the beginning and check each item in turn until we find the one we want.

It can be used for both sorted and unsorted data, but there are much better methods for sorted data.

Here is a straightforward implementation:

In [1]:

def sequential_search(a, s):
    for i, element in enumerate(a):
        if element == s:
            return i
    return -1

Let's check it on a random list:

In [2]:

import random

a = list(range(1000))
random.shuffle(a)

pos = sequential_search(a, 314)
print(a[pos], 'found at', pos)

pos = sequential_search(a, 1001)
print(pos)

314 found at 942
-1

We need not write sequential_search(a, s) in Python.

If a is a list, we can use a.index(s) instead.

In fact that's what we should do, because it is way faster (we saw that also in Chapter 7).

Here is the timing for our version:

In [3]:

import timeit

total_elapsed = 0
for i in range(100):
    a = list(range(10000))
    random.shuffle(a)
    start = timeit.default_timer()
    index = sequential_search(a, 314)
    end = timeit.default_timer()
    total_elapsed += end - start
print(total_elapsed)

0.028058045892976224

And here is the timing for the native version (which actually calls a function written in C):

In [4]:

total_elapsed = 0
for i in range(100):
    a = list(range(10000))
    random.shuffle(a)
    start = timeit.default_timer()
    index = a.index(314)
    end = timeit.default_timer()
    total_elapsed += end - start

print(total_elapsed)

0.006791955849621445

Matching, Comparing, Records, Keys¶

When we are searching for an item in the list, Python performs an equality test between each item and the item we are searching for.

The equality test is performed with the operator ==.

Checking for equality is not the same as checking whether two items are the same.

This is called strict comparison and in Python it is implemented with the operator is.

That means that the following two are equal:

In [5]:

an_item = (1, 2)

another_item = (1, 2)

an_item == another_item

Out[5]:

True

But they are not the same:

In [6]:

an_item is another_item

Out[6]:

False

As another example, let's see what happens with Python's support for complex numbers:

In [7]:

x = 3.14+1.62j
y = 3.14+1.62j
print(x == y)
print(x is y)

True
False

String comparison is must faster than equality checking, but it is not what we usually want to use.

A common idiom for identity checking in Python is checking for None, like if a is None or if a is not None.

In many cases, we hold information for entities in records, which are collections of attributes.

In that case, we want to search for an entity based on a particular attribute that identifies it.

The attribute is called a key.

In Python we can represent records as objects that are instances of a class.

Alternatively, we can represent them as dictionaries.

In fact, Python objects use dictionaries internally.

Let's get a list of two persons.

In [8]:

john = {
    'first_name': 'John',
    'surname': 'Doe',
    'passport_no': 'AI892495',
    'year_of_birth': 1990,
    'occupation': 'teacher'
}

jane = {
    'first_name': 'Jane',
    'surname': 'Roe',
    'passport_no': 'AI485713',
    'year_of_birth': 1986,
    'occupation': 'civil engineer'
}

persons = [john, jane]

In this example, the key for our search would be the passport number, because we would like to find the full information for a person with that particular piece of identification.

To do that we could re-implement sequential search so that we provide to it the comparison function.

In [9]:

def sequential_search_m(a, s, matches):
    for i, element in enumerate(a):
        if matches(element, s):
            return i
    return -1

def match_passport(person, passport_no):
    return person['passport_no'] == passport_no

pos = sequential_search_m(persons, 'AI485713', match_passport)
print(persons[pos], 'found at', pos)

{'first_name': 'Jane', 'surname': 'Roe', 'passport_no': 'AI485713', 'year_of_birth': 1986, 'occupation': 'civil engineer'} found at 1

Although you would probably use something more Pythonic like:

In [10]:

results = [(i, p) for i, p in enumerate(persons) if p['passport_no'] == 'AI485713']
results

Out[10]:

[(1,
  {'first_name': 'Jane',
   'surname': 'Roe',
   'passport_no': 'AI485713',
   'year_of_birth': 1986,
   'occupation': 'civil engineer'})]

Self-Organizing Search¶

In self-organizing search, we take advantage of an item's popularity to move it to the front of the collection in which we are performing our searches.

In the move-to-front method, when we find an item we move it directly to the front.

In the transposition method, when we find an item we swap it with its predecessor (if any).

We cannot implement directly the algorithms for lists given in the book (that is, algorithm 11.2 and algorithm 11.3) for the simple reason that Python hides the list implementation from us.

Moreover, Python lists are not linked lists. They are variable-length arrays (see the online documentation for details on the implementation of lists in Python).

We can implement algorithm 11.3, which is the transposition method for arrays.

In [11]:

def transposition_search(a, s):
    for i, item in enumerate(a):
        if item == s:
            if i > 0:
                a[i-1], a[i] = a[i], a[i-1]
                return i-1
            else:
                return i
    return -1

How can we test transposition_search(a, s)?

We need to do some groundwork to emulate a situation of popularity-biased searches.

In particular, we will create a setting where the items we are searching for are governed by Zipf's law.

First, we'll write a function that provides the Zipf probability for $n$ items.

In [12]:

def zipf(n):
    h = 0
    for x in range(1, n+1):
        h += 1/x
    z = [ 1/x * 1/h for x in range(1, n+1) ]
    return z

We'll work with 1000 items:

In [13]:

zipf_1000 = zipf(1000)

We can check that they sum up to 1, and see the first 20 of the probabilities.

In [14]:

print(sum(zipf_1000))
print(zipf_1000[:20])

1.0000000000000018
[0.13359213049244018, 0.06679606524622009, 0.044530710164146725, 0.033398032623110044, 0.026718426098488037, 0.022265355082073363, 0.019084590070348597, 0.016699016311555022, 0.014843570054715576, 0.013359213049244019, 0.012144739135676381, 0.011132677541036681, 0.010276317730187707, 0.009542295035174298, 0.008906142032829346, 0.008349508155777511, 0.007858360617202364, 0.007421785027357788, 0.007031164762760009, 0.006679606524622009]

Again we will be performing our searches on 1000 items, in random order.

In [15]:

a = list(range(1000))
random.shuffle(a)
print(a[:20])

[965, 274, 365, 189, 152, 107, 624, 641, 767, 475, 750, 824, 490, 524, 698, 211, 958, 607, 13, 599]

We will perform 100000 searches among these items.

We want the searches to follow Zipf's law.

First, we will create another list of 1000 items in random order.

In [16]:

b = list(range(1000))
random.shuffle(b)
print(b[:20])

[934, 515, 787, 618, 387, 654, 322, 164, 810, 204, 453, 415, 109, 855, 402, 53, 728, 581, 280, 809]

Then, we will select 100000 items from the second list, using the Zipf probabilities.

That means that we will be selecting the first item with probability zipf_1000[0], the second item with probability zipf_1000[1], and so on.

In [17]:

searches = random.choices(b, weights=zipf_1000, k=100000)

Indeed, we can verify that the popularity of items in searches mirrors b:

In [18]:

from collections import Counter

counted = Counter(searches)
counted.most_common(20)

Out[18]:

[(934, 13486),
 (515, 6695),
 (787, 4358),
 (618, 3372),
 (387, 2703),
 (654, 2284),
 (322, 1879),
 (164, 1728),
 (810, 1461),
 (204, 1289),
 (453, 1224),
 (109, 1096),
 (415, 1073),
 (855, 941),
 (402, 845),
 (53, 808),
 (728, 774),
 (581, 759),
 (280, 709),
 (809, 701)]

So, we will perform 100000 searches in the first list, using as keys the items in searches.

Because transposition_search(a, s) changes a, we will keep a copy of it to use it to compare the performance with simple sequential search.

At the end, apart from displaying the time elapsed we will also show the first items of the changed a, to see how popular searches have gone to the beginning.

In [19]:

a_copy = a[:]

total_elapsed = 0
for s in searches:
    start = timeit.default_timer()
    index = transposition_search(a, s)
    end = timeit.default_timer()
    total_elapsed += end - start

print(total_elapsed)
print(a[:20])

1.3782846610993147
[934, 515, 387, 787, 618, 322, 654, 164, 204, 415, 810, 109, 53, 453, 402, 855, 750, 58, 581, 728]

We will now perform the same searches with a_copy using simple sequential search.

In [20]:

total_elapsed = 0
for s in searches:
    start = timeit.default_timer()
    index = sequential_search(a_copy, s)
    end = timeit.default_timer()
    total_elapsed += end - start

print(total_elapsed)

2.779128445254173

The Secretary Problem¶

The secretary problem requires selecting the best item when we have not seen, and we cannot wait to see, the full sets of items.

The solution is an online algorithm. We find the best item among the first $n/e$, where $n$ is the total expected number of items, and $e \approx 2.71828$ is [Euler's number](https://en.wikipedia.org/wiki/E_(mathematical_constant).

Then we select the first of the remaining items that is better than that. The probability that we'll indeed select the best item is $n/e \approx 37\%$.

Here is how we can do that:

In [21]:

import math

def secretary_search(a):
    # Calculate |a|/n items.
    m = int((len(a) // math.e) + 1)
    
    # Find the best among the first |a|/n.
    c = 0
    for i in range(1, m):
        if a[i] > a[c]:
            c = i
    
    # Get the first that is better from the one
    # we found, if possible.
    for i in range(m, len(a)):
        if a[i] > a[c]:
            return i
    return - 1

Does secretary_search(a) find the best item in a about 37% of the time?

To check that, we'll continue working in a similar fashion. We'll perform 1000 searches in 1000 items and see how often we do come up with the best item.

In [22]:

total_found = 0

for i in range(1000):
    a = list(range(1000))
    random.shuffle(a)
    index = secretary_search(a)
    max_index = a.index(max(a))
    if index == max_index:
        total_found += 1

print(f"found  {total_found} out of {i+1}, {100*total_found/(i+1)}%")

found  379 out of 1000, 37.9%

Binary Search¶

Binary search is the most efficient way to search for an item when the search space is ordered.

It is an iterative algorithm, where in each iteration we split the search space in half.

We start by asking if the search item is in the middle of the search space. Let's assume that the items are ordered in ascending orded.

If it is greater than the item in the middle, we repeat the question on the right part of the search space; it it is smaller, we repeat the question on the left part of the search space. We continue until we find the item, or we cannot perform a split any more.

In [23]:

def binary_search(a, item):

    # Initialize borders of search space.
    low = 0
    high = len(a) - 1

    # While the search space is not empty:
    while low <= high:
        # Split the search space in the middle.
        mid = low + (high - low) // 2
        # Compare with midpoint.
        c = (a[mid] > item) - (a[mid] < item)
        # If smaller, repeat on the left half.
        if c < 0:
            low = mid + 1
        # If greater, repeat on the right half.
        elif c > 0:
            high = mid - 1
        # If found, we are done.
        else:
            return mid

    return -1

In Python 3 there is no cmp(x, y) function that compares x and y and returns -1, 0, or 1, if x > y, x == y, or x < y, respectively. We use the

(x > y) - (y < x)```
idiom instead.

Note also the line where we calculate the midpoint:
```python
mid = low + (high - low) // 2

This guards against overflows. In Python that is not necessary, because there is no upper limit in integers, so it could be:

mid = (low + high) // 2

However, this is a problem in most other languages, so we'll stick with the foolproof version.

To see how binary search works we can add some tracing information in binary_search(a, item):

In [24]:

def binary_search_trace(a, item):

    print("Searching for", item)
    # Initialize borders of search space.
    low = 0
    high = len(a) - 1

    # While the search space is not empty:
    while low <= high:
        # Split the search space in the middle.
        mid = low + (high - low) // 2
        # Compare with midpoint.
        c = (a[mid] > item) - (a[mid] < item)
        print(f"({low}, {a[low]}), ({mid}, {a[mid]}), ({high}, {a[high]})")
        
        # If smaller, repeat on the left half.
        if c < 0:
            low = mid + 1
        # If greater, repeat on the right half.
        elif c > 0:
            high = mid - 1
        # If found, we are done.
        else:
            return mid

    return -1

a = [4, 10, 31, 65, 114, 149, 181, 437,
     480, 507, 551, 613, 680, 777, 782, 903] 

binary_search_trace(a, 149)
binary_search_trace(a, 181)
binary_search_trace(a, 583)
binary_search_trace(a, 450)
binary_search_trace(a, 3)

Searching for 149
(0, 4), (7, 437), (15, 903)
(0, 4), (3, 65), (6, 181)
(4, 114), (5, 149), (6, 181)
Searching for 181
(0, 4), (7, 437), (15, 903)
(0, 4), (3, 65), (6, 181)
(4, 114), (5, 149), (6, 181)
(6, 181), (6, 181), (6, 181)
Searching for 583
(0, 4), (7, 437), (15, 903)
(8, 480), (11, 613), (15, 903)
(8, 480), (9, 507), (10, 551)
(10, 551), (10, 551), (10, 551)
Searching for 450
(0, 4), (7, 437), (15, 903)
(8, 480), (11, 613), (15, 903)
(8, 480), (9, 507), (10, 551)
(8, 480), (8, 480), (8, 480)
Searching for 3
(0, 4), (7, 437), (15, 903)
(0, 4), (3, 65), (6, 181)
(0, 4), (1, 10), (2, 31)
(0, 4), (0, 4), (0, 4)

Out[24]:

-1

Binary search is very efficient—in fact, it is as efficient as a search method can be (there is a smalll caveat here, concerning searching in quantum computers, but we can leave that aside).

If have have $n$ items, it will complete the search in $O(\lg(n))$.

Once again, we can verify that theory agrees with practice. We will perform 1000 searches, 500 of them successful and 500 of them unsuccessful and count the average number of iterations required. To do that, we'll change binary_search(a, item) so that it also returns the number of iterations.

In [25]:

def binary_search_count(a, item):

    # Initialize borders of search space.
    low = 0
    high = len(a) - 1
    
    i = 0
    # While the search space is not empty:
    while low <= high:
        i += 1
        # Split the search space in the middle.
        mid = low + (high - low) // 2
        # Compare with midpoint.
        c = (a[mid] > item) - (a[mid] < item)
        
        # If smaller, repeat on the left half.
        if c < 0:
            low = mid + 1
        # If greater, repeat on the right half.
        elif c > 0:
            high = mid - 1
        # If found, we are done.
        else:
            return (i, mid)

    return (i, -1)

We build up our test suite. Our items will be 1000 random numbers in the range from 0 to 999,999.

We will select 500 of them to perform matching searches, and another 500, not in them, to perform non-matching searches.

In [26]:

num_range = 1000000
# Get 1000 random numbers from 0 to 999999.
a = random.sample(range(num_range), k=1000)
# Select 500 from them for our matching searches.
existing = random.sample(a, k=500)
# Select another 500 random numbers in the range, 
# not in the set a, for our non-matching searches
non_existing = random.sample(set(range(num_range)) - set(a), k=500)

# Verify that the matching and non-matchin sets are distinct.
print(set(existing) & set(non_existing))

set()

So now we can see how the average number of iterations in practice fares compared to what predicted by theory.

In [27]:

total_iters = 0
for matching, non_matching in zip(existing, non_existing):
    matching_iters, _ = binary_search_count(a, matching)
    non_matching_iters, _ = binary_search_count(a, non_matching)
    total_iters += (matching_iters + non_matching_iters)
    
print(f"Average iterations:", total_iters / (len(existing) + len(non_existing)))
print(f"lg(1000) = {math.log(1000, 2)}")

Average iterations: 9.743
lg(1000) = 9.965784284662087