Map, Filter, Reduce, and Groupby¶

This notebook shows off the canonical higher-order functions.

In [1]:

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [2]:

def square(x):
    return x ** 2

def iseven(n):
    return n % 2 == 0

def add(x, y):
    return x + y

def mul(x, y):
    return x * y

def lesser(x, y):
    if x < y:
        return x
    else:
        return y

def greater(x, y):
    if x > y:
        return x
    else:
        return y

Map¶

In [3]:

# map works like this
map(square, data)

Out[3]:

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [4]:

# In this way it's like numpy's broadcasting operators
import numpy as np
X = np.arange(1, 11)
X**2

Out[4]:

array([  1,   4,   9,  16,  25,  36,  49,  64,  81, 100])

But map is pure Python and so

is slower
can handle general functions like fibonacci

In [5]:

def fib(i):
    if i in (0, 1):
        return i
    else:
        return fib(i - 1) + fib(i - 2)
    
map(fib, data)

Out[5]:

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [6]:

# Normally we would perform this in the following way
result = []
for item in data:
    result.append(fib(item))
    
result

Out[6]:

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [7]:

# looking at the function above gives us a good pattern for how to define `map`
# We just abstract out the function `fib` for a user input

# `map` is easy to define
def map(fn, sequence):
    result = []
    for item in sequence:
        result.append(fn(item))
    return result

Little known fact, object methods are perfectly valid functions

In [8]:

map(str.upper, ['Alice', 'Bob', 'Charlie'])

Out[8]:

['ALICE', 'BOB', 'CHARLIE']

The map function is so important that it was given its own syntax, the list comprehension

In [9]:

[fib(i) for i in data]

Out[9]:

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [10]:

[name.upper() for name in ['Alice', 'Bob', 'Charlie']]

Out[10]:

['ALICE', 'BOB', 'CHARLIE']

Filter¶

The filter higher order function filters a dataset by a predicate.

A predicate is a function that returns True or False. The filter function returns a new list of only those elements for which the predicate holds true.

In [11]:

filter(iseven, data)

Out[11]:

[2, 4, 6, 8, 10]

In [12]:

from sympy import isprime  # Only works if you have the sympy math library installed
filter(isprime, data)

Out[12]:

[2, 3, 5, 7]

In [13]:

def filter(predicate, sequence):
    result = []
    for item in sequence:
        if predicate(item):
            result.append(item)
    return result

Reduce¶

Reduce is the little sibling of map and filter. Reduce is much less popular and often scolded as being difficult to understand.

Despite its social problems reduce is quite powerful and, once you write reduce once you'll understand how it works. More importantly you will learn how to identify reduction operations and how to pair them with binary operators. Reductions are common in data analytics, particularly when reducing large datasets into synopses.

To show reduce we'll first implement two common reductions, sum and min. We've written them suggestively with the binary operators add and lesser to highlight their similar structure. Pick out the parts of the following two functions that differ from each other.

In [14]:

def sum(sequence):
    result = 0
    for item in sequence:
        # reult = result + item
        result = add(result, item)
    return result

In [15]:

def min(sequence):
    result = 99999999999999  # a really big number
    for item in sequence:
        # result = result if result < item else item
        result = lesser(result, item)
    return result

Exercise¶

Now fill in the blanks below to complete the definition of product, a function that multiplies the elements of the sequence together.

In [16]:

def product(sequence):
    result = ?
    for item in sequence:
        result = ?(result, item)
    return result

assert product([2, 3, 10]) == 60

  File "<ipython-input-16-92db2dd2fc1e>", line 2
    result = ?
             ^
SyntaxError: invalid syntax

Exercise¶

Write reduce.

Start by copying the pattern of the above three functions. The parts that differ between the three are your inputs. Traditionally the arguments of reduce are ordered so that the examples below operate well.

In [17]:

def reduce(...):
    ...

  File "<ipython-input-17-880dfea8dc75>", line 1
    def reduce(...):
               ^
SyntaxError: invalid syntax

In [18]:

reduce(add, data, 0)

Out[18]:

In [19]:

reduce(mul, data, 1)

Out[19]:

In [20]:

reduce(lesser, data, 10000000)

Out[20]:

In [21]:

reduce(greater, data, -100000000)

Out[21]:

Lambda¶

We started this notebook with lots of little definitions like

def add(x, y):
    return x + y

These one-line functions sometimes seem a little silly. We use the lambda keyword to create small functions on the fly. The above definition could be expressed as follows

add = lambda x, y: x + y

The expression lambda x, y: x + y is a value, just like 3 or Alice. Just like literal ints and strings, Lambda expressions can be used on-the-fly without storing them in variables.

In [22]:

reduce(add, data, 0)

Out[22]:

In [23]:

reduce(lambda x, y: x + y, data, 0)  # Define `add` on the fly

Out[23]:

Additionally we can use lambda to quickly specify functions as specializations of more general ones. In the following we quickly define sum, min, and max

In [24]:

sum = lambda data: reduce(add, data, 0)
min = lambda data: reduce(lesser, data, 99999999999)
max = lambda data: reduce(greater, data, -999999999999)

In [25]:

sum(data)

Out[25]:

As an exercise make product using lambda, reduce, and mul.

In [26]:

product = ...
assert product([2, 3, 10]) == 60

  File "<ipython-input-26-405b2d336b95>", line 1
    product = ...
              ^
SyntaxError: invalid syntax

Groupby¶

Groupby can be seen as a more powerful version of filter. Rather than give you one subset of the data it divides the data into all relevant subsets.

In [27]:

filter(iseven, data)

Out[27]:

[2, 4, 6, 8, 10]

In [29]:

from toolz import groupby
groupby(iseven, data)

Out[29]:

{False: [1, 3, 5, 7, 9], True: [2, 4, 6, 8, 10]}

In [30]:

groupby(isprime, data)

Out[30]:

{False: [1, 4, 6, 8, 9, 10], True: [2, 3, 5, 7]}

But groupby is not restricted to predicates (functions which return True or False)

In [31]:

groupby(lambda n: n % 3, data)

Out[31]:

{0: [3, 6, 9], 1: [1, 4, 7, 10], 2: [2, 5, 8]}

In [32]:

groupby(len, ['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'])

Out[32]:

{3: ['Bob', 'Dan'], 5: ['Alice', 'Edith', 'Frank'], 7: ['Charlie']}

Amazingly groupby is not significantly more costly than filter in the common case. It computes these groups in a single pass through the data.

Integrative example¶

Lets bring it all together on a tiny dataset

In [33]:

likes = """Alice likes Chocolate
Bob likes Chocolate
Bob likes Apples
Charlie likes Apples
Alice likes Peanut Butter
Charlie likes Peanut Butter"""

In [34]:

tuples = map(lambda s: s.split(' likes '), likes.split('\n'))
tuples

Out[34]:

[['Alice', 'Chocolate'],
 ['Bob', 'Chocolate'],
 ['Bob', 'Apples'],
 ['Charlie', 'Apples'],
 ['Alice', 'Peanut Butter'],
 ['Charlie', 'Peanut Butter']]

In [35]:

groups = groupby(lambda x: x[0], tuples)
groups

Out[35]:

{'Alice': [['Alice', 'Chocolate'], ['Alice', 'Peanut Butter']],
 'Bob': [['Bob', 'Chocolate'], ['Bob', 'Apples']],
 'Charlie': [['Charlie', 'Apples'], ['Charlie', 'Peanut Butter']]}

In [36]:

from toolz import valmap, first, second
valmap(lambda L: map(second, L), groups)

Out[36]:

{'Alice': ['Chocolate', 'Peanut Butter'],
 'Bob': ['Chocolate', 'Apples'],
 'Charlie': ['Apples', 'Peanut Butter']}

In [37]:

valmap(lambda L: map(first, L), groupby(lambda x: x[1], tuples))

Out[37]:

{'Apples': ['Bob', 'Charlie'],
 'Chocolate': ['Alice', 'Bob'],
 'Peanut Butter': ['Alice', 'Charlie']}

In [39]:

from toolz.curried import map, valmap, groupby, first, second, get, curry, compose, pipe
tmap = curry(compose(tuple, map))
pipe(tuples, groupby(second), valmap(tmap(first)))
valmap(tmap(first), groupby(second, tuples))

f = compose(valmap(tmap(first)), groupby(second))
f(tuples)

Out[39]:

{'Apples': ('Bob', 'Charlie'),
 'Chocolate': ('Alice', 'Bob'),
 'Peanut Butter': ('Alice', 'Charlie')}

In [ ]: