There are a number of features, idioms, and data types for manipulating Python data structures that are worth being familiar with. Much of the material that follows in this course assumes this familiarity. An overview of some of these concepts are below. The goal of this tutorial is not necessarily for you to be able to use these patterns fluently, but to be able to recognize them in use as the course proceeds, and to start to be able to reason about what built-in Python data structures and functions are the best match for the problems and tasks you'll encounter.
I begin by going through the concepts in a technical manner, and end with an example combining many of the concepts that data from the Open Weather API.
(Some of this may be review from Foundations, but I want to make sure we have common ground on these items in particular.)
A very common task in both data analysis and computer programming is applying some operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or to create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). Python has a succinct syntax, called a list comprehension, which allows you to easily write expressions that transform and filter lists.
A list comprehension has a few parts:
True
or False
; andThese parts are arranged like so:
[
predicate expressionfor
temporary variable namein
source listif
membership expression]
The words for
, in
, and if
are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.
Here's an example, returning the squares of integers zero up to ten:
print [x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In the example above, x*x
is the predicate expression; x
is the temporary variable name; and [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
is the source list. There's no membership expression in this example, so we omit it (and the word if
).
There's nothing special about the variable x
; it's just a name that we chose. We could easily choose any other temporary variable name, as long as we use it in the predicate expression as well. Below, I use the name of one of my cats as the temporary variable name, and the expression evaluates the same way it did with x
:
print [shumai * shumai for shumai in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
The expression in the list comprehension can be any expression, even just the temporary variable itself, in which case the list comprehension will simply evaluate to a copy of the original list:
print [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
You don't technically even need to use the temporary variable in the predicate expression:
print [42 for x in range(10)]
[42, 42, 42, 42, 42, 42, 42, 42, 42, 42]
As indicated above, you can include an expression at the end of the list comprehension to determine whether or not the item in the source list will be evaluated and included in the resulting list. One way, for example, of including only those values from the source list that are greater than or equal to five:
print [x*x for x in range(10) if x >= 5]
[25, 36, 49, 64, 81]
Random numbers are useful for a number of reasons, from testing to statistics to cryptography. Python has a built-in module random
which allows you to do things with random numbers. I'm going to introduce just a few of the most useful functions from random
.
In order to use random
, you need to import
it first. Once you do, you can call random.randrange()
to generate a random number from zero up to (but not including) the specified number:
import random
print random.randrange(100)
13
You can use random.randrange()
to, for example, simulate a number of dice rolls. Here's 100 random rolls of a d6:
print [random.randrange(6)+1 for i in range(100)]
[6, 6, 5, 4, 4, 3, 2, 4, 4, 2, 6, 5, 6, 1, 2, 3, 2, 3, 2, 4, 4, 4, 1, 1, 2, 6, 4, 5, 4, 6, 3, 3, 4, 2, 1, 3, 4, 4, 5, 6, 5, 2, 1, 2, 1, 5, 3, 2, 4, 6, 4, 5, 1, 1, 2, 5, 6, 4, 4, 4, 1, 4, 4, 2, 5, 2, 2, 6, 4, 6, 1, 3, 4, 2, 6, 1, 3, 1, 6, 5, 5, 1, 4, 5, 1, 6, 3, 2, 5, 2, 3, 5, 5, 1, 3, 2, 3, 2, 1, 2]
The random
module also has a number of functions for getting random items from lists. The first is random.choice()
, which simply returns a random item from a list:
flavors = ["vanilla", "chocolate", "red velvet", "durian", "cinnamon", "~mystery~"]
print random.choice(flavors)
chocolate
The random.sample()
function randomly samples a specified number of items from a list (guaranteeing that the same item won't be drawn twice):
print random.sample(flavors, 2)
['~mystery~', 'durian']
Finally, the random.shuffle()
function sorts a list in random order:
print flavors
random.shuffle(flavors)
print flavors
['vanilla', 'chocolate', 'red velvet', 'durian', 'cinnamon', '~mystery~'] ['cinnamon', '~mystery~', 'red velvet', 'durian', 'vanilla', 'chocolate']
These are just the most useful (in my opinion) functions from random
; the module has many other helpful functions to (e.g.) generate random numbers with particular distributions. Read more here.
Tuples (rhymes with "supple") are data structures very similar to lists. You can create a tuple using parentheses (instead of square brackets, as you would with a list):
t = ("alpha", "beta", "gamma", "delta")
print t
('alpha', 'beta', 'gamma', 'delta')
You can access the values in a tuple in the same way as you access the values in a list: using square bracket indexing syntax. Tuples support slice syntax and negative indexes, just like lists:
t[-2]
'gamma'
t[1:3]
('beta', 'gamma')
The difference between a list and a tuple is that the values in a tuple can't be changed after the tuple is created. This means, for example, that attempting to .append()
a value to a tuple will fail:
t.append("epsilon")
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-179-bd64cbe45262> in <module>() ----> 1 t.append("epsilon") AttributeError: 'tuple' object has no attribute 'append'
Likewise, assigning to an index of a tuple will fail:
t[2] = "bravo"
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-180-442965e372c2> in <module>() ----> 1 t[2] = "bravo" TypeError: 'tuple' object does not support item assignment
"So," you think to yourself. "Tuples are just like... broken lists. That's strange and a little unreasonable. Why even have them in your programming language?" That's a fair question, and answering it requires a bit of knowledge of how Python works with these two kinds of values (lists and tuples) behind the scenes.
Essentially, tuples are faster and smaller than lists. Because lists can be modified, potentially becoming larger after they're initialized, Python has to allocate more memory than is strictly necessary whenever you create a list value. If your list grows beyond what Python has already allocated, Python has to allocate more memory. Allocating memory, copying values into memory, and then freeing memory when it's when no longer needed, are all (perhaps surprisingly) slow processes---slower, at least, than using data already loaded into memory when your program begins.
Because a tuple can't grow or shrink after it's created, Python knows exactly how much memory to allocate when you create a tuple in your program. That means: less wasted memory, and less wasted time allocating a deallocating memory. The cost of this decreased resource footprint is less versatility.
Tuples are often called an immutable data type. "Immutable" in this context simply means that it can't be changed after it's created.
Because tuples are faster, they're often the data type that gets returned from methods and functions in Python's built-in library. For example, the .items()
method of the dictionary object returns a list of tuples (rather than, as you might otherwise expect, a list of lists):
moon_counts = {'mercury': 0, 'venus': 0, 'earth': 1, 'mars': 2}
moon_counts.items()
[('mercury', 0), ('earth', 1), ('venus', 0), ('mars', 2)]
The tuple()
function takes a list and returns it as a tuple:
tuple([1, 2, 3, 4, 5])
(1, 2, 3, 4, 5)
If you want to initialize a new list with with data in a tuple, you can pass the tuple to the list()
function:
list((1, 2, 3, 4, 5))
[1, 2, 3, 4, 5]
Sets are another list-like data structure. Like tuples, sets have limitations compared to lists, but are very useful in particular circumstances.
You can create a set like this, by passing a list to the set()
function:
s = set(["alpha", "beta", "gamma", "delta", "epsilon"])
print type(s)
print s
<type 'set'> set(['epsilon', 'alpha', 'beta', 'gamma', 'delta'])
In Python 2.7 and later, you can also create a set using curly brackets ({
and }
) with a comma-separated sequence of values:
s = {"alpha", "beta", "gamma", "delta", "epsilon"}
print type(s)
print s
<type 'set'> set(['epsilon', 'beta', 'alpha', 'gamma', 'delta'])
Sets, like lists, can be iterated over in a for
loop and serve as the source value in a list comprehension:
for item in s:
print item
epsilon beta alpha gamma delta
[item[0] for item in s]
['e', 'b', 'a', 'g', 'd']
You can add an item to a set using the set object's .add()
method:
s.add("omega")
print s
set(['epsilon', 'beta', 'delta', 'alpha', 'omega', 'gamma'])
And you can check to see if a particular value is present in a set using the in
operator:
"beta" in s
True
"emoji" in s
False
So now you're asking "okay, so... it's a list. You can put things in it and add things to it and check if things are in it. Big deal. Why am I even listening to this. I'm going to check Facebook. Ah, sweet, sweet Facebook." But wait! Sets are different from lists in several useful (and/or strange) ways. One useful property of the set is that once a value is in a set, any further attempts to add that value to the set will be ignored. That is: a set can't contain the same value twice. You can exploit this property of sets in order to remove duplicates from a list:
source_list = ["it", "is", "what", "it", "is"]
without_duplicates = set(source_list)
print without_duplicates
set(['is', 'it', 'what'])
Another useful property of sets is that the in
operator is much faster when operating on sets than it is on lists, especially when the number of items you're working with is very large. The following code illustrates the speed difference by creating a list of 9999 values, and a set of the same values, then performing a thousand random in
checks against both:
import time, random
values = range(9999)
values_set = set(values)
start_list = time.clock()
for i in range(1000):
random.randrange(99999) in values
end_list = time.clock()
start_set = time.clock()
for i in range(1000):
random.randrange(99999) in values_set
end_set = time.clock()
print "1000 random checks on list: ", end_list - start_list, "seconds"
print "1000 random checks on set: ", end_set - start_set, "seconds"
1000 random checks on list: 0.19032 seconds 1000 random checks on set: 0.00268700000001 seconds
(The time.clock()
function returns the current time in seconds; subtracting one time from another tells us roughly how much time has passed between the two calls.) Depending on your computer, using the set
instead of the list will be 10 to 100 times faster. In this example, the actual clock time difference is miniscule---two tenths of a second versus two hundredths of a second---but when you're working with billions or trillions of items instead of tens of thousands, that performance difference can really add up.
(The reasons for this performance difference are outside the scope of this tutorial. Suffice it to say that the in
operator must potentially check every item in a list to see if the first operand matches---meaning that as the list grows larger, the operation gets slower. With a set, the in
operator needs to perform only one check, regardless of how large the data structure is.)
The tradeoff for the set's speed is that it's less memory-efficient than a list. Here we'll use the sys.getsizeof()
function to get a rough estimate of the size (in bytes) of both objects:
import sys
print "size of list: ", sys.getsizeof(values)
print "size of set: ", sys.getsizeof(values_set)
size of list: 80064 size of set: 524520
Another important difference between sets and lists is that sets are unordered. You may have noticed this in the examples above: the order of the items added to a set is not the same as the order of the items when you get them back out. (This is similar to how keys in Python dictionaries are unordered.)
Sets and lists are similar, but not interchangeable. Use lists when it's important to know the order of a particular sequence; use sets when it's important to be able to quickly check to see if a particular item is in the sequence.
dict()
function¶The dict()
function creates a new dictionary. You can create an empty dictionary by calling this function with no parameters:
t = dict() # same as t = {}
print type(t)
<type 'dict'>
But the dict()
function can also be used to initialize a new dictionary from a list of tuples. Here's what that usage looks like:
items = [("a", 1), ("b", 2), ("c", 3)]
t = dict(items)
print t
{'a': 1, 'c': 3, 'b': 2}
This might not seem immediately useful, but as we'll see below, the dict()
function can be used to quickly make a dictionary out of sequential data.
A very common task in Python is to take some kind of sequential data and then turn it into a dictionary. Say, for example, that we wanted to take a list of strings and then create a dictionary mapping the strings to their lengths. Here's how to do that with a for
loop:
us_presidents = ["carter", "reagan", "bush", "clinton", "bush", "obama"]
prez_lengths = {}
for item in us_presidents:
prez_lengths[item] = len(item)
print prez_lengths
{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}
This task is so common that it's often written as a single expression. There are several ways to this; the first is by passing the result of a list comprehension to the dict()
function:
prez_length_tuples = [(item, len(item)) for item in us_presidents]
print "our list of tuples: ", prez_length_tuples
prez_lengths = dict(prez_length_tuples)
print "resulting dictionary: ", prez_lengths
our list of tuples: [('carter', 6), ('reagan', 6), ('bush', 4), ('clinton', 7), ('bush', 4), ('obama', 5)] resulting dictionary: {'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}
The example above is a little bit complicated! The tricky part is the list comprehension. The source list of the comprehension is our list of presidential names; the predicate expression is a tuple with two items: the name itself, and the length of the name. We then pass the resulting list of tuples to the dict()
function, which evaluates to the desired dictionary. This bit of code can be rewritten as one expression:
prez_lengths = dict([(item, len(item)) for item in us_presidents])
print prez_lengths
{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}
If you're a beginner Python programmer, it might be a while before you can formulate these expressions on your own. But it's important to be able to recognize this idiom when you see it in other people's code.
Python 3 introduced a new syntax specifically for creating dictionaries in this manner; the syntax has subsequently been backported to Python 2.7. The expression above can be rewritten like so:
prez_lengths = {item: len(item) for item in us_presidents}
print prez_lengths
{'clinton': 7, 'bush': 4, 'reagan': 6, 'carter': 6, 'obama': 5}
This syntax is called a dictionary comprehension (by analogy with "list comprehension"). A dictionary comprehension is like a list comprehension, except the "predicate expression" is not an expression proper, but a key/value pair separated by a colon. We'll see another example of this syntax below.
zip
¶As you can see by now, the "list of tuples" is a very common configuration for data in Python. The zip()
function allows you to create a list of tuples that combines values from two separate lists. For example, imagine that you've retrieved the names of certain US states from one source, and the estimated population for those states from a different source. You know that the data is in the same order in both sources, and you'd like to combine the two into one list (perhaps to eventually create a dictionary for easy population lookups). The zip()
function does just this:
state_names = ["alabama", "alaska", "arizona", "arkansas", "california"]
state_pop = [4849377, 736732, 6731484, 2966369, 38802500]
combo = zip(state_names, state_pop)
print combo
[('alabama', 4849377), ('alaska', 736732), ('arizona', 6731484), ('arkansas', 2966369), ('california', 38802500)]
As you can see, the zip()
function takes two lists as parameters and returns a list of tuples with the respective items from both lists. You could then (for example) pass the result of zip()
to the dict()
function, to create a dictionary mapping state names to state populations:
state_pop_lookup = dict(zip(state_names, state_pop))
print state_pop_lookup
{'california': 38802500, 'alabama': 4849377, 'arizona': 6731484, 'arkansas': 2966369, 'alaska': 736732}
Let's say you want to iterate through a list, prepending each item in the list with its index: a numbered list. One way to do this is to write a for
loop to print out the items of a list with their index, keeping track of the current index in a separate variable:
elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
index = 0
for item in elements:
print index, item
index += 1
0 hydrogen 1 helium 2 lithium 3 beryllium 4 boron
That whole index
variable thing, though---kind of ugly and non-Pythonic. What if we used the zip()
function to create instead a list of tuples, where the first item is the index of the item in the source list, and the second item is the item itself? That might look like this:
# the range() function returns a list from 0 up to the specified value
enumerated_elements = zip(range(len(elements)), elements)
print "enumerated list: ", enumerated_elements
# now, iterate over each tuple in the enumerated list...
for index_item_tuple in enumerated_elements:
print index_item_tuple[0], index_item_tuple[1]
enumerated list: [(0, 'hydrogen'), (1, 'helium'), (2, 'lithium'), (3, 'beryllium'), (4, 'boron')] 0 hydrogen 1 helium 2 lithium 3 beryllium 4 boron
The zip()
function here takes two lists: the first is a list returned from range()
that has numbers from zero up to the number of items in the elements
list (i.e., [0, 1, 2, 3, 4]
). The second is the elements
list itself. The call to zip()
evaluates to the list shown above: a list of 2-tuples with index/item pairs.
The for
loop above is a little awkward: the temporary loop variable index_item_tuple
has the value of each tuple in the enumerated_elements
list in turn, so we need to use square brackets to get the values from the tuple. It turns out there's an easier, more Pythonic way to do this, using a feature called "tuple unpacking":
for index, item in enumerated_elements:
print index, item
0 hydrogen 1 helium 2 lithium 3 beryllium 4 boron
If you know that each item of a list is a tuple, you can write a series of comma-separated temporary variables between the for
and the in
of a for
loop. Python will assign the first element of each tuple to the first variable listed, the second element of each tuple to the second variable listed, etc. This for
loop accomplishes the same thing as the previous one, but it's much cleaner.
Lists of index/value 2-tuples are needed fairly frequently in Python. So frequently that there's a built-in function for constructing such lists. That function is called enumerate()
:
# this code:
print "with zip/range/len:"
for index, item in zip(range(len(elements)), elements):
print index, item
print "\nwith enumerate:"
# ... can also be written like this:
for index, item in enumerate(elements):
print index, item
with zip/range/len: 0 hydrogen 1 helium 2 lithium 3 beryllium 4 boron with enumerate: 0 hydrogen 1 helium 2 lithium 3 beryllium 4 boron
As you're aware, you can define a function in Python using the def
keyword and an indented code block. For example, here's a function first()
which returns the first item of a list or string:
def first(t):
return t[0]
first("all of these wonderful characters")
'a'
What you may not know is that functions are themselves values, just like an integer or a floating-point number or a string or a list. Once a function has been defined, the name of a function is a variable that contains that value. You can ask Python to print that value out, just like you can ask Python to print the value of a list or string:
print first
<function first at 0x104967938>
Printing a function isn't very useful, of course; you just get a string with information about where in memory Python is storing the function's code. But you can do other interesting things. For example, you can create a new variable that points to that function:
grab_index_zero = first
grab_index_zero("all of these wonderful characters")
'a'
Above, we created a new variable grab_index_zero
and assigned to it the value first
. Now grab_index_zero
can be called as a function, just like we can call first
as a function! This works even for built-in functions like len()
:
how_many_things_are_in = len
how_many_things_are_in(["hi", "there", "how", "are", "you"])
5
Importantly, because Python functions are values, we can pass them as parameters to other Python functions. To illustrate, let's write a function say_hello
which accomplishes a simple task: it prints out a random greeting.
import random
def say_hello():
greetz = ["hey", "howdy", "hello", "greetings", "yo", "hi"]
print random.choice(greetz) + "!"
say_hello()
greetings!
Now, let's write a function called thrice
. This function takes another function as a parameter, and calls that function three times:
def thrice(func):
for i in range(3):
func()
# let's try it out...
thrice(say_hello)
hey! greetings! hi!
The thrice
function has limited utility, of course (if for no other reason than the fact that most Python functions return values, instead of just printing them out). But it at least illustrates the concept.
There are two built-in Python functions that I want to mention here, called map()
and filter()
. These are functions that operate on lists and take other functions as parameters.
The map()
function takes two parameters: a function and a list (or other sequence). It returns a new list, which contains the result of calling the given function on every item of the list:
def first(t):
return t[0]
elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
# a new list containing the first character of each string
map(first, elements)
['h', 'h', 'l', 'b', 'b']
The map()
call above is essentially the same thing as this list comprehension:
[first(item) for item in elements]
['h', 'h', 'l', 'b', 'b']
There's no real reason to choose one idiom over the other (map()
vs. list comprehension), and you'll often see Python programmers switch between the two. But it's important to be able to recognize that these two bits of code do the same thing.
The filter()
function takes a function and a list (or other sequence), and returns a new list containing only those items from the source list that, when passed as a parameter to the given function, evaluate to True
:
def greater_than_ten(num):
return num > 10
numbers = [-10, 17, 4, 94, 2, 0, 10]
filter(greater_than_ten, numbers)
[17, 94]
Again, this call to filter()
can be re-written as a list comprehension:
[item for item in numbers if greater_than_ten(item)]
[17, 94]
With functions like map()
and filter()
, it's very common to write quick, one-off functions simply for the purposes of processing a list. The greater_than_ten
and first
functions above are great examples of this: these are tiny functions that only have a return
statement in them.
Writing functions like this is so common, in fact, that there's a little shorthand for writing them. This shorthand allows you to define a function all in one line, without having to type out the def
and the return
. It looks like this:
# the regular way
def first(t):
return t[0]
# the "shorthand" way
first = lambda t: t[0]
# test it out!
first("cheese")
'c'
This shorthand method is called a "lambda function." (Why "lambda"? For secret programming reasons that you can investigate on your own. It's a ~mystery~.) A lambda function is essentially an alternate syntax for defining a function. Schematically, it looks like this:
lambda vars: expression
... where "lambda" is the lambda
keyword, vars
is a comma-separated list of temporary variable names for parameters passed to the function, and expression
is the expression that describes what the function will evaluate to.
Here's a lambda function that takes two parameters:
# squish combines the first item of its first parameter with the last item of its second parameter
squish = lambda one, two: one[0] + two[-1]
squish("hi", "there")
# you could also write "squish" the longhand way, like this:
#def squish(one, two):
# return one[0] + two[-1]
'he'
Lambda functions have a serious limitation, though: a lambda function can consist of only one expression---you can't have an entire block of statements like you can with a regular function.
The real utility of this alternate syntax comes from the fact that you can define a lambda function in-line: you don't have to assign a lambda function to a variable before you use it. So, for example, you can write this:
elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
map(lambda x: x[0], elements)
['h', 'h', 'l', 'b', 'b']
... or this:
numbers = [-10, 17, 4, 94, 2, 0, 10]
filter(lambda x: x > 10, numbers)
[17, 94]
You can sort a list two ways. The first is to use the list object's .sort()
method, which sorts the list in-place:
elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
elements.sort()
print elements
['beryllium', 'boron', 'helium', 'hydrogen', 'lithium']
The second is to use Python's built-in sorted()
function, which evaluates to a copy of the list with its elements in order, while leaving the original list the same:
elements = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
print sorted(elements)
['beryllium', 'boron', 'helium', 'hydrogen', 'lithium']
Both the .sort()
method and the sorted()
function take an optional keyword parameter reverse
which, if set to True
, causes the sorting to happen in reverse order:
# with .sort()
numbers = [52, 54, 108, 13, 7, 2]
numbers.sort(reverse=True)
print numbers
[108, 54, 52, 13, 7, 2]
# with sorted()
numbers = [52, 54, 108, 13, 7, 2]
sorted(numbers, reverse=True)
[108, 54, 52, 13, 7, 2]
This is all well and good, but what if we want to sort using some method than numerically or alphabetically? Say, for example, we had the following list of tuples describing state populations, and we wanted to sort the list by population. Just using .sort()
or sorted()
doesn't return the desired result:
states = [
('Alabama', 4849377),
('Alaska', 736732),
('Arizona', 6731484),
('Arkansas', 2966369),
('California', 38802500)
]
# doesn't sort based on population!
sorted(states)
[('Alabama', 4849377), ('Alaska', 736732), ('Arizona', 6731484), ('Arkansas', 2966369), ('California', 38802500)]
What we need is some way to tell Python which part of the data to look at when performing the sort. Python provides a way to do this with the key
parameter, which you can pass to either .sort()
or sorted()
. The value passed to the key
parameter should be a function. When sorting the list, Python will evaluate this function for each item in the list, and will decide how that item should be sorted based on the value returned from the function.
So, to perform the task outlined above (sorting the list by the second item in each tuple), we could do something like this:
def get_second(t):
return t[1]
sorted(states, key=get_second)
[('Alaska', 736732), ('Arkansas', 2966369), ('Alabama', 4849377), ('Arizona', 6731484), ('California', 38802500)]
Because we specified the key
parameter, Python calls the get_second
function for each item in the list. The result of this function (i.e., the second value in the tuple) is then used when sorting the list. We can rewrite this more succinctly using a lambda function:
sorted(states, key=lambda t: t[1])
[('Alaska', 736732), ('Arkansas', 2966369), ('Alabama', 4849377), ('Arizona', 6731484), ('California', 38802500)]
The expression lambda t: t[1]
is just a shorter way of writing the function get_second
above.
It's common to use Python dictionaries to count things---say, for example, how often words are repeated in a given source text. You'll often end up with a dictionary that looks something like this:
word_counts = {'it': 123, 'was': 48, 'the': 423, 'best': 7, 'worst': 13, 'of': 350, 'times': 2}
Once you have data like this, it's only natural to want to see, e.g., what the most common word is and what the least common word is. It should be simple enough to do this, right? Just pass the dictionary to the sorted()
function!
sorted(word_counts)
['best', 'it', 'of', 'the', 'times', 'was', 'worst']
Hmm. That didn't work. It looks like Python is sorting the dictionary... in alphabetical order? Which is weird. Actually, what's happening is that when you pass a dictionary to sorted()
, Python implicitly assumes you meant to sort just the keys of the dictionary---and, in this case, it sorts them in alphabetical order, because we haven't specified an alternative order!
Maybe it would help to step back and remember that dictionaries are an inherently unordered data type. So sorting a dictionary doesn't make any sense! What we need is some way to turn a dictionary into a sortable data type, like a list. The .items()
method of the dictionary object does just this: it evaluates to a list of tuples containing the key-value pairs from the dictionary.
word_counts.items()
[('of', 350), ('it', 123), ('times', 2), ('worst', 13), ('the', 423), ('was', 48), ('best', 7)]
Hey now, this is looking familiar! A list of tuples! We just finished learning how to sort lists of tuples by particular members of the tuples. We just need to use the sorted()
function and specify a key
parameter that is a function returning the second value from the tuple! Like so:
sorted(word_counts.items(), key=lambda x: x[1])
[('times', 2), ('best', 7), ('worst', 13), ('was', 48), ('it', 123), ('of', 350), ('the', 423)]
We did it! This expression evalues to a list of tuples from the word_counts
dictionary, ordered by the value for each key in the original dictionary. The least common word ("times") is the first item in the list. We can use the reverse
parameter of sorted()
to order from most common to least common instead:
sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
[('the', 423), ('of', 350), ('it', 123), ('was', 48), ('worst', 13), ('best', 7), ('times', 2)]
Beautiful. Simply stunning.
Let's combine a number of these concepts to get useful information from a web API. For this section, we'll use the Open Weather Map API. The Open Weather Map API provides several different kinds of data about the weather.
Let's say that we want to pick which of the next five days will be the best day for our outdoor Data Journalism Picnic. We want to choose the day based on the day's weather---not rainy, not too hot. (I'm assuming you're running this code in the summer. If you're planning an outdoor picnic in some other season, you may want to change your criteria.) For this purpose, we can use the API endpoint that returns a daily weather forecast for a particular city. You can read more about how to use this endpoint here. Here's a query that gets the daily forecast for the next five days in New York City:
import urllib
import json
query_url = "http://api.openweathermap.org/data/2.5/forecast/daily?id=5128581&cnt=5&units=imperial"
resp = urllib.urlopen(query_url).read()
data = json.loads(resp)
data
{u'city': {u'coord': {u'lat': 40.714272, u'lon': -74.005966}, u'country': u'US', u'id': 5128581, u'name': u'New York', u'population': 0}, u'cnt': 5, u'cod': u'200', u'list': [{u'clouds': 68, u'deg': 177, u'dt': 1436893200, u'humidity': 60, u'pressure': 1000.31, u'rain': 1.4, u'speed': 6.09, u'temp': {u'day': 80.47, u'eve': 75.45, u'max': 80.47, u'min': 71.26, u'morn': 77.47, u'night': 71.26}, u'weather': [{u'description': u'light rain', u'icon': u'10d', u'id': 500, u'main': u'Rain'}]}, {u'clouds': 92, u'deg': 3, u'dt': 1436979600, u'humidity': 77, u'pressure': 998.18, u'rain': 0.58, u'speed': 3.93, u'temp': {u'day': 79.99, u'eve': 70.9, u'max': 79.99, u'min': 65.43, u'morn': 71.76, u'night': 65.43}, u'weather': [{u'description': u'light rain', u'icon': u'10d', u'id': 500, u'main': u'Rain'}]}, {u'clouds': 0, u'deg': 13, u'dt': 1437066000, u'humidity': 51, u'pressure': 1010.92, u'speed': 4.45, u'temp': {u'day': 74.95, u'eve': 70.14, u'max': 76.12, u'min': 59.11, u'morn': 63.81, u'night': 59.11}, u'weather': [{u'description': u'sky is clear', u'icon': u'01d', u'id': 800, u'main': u'Clear'}]}, {u'clouds': 8, u'deg': 179, u'dt': 1437152400, u'humidity': 52, u'pressure': 1013.8, u'speed': 4.13, u'temp': {u'day': 79.12, u'eve': 72.52, u'max': 79.12, u'min': 62.98, u'morn': 66.36, u'night': 62.98}, u'weather': [{u'description': u'sky is clear', u'icon': u'02d', u'id': 800, u'main': u'Clear'}]}, {u'clouds': 4, u'deg': 204, u'dt': 1437238800, u'humidity': 0, u'pressure': 1010.92, u'rain': 9.19, u'speed': 6.63, u'temp': {u'day': 82.94, u'eve': 77.65, u'max': 82.94, u'min': 72.09, u'morn': 72.09, u'night': 74.84}, u'weather': [{u'description': u'moderate rain', u'icon': u'10d', u'id': 501, u'main': u'Rain'}]}], u'message': 0.0306}
The part of this data structure that we're interested in is the list
attribute of the top-level dictionary, which contains a list of dictionaries describing the weather on a particular day. We'll create a variable days
which points at just this information:
days = data['list']
# each item in "days" is a dictionary with weather information for that day
days[0]
{u'clouds': 68, u'deg': 177, u'dt': 1436893200, u'humidity': 60, u'pressure': 1000.31, u'rain': 1.4, u'speed': 6.09, u'temp': {u'day': 80.47, u'eve': 75.45, u'max': 80.47, u'min': 71.26, u'morn': 77.47, u'night': 71.26}, u'weather': [{u'description': u'light rain', u'icon': u'10d', u'id': 500, u'main': u'Rain'}]}
Consult the API documentation to learn what each key/value pair in the dictionary represents. There's a lot of data here, and we're not necessarily interested in all of it. The data items that we are interested in are:
dt
is a UNIX timestamp that indicates which day the forecast applies tohumidity
indicates the day's humiditytemp['min']
and temp['max']
are the day's minimum and maximum temperatures, respectivelyweather[0]['description']
has a brief description of the day's weatherTo simplify the task of analyzing this data, let's create a new list of dictionaries that only has the parts of the list-item dictionaries that we're most interested in. We'll do this in a for
loop:
import datetime
def timestamp_to_date(dt):
return datetime.datetime.fromtimestamp(dt).date().isoformat()
cleaned = list()
for item in days:
new_item = {
'date': timestamp_to_date(item['dt']),
'max_temp': item['temp']['max'],
'min_temp': item['temp']['min'],
'humidity': item['humidity'],
'description': item['weather'][0]['description']
}
cleaned.append(new_item)
cleaned
[{'date': '2015-07-14', 'description': u'light rain', 'humidity': 60, 'max_temp': 80.47, 'min_temp': 71.26}, {'date': '2015-07-15', 'description': u'light rain', 'humidity': 77, 'max_temp': 79.99, 'min_temp': 65.43}, {'date': '2015-07-16', 'description': u'sky is clear', 'humidity': 51, 'max_temp': 76.12, 'min_temp': 59.11}, {'date': '2015-07-17', 'description': u'sky is clear', 'humidity': 52, 'max_temp': 79.12, 'min_temp': 62.98}, {'date': '2015-07-18', 'description': u'moderate rain', 'humidity': 0, 'max_temp': 82.94, 'min_temp': 72.09}]
Much better! Now we have just the data that we're interested in.
SIDEBAR: UNIX timestamps? "Datetime"? What is all this nonsense? This is really a subject for its own tutorial, but here's the brief. Computers usually internally represent time as a number. One of the most common ways of indicating time is to count how many seconds have passed since the "UNIX epoch", or January 1st, 1970 at 12:00am UTC. The "timestamp" that the Open Weather API returns is such an integer. In order to convert this integer into a readable date, I used Python's built-in
datetime
module. The code intimestamp_to_date()
looks convoluted, but basically boils down to "convert this weird UNIX timestamp into a readable representation of the date it corresponds to."
With our revised, cleaned data, we can start doing some fun tricks. First, let's sort the days by their high temperature, in ascending order:
sorted(cleaned, key=lambda x: x['max_temp'])
[{'date': '2015-07-16', 'description': u'sky is clear', 'humidity': 51, 'max_temp': 76.12, 'min_temp': 59.11}, {'date': '2015-07-17', 'description': u'sky is clear', 'humidity': 52, 'max_temp': 79.12, 'min_temp': 62.98}, {'date': '2015-07-15', 'description': u'light rain', 'humidity': 77, 'max_temp': 79.99, 'min_temp': 65.43}, {'date': '2015-07-14', 'description': u'light rain', 'humidity': 60, 'max_temp': 80.47, 'min_temp': 71.26}, {'date': '2015-07-18', 'description': u'moderate rain', 'humidity': 0, 'max_temp': 82.94, 'min_temp': 72.09}]
We can get the date of the coolest day by grabbing the first item from the sorted list and accessing its date
key, like so:
by_temp = sorted(cleaned, key=lambda x: x['max_temp'])
by_temp[0]['date']
'2015-07-16'
Or as one expression:
sorted(cleaned, key=lambda x: x['max_temp'])[0]['date']
'2015-07-16'
Let's filter the list to ensure that we only have days where there is no rain. You might accomplish this like so:
filter(lambda x: "rain" not in x['description'], cleaned)
[{'date': '2015-07-16', 'description': u'sky is clear', 'humidity': 51, 'max_temp': 76.12, 'min_temp': 59.11}, {'date': '2015-07-17', 'description': u'sky is clear', 'humidity': 52, 'max_temp': 79.12, 'min_temp': 62.98}]
You could also write this as a list comprehension, of course:
[day for day in cleaned if "rain" not in day['description']]
[{'date': '2015-07-16', 'description': u'sky is clear', 'humidity': 51, 'max_temp': 76.12, 'min_temp': 59.11}, {'date': '2015-07-17', 'description': u'sky is clear', 'humidity': 52, 'max_temp': 79.12, 'min_temp': 62.98}]
If we wanted an overview of all the kinds of weather that we might experience in the given forecast range, we could generate a set of all unique descriptions:
set([day['description'] for day in cleaned])
{u'light rain', u'moderate rain', u'sky is clear'}
Or, we could make a dictionary that maps dates to humidity:
dict([(day['date'], day['humidity']) for day in cleaned])
{'2015-07-14': 60, '2015-07-15': 77, '2015-07-16': 51, '2015-07-17': 52, '2015-07-18': 0}
The above could also be written as a dictionary comprehension:
date_humidity = {day['date']: day['humidity'] for day in cleaned}
date_humidity
{'2015-07-14': 60, '2015-07-15': 77, '2015-07-16': 51, '2015-07-17': 52, '2015-07-18': 0}
Now that we have a dictionary, we can easily look up the humidity for a given date:
date_humidity['2015-07-17']
52
I hope this has been a helpful overview! Here's some further reading: