#!/usr/bin/env python # coding: utf-8 # # Let's jump in # To pull in comma-separated datasets like those for our inflammation data, we make use of a package called ``numpy``. This is the package to use for anything numerical, especially when messing with whole arrays of data. # In[1]: import numpy # In[2]: numpy.loadtxt("inflammation-01.csv", delimiter=',') # We can load one of the inflammation data sets using the ``numpy.loadtxt`` function. This function loads the data from disk, but without attaching a name to it, it isn't necessarily stored in memory. A bit about names... # The equal sign is known as the "assignment" operator. For example, given some weight in kg, we can attach a name to that weight like: # In[3]: weight_kg = 55 # In[4]: print weight_kg # The value ``55`` has the name ``weight_kg`` attached to it. To get the value, we need only use the name. We can use this in other expressions that may assign to other names: # In[5]: weight_lb = 2.2 * weight_kg # In[6]: weight_lb # ``weight_lb`` is calculated from ``weight_kg``. What happens to ``weight_lb`` if we change the value attached to ``weight_kg``? # In[7]: weight_kg = 66 # In[8]: weight_lb # Nothing! That's because ``weight_lb`` only remembers the value assigned to it, not how that value was obtained. To update its value, it must be recalculated: # In[9]: weight_lb = 2.2 * weight_kg # In[10]: weight_lb # We can assign a name to our data array, which quantifies the inflammation observed in a set of patients (rows) over succeeding days (columns), so we can begin working with it in memory: # In[11]: data = numpy.loadtxt('inflammation-01.csv', delimiter=',') # In[12]: print data # Arrays are rectangular datasets that consist of all a single data type (in this case, floating-point numbers). They have attributes such as: # In[13]: data.shape # and methods to calculate quantities from their elements such as: # In[14]: print "mean for all values:", data.mean() # Many of their methods can be fed parameters to change their behavior. For example, instead of the mean of all elements in the array, perhaps we want the mean of each row (that is, for each patient, calculated across all days): # In[15]: print "mean for each patient:\n", print data.mean(axis=1) # Notice the shape of the output array: # In[16]: data.mean(axis=1).shape # If you want to know the other parameters a function or method take, put a question mark at the end and run the cell. The documentation will pop up in the notebook: # In[17]: # displays the docstring get_ipython().run_line_magic('pinfo', 'data.mean') # We can also get the mean across all patients for each day: # In[18]: print "mean for each day:\n", print data.mean(axis=0) # *Note*: python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first axis (rows) is axis 0, the second axis (columns) is axis 1. # How do we get at individual elements in array? We can use indices to grab them. To get the inflammation value for the fifth patient on the 10th day, we can use: # In[19]: data[4, 9] # Remember that python uses 0-based indices! The data for the first patient on the first day is: # In[20]: data[0, 0] # We can select subsets of our data with slicing. To select only the first 5 rows (patients) of the array, we can do: # In[21]: data[0:5] # In[22]: data[0:5].shape # This slice should be read as "get the 0th element up to and not including the 5th element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to. # We can also slice the columns; to get only the data for the first five rows (first five patients) and columns 9 through 12 (10th through 13th day), we can use: # In[23]: data[0:5, 9:13] # Since the output of a sliced array is another array, we can run methods on these just like before. For example, we can get the standard deviation across days for each patient for this subset: # In[24]: data[0:5, 9:13].std(axis=1) # We can treat arrays as if they are single numbers in arithmetic operations. For example, multiplying the array by 2 doubles each element of the array: # In[25]: double = data * 2 # In[26]: data # In[27]: double # Arithmetic operations can be applied between arrays of the same shape; these operations are performed on corresponding elements (this is also true for multiplication/division). # In[28]: triple = double + data # In[29]: triple # This amounts to $3x_{i,j}^2$, for each element $x_{i,j}$ of the array: # In[30]: triple * data # Interested in performing matrix operations between arrays? There are methods built into numpy for that. For example, the inner product: # In[31]: numpy.inner(triple, data) # Let's try viewing this array graphically. Just as ``numpy`` is the de-facto library for numerically-intensive work, ``matplotlib`` is the de-facto library for producing high-quality plots from numerical data. A saying often goes that ``matplotlib`` makes easy things easy and hard things possible when it comes to making plots. # In[32]: import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # We can throw our data into the ``matplotlib.pyplot.imshow`` function to view it as a heatmap: # In[33]: plt.imshow(data) # We can try viewing the mean inflammation value across patients for each day directly using the ``matplotlib.pyplot.plot`` function. # In[34]: plt.plot(data.mean(axis=0)) # Huh...that looks a bit orderly. Almost completely linear with a positive slope to precisely day 20, then back down with about the same slope to day 40... # We can try plotting the max value across all patients for each day: # In[35]: plt.plot(data.max(axis=0)) # That's fishy... # The min across all patients for each day? # In[36]: plt.plot(data.min(axis=0)) # A step function, with steps that appear to go in increments of 4 or so across the whole domain of days. Weird... # Matplotlib can make figures that include multiple sub-figures, referred to as *axes*. We can put the three plots above side-by-side with something like: # In[37]: # make a figure object fig = plt.figure(figsize=(10,3)) # make an axis object for the figure, each with a different position # on a grid that has 1 row and 3 columns of axes ax1 = fig.add_subplot(1, 3, 1) ax2 = fig.add_subplot(1, 3, 2) ax3 = fig.add_subplot(1, 3, 3) # plot the means ax1.set_ylabel('average') ax1.plot(data.mean(axis=0)) # plot the maxes ax2.set_ylabel('max') ax2.plot(data.max(axis=0)) # plot the minses ax3.set_ylabel('min') ax3.plot(data.min(axis=0)) fig.tight_layout() # It looks like the books were cooked for this data set. What about the other 11? Instead of checking them manually, we can let the computer do what it's good at: doing the same thing repetitively. We need a loop. # # Loops are for looping # First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example: # In[38]: word = 'lead' # We have a string ``'lead'`` that we've assigned the name ``word``. We can examine the string's type, and indeed any object in python, with the ``type`` builtin. # In[39]: print type(word) # We can also access its characters (it's a sequence!) with indexing: # In[40]: word[0] # In[41]: word[1] # And slicing: # In[42]: word[1:] # If I want to output each letter of ``word`` on a new line, I could do: # In[43]: print word[0] print word[1] print word[2] print word[3] # But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a ``for`` loop: # In[44]: for letter in word: print letter # This same bit of code will work for *any* ``word``, including ridiculously long ones. Like this gem: # In[45]: word = 'supercallifragilisticexpialidocious' print word # In[46]: for letter in word: print letter # We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in ``word``: # In[47]: counter = 0 for letter in word: counter = counter + 1 print counter # For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead: # In[48]: len(word) # It works on any sequence! For arrays it gives the length of the first axis, in this case the number of rows: # In[49]: len(data) # # Lists: ordered collections of other objects (even other lists) # Loops work best when they have something to iterate over. In python, the ``list`` is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with: # In[50]: stuff = list() # In[51]: print stuff # And we can append things to the end of the list, in this case a bunch of random strings: # In[52]: stuff.append('marklar') # In[53]: stuff.append('chewbacca') # In[54]: stuff.append('chickenfingers') # In[55]: stuff # Just like with the ``word`` example, we can iterate through the elements of the list: # In[56]: for item in stuff: print item # And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice ``stuff[-2:]`` could be read as "get the 2nd-to-last element of ``stuff``, through the last element": # In[57]: for item in stuff[-2:]: print item # So indexing with ``-1`` gives the last element: # In[58]: stuff[-1] # And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice ``n:m``, it reads as "get all elements starting with element ``n`` and up to but not including element ``m``." # In[59]: stuff[0:0] # Lists are mutable: that is, their elements can be changed. For example, we can replace ``'chewbacca'`` with ``'hansolo'``: # In[60]: stuff[1] # In[61]: stuff[1] = 'hansolo' # In[62]: stuff[1] # ...or replace ``'hansolo'`` with our ``data`` array: # In[63]: stuff[1] = data # In[64]: stuff[1] # In[65]: print stuff # We can make another list, including ``stuff`` as an element: # In[66]: morestuff = ['logs', stuff] # In[67]: morestuff # Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements: # In[68]: morestuff.append(len) # In[69]: morestuff # Want the third element of the second element of ``morestuff``? # In[70]: morestuff[1][2] # Lists also have their own methods, which usually alter the list itself: # In[71]: stuff # In[72]: stuff.reverse() # In[73]: stuff # ``list.pop`` removes the element at the given index: # In[74]: stuff.pop(1) # ...while ``list.remove`` removes the first instance of the given element in the list: # In[75]: stuff.remove('chickenfingers') # So now our list looks like: # In[76]: stuff # # Analyzing at scale # We noticed what looked like fraudulent data in the first dataset we analyzed. To look at more datasets without manually running our block of plotting code on each, we'll take advantage of loops and lists. # To grab all the filenames for our inflammation data sets, we'll use the ``glob`` library. This includes a function (``glob.glob``) that will return a list of filenames matching a "glob" pattern, which is what we used for matching files in the bash shell. # In[77]: import glob # In[78]: # this will get us all the inflammation data files; we don't care about order print glob.glob('inflammation*.csv') # Stealing our code block for the plots above and putting them in a loop over filenames, we can look at a whole block of figures. We can slice a subset of the filenames to limit the number we plot at one time. # In[79]: # get filenames filenames = glob.glob('inflammation*.csv') # iterate through each filename, perhaps only a slice of them for filename in filenames[0:7]: # so we know what we're looking at print filename # load data data = numpy.loadtxt(fname=filename, delimiter=',') # make a figure object, with 3 axes in 1 row, 3 columns fig = plt.figure(figsize=(10,3)) ax1 = fig.add_subplot(1, 3, 1) ax2 = fig.add_subplot(1, 3, 2) ax3 = fig.add_subplot(1, 3, 3) # plot means ax1.set_ylabel('average') ax1.plot(data.mean(axis=0)) # plot maxes ax2.set_ylabel('max') ax2.plot(data.max(axis=0)) # plot minses ax3.set_ylabel('min') ax3.plot(data.min(axis=0)) # this is useful for making the plot elements not overlap fig.tight_layout() # this will tell matplotlib to draw the figure now, instead # of letting the notebook wait to draw them all at once when the cell # finishes executing plt.show() print "done" # At least one new pattern emerged: all mins were zero for one dataset! It only gets weirder... # ## A short aside: showing off pandas # We don't have time to cover ``pandas`` in this workshop, but it is quickly becoming the de-facto tool for working with tabular data interactively, and it adds great power to ``numpy``, which it is built on top of. Among its features are the ability to cleanly handle datasets that have missing values (non-rectangular), joining datasets along non-integer indices, grouping datasets by column values, as well as many convenience methods for quickly generating useful visualizations. There's plenty more, but you'll have to explore it on your own for now. # But here's a quick demo; say we want to examine how the distribution of inflammation values of patients changes with each passing day. We can do this easily by putting this data in a ``pandas.DataFrame``, then plotting directly: # In[80]: import pandas as pd # In[81]: df = pd.DataFrame(data) # We can look at what the top of our data look like as a DataFrame: # In[82]: df.head() # We'll quickly plot a histogram for each day from 1 to 21 (the rise) to see how the distributions shift with day. Cyan distributions are closer to day 0; magenta distributions are closer to day 20. # In[83]: df.iloc[:, 1:21].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4)) # We can also plot day 21 through day 39 (the fall) to see how the distributions shift with day. Cyan distributions are closer to day 21; magenta distributions are closer to day 39. # In[84]: df.iloc[:, 21:40].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4)) # We can do the same plots as kernel density estimates quite easily, though with the dataset being discrete and rather small this might not be so useful: # In[85]: df.iloc[:, 1:21].plot(kind='kde', legend=False, xlim=(0, 20), colormap='cool', figsize=(13,4)) df.iloc[:, 21:41].plot(kind='kde', legend=False, xlim=(0, 20), colormap='cool', figsize=(13,4)) # Not pretty enough? There are tools for that. ``seaborn`` is quite nice, and ``matplotlib`` supports style presets: # In[86]: import seaborn.apionly as sns plt.style.use('ggplot') sns.set_style('ticks') # In[87]: ax = df.iloc[:, 1:21].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4)) ax.grid(False) sns.despine(offset=10) ax = df.iloc[:, 21:40].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4)) ax.grid(False) sns.despine(offset=10) # # Conditionals: or, what do I do now? # # We could look at each set of figures for each dataset manually, since this is much easier with the loop we wrote. However, what if we had hundreds of datasets, instead of twelve? That approach wouldn't scale well. Can we write some code to help identify which datasets, and how many, show what flavor of weird behavior? # We can use conditionals. To demonstrate how these work, we'll use absurd toy examples. Say we have some number: # In[88]: num = 37 # We can ask whether that number is greater than 100 with a conditional like this: # In[89]: if num > 100: print 'greater' else: print 'not greater' # We could add some more sophisticatioin to the conditional by checking for equality: # In[90]: if num > 100: print 'greater' elif num == 100: print 'equal!' else: print 'not greater' # ...and we could test it: # In[91]: num = 100 # In[92]: if num > 100: print 'greater' elif num == 100: print 'equal!' else: print 'not greater' # Conditionals can include boolean operators such as ``and``: # In[93]: if (1 > 0) and (-1 > 0): print 'both parts are true' else: print 'one part is not true' # A "truth table" for ``and``, given two boolean values ``s1`` and ``s2``, each either ``True`` or ``False`` gives: # | s1 | s2 | s1 and s2 | # | ----- | ----- | --------- | # | True | True | True | # | True | False | False | # | False | True | False | # | False | False | False | # There's also ``or``: # In[94]: if (1 > 0) or (-1 > 0): print 'at least one part is true' else: print 'no part is true' # | s1 | s2 | s1 or s2 | # | ----- | ----- | --------- | # | True | True | True | # | True | False | True | # | False | True | True | # | False | False | False | # And ``not``: # In[95]: if (1 > 0) and not (-1 > 0): print 'both parts are true' else: print 'one part is not true' # | s1 | not s1 | # | ----- | ------ | # | True | False | # | False | True | # ## Let's check the data, automatically! # We can try and filter the datasets according to some criteria, each matching a case of weird behavior we saw in our plots: # In[96]: filenames = glob.glob('inflammation*.csv') # we'll count the number of each weird variety as we go count_susp_max = 0 count_susp_min = 0 count_okay = 0 for filename in filenames: data = numpy.loadtxt(filename, delimiter=',') print filename # check for the cases where we see min inflammation as precisely 0 on day 0 # and max inflammation as precisely 20 on day 20 if data.min(axis=0)[0] == 0 and data.max(axis=0)[20] == 20: print 'suspicious looking maxima!' # increment count_susp_max count_susp_max += 1 # check for cases where min is zero across all days elif data.min(axis=0).sum() == 0: print 'minima add up to zero!' # increment count_susp_min count_susp_min += 1 # keep track of datasets that look okay, at least so far else: print "seems 'okay'!" count_okay += 1 # What did we get for each count? # In[97]: count_susp_max # In[98]: count_susp_min # In[99]: count_okay # Sadly, all the data shows fishy behavior. Someone has explaining to do... # # Functions: reusable, modular pieces of code # If we had more datasets on the way, we may want to run our tests against these, too. Instead of rewriting the same block of code each time, or copying and pasting it here or there, we should put it into a self-contained function. This way we can also improve it by working on only one piece of code, not a ton of duplicates. # We've already been using functions. One example is the python built-in ``len``: # In[100]: stuff # In[101]: len(stuff) # In[102]: # run this to view the docstring get_ipython().run_line_magic('pinfo', 'len') # Let's write a simple function to show the general form; the function ``smaller`` tests whether the input value is smaller than ``10``, then returns either ``True`` or ``False`` depending: # In[103]: # def is followed by the name of the new function, then # its arguments in parentheses def smaller(somenumber): if somenumber < 10: # mind the indentation levels! out = True else: out = False return out # return gives the value to be output by the function # In[104]: smaller(157) # In[105]: smaller(7) # In[106]: smaller(10) # In fact, for such a simple function it can be condensed greatly: # In[107]: def smaller(somenumber): return somenumber < 10 # Writing functions make it easier to write *better* code as well as reusable code. *Better* in that it is easier to test, and therefore more likely to actually do what you want it to do. Functions can (and should!) be written to do individual tasks (like bash tools) so they can be used as building blocks for larger programs. As an example, let's write a function that calculates the temperature in Kelvin from a temperature in Fahrenheit: # In[108]: def fahr_to_kelvin(temp): return ((temp - 32) * (5/9)) + 273.15 # Now, let's do some quick sanity tests: # In[109]: fahr_to_kelvin(32) # The freezing point of water looks good! # In[110]: fahr_to_kelvin(212) # Ummm...the boiling point of water should be closer to 373.15 K. What's up? Although languages like python have debugger facilities (see [pdb](https://docs.python.org/2/library/pdb.html), the built-in debugger), we can debug our simple function by checking the individual pieces to make sure they calculate what we think they should. # In[141]: print (212 - 32) # That works. The next operation? # In[142]: print (212 - 32) * (5/9) # Ah! Something funny is happening here! # In[143]: 180 * (5/9) # This is a common "gotcha" that has to do with how computers store numbers in memory. Arithmetic operations between integers (such as 5 and 9) always yield integers in many programming languages, including python. So when two integers are divided, everything behind the decimal point is truncated (lopped off), leaving behind only what's in front of the decimal point. In this case, just ``0``. # To fix this, we can convert one of the values into a floating-point number. Division between a float and an integer yields a float: # In[144]: float(5)/9 # Also we could do this: # In[145]: 5.0/9 # But note, that using ``float()`` works even for variables, so it's a bit more useful in general. # We now have: # In[146]: def fahr_to_kelvin(temp): return ((temp - 32) * float(5)/9) + 273.15 # In[147]: fahr_to_kelvin(32) # In[148]: fahr_to_kelvin(212) # It appears to work! # What about a Kelvin to Celsius converter? # In[149]: def kelvin_to_celsius(temp): return temp - 273.15 # In[150]: kelvin_to_celsius(273.15) # In[151]: kelvin_to_celsius(373.15) # The sanity checks work out. # We can now write a function to convert fahrenheit to celsius temperatures by *composing* our two converters together: # In[154]: def fahrenheit_to_celsius(temp): temp = fahr_to_kelvin(temp) temp = kelvin_to_celsius(temp) return temp # We can make this simpler: # In[155]: def fahrenheit_to_celsius(temp): return kelvin_to_celsius(fahr_to_kelvin(temp)) # We want to apply the principle of encapsulation (using functions) to our fraud-detection codes. We'll write two functions: one that generates three plots for a dataset given the filename, and another that prints the category of weirdness (or lack thereof) the dataset displays. # In[156]: import numpy import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # For the first function, this works: # In[159]: def make_plots(filename): # so we know what we're looking at print filename # load data data = numpy.loadtxt(fname=filename, delimiter=',') # make a figure object, with 3 axes in 1 row, 3 columns fig = plt.figure(figsize=(10,3)) ax1 = fig.add_subplot(1, 3, 1) ax2 = fig.add_subplot(1, 3, 2) ax3 = fig.add_subplot(1, 3, 3) # plot means ax1.set_ylabel('average') ax1.plot(data.mean(axis=0)) # plot maxes ax2.set_ylabel('max') ax2.plot(data.max(axis=0)) # plot minses ax3.set_ylabel('min') ax3.plot(data.min(axis=0)) # this is useful for making the plot elements not overlap fig.tight_layout() plt.show() # In[160]: make_plots('inflammation-02.csv') # And for the second: # In[161]: def filter_data(filename): """Apply labeling criteria for fishy datasets. Prints the label to standard out :Arguments: *filename* filename for a comma-separated file giving inflammation data :Returns: *label* the flavor of fishiness of this data """ data = numpy.loadtxt(filename, delimiter=',') print filename # check for the cases where we see min inflammation as precisely 0 on day 0 # and max inflammation as precisely 20 on day 20 if data.min(axis=0)[0] == 0 and data.max(axis=0)[20] == 20: out = 'suspicious looking maxima!' # check for cases where min is zero across all days elif data.min(axis=0).sum() == 0: out = 'minima add up to zero!' # keep track of datasets that look okay, at least so far else: out = "seems 'okay'!" print out return out # Note that we've added what looks like documentation in triple quotes (``"""``) at the beginning of the function definition. This allows us to quickly check what our function does in the same way we can for functions like ``numpy.loadtxt``: # In[162]: get_ipython().run_line_magic('pinfo', 'filter_data') # In addition to just printing output to ``stdout``, the function also returns the output. This allows us to make use of it, perhaps in more code we would want to write later: # In[164]: value = filter_data('inflammation-01.csv') # In[166]: print value # Now lets use our functions to examine many files at once, as before! # In[167]: import glob # In[168]: filenames = glob.glob('inflammation-*.csv') filenames # In[169]: for f in filenames[:5]: make_plots(f) filter_data(f) # Our ``for`` loop went from being long and complicated to simple and easy-to-understand. This is part of the power of functions. Because the human brain can only keep about $7 \pm 2$ concepts in working memory at any given time, functions allow us to abstract and build more complicated code without making it more difficult to understand. Instead of having to worry about *how* ``filter_data`` does its thing, we need only know its inputs and its outputs to wield it effectively. It can be re-used without copying and pasting, and we can maintain it and improve it easily since it can be defined in a single place. # Since we may want to use these functions later in other notebooks/code, we can put their definitions in a file and import it. Opening a text editor and copying the function definitions (and the ``import``s they need) to a file # ``datastuff.py``, that looks like this: # In[170]: get_ipython().run_line_magic('cat', 'datastuff.py') # We can then import this as a module: # In[171]: import datastuff # And use the functions defined inside it: # In[173]: datastuff.filter_data('inflammation-01.csv') # In[174]: datastuff.make_plots('inflammation-01.csv') # Note that if we make changes to the file and want to see those reflected in our current notebook session, we can use: # In[175]: reload(datastuff) # # Errors and exceptions # Encountering errors is part of coding. If you are coding, you will hit errors. The important thing to remember is that errors that are loud are the best kind, because they usually give hints as to what the problem is. In python, errors are known as ``exceptions``, and there are a few common varieties. Familiarity with these varieties helps to more quickly identify the root of the problem, which means we can more quickly fix it. # Let's import some example code; we need to first add the location of the module we want to import to the list of places python looks for modules: # In[177]: import sys sys.path.append('scripts') # Then we can import it: # In[178]: import errors_01 # One example is the ``IndexError``: # In[179]: errors_01.favorite_ice_cream() # Let's dissect this. The first line shows where the error originates in *our* code. In this case it's the only line in the cell. The next set of lines shows where the error originates at the source: that is, inside the definition of ``errors_01.favorite_ice_cream()`` in line 7 of the module ``errors_01``. The last line gives us a hint as to the nature of the ``IndexError``: we appear to be using an index for an element of a list that doesn't exist. # We can see this type of error for lists of our own: # In[180]: a = ['marklar', 'lolcats'] # In[181]: a[2] # Another common type of exception is the ``SyntaxError``. This is the equivalent of using poor grammar, and the python interpreter is telling us it doesn't understand what we are saying: # In[182]: for item in a print item # There's also the ``IOError``. "IO" means "input/output", which commonly means reading and writing to disk. It commonly shows up when we try to read a file that doesn't exist: # In[184]: file_handle = open('myfile.txt', 'r') # Or if we try to read a file that we've opened exclusively for writing, in which case python is telling us that our code is probably doing something we don't want it to. # In[185]: file_handle = open('scripts/myfile.txt', 'w') # In[186]: file_handle.read() # There are other types of exceptions, but these are the most common. You need not know all of them, but having some knowledge of the basic varieties is valuable for understanding what's going wroing, and then how to fix it! # # Defensive programming: using assertions # Say we write a function that does something a bit obtuse (heh): Given the coordinates of the lower-left and upper-right corners of a rectangle, it normalizes the rectangle so that its lower-left coordinate is at the origin (0, 0), and its longest side is 1.0 units long: # In[217]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dx) / dy upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # Notice that we added a nice documentation string at the top that explicitly states the functions inputs (arguments) and outputs (returns), so when we use it we don't need to mind its details anymore. Also, at the beginning of the function we have ``assert`` statements that check the validity of the inputs, while at the end of the function we have ``assert`` statements that check the validity of the outputs. # # These statements codify what our documentation says about what the inputs should mean and what the output should look like, guarding against cases where the function is given inputs it wasn't designed to handle. # Let's test the function! # In[192]: normalize_rectangle((3, 4, 4, 6)) # Looks good. First point is at the origin; longest side is 1.0. # In[193]: normalize_rectangle((0.0, 1.0, 2.0, 5.0)) # Nice! # In[194]: normalize_rectangle((0.0, 0.0, 5.0, 1.0)) # Oh...what happened here? The inputs should be fine, but an assertion caught something strange in the output. Looking closely at the code, we spot a mistake in line 26: # In[222]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dy) / dx upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # And now we get: # In[213]: normalize_rectangle((0.0, 0.0, 5.0, 1.0)) # In[214]: normalize_rectangle((2, 3, 4, 6)) # That looks better. # What if we were to try feeding in only three values instead of four? # In[223]: normalize_rectangle((2, 4, 5)) # That's a rather unhelpful error message. We can also add an assertion that checks that the input is long enough, and gives a message indicating what the problem is: # In[225]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ assert len(rect) == 4, "Rectangle must have 4 elements" x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dy) / dx upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # In[226]: normalize_rectangle((2, 4, 5)) # Now when we use this function six months later, we won't have to scratch our heads too long when this sort of thing happens! *Defensive programming* in this way makes code more robust, easier to debug, and easier to maintain. This is especially important for scientific code, since incorrect results can occur silently and end up getting published, perhaps only getting uncovered years later. Much embarrassment and misinformation can be avoided by being skeptical of our code, using tools such as ``assert`` statements to make sure it keeps doing what we think it should.