Hey, Dan Welling wrote this. If it's wrong, complain at him.
This intro is built from pieces of my own Python notebooks with some new material plugged in. If you found this material useful, there's more here.
The goal of this document is to give you a good first impression of Python: its capabilities, peculiarities, and potential power when employed for scientific research. I want you to leave for a bit of a flavor of how Python works and how to start to think in a "pythonic" way- using shortcuts made available through this language. I also want to introduce you to some of the Python terminology so that other talks make more sense.
This first line is for getting plots inline:
%matplotlib inline
Let's learn about some interesting features by assigning variables and printing to screen:
Dog = 1
Cat = 1E17
dog, cat = 'bark', 1/3.
print(Dog, Cat, dog, cat)
1 1e+17 bark 0.3333333333333333
In the above trivial example, we learned a few subtly important things:
Don't sleep on that last point. Multiple assignment is a core concept for writing efficient loops, handling function output, and more. It's a very useful feature!
Python has familiar data types that work as expected. Let's look at a list:
list1 = [1,2,'cat',cat,[1,2,3], 'f']
print(list1)
print(list1[0:3])
print(list1[-1])
print(list1[0::2])
print(list1[4][2])
[1, 2, 'cat', 0.3333333333333333, [1, 2, 3], 'f'] [1, 2, 'cat'] f [1, 'cat', [1, 2, 3]] 3
What did we do here? First, we created a list. Then, we used indexing and slicing to get the items we want to examine. Some important notes:
[0:3]
means that we obtain "all items starting from but not equal to item 3". Mathematically, we would say, $[0,3)$.Let's explore another common data type that a surprising number of scientists do not take advantage of: Dictionaries! Dictionaries (also known as associative arrays in other languages), are containers that organize things by key-value pairs. Contrast this with lists, where things have a clear order (denoted via indices). Let's look an example:
dict1 = {'dan':'cool guy', 1:Cat, 'gem':list1} # Again, we can put anything in here.
print(dict1)
print(dict1['dan'])
dict1['dan'] = 'Okay, not that cool IRL'
dict1['karl'] = 'A good name.'
print(dict1)
{'dan': 'cool guy', 1: 1e+17, 'gem': [1, 2, 'cat', 0.3333333333333333, [1, 2, 3], 'f']} cool guy {'dan': 'Okay, not that cool IRL', 1: 1e+17, 'gem': [1, 2, 'cat', 0.3333333333333333, [1, 2, 3], 'f'], 'karl': 'A good name.'}
When we store a value in our dictionary, we assign it a key -- either an integer or string literal that references that stored object. It is important to remember that dictionaries do not store things in order. When we print our dictionary out, the order it comes out in is set by sheer luck! Note that we can easily change or expand a dictionary by using old or creating new keys.
Why would you use this? Dictionaries are useful for grouping values that are related, but the order doesn't really matter. A good example is satellite time series data. Typically, you have an array of time, an array of one observed value (say, magnetic field) and an array of another observed value (say, plasma density). Strictly speaking, these values aren't related so closely that you want them in a three-by-number-of-time-points array. You could do that, but it gets unwieldy quickly because you have to remember what index refers to what value set. But the three values (time, magnetic field, and density) are related in the sense that they came from the same satellite over the same time period! If we stash those all in a dictionary, it groups them together in a way that makes sense- we can set the time array value to a key called time, etc. Dictionaries are useful. Try them out!
This is a short tutorial- I don't have time to get into many details. But I do want to show you what it looks like to write actual code in Python. Let's start with a simple for
loop:
for x in list1:
print('Our list item is ', x)
print('Loop over!')
Our list item is 1 Our list item is 2 Our list item is cat Our list item is 0.3333333333333333 Our list item is [1, 2, 3] Our list item is f Loop over!
A lot just happened. Let's break it down:
for
loops are unique in Python. For a full breakdown, see this primer. Essentially, instead of working over an incrementing index, we loop over the items in any sequence- a list, string, dictionary, etc.I won't re-litigate whitespace delimiters compared to other ways (e.g., curly braces) suffice to say that I really like it- it enforces well organized source code.
Let's do another example that highlights the importance of our multiple assignment:
for i, x in enumerate('A Catchy Sentence!'):
print("Letter #{} is {}".format(i,x))
Letter #0 is A Letter #1 is Letter #2 is C Letter #3 is a Letter #4 is t Letter #5 is c Letter #6 is h Letter #7 is y Letter #8 is Letter #9 is S Letter #10 is e Letter #11 is n Letter #12 is t Letter #13 is e Letter #14 is n Letter #15 is c Letter #16 is e Letter #17 is !
Even more things happening! We're getting complicated quickly!
i
and x
on each iteration. There are other very helpful functions that make loops insanely powerful, such as zip.Okay, one last thing- let's write a function. I'm going to be painfully superficial here...
def power(x, pow=2, factor=5):
'''
Take a value, *x*, to the power set by *pow* (defaults to 2).
Both the exponent result and the same thing multiplied by *factor*.
'''
return x**pow, factor*x**pow
result1, result2 = power(3)
print(result1, result2)
everything = power(10, factor=.5)
print(everything)
9 45 (100, 50.0)
We define functions with the def syntax. Inside the parentheses are arguments. When our function is called, all arguments are required in order. After that, we have keyword arguments, which, when the function is called, are optional and not ordered.
Next, we have the docstring. Docstrings are how we set our function's help information (discussed below). USE DOCSTRINGS EVERY TIME. After that, we have the meat of our function. In the end, we return two values.
The beauty in using functions in python is how multiple assignment is wielded. The function hands everything back at once, so we can collect the output all at once (as a tuple, or a list that can't be changed) or assign things to their own variable as we get them.
Let's talk more about that docstring. The docstring serves as the documentation for that function. To access it interactively, we use the following syntax inside of IPython:
power?
Let's do some real math now. Let's import "numpy", and explore some functionality:
import numpy as np
a1 = np.array( [1,2,3,4,5] )
a2 = np.arange( 5 ) + 1
print(a1[3::-1])
print(1.5 * a1)
print(a1+a2)
print(a1*a2)
[4 3 2 1] [1.5 3. 4.5 6. 7.5] [ 2 4 6 8 10] [ 1 4 9 16 25]
Let's start with the elephant in the source code: import
. Import makes material in a python module or package available for use. We can't use numpy until we import it into the current namespace. Note that we renamed it for the current session- it's easier to type np
than numpy
every time. We can import individual parts of a package by typing from module import something
. That's often a convenient way to get just one thing from a very large
Oh, and if anyone ever tells you to type from module import *
, don't listen to them. Ever. About anything. Stay away from that person. They're going to try to tell you that goto
is a good idea (it's not.)
As you can see, Numpy gives us real array syntax. Arrays work like lists that do math. Unlike lists, arrays are homogenous. We can do element-wise arithmetic, array and vector operations, and way more- masked arrays, means, stats, and much, much more. Numpy is very powerful and fast. Above, notice how we can index and slice arrays like regular lists. But Numpy arrays are better than that- we can do tricky indexing!
a3 = np.linspace(1,5, 10)
print(a3)
print(a3[ a3>=3 ])
x = (a3<4)&(a3>2)
print(x)
print(a3[x])
[1. 1.44444444 1.88888889 2.33333333 2.77777778 3.22222222 3.66666667 4.11111111 4.55555556 5. ] [3.22222222 3.66666667 4.11111111 4.55555556 5. ] [False False False True True True True False False False] [2.33333333 2.77777778 3.22222222 3.66666667]
Whoa, what was all that? First, we made an array of 10 evenly spaced values from 1 to 5 using np.linspace
. Then, instead of using indices to pull out values of interest, we used tricky indexing. Tricky indexing allows us to use a logic argument instead of a slice or index. The 2nd example breaks it down just a touch more: an numpy logic statement returns an array of the same shape with only boolean values that denote the result of the logic statement. We then plug that in to mask our array. Pretty cool, eh?
Let's do some more math:
print( a3.mean() )
print( a3.cumsum() )
3.0 [ 1. 2.44444444 4.33333333 6.66666667 9.44444444 12.66666667 16.33333333 20.44444444 25. 30. ]
Cool! We calculated the mean, then the cumulative sum of an array. There are way more functions (don't get married to that word, we're going to kill it in a second) that we can use. In IPython, we can tab-complete to see what else there is. We can also use ? to get help:
# SMASH TAB HERE! THEN USE QUESTION MARK TO SEE WHAT THINGS DO!
a3.all?
So what the heck is going on with this variable-dot-function syntax? This is actually an object-oriented way of doing business. The .mean
isn't a function per se, it's an object method. Object methods are functions that are part of and integrated into the object from which they come. Typically, they act on the object themselves. While our .mean
doesn't buy us much, object methods can save us a lot of time and energy.
There are also object attributes. These are variables that are contained within the namespace of the object (this is a fancy way of saying that the names of the attributes don't clash with variables in other spaces). To wit,
print(a3.size)
print(a3.shape)
size='cow'
print(a3.size)
10 (10,) 10
print( list1.count(2) ) # Count the number of times "2" appears in list1
print( dict1.keys() ) # Get all keys inside of dict1
print( dog.upper() ) # Make the string "dog" all upper case.
1 dict_keys(['dan', 1, 'gem', 'karl']) BARK
Before we get too much farther, I'd like to introduce you to another important object type: Python's datetimes.
from datetime import datetime
d1 = datetime(2008,1,18,21,54,17)
d2 = datetime(2010,9,14,12,13,30)
print(d1, d2)
print(d1.hour)
print(d1.isoformat())
delta = d2 - d1
print(delta)
print(delta.total_seconds())
2008-01-18 21:54:17 2010-09-14 12:13:30 21 2008-01-18T21:54:17 969 days, 14:19:13 83773153.0
datetime objects are wildly powerful in Python. They can be subtracted from each other. They can be stored in Numpy arrays. They can be used in plots without conversion. They solve many problems when handling dates, times, and data.
Let's expand this example by nesting a datetime object within a list object:
list4 = [d1, d2] # A list of datetimes
a = list4[-1] # What if we want to act on one item within our list? We could create a inbetween variable...
print(a.isoformat()) # ...then use that object's method as usual.
print( list4[-1].isoformat() ) # However, there's no need to waste so much space!
2010-09-14T12:13:30 2010-09-14T12:13:30
Here, we see a critical python concept: when dereferencing nested objects, we can use object methods and attributes as if we had the nested object directly. When we dereference a list item via indexing, we can treat the dereferencing statement, list4[-1]
, as if it were the original object itself- giving us access to the nested object's methods and attributes. This makes nested data structures in python incredibly powerful.
If you do not already, I urge you to embrace object oriented programming. In science, some of our work can be truly automated- we can write a script to create wide swaths of plots and do complicated calculations. However, much of it is exploratory - we get a new data file, and we need to poke and prod and play until we know what's in there. There's no point creating a new script to do that. It requires a human to play with a file at the command prompt. Let's see how that looks in a real life scenario.
Let's tie some of these concepts together with a quick example.
Download the example_imf.py module and the imf_jul2000.dat file. This is a simple ascii file of solar wind and IMF data. We're going to use a special class of object, ImfData, to read and explore this data. The source code is heavily commented, so feel free to read and explore to see how all of this was accomplished.
from example_imf import ImfData
imf = ImfData('data/imf_jul2000.dat')
print(imf)
print(imf.keys())
print(imf['bz'][:5])
print('The mean IMF Bz is {}nT'.format(imf['bz'].mean()))
print(imf['time'][:5])
imf.calc_epsilon()
print(imf)
imf.keys()
imf.plot_imf()
ImfData object of data/imf_jul2000.dat dict_keys(['time', 'bx', 'by', 'bz', 'vx', 'vy', 'vz', 'rho', 'temp']) [3.9 3.5 3.9 2.4 1.6] The mean IMF Bz is 3.865625nT [datetime.datetime(2000, 7, 15, 0, 0) datetime.datetime(2000, 7, 15, 1, 0) datetime.datetime(2000, 7, 15, 2, 0) datetime.datetime(2000, 7, 15, 3, 0) datetime.datetime(2000, 7, 15, 4, 0)] ImfData object of data/imf_jul2000.dat
So what did we do here? Let's break it down line by line:
imf
, of class ImfData
. When creating it, we handed it the name of a IMF data file to read.ImfData
is a sub-class of a dictionary- this means it works like a dictionary with new features. Each value is an array of data (e.g., magnetic field data); the key is the name of the data inside the array. We can explore the data like it's a dictionary..mean()
.Contrast this with a procedural way of doing this, or an approach that does not use dictionaries. You need many functions that know how to handle your data structure. They may be scattered in different subroutines or modules. With an OO approach, the functions are methods attached to the object- we can't lose them. The dictionary approach makes it easier to organize our data; we see that our epsilon value gets sorted directly with its peers. All of this can be easily sorted into a script that does this in an automated fashion.
I hope this very short introduction gives you the flavor of working in Python. There's so much more to explore, however. Consider checking out the following packages:
There's always so much more, but this is a start.