No, scratch that...
conda
as its package manager, instead of pip
or easy_install
This works on campus machines
Find the correct link at http://continuum.io/downloads then type something like the following in your command shell:
wget <copy-paste that link> bash Anaconda-<your version>.sh
Answer yes at the prompts. At the end, the installer asks if you want to add Anaconda to your PATH
, say yes if this will be your primary Python install.
To update all of anaconda's packages at once:
conda update anaconda
conda
and pip
¶Conda is preferable since it does dependency resolution:
conda install pymc
But not all packages are in the conda repositories
conda install geopy
In that case, use pip
pip install geopy
conda
for package environments¶The current version of NetworkX is 1.8.1, but suppose I have a script that is dependent on NetworkX 1.7 for now. This calls for package environments!
Create a new environment named "nx1.7", and link all the Anaconda packages
conda create -n nx1.7 anaconda
List all currently-defined environments:
conda info -e
Activate our new environment:
source activate nx1.7
Replace NetworkX 1.8 with 1.7:
conda install networkx=1.7
Deactivate our new environment, returning to the base Anaconda env:
source deactivate nx1.7
Environments are also a great way to have Python 2.7 and 3.3 side-by-side:
conda create -n py3.3 python=3.3 anaconda
Then, like before, I can just switch to py3k with a
source activate py3.3
and switch back with
source deactivate py3.3
import random as rd
import numpy as np
%timeit a = [rd.random() for x in range(1000)]
10000 loops, best of 3: 102 µs per loop
%timeit a = [np.random.random() for x in range(1000)]
10000 loops, best of 3: 182 µs per loop
%timeit a = np.random.random(1000)
100000 loops, best of 3: 14 µs per loop
There are also cell magics for use in IPython notebook
%%time
# This is a terrible way to do this
fib = [0, 1]
for i in range(10**5):
fib.append(fib[-1] + fib[-2])
CPU times: user 208 ms, sys: 24 ms, total: 232 ms Wall time: 210 ms
Run the following:
ipython qtconsole --pylab inline
then paste this in the Qt console:
from scipy.special import jn
x = linspace(0, 4*pi)
for i in range(6):
plot(x, jn(i, x))
Locally, you can run
ipython notebook --pylab inline
If you're reading this, you're seeing my slides online. It's hard for me to demo this for you, but luckily there are writeups and screencasts out there of how awesome IPython notebook is.
Run this notebook in Wakari to see Python + JavaScript in action
Actually...
The "native" Python solution for matrices is often to use a list of lists, but this can be really awful.
For example, let's look at column-wise operations on a list of lists.
import random as rd
N = 4
lol = [[rd.random() for c in range(N)] for r in range(N)]
# OR
lol = []
for r in range(N):
row = []
for c in range(N):
row.append(rd.random())
lol.append(row)
lol
[[0.9937229473980533, 0.995334868014186, 0.5942674962738761, 0.5154385022192677], [0.422784005229769, 0.7807343114323023, 0.09179473422846407, 0.609573372880339], [0.4972764651566518, 0.13311678867268917, 0.12249203373176598, 0.8453804747179231], [0.15205034547325147, 0.6133030575174816, 0.9485418183964225, 0.8287048466130321]]
List slicing: Can we get the first column of this "matrix?"?
lol[:,0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-55-fb5f0d5bbc73> in <module>() ----> 1 lol[:,0] TypeError: list indices must be integers, not tuple
Try again.
How about this?
lol[:][0]
[0.9937229473980533, 0.995334868014186, 0.5942674962738761, 0.5154385022192677]
This is actually the first row. Seriously, go back and check.
Surely a list comprehension can save us:
[x[0] for x in lol]
[0.9937229473980533, 0.422784005229769, 0.4972764651566518, 0.15205034547325147]
That works, but... ugh.
Do you want to write a column-wise sum?
This isn't FORTRAN. I shouldn't have to think about which way my matrix is laid out, row- or column-major. Try again with Numpy:
import numpy as np
npm = np.random.random((N, N))
npm
array([[ 0.15602394, 0.98540191, 0.13926817, 0.23745073], [ 0.51627372, 0.53242708, 0.27257836, 0.30676598], [ 0.04349275, 0.16361163, 0.3415986 , 0.49103593], [ 0.33081051, 0.99886844, 0.76993015, 0.55732118]])
npm[:,0]
array([ 0.15602394, 0.51627372, 0.04349275, 0.33081051])
Nice. And that column-wise sum?
npm.sum(0)
array([ 1.04660092, 2.68030907, 1.52337528, 1.59257382])
Like a boss.
Uses Numpy's image convolution for a really fast, really elegant Game of Life implementation.
import numpy
import scipy.ndimage
import Image
class Life(object):
def __init__(self, n, p=0.5, mode='wrap'):
self.n = n
self.mode = mode
self.array = numpy.uint8(numpy.random.random((n, n)) < p)
self.weights = numpy.array([[1,1,1],
[1,10,1],
[1,1,1]], dtype=numpy.uint8)
def step(self):
con = scipy.ndimage.filters.convolve(self.array,
self.weights,
mode=self.mode)
boolean = (con==3) | (con==12) | (con==13)
self.array = numpy.int8(boolean)
def run(self, N):
for _ in range(N):
self.step()
def draw(self, scale):
im = Image.fromarray(numpy.uint8(self.array)*255)
z = int(scale*self.n)
return im.resize((z,z))
l = Life(50)
imshow(l.draw(15))
<matplotlib.image.AxesImage at 0x547cf10>
l.run(10)
imshow(l.draw(15))
<matplotlib.image.AxesImage at 0x55d0210>
from scipy import stats
import numpy as np
x = np.random.random(10)
y = np.random.random(10)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print "r-squared:", r_value**2
r-squared: 0.109819072488
We'll come back to this in a minute.
From before, we have the following:
from scipy import stats
import numpy as np
x = np.random.random(10)
y = np.random.random(10)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print "r-squared:", r_value**2
import matplotlib.pyplot as plt
sizes = 1000* np.random.random(10)
colors = np.random.random(10)
fit_x = np.linspace(0,1,100)
fit_y = slope * xx + intercept
plt.scatter(x,y, sizes, colors, alpha=0.5)
plt.plot(fit_x, fit_y, '--r')
plt.title("Fit line to random junk", fontsize=16)
plt.show()
TASK: Calculate how many responses Kinsey Reporter received, with weekly resolution, for a given tag.
Reports come in, and each report has one or more tags associated with it.
First, we need to do a SQL query to get all the reports associated with a specific tag (don't worry if you're not familiar with SQL)
SELECT `option`,
DATE(timestamp) as datestamp,
count(1) as num_answers
FROM survey_event, survey_answer, survey_option
WHERE event_id = survey_event.id
AND option_id = survey_option.id
GROUP BY datestamp
The results come to Python looking like this:
[
...
('smile flirt', datetime.date(2013,5,23), 12),
('smile flirt', datetime.date(2013,5,24), 9),
...
]
To calculate the weekly timeline with native Python would be... uncomfortable. One must:
With Pandas, it's as simple as:
import pandas as pd
df = pd.DataFrame.from_records(kr_rows, index=datestamp)
timeline = df['num_answers'].asfreq('D').resample('W', how=sum).fillna(0)
In addition to making time series easier, Pandas makes plotting a snap too.
This is a notebook where I slice and dice some automobile gas mileage data: https://www.wakari.io/sharing/bundle/clayadavis/mpg
Key concepts used in the MPG analysis:
DataFrame.group_by()
Series.describe()
Series.plot()
DataFrame.boxplot()
DataFrame.hist()
Series.rolling_mean()
import sympy
sympy.init_printing(use_latex=True) ##or use_unicode=True in a console
x, y = sympy.symbols('x y')
We differentiate and get an answer...
sympy.diff(sympy.exp(x**2), x)
...or we can create an unevaluated expression for further manipulation...
my_deriv = sympy.Derivative(sympy.exp(x**2), x, x)
my_deriv
...which we can evaluate later.
my_deriv.doit()
sympy.Integral(sympy.exp(-x**2 - y**2), x, y)
from sympy import oo
sympy.integrate(sympy.exp(-x**2 - y**2), (x, -oo, oo), (y, -oo, oo))
sympy.solve([x*y - 7, x + y - 6], [x, y])
Some truths about GUIs
Why it rocks:
Rule 1: "Premature optimization is the root of all evil" $-$ Knuth
Rule 2: Post-hoc optimization is fucking rad