- Use a Python distribution
- Use IPython + notebook
- Use the Scipy stack

- It's the best thing ever

- Expressive language with easy-to-read syntax makes it easier to share and reuse code
- Large ecosystem of high-quality modules makes it easy to not reinvent the wheel

No, scratch that...

- Many popular packages for science and data analysis preinstalled

- Use
*your*version and*your*packages on any campus machine- The Simpsons (CNetS)
- FutureGrid
- Big Red II

- Default Python installs on these machines are out of date

- Fix the "Python Packaging Problem" with dependency resolution
- Anaconda uses
as its package manager, instead of`conda`

`pip`

or`easy_install`

- Use separate environments for incompatible packages

- Freeze the requirements for a particular package or script
- Share the requirements as dependencies
- If using Anaconda, share the entire environment

**This works on campus machines**

Find the correct link at http://continuum.io/downloads then type something like the following in your command shell:

wget <copy-paste that link> bash Anaconda-<your version>.sh

Answer yes at the prompts. At the end, the installer asks if you want to add Anaconda to your `PATH`

, say yes if this will be your primary Python install.

To update all of anaconda's packages at once:

```
conda update anaconda
```

`conda`

and `pip`

¶Conda is preferable since it does dependency resolution:

```
conda install pymc
```

But not all packages are in the conda repositories

```
conda install geopy
```

In that case, use `pip`

```
pip install geopy
```

`conda`

for package environments¶The current version of NetworkX is 1.8.1, but suppose I have a script that is dependent on NetworkX 1.7 for now. This calls for package environments!

Create a new environment named "nx1.7", and link all the Anaconda packages

`conda create -n nx1.7 anaconda`

List all currently-defined environments:

`conda info -e`

Activate our new environment:

`source activate nx1.7`

Replace NetworkX 1.8 with 1.7:

`conda install networkx=1.7`

Deactivate our new environment, returning to the base Anaconda env:

`source deactivate nx1.7`

Environments are also a great way to have Python 2.7 and 3.3 side-by-side:

```
conda create -n py3.3 python=3.3 anaconda
```

Then, like before, I can just switch to py3k with a

```
source activate py3.3
```

and switch back with

```
source deactivate py3.3
```

- Enhanced Python shell
- Comes in console and Qt versions
- Awesome features
- Tab-completion
- Magic (special %-prefixed commands)
- Inline plots in Qt console

In [1]:

```
import random as rd
import numpy as np
```

In [2]:

```
%timeit a = [rd.random() for x in range(1000)]
```

10000 loops, best of 3: 102 µs per loop

In [3]:

```
%timeit a = [np.random.random() for x in range(1000)]
```

10000 loops, best of 3: 182 µs per loop

In [4]:

```
%timeit a = np.random.random(1000)
```

100000 loops, best of 3: 14 µs per loop

There are also cell magics for use in IPython notebook

In [15]:

```
%%time
# This is a terrible way to do this
fib = [0, 1]
for i in range(10**5):
fib.append(fib[-1] + fib[-2])
```

CPU times: user 208 ms, sys: 24 ms, total: 232 ms Wall time: 210 ms

Run the following:

ipython qtconsole --pylab inline

then paste this in the Qt console:

In [95]:

```
from scipy.special import jn
x = linspace(0, 4*pi)
for i in range(6):
plot(x, jn(i, x))
```

Locally, you can run

ipython notebook --pylab inline

- Evaluate and edit code by the chunk (cell) instead of by line
- Keep images/plots/calculations inline with code
- Excellent for interactive/iterative data exploration
- Use markdown instead of comments for literate programming

If you're reading this, you're seeing my slides online. It's hard for me to demo this for you, but luckily there are writeups and screencasts out there of how awesome IPython notebook is.

- Converts notebooks into a variety of formats
- LaTeX/pdf
- HTML
**Reveal.js**

- This entire presentation was made with IPython notebook! GitHub Source
- Notebooks can then be hosted/shared with NBviewier or Wakari

Run this notebook in Wakari to see Python + JavaScript in action

Actually...

The "native" Python solution for matrices is often to use a list of lists, but this can be really awful.

For example, let's look at column-wise operations on a list of lists.

In [53]:

```
import random as rd
N = 4
lol = [[rd.random() for c in range(N)] for r in range(N)]
# OR
lol = []
for r in range(N):
row = []
for c in range(N):
row.append(rd.random())
lol.append(row)
lol
```

Out[53]:

[[0.9937229473980533, 0.995334868014186, 0.5942674962738761, 0.5154385022192677], [0.422784005229769, 0.7807343114323023, 0.09179473422846407, 0.609573372880339], [0.4972764651566518, 0.13311678867268917, 0.12249203373176598, 0.8453804747179231], [0.15205034547325147, 0.6133030575174816, 0.9485418183964225, 0.8287048466130321]]

List slicing: Can we get the first column of this "matrix?"?

In [55]:

```
lol[:,0]
```

Try again.

How about this?

In [56]:

```
lol[:][0]
```

Out[56]:

[0.9937229473980533, 0.995334868014186, 0.5942674962738761, 0.5154385022192677]

This is actually the first *row*. Seriously, go back and check.

Surely a list comprehension can save us:

In [57]:

```
[x[0] for x in lol]
```

Out[57]:

[0.9937229473980533, 0.422784005229769, 0.4972764651566518, 0.15205034547325147]

That works, but... ugh.

Do you want to write a column-wise sum?

In [62]:

```
import numpy as np
npm = np.random.random((N, N))
npm
```

Out[62]:

array([[ 0.15602394, 0.98540191, 0.13926817, 0.23745073], [ 0.51627372, 0.53242708, 0.27257836, 0.30676598], [ 0.04349275, 0.16361163, 0.3415986 , 0.49103593], [ 0.33081051, 0.99886844, 0.76993015, 0.55732118]])

In [63]:

```
npm[:,0]
```

Out[63]:

array([ 0.15602394, 0.51627372, 0.04349275, 0.33081051])

Nice. And that column-wise sum?

In [67]:

```
npm.sum(0)
```

Out[67]:

array([ 1.04660092, 2.68030907, 1.52337528, 1.59257382])

Like a boss.

Uses Numpy's image convolution for a really fast, really elegant Game of Life implementation.

In [98]:

```
import numpy
import scipy.ndimage
import Image
class Life(object):
def __init__(self, n, p=0.5, mode='wrap'):
self.n = n
self.mode = mode
self.array = numpy.uint8(numpy.random.random((n, n)) < p)
self.weights = numpy.array([[1,1,1],
[1,10,1],
[1,1,1]], dtype=numpy.uint8)
def step(self):
con = scipy.ndimage.filters.convolve(self.array,
self.weights,
mode=self.mode)
boolean = (con==3) | (con==12) | (con==13)
self.array = numpy.int8(boolean)
def run(self, N):
for _ in range(N):
self.step()
def draw(self, scale):
im = Image.fromarray(numpy.uint8(self.array)*255)
z = int(scale*self.n)
return im.resize((z,z))
```

In [99]:

```
l = Life(50)
imshow(l.draw(15))
```

Out[99]:

<matplotlib.image.AxesImage at 0x547cf10>

In [100]:

```
l.run(10)
imshow(l.draw(15))
```

Out[100]:

<matplotlib.image.AxesImage at 0x55d0210>

In [2]:

```
from scipy import stats
import numpy as np
x = np.random.random(10)
y = np.random.random(10)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print "r-squared:", r_value**2
```

r-squared: 0.109819072488

We'll come back to this in a minute.

From before, we have the following:

In [ ]:

```
from scipy import stats
import numpy as np
x = np.random.random(10)
y = np.random.random(10)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print "r-squared:", r_value**2
```

In [48]:

```
import matplotlib.pyplot as plt
sizes = 1000* np.random.random(10)
colors = np.random.random(10)
fit_x = np.linspace(0,1,100)
fit_y = slope * xx + intercept
plt.scatter(x,y, sizes, colors, alpha=0.5)
plt.plot(fit_x, fit_y, '--r')
plt.title("Fit line to random junk", fontsize=16)
plt.show()
```

**TASK**: Calculate how many responses Kinsey Reporter received, with *weekly* resolution, for a given tag.

Reports come in, and each report has one or more tags associated with it.

First, we need to do a SQL query to get all the reports associated with a specific tag (don't worry if you're not familiar with SQL)

`SELECT `option`, DATE(timestamp) as datestamp, count(1) as num_answers FROM survey_event, survey_answer, survey_option WHERE event_id = survey_event.id AND option_id = survey_option.id GROUP BY datestamp`

The results come to Python looking like this:

```
[
...
('smile flirt', datetime.date(2013,5,23), 12),
('smile flirt', datetime.date(2013,5,24), 9),
...
]
```

To calculate the weekly timeline with native Python would be... uncomfortable. One must:

- Fill in any missing days with zeros
- Sort the list by date, go through and sum up every set of seven counts

With Pandas, it's as simple as:

```
import pandas as pd
df = pd.DataFrame.from_records(kr_rows, index=datestamp)
timeline = df['num_answers'].asfreq('D').resample('W', how=sum).fillna(0)
```

In addition to making time series easier, Pandas makes plotting a snap too.

This is a notebook where I slice and dice some automobile gas mileage data: https://www.wakari.io/sharing/bundle/clayadavis/mpg

Key concepts used in the MPG analysis:

`DataFrame.group_by()`

`Series.describe()`

`Series.plot()`

`DataFrame.boxplot()`

`DataFrame.hist()`

`Series.rolling_mean()`

In [93]:

```
import sympy
sympy.init_printing(use_latex=True) ##or use_unicode=True in a console
x, y = sympy.symbols('x y')
```

We differentiate and get an answer...

In [74]:

```
sympy.diff(sympy.exp(x**2), x)
```

Out[74]:

$$2 x e^{x^{2}}$$

...or we can create an unevaluated expression for further manipulation...

In [75]:

```
my_deriv = sympy.Derivative(sympy.exp(x**2), x, x)
my_deriv
```

Out[75]:

$$\frac{d^{2}}{d x^{2}} e^{x^{2}}$$

...which we can evaluate later.

In [76]:

```
my_deriv.doit()
```

Out[76]:

$$2 \left(2 x^{2} + 1\right) e^{x^{2}}$$

In [82]:

```
sympy.Integral(sympy.exp(-x**2 - y**2), x, y)
```

Out[82]:

$$\iint e^{- x^{2} - y^{2}}\, dx\, dy$$

In [84]:

```
from sympy import oo
sympy.integrate(sympy.exp(-x**2 - y**2), (x, -oo, oo), (y, -oo, oo))
```

Out[84]:

$$\pi$$

In [83]:

```
sympy.solve([x*y - 7, x + y - 6], [x, y])
```

Out[83]:

$$\begin{bmatrix}\begin{pmatrix}- \sqrt{2} + 3, & \sqrt{2} + 3\end{pmatrix}, & \begin{pmatrix}\sqrt{2} + 3, & - \sqrt{2} + 3\end{pmatrix}\end{bmatrix}$$

- Use it to replace Wolfram Alpha
**PRO**

- As a community, we've focused on creating awesome tools for data analysis
- Less attention has been paid to sharing workflows once created
- IPython notebook is a huge step in the right direction

- GUIs are cool, they lower the barrier of entry

Some truths about GUIs

- GUIs suck to program
- For 90% of use cases, a web GUI on a modern browser is as good as native
- 90% of (data) scientists want the same thing from a GUI:
- Give me knobs to twiddle
- Let me see how it affects the output

- A framework for making webapps with Python at their core
- GUI elements defined in HTML or Enaml
- No JavaScript needed

- (almost) free software
- Backed by Continuum Analytics

Why it rocks:

- Uses familiar Python libraries on the backend
- Pandas
- Numpy
- Matplotlib

- Allows rapid development of web applications from existing Python code
- Moves beyond "open data" -- share both the data and a framework to analyze it
- Useful at every stage of te research process
- Exploring data and forming hypotheses
- Eliciting collaboration and feedback from peers
- Enabling wide dialog and evaluation of the finished product

Rule 1: "Premature optimization is the root of all evil" $-$ Knuth

Rule 2: Post-hoc optimization is fucking rad

- My time is more valuable than CPU time.
- Optimization is only useful when it lets me do something otherwise impossible with the resources I have.