*Pandas* is a very strong library for manipulating large and complex datasets using a new data structure, the **data frame**, which models a table of data.
Pandas helps to close the gap between Python and R for data analysis and statistical computing.

Pandas data frames address three deficiencies of NumPy arrays:

- data frame hold heterogenous data; each column can have its own numpy.dtype,
- the axes of a data frame are labeled with column names and row indices,
- and, they account for missing values which this is not directly supported by arrays.

Data frames are extremely useful for data manipulation. They provide a large range of operations such as filter, join, and group-by aggregation, as well as plotting.

In [1]:

```
import numpy as np
import pandas as pd
print('Pandas version:', pd.__version__)
```

We will analyze animal life-history data from AnAge.

In [2]:

```
data = pd.read_csv('../data/anage_data.txt', sep='\t') # lots of other pd.read_... functions
print(type(data))
print(data.shape)
```

Pandas holds data in `DataFrame`

(similar to *R*).
`DataFrame`

have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.

The `head`

method gives us the 5 first rows of the data frame.

In [3]:

```
data.head()
```

Out[3]:

`DataFrame`

has many of the features of `numpy.ndarray`

- it also has a `shape`

and various statistical methods (`max`

, `mean`

etc.).
However, `DataFrame`

allows richer indexing.
For example, let's browse our data for species that have body mass greater than 300 kg.
First we will a create new column (`Series`

object) that tells us if a row is a large animal row or not:

In [4]:

```
large_index = data['Body mass (g)'] > 300 * 1000 # 300 kg
large_index.head()
```

Out[4]:

Now, we slice our data with this boolean index.
The `iterrows`

method let's us iterate over the rows of the data.
For each row we get both the row as a `Series`

object (similar to `dict`

for our use) and the row number as an `int`

(this is similar to the use of `enumerate`

on lists and strings).

In [5]:

```
large_data = data[large_index]
for i, row in large_data.iterrows():
print(row['Common name'], row['Body mass (g)']/1000, 'kg')
```

So... a Dromedary is the single-humped camel.

Let's continue with small and medium animals - we filter out anything that doesn't have body mass of less than 300 kg.

In [6]:

```
data = data[data['Body mass (g)'] < 300 * 1000]
```

For starters, let's plot a scatter of body mass vs. metabolic rate.
Because we work with pandas, we can do that with the `plot`

method of `DataFrame`

, specifying the columns for `x`

and `y`

and a plotting style (without the style we would get a line plot which makes no sense here).

You can change `%matplotlib inline`

to `%matplotlib widget`

to get interactive plotting -- if this causes errors, just stay with `inline`

, as the `widget`

feature is new and may require to update some packages.

In [7]:

```
%matplotlib inline
import matplotlib.pyplot as plt
```

In [8]:

```
data.plot.scatter(x='Body mass (g)', y='Metabolic rate (W)', legend=False)
plt.ylabel('Metabolic rate (W)');
```

If this plot looks funny, you are probably using Pandas with version <0.22; the bug was reported and fixed in version 0.22.

From this plot it seems that

- there is a correlation between body mass and metabolic rate, and
- there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg).

Before we continue, I prefer to have mass in kg, let's add a new column:

In [9]:

```
data['Body mass (kg)'] = data['Body mass (g)'] / 1000
```

Next, let's check how many records do we have for each Class (as in the taxonomic unit):

In [10]:

```
class_counts = data['Class'].value_counts()
print(class_counts)
```

In [12]:

```
# plt.figure() # only required if you used %matplotlib widget
class_counts.plot.bar()
plt.ylabel('Num. of species');
```

So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism because they are cold blooded.

1) **Print the number** of reptiles are in this dataset, and how many of them are of the genus `Python`

.

**Reminder**

- Edit cell by double clicking
- Run cell by pressing
*Shift+Enter* - Get autocompletion by pressing
*Tab* - Get documentation by pressing
*Shift+Tab*

In [28]:

```
```

In [29]:

```
print("# of reptiles: ", reptiles)
print("# of pythons: ", pythons)
```

2) **Plot the histogram of the mammal body masses** using `plot.hist()`

.
Since most mammals are small, the histogram looks better if we plot a cumulative distribution rather then the distribution - we can do this with the `cumulative`

argument. You also need to specify a higher `bins`

argument then the default.

In [ ]:

```
```

In [34]:

```
```

Let's do a simple linear regression plot; but let's do it in separate for each Class. We can do this kind of thing with Matplotlib and SciPy, but a very good tool for statistical visualizations is **Seaborn**.

Seaborn adds on top of Pandas a set of sophisticated statistical visualizations, similar to ggplot2 for R.

In [13]:

```
import seaborn as sns
sns.set_context("talk")
```

In [14]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
hue='Class',
data=data,
ci=False,
);
```

`hue`

means*color*, but it also causes*seaborn*to fit a different linear model to each of the Classes.`ci`

controls the confidence intervals. I chose`False`

, but setting it to`True`

will show them.

We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which orders of mammals we have.

In [15]:

```
mammalia = data[data.Class=='Mammalia']
order_counts = mammalia.Order.value_counts()
ax = order_counts.plot.barh()
ax.set(
xlabel='Num. of species',
ylabel='Mammalia order'
)
ax.figure.set_figheight(7)
```

You see we have alot of rodents and carnivores, but also a good number of bats (*Chiroptera*) and primates.

Let's continue with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and Taz (Diprotodontia and Dasyuromorphia)

In [16]:

```
orders = order_counts[order_counts >= 20]
print(orders)
abund_mammalia = mammalia[mammalia.Order.isin(orders.index)]
```

In [17]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
hue='Order',
data=abund_mammalia,
ci=False,
height=8,
aspect=1.3,
line_kws={'lw':2, 'ls':'--'},
scatter_kws={'s':50, 'alpha':0.5}
);
# if you get an error about height not being a keyword, change it to size or update seaborn: conda update seaborn
```

Because there is alot of data here I made the lines thinner - this can be done by giving *matplotlib* keywords as a dictionary to the argument `line_kws`

- and I made the markers bigger but with alpha (transperancy) 0.5 using the `scatter_kws`

argument.

Still ,there's too much data, and part of the problem is that some orders are large (e.g. primates) and some are small (e.g. rodents).

Let's plot a separate regression plot for each order.
We do this using the `col`

and `row`

arguments of `lmplot`

, but in general this can be done for any plot using seaborn's `FacetGrid`

function.

In [18]:

```
sns.lmplot(
x='Body mass (kg)',
y='Metabolic rate (W)',
data=abund_mammalia,
hue='Order',
col='Order',
col_wrap=3,
ci=None,
scatter_kws={'s':40},
sharex=False,
sharey=False
);
```

We used the `sharex=False`

and `sharey=False`

arguments so that each Order will have a different axis range and so the data is will spread nicely.

Lastly, let's do some quick statistics.

First, calculate a summary of the the mammals using `describe`

.

In [19]:

```
mass = abund_mammalia
mass.describe()
```

Out[19]:

Now lets check if we can significantly say that the body mass of rodents is lower than that of carnivores.

**Plot boxplots of the mammals body mass** using Seaborn, which is easier to use (and also makes nicer boxplots) then standard matplotlib boxplot.

In [ ]:

```
```

In [82]:

```
```

Now, we'll use a t-test (implemented in the `scipy.stats`

module) to test the hypothesis that there is *no difference* in body mass between rodents and carnivores.

`ttest_ind`

calculates the t-test for the means of*two independent*samples of scores.`scipy.stats`

has many more statistical tests, distributions, etc.

In [20]:

```
from scipy.stats import ttest_ind
```

In [21]:

```
carnivora_mass = abund_mammalia.loc[abund_mammalia['Order']=='Carnivora', 'Body mass (kg)']
rodentia_mass = abund_mammalia.loc[abund_mammalia['Order']=='Rodentia', 'Body mass (kg)']
res = ttest_ind(carnivora_mass, rodentia_mass, equal_var=False)
print("P-value of t-test: {:.2g}".format(res.pvalue))
```

- Examples: Seaborn example gallery
- Slides: Statistical inference with Python by Allen Downey
- Book: Think Stats by Allen Downey - statistics with Python. Free Ebook.
- Blog post: A modern guide to getting started with Data Science and Python
- Tutorial: An Introduction to Pandas

This notebook was written by Yoav Ram.

The notebook was written using Python 3.7. Dependencies listed in environment.yml.

This work is licensed under a CC BY-NC-SA 4.0 International License.