As a personal preference, I believe it is better not to import all the functions into the current namespace.
import numpy as np
import pandas as pd
There are 3 types of data structures.
Data Structure | Dimensions |
---|---|
Series | 1-Dim |
DataFrame | 2-Dim |
3-Dim |
We will be dealing with Series and DataFrames. We will not be handing Panel here.
All datastructures have both List-like and Dict-like properties.
A Series at it's simplest form can be created from a dict.
data = {'Mon':'Monday',
'Tues':'Tuesday',
'Wed':'Wednesday',
'Thurs':'Thursday',
}
s = pd.Series(data)
s
Mon Monday Thurs Thursday Tues Tuesday Wed Wednesday dtype: object
s.index
Index([u'Mon', u'Thurs', u'Tues', u'Wed'], dtype='object')
A Series can also be created from a sequence of values and a sequence of index.
s = pd.Series(np.random.randint(5, 15, 7), ('Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat',
'Sun'), name='Temperature')
s.index.name = "Day of the Week"
s
Day of the Week Mon 12 Tues 12 Wed 14 Thur 5 Fri 5 Sat 14 Sun 12 Name: Temperature, dtype: int64
s['Tues']
12
'Mon' in s
True
'Son' in s
False
The Series can also be sliced using index.
s['Thur':'Sun']
Day of the Week Thur 5 Fri 5 Sat 14 Sun 12 Name: Temperature, dtype: int64
s.max()
14
s + 2*s #Vectorized operation
Day of the Week Mon 36 Tues 36 Wed 42 Thur 15 Fri 15 Sat 42 Sun 36 Name: Temperature, dtype: int64
s[1] #Accessing a value by position
12
s[2:5] #Slicing the Series by position
Day of the Week Wed 14 Thur 5 Fri 5 Name: Temperature, dtype: int64
s[:1]
Day of the Week Mon 12 Name: Temperature, dtype: int64
s - np.random.randint(5, 15, 7)
Day of the Week Mon 7 Tues 0 Wed 9 Thur -1 Fri 0 Sat 0 Sun 5 Name: Temperature, dtype: int64
for x in s: print x #iterating over values
12 12 14 5 5 14 12
for pos, value in enumerate(s): print pos, ':', value
0 : 12 1 : 12 2 : 14 3 : 5 4 : 5 5 : 14 6 : 12
for key, value in s.iteritems(): print key, ':', value
Mon : 12 Tues : 12 Wed : 14 Thur : 5 Fri : 5 Sat : 14 Sun : 12
Dataframe is a two dimensional array, and probably the most used data structure in Pandas. The columns themselves can have different data types but all the values within each column should be of the same datatype.
A dataframe can be created from
-Now let us look at the obligatory Day-Temperature example.
import datetime
base = datetime.datetime.today()
days = 20
date_list = [base - datetime.timedelta(days=x) for x in range(0, days)]
date_list = [datetime.date(x.year, x.month, x.day) for x in date_list]
date_list.reverse()
data = {'date':date_list,
'Chennai':np.random.randint(25,35,days),
'Mumbai':np.random.randint(15,25,days),
'Delhi':np.random.randint(5,15,days)}
df = pd.DataFrame(data)
type(df)
pandas.core.frame.DataFrame
df.head()
Chennai | Delhi | Mumbai | date | |
---|---|---|---|---|
0 | 29 | 5 | 19 | 2014-11-02 |
1 | 33 | 11 | 24 | 2014-11-03 |
2 | 27 | 14 | 19 | 2014-11-04 |
3 | 30 | 9 | 20 | 2014-11-05 |
4 | 27 | 5 | 15 | 2014-11-06 |
df = df.set_index('date')
df.head()
Chennai | Delhi | Mumbai | |
---|---|---|---|
date | |||
2014-11-02 | 29 | 5 | 19 |
2014-11-03 | 33 | 11 | 24 |
2014-11-04 | 27 | 14 | 19 |
2014-11-05 | 30 | 9 | 20 |
2014-11-06 | 27 | 5 | 15 |
df.median()
Chennai 30.0 Delhi 9.5 Mumbai 19.0 dtype: float64
df.mean()
Chennai 29.60 Delhi 9.15 Mumbai 19.20 dtype: float64
df.diff().head()
Chennai | Delhi | Mumbai | |
---|---|---|---|
date | |||
2014-11-02 | NaN | NaN | NaN |
2014-11-03 | 4 | 6 | 5 |
2014-11-04 | -6 | 3 | -5 |
2014-11-05 | 3 | -5 | 1 |
2014-11-06 | -3 | -4 | -5 |
titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index('PassengerId')
titanic.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
len(titanic)
891
titanic.Fare.sum()
28693.9493
titanic.Survived.value_counts()
0 549 1 342 dtype: int64
titanic.Pclass.value_counts()
3 491 1 216 2 184 dtype: int64