Chapter 3¶

Data Wrangling¶

3.0 Introduction¶

Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use.

The most common data structure used to "wrangle" data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that htey are based on rows and columns like you'd find in a spreadsheet

3.1 Creating a Data Frame¶

In [2]:

import pandas as pd
dataframe = pd.DataFrame()

3.2 Describing the Data¶

In [6]:

url = "http://tinyurl.com/titanic-csv"
df = pd.read_csv(url)
# show first two rows
print(df.head(2)) # also try tail(2) for last two rows

# show dimensions
print("Dimensions: {}".format(df.shape))

# show statistics
df.describe()

                           Name PClass   Age     Sex  Survived  SexCode
0  Allen, Miss Elisabeth Walton    1st  29.0  female         1        1
1   Allison, Miss Helen Loraine    1st   2.0  female         0        1
Dimensions: (1313, 6)

Out[6]:

	Age	Survived	SexCode
count	756.000000	1313.000000	1313.000000
mean	30.397989	0.342727	0.351866
std	14.259049	0.474802	0.477734
min	0.170000	0.000000	0.000000
25%	21.000000	0.000000	0.000000
50%	28.000000	0.000000	0.000000
75%	39.000000	1.000000	1.000000
max	71.000000	1.000000	1.000000

3.3 Navigating DataFrames¶

In [10]:

# select the first row
print(df.iloc[0])

# select three rows
print(df.iloc[1:4])

# all rows up to and including the fourth row
print(df.iloc[:4])

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                   29
Sex                               female
Survived                               1
SexCode                                1
Name: 0, dtype: object
                                            Name PClass   Age     Sex  \
1                    Allison, Miss Helen Loraine    1st   2.0  female   
2            Allison, Mr Hudson Joshua Creighton    1st  30.0    male   
3  Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st  25.0  female   

   Survived  SexCode  
1         0        1  
2         0        0  
3         0        1  
                                            Name PClass   Age     Sex  \
0                   Allen, Miss Elisabeth Walton    1st  29.0  female   
1                    Allison, Miss Helen Loraine    1st   2.0  female   
2            Allison, Mr Hudson Joshua Creighton    1st  30.0    male   
3  Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st  25.0  female   

   Survived  SexCode  
0         1        1  
1         0        1  
2         0        0  
3         0        1

DataFrames do not need to be numerically indexed. We can set the index of a DataFrame to any value where the value is unique to each row. For example, we can set the index to be passenger names and then select rows using a name:

In [13]:

# set index
df = df.set_index(df['Name'])

# show row
df.loc['Allen, Miss Elisabeth Walton']

Out[13]:

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                   29
Sex                               female
Survived                               1
SexCode                                1
Name: Allen, Miss Elisabeth Walton, dtype: object

Discussion¶

To select individual rows and slices of rows, pandas provides two methods:

loc is useful when the index of the DataFrame is a label (a string)
iloc works by looking for the position in the DataFrame. For exmaple, iloc[0] will return the first row regardless of whether the index is an integer or a label

3.4 Selecting Rows Based on Conditionals¶

In [14]:

# select top two rows where column 'sex' is 'female'
df[df['Sex'] == 'female'].head(2)

Out[14]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2.0	female	0	1

In [15]:

# multiple conditions
df[(df['Sex'] == 'female') & (df['Age'] >= 65)]

Out[15]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead)	Crosby, Mrs Edward Gifford (Catherine Elizabet...	1st	69.0	female	1	1

3.5 Replacing Values¶

In [16]:

# replace any instance of 'female' with Woman
df['Sex'].replace('female', 'Woman').head(2)

Out[16]:

Name
Allen, Miss Elisabeth Walton    Woman
Allison, Miss Helen Loraine     Woman
Name: Sex, dtype: object

In [18]:

# replace any instance of 'female' with Woman
df['Sex'].replace(['female', 'male'], ['Woman', 'Man']).head(5)

Out[18]:

Name
Allen, Miss Elisabeth Walton                     Woman
Allison, Miss Helen Loraine                      Woman
Allison, Mr Hudson Joshua Creighton                Man
Allison, Mrs Hudson JC (Bessie Waldo Daniels)    Woman
Allison, Master Hudson Trevor                      Man
Name: Sex, dtype: object

In [19]:

df.replace(1, "One").head(2)

Out[19]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29	female	One	One
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2	female	0	One

3.6 Renaming Columns¶

In [20]:

df.rename(columns={'PClass': 'Passenger Class'}).head(2)

Out[20]:

	Name	Passenger Class	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2.0	female	0	1

In [22]:

df.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2)

Out[22]:

	Name	Passenger Class	Age	Gender	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2.0	female	0	1

3.7 Finding the Min, Max, Sum, Average, and Count¶

In [24]:

print('Maximum: {}'.format(df['Age'].max()))
print('Minimum: {}'.format(df['Age'].min()))
print('Mean: {}'.format(df['Age'].mean()))
print('Sum: {}'.format(df['Age'].sum()))
print('Count: {}'.format(df['Age'].count()))

Maximum: 71.0
Minimum: 0.17
Mean: 30.397989417989418
Sum: 22980.88
Count: 756

In addition to these, pandas also offers variance (var), standard deviation (std), kurtosis (kurt), skewness (skew), and a number of others.

We can also apply these methods to whole dataframes

In [27]:

print("Variance: {}".format(df.var()))
print("Standard Deviation: {}".format(df.std()))
print("Kurtosis: {}".format(df.kurt()))
print("Skewness: {}".format(df.skew()))

Variance: Age         203.320470
Survived      0.225437
SexCode       0.228230
dtype: float64
Standard Deviation: Age         14.259049
Survived     0.474802
SexCode      0.477734
dtype: float64
Kurtosis: Age        -0.036536
Survived   -1.562162
SexCode    -1.616702
dtype: float64
Skewness: Age         0.368511
Survived    0.663491
SexCode     0.621098
dtype: float64

3.8 Finding Unique Values¶

In [29]:

# unique will return an array of all unique values in a column
df['Sex'].unique()

Out[29]:

array(['female', 'male'], dtype=object)

In [31]:

# value_counts will display all unique values with the number of times each value appears
df['Sex'].value_counts()

Out[31]:

male      851
female    462
Name: Sex, dtype: int64

3.9 Handling Missing Values¶

In [33]:

# select missing values, show 2 rows
df[df['Age'].isnull()].head(2)

Out[33]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Aubert, Mrs Leontine Pauline	Aubert, Mrs Leontine Pauline	1st	NaN	female	1	1
Barkworth, Mr Algernon H	Barkworth, Mr Algernon H	1st	NaN	male	1	0

3.10 Deleting a Column¶

In [34]:

# axis=1 means the column axis
df.drop('Age', axis=1).head(2)

Out[34]:

	Name	PClass	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	female	0	1

3.11 Deleting a Row¶

In [35]:

# create new dataframe excluding the rows you want to delete
df[df['Sex'] != 'male'].head(2)

Out[35]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2.0	female	0	1

In [37]:

# delete a row by matching a unique value
df[df['Name'] != 'Allison, Miss Helen Loraine'].head(2)

Out[37]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Mr Hudson Joshua Creighton	Allison, Mr Hudson Joshua Creighton	1st	30.0	male	0	0

In [38]:

# delete a row by index
df[df.index != 0].head(2)

Out[38]:

	Name	PClass	Age	Sex	Survived	SexCode
Name
Allen, Miss Elisabeth Walton	Allen, Miss Elisabeth Walton	1st	29.0	female	1	1
Allison, Miss Helen Loraine	Allison, Miss Helen Loraine	1st	2.0	female	0	1

3.13 Grouping Rows by Values¶

In [39]:

df.groupby('Sex').mean()

Out[39]:

	Age	Survived	SexCode
Sex
female	29.396424	0.666667	1.0
male	31.014338	0.166863	0.0

In [40]:

df.groupby('Survived')['Name'].count()

Out[40]:

Survived
0    863
1    450
Name: Name, dtype: int64

In [41]:

df.groupby(['Sex', 'Survived'])['Age'].mean()

Out[41]:

Sex     Survived
female  0           24.901408
        1           30.867143
male    0           32.320780
        1           25.951875
Name: Age, dtype: float64

3.14 Grouping Rows by Time¶

In [ ]:

3.15 Looping Over a Column¶

In [43]:

# for .. in .. loop
for name in df['Name'][0:2]:
    print(name.upper())

ALLEN, MISS ELISABETH WALTON
ALLISON, MISS HELEN LORAINE

In [45]:

# list comprehension (more "pythonic")
[name.upper() for name in df['Name'][0:2]]

Out[45]:

['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']

3.16 Applying a Function Over All Elements in a Column¶

In [46]:

def uppercase(x):
    return x.upper()

df['Name'].apply(uppercase)[0:2]

Out[46]:

Name
Allen, Miss Elisabeth Walton    ALLEN, MISS ELISABETH WALTON
Allison, Miss Helen Loraine      ALLISON, MISS HELEN LORAINE
Name: Name, dtype: object

Discussion¶

apply is a great way to do data cleaning and wrangling. It is common to write a function to perform some useful operation (separate first and last names, convert string to floats, etc) and then map that funtion to every element in a column.

3.17 Applying a Function to Groups¶

In [47]:

df.groupby('Sex').apply(lambda x: x.count())

Out[47]:

	Name	PClass	Age	Sex	Survived	SexCode
Sex
female	462	462	288	462	462	462
male	851	851	468	851	851	851

By combining groupby and apply we can calculate custom statistics or apply any function to each group separately

3.18 Concatenating DataFrames¶

In [ ]:

Chapter 3¶

Data Wrangling¶

3.0 Introduction¶

3.1 Creating a Data Frame¶

3.2 Describing the Data¶

3.3 Navigating DataFrames¶

Discussion¶

3.4 Selecting Rows Based on Conditionals¶

3.5 Replacing Values¶

3.6 Renaming Columns¶

3.7 Finding the Min, Max, Sum, Average, and Count¶

3.8 Finding Unique Values¶

3.9 Handling Missing Values¶

3.10 Deleting a Column¶

3.11 Deleting a Row¶

3.13 Grouping Rows by Values¶

3.14 Grouping Rows by Time¶

3.15 Looping Over a Column¶

3.16 Applying a Function Over All Elements in a Column¶

Discussion¶

3.17 Applying a Function to Groups¶

3.18 Concatenating DataFrames¶

See Also¶