Data wrangling is a broad term used, often informally, to describe the process of transforming raw data to a clean and organized format ready for use.
The most common data structure used to "wrangle" data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that htey are based on rows and columns like you'd find in a spreadsheet
import pandas as pd
dataframe = pd.DataFrame()
url = "http://tinyurl.com/titanic-csv"
df = pd.read_csv(url)
# show first two rows
print(df.head(2)) # also try tail(2) for last two rows
# show dimensions
print("Dimensions: {}".format(df.shape))
# show statistics
df.describe()
Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1 1 Allison, Miss Helen Loraine 1st 2.0 female 0 1 Dimensions: (1313, 6)
Age | Survived | SexCode | |
---|---|---|---|
count | 756.000000 | 1313.000000 | 1313.000000 |
mean | 30.397989 | 0.342727 | 0.351866 |
std | 14.259049 | 0.474802 | 0.477734 |
min | 0.170000 | 0.000000 | 0.000000 |
25% | 21.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 1.000000 | 1.000000 |
max | 71.000000 | 1.000000 | 1.000000 |
# select the first row
print(df.iloc[0])
# select three rows
print(df.iloc[1:4])
# all rows up to and including the fourth row
print(df.iloc[:4])
Name Allen, Miss Elisabeth Walton PClass 1st Age 29 Sex female Survived 1 SexCode 1 Name: 0, dtype: object Name PClass Age Sex \ 1 Allison, Miss Helen Loraine 1st 2.0 female 2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.0 female Survived SexCode 1 0 1 2 0 0 3 0 1 Name PClass Age Sex \ 0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 Allison, Miss Helen Loraine 1st 2.0 female 2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.0 female Survived SexCode 0 1 1 1 0 1 2 0 0 3 0 1
DataFrames do not need to be numerically indexed. We can set the index of a DataFrame to any value where the value is unique to each row. For example, we can set the index to be passenger names and then select rows using a name:
# set index
df = df.set_index(df['Name'])
# show row
df.loc['Allen, Miss Elisabeth Walton']
Name Allen, Miss Elisabeth Walton PClass 1st Age 29 Sex female Survived 1 SexCode 1 Name: Allen, Miss Elisabeth Walton, dtype: object
To select individual rows and slices of rows, pandas provides two methods:
loc
is useful when the index of the DataFrame is a label (a string)iloc
works by looking for the position in the DataFrame. For exmaple, iloc[0] will return the first row regardless of whether the index is an integer or a label# select top two rows where column 'sex' is 'female'
df[df['Sex'] == 'female'].head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
# multiple conditions
df[(df['Sex'] == 'female') & (df['Age'] >= 65)]
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Crosby, Mrs Edward Gifford (Catherine Elizabeth Halstead) | Crosby, Mrs Edward Gifford (Catherine Elizabet... | 1st | 69.0 | female | 1 | 1 |
# replace any instance of 'female' with Woman
df['Sex'].replace('female', 'Woman').head(2)
Name Allen, Miss Elisabeth Walton Woman Allison, Miss Helen Loraine Woman Name: Sex, dtype: object
# replace any instance of 'female' with Woman
df['Sex'].replace(['female', 'male'], ['Woman', 'Man']).head(5)
Name Allen, Miss Elisabeth Walton Woman Allison, Miss Helen Loraine Woman Allison, Mr Hudson Joshua Creighton Man Allison, Mrs Hudson JC (Bessie Waldo Daniels) Woman Allison, Master Hudson Trevor Man Name: Sex, dtype: object
df.replace(1, "One").head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29 | female | One | One |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2 | female | 0 | One |
df.rename(columns={'PClass': 'Passenger Class'}).head(2)
Name | Passenger Class | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
df.rename(columns={'PClass': 'Passenger Class', 'Sex': 'Gender'}).head(2)
Name | Passenger Class | Age | Gender | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
print('Maximum: {}'.format(df['Age'].max()))
print('Minimum: {}'.format(df['Age'].min()))
print('Mean: {}'.format(df['Age'].mean()))
print('Sum: {}'.format(df['Age'].sum()))
print('Count: {}'.format(df['Age'].count()))
Maximum: 71.0 Minimum: 0.17 Mean: 30.397989417989418 Sum: 22980.88 Count: 756
In addition to these, pandas also offers variance (var
), standard deviation (std
), kurtosis (kurt
), skewness (skew
), and a number of others.
We can also apply these methods to whole dataframes
print("Variance: {}".format(df.var()))
print("Standard Deviation: {}".format(df.std()))
print("Kurtosis: {}".format(df.kurt()))
print("Skewness: {}".format(df.skew()))
Variance: Age 203.320470 Survived 0.225437 SexCode 0.228230 dtype: float64 Standard Deviation: Age 14.259049 Survived 0.474802 SexCode 0.477734 dtype: float64 Kurtosis: Age -0.036536 Survived -1.562162 SexCode -1.616702 dtype: float64 Skewness: Age 0.368511 Survived 0.663491 SexCode 0.621098 dtype: float64
# unique will return an array of all unique values in a column
df['Sex'].unique()
array(['female', 'male'], dtype=object)
# value_counts will display all unique values with the number of times each value appears
df['Sex'].value_counts()
male 851 female 462 Name: Sex, dtype: int64
# select missing values, show 2 rows
df[df['Age'].isnull()].head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Aubert, Mrs Leontine Pauline | Aubert, Mrs Leontine Pauline | 1st | NaN | female | 1 | 1 |
Barkworth, Mr Algernon H | Barkworth, Mr Algernon H | 1st | NaN | male | 1 | 0 |
# axis=1 means the column axis
df.drop('Age', axis=1).head(2)
Name | PClass | Sex | Survived | SexCode | |
---|---|---|---|---|---|
Name | |||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | female | 0 | 1 |
# create new dataframe excluding the rows you want to delete
df[df['Sex'] != 'male'].head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
# delete a row by matching a unique value
df[df['Name'] != 'Allison, Miss Helen Loraine'].head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Mr Hudson Joshua Creighton | Allison, Mr Hudson Joshua Creighton | 1st | 30.0 | male | 0 | 0 |
# delete a row by index
df[df.index != 0].head(2)
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Name | ||||||
Allen, Miss Elisabeth Walton | Allen, Miss Elisabeth Walton | 1st | 29.0 | female | 1 | 1 |
Allison, Miss Helen Loraine | Allison, Miss Helen Loraine | 1st | 2.0 | female | 0 | 1 |
df.groupby('Sex').mean()
Age | Survived | SexCode | |
---|---|---|---|
Sex | |||
female | 29.396424 | 0.666667 | 1.0 |
male | 31.014338 | 0.166863 | 0.0 |
df.groupby('Survived')['Name'].count()
Survived 0 863 1 450 Name: Name, dtype: int64
df.groupby(['Sex', 'Survived'])['Age'].mean()
Sex Survived female 0 24.901408 1 30.867143 male 0 32.320780 1 25.951875 Name: Age, dtype: float64
# for .. in .. loop
for name in df['Name'][0:2]:
print(name.upper())
ALLEN, MISS ELISABETH WALTON ALLISON, MISS HELEN LORAINE
# list comprehension (more "pythonic")
[name.upper() for name in df['Name'][0:2]]
['ALLEN, MISS ELISABETH WALTON', 'ALLISON, MISS HELEN LORAINE']
def uppercase(x):
return x.upper()
df['Name'].apply(uppercase)[0:2]
Name Allen, Miss Elisabeth Walton ALLEN, MISS ELISABETH WALTON Allison, Miss Helen Loraine ALLISON, MISS HELEN LORAINE Name: Name, dtype: object
df.groupby('Sex').apply(lambda x: x.count())
Name | PClass | Age | Sex | Survived | SexCode | |
---|---|---|---|---|---|---|
Sex | ||||||
female | 462 | 462 | 288 | 462 | 462 | 462 |
male | 851 | 851 | 468 | 851 | 851 | 851 |
By combining groupby
and apply
we can calculate custom statistics or apply any function to each group separately