In this notebook we will learn indexing DataFrame
with loc
and iloc
functionality and slicing
operation. Some Data Filtering
techniques will also be conducted.
Read more about these functionality from Pydata documentation for indexing[1]. Some parts of this notebook are taken from EuroScipy 2016 Pandas Tutorial by Joris Van den Bossche and Nicholas Devenish[2]
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
titanic = pd.read_csv('data/titanic.csv')
titanic = titanic.set_index("Name")
titanic.head(2)
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
sample = titanic[0:5]
iterrows
¶for index,row in sample.iterrows():
print(index,list(row))
Braund, Mr. Owen Harris [1, 0, 3, 'male', 22.0, 1, 0, 'A/5 21171', 7.25, nan, 'S'] Cumings, Mrs. John Bradley (Florence Briggs Thayer) [2, 1, 1, 'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C'] Heikkinen, Miss. Laina [3, 1, 3, 'female', 26.0, 0, 0, 'STON/O2. 3101282', 7.925, nan, 'S'] Futrelle, Mrs. Jacques Heath (Lily May Peel) [4, 1, 1, 'female', 35.0, 1, 0, '113803', 53.1, 'C123', 'S'] Allen, Mr. William Henry [5, 0, 3, 'male', 35.0, 0, 0, '373450', 8.05, nan, 'S']
row()
inputfor index,row in sample.iterrows():
print(index,row['Sex'],row['Age'])
Braund, Mr. Owen Harris male 22.0 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 Heikkinen, Miss. Laina female 26.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 Allen, Mr. William Henry male 35.0
iteritems
¶for index,row in sample.T.iteritems():
print(index,list(row))
Braund, Mr. Owen Harris [1, 0, 3, 'male', 22.0, 1, 0, 'A/5 21171', 7.25, nan, 'S'] Cumings, Mrs. John Bradley (Florence Briggs Thayer) [2, 1, 1, 'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C'] Heikkinen, Miss. Laina [3, 1, 3, 'female', 26.0, 0, 0, 'STON/O2. 3101282', 7.925, nan, 'S'] Futrelle, Mrs. Jacques Heath (Lily May Peel) [4, 1, 1, 'female', 35.0, 1, 0, '113803', 53.1, 'C123', 'S'] Allen, Mr. William Henry [5, 0, 3, 'male', 35.0, 0, 0, '373450', 8.05, nan, 'S']
for index,row in sample.iteritems():
print(index,row[0],row[1],row[2])
PassengerId 1 2 3 Survived 0 1 1 Pclass 3 1 3 Sex male female female Age 22.0 38.0 26.0 SibSp 1 1 0 Parch 0 0 0 Ticket A/5 21171 PC 17599 STON/O2. 3101282 Fare 7.25 71.2833 7.925 Cabin nan C85 nan Embarked S C S
loc
and iloc
:The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.
The Pandas loc indexer can be used with DataFrames for two different use cases:
sample.iloc[0:2,:]
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
sample.iloc[1,0:3]
PassengerId 2 Survived 1 Pclass 1 Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer), dtype: object
sample.loc[:,'Survived': 'Ticket']
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | |
---|---|---|---|---|---|---|---|
Name | |||||||
Braund, Mr. Owen Harris | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 |
Heikkinen, Miss. Laina | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 1 | female | 35.0 | 1 | 0 | 113803 |
Allen, Mr. William Henry | 0 | 3 | male | 35.0 | 0 | 0 | 373450 |
sample.loc['Braund, Mr. Owen Harris',:]
PassengerId 1 Survived 0 Pclass 3 Sex male Age 22 SibSp 1 Parch 0 Ticket A/5 21171 Fare 7.25 Cabin NaN Embarked S Name: Braund, Mr. Owen Harris, dtype: object
Data Filters can be performed by either selecting specific set of column names or by seting boolean filters
.
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
country | population | area | capital | |
---|---|---|---|---|
0 | Belgium | 11.3 | 30510 | Brussels |
1 | France | 64.3 | 671308 | Paris |
2 | Germany | 81.3 | 357050 | Berlin |
3 | Netherlands | 16.9 | 41526 | Amsterdam |
4 | United Kingdom | 64.9 | 244820 | London |
country
as index:countries = countries.set_index('country')
countries
population | area | capital | |
---|---|---|---|
country | |||
Belgium | 11.3 | 30510 | Brussels |
France | 64.3 | 671308 | Paris |
Germany | 81.3 | 357050 | Berlin |
Netherlands | 16.9 | 41526 | Amsterdam |
United Kingdom | 64.9 | 244820 | London |
area
of each country:countries['area']
country Belgium 30510 France 671308 Germany 357050 Netherlands 41526 United Kingdom 244820 Name: area, dtype: int64
list
instead of pandas Series
object.Area = list(countries['area'])
Area
[30510, 671308, 357050, 41526, 244820]
sub-DataFrame
by choosing area
and population
countries[['area', 'population']]
area | population | |
---|---|---|
country | ||
Belgium | 30510 | 11.3 |
France | 671308 | 64.3 |
Germany | 357050 | 81.3 |
Netherlands | 41526 | 16.9 |
United Kingdom | 244820 | 64.9 |
index
names.countries['France':'Netherlands']
population | area | capital | |
---|---|---|---|
country | |||
France | 64.3 | 671308 | Paris |
Germany | 81.3 | 357050 | Berlin |
Netherlands | 16.9 | 41526 | Amsterdam |
index
name(s) and spefic column identified by column
name(s):countries.loc['Germany', 'area']
357050
countries.loc['France':'Germany', ['area', 'population']]
area | population | |
---|---|---|
country | ||
France | 671308 | 64.3 |
Germany | 357050 | 81.3 |
boolean
selection for subDataFrame
:countries['area'] > 100000
country Belgium False France True Germany True Netherlands False United Kingdom True Name: area, dtype: bool
countries[countries['area'] > 100000]
population | area | capital | |
---|---|---|---|
country | |||
France | 64.3 | 671308 | Paris |
Germany | 81.3 | 357050 | Berlin |
United Kingdom | 64.9 | 244820 | London |
countries['density'] = countries['population']*1000000 / countries['area']
countries
population | area | capital | density | |
---|---|---|---|---|
country | ||||
Belgium | 11.3 | 30510 | Brussels | 370.370370 |
France | 64.3 | 671308 | Paris | 95.783158 |
Germany | 81.3 | 357050 | Berlin | 227.699202 |
Netherlands | 16.9 | 41526 | Amsterdam | 406.973944 |
United Kingdom | 64.9 | 244820 | London | 265.092721 |
countries.loc[countries['density'] > 300, ['capital', 'population']]
capital | population | |
---|---|---|
country | ||
Belgium | Brussels | 11.3 |
Netherlands | Amsterdam | 16.9 |
countries['density_ratio'] = countries['density'] / countries['density'].mean()
countries
population | area | capital | density | density_ratio | |
---|---|---|---|---|---|
country | |||||
Belgium | 11.3 | 30510 | Brussels | 370.370370 | 1.355755 |
France | 64.3 | 671308 | Paris | 95.783158 | 0.350618 |
Germany | 81.3 | 357050 | Berlin | 227.699202 | 0.833502 |
Netherlands | 16.9 | 41526 | Amsterdam | 406.973944 | 1.489744 |
United Kingdom | 64.9 | 244820 | London | 265.092721 | 0.970382 |
countries.loc['United Kingdom', 'capital'] = 'Cambridge'
countries
population | area | capital | density | density_ratio | |
---|---|---|---|---|---|
country | |||||
Belgium | 11.3 | 30510 | Brussels | 370.370370 | 1.355755 |
France | 64.3 | 671308 | Paris | 95.783158 | 0.350618 |
Germany | 81.3 | 357050 | Berlin | 227.699202 | 0.833502 |
Netherlands | 16.9 | 41526 | Amsterdam | 406.973944 | 1.489744 |
United Kingdom | 64.9 | 244820 | Cambridge | 265.092721 | 0.970382 |
countries[(countries['density'] > 100) & (countries['density'] < 300)]
population | area | capital | density | density_ratio | |
---|---|---|---|---|---|
country | |||||
Germany | 81.3 | 357050 | Berlin | 227.699202 | 0.833502 |
United Kingdom | 64.9 | 244820 | Cambridge | 265.092721 | 0.970382 |