Version 1.1. Prepared by Makzan. Updated at 2021 March.
Pandas stands for Python Data Analysis Library. It makes use of Series (Like a Numpy Array) and DataFrame (tabular data). In this lesson, we will learn the basic data processing, calculation and filtering by using Pandas.
import numpy as np
import pandas as pd
Remember the pass_or_fail
function we defined in lesson 2? Now we can further process this example with Pandas and its DataFrame.
# Apply label to data
def pass_or_fail(x):
if x >= 60:
return True
return False
# Prepare the data
data = [50, 60, 20, 50, 88, 45, 63, 59]
Given the data list, we can convert it into DataFrame
# Where is the pass and fail?
df = pd.DataFrame(data, columns=['Score'])
df
Score | |
---|---|
0 | 50 |
1 | 60 |
2 | 20 |
3 | 50 |
4 | 88 |
5 | 45 |
6 | 63 |
7 | 59 |
We can create extra column to indicate if the score is passed.
df['Is Passed'] = df['Score'].apply(pass_or_fail)
df
Score | Is Passed | |
---|---|---|
0 | 50 | False |
1 | 60 | True |
2 | 20 | False |
3 | 50 | False |
4 | 88 | True |
5 | 45 | False |
6 | 63 | True |
7 | 59 | False |
By using Boolean filtering, we can list only the data that meet the condition.
df[ (df['Score']>=60) & (df['Score']<=80) ]
Score | Is Passed | |
---|---|---|
1 | 60 | True |
6 | 63 | True |
Next, we will load the students dictionary from lesson 2.
students = [
{'name': 'Thomas', 'score': 65},
{'name': 'Alan', 'score': 95},
{'name': 'Jane', 'score': 85},
{'name': 'Susan', 'score': 75},
{'name': 'Chris', 'score': 45}
]
We convert the dictionary into DataFrame. Pandas will take care of the column name automatically.
df = pd.DataFrame(students)
df
name | score | |
---|---|---|
0 | Thomas | 65 |
1 | Alan | 95 |
2 | Jane | 85 |
3 | Susan | 75 |
4 | Chris | 45 |
Same as the example above, we apply the pass_or_fail
function to create a new column with Boolean.
df['Is Passed'] = df['score'].apply(pass_or_fail)
df
name | score | Is Passed | |
---|---|---|---|
0 | Thomas | 65 | True |
1 | Alan | 95 | True |
2 | Jane | 85 | True |
3 | Susan | 75 | True |
4 | Chris | 45 | False |
We can generate a new DataFrame with only the passed records.
df_passed = df[df["Is Passed"]]
df_passed
name | score | Is Passed | |
---|---|---|---|
0 | Thomas | 65 | True |
1 | Alan | 95 | True |
2 | Jane | 85 | True |
3 | Susan | 75 | True |
We can save the processed tabular data into Excel by using to_excel
.
df_passed.to_excel("Students Passed.xlsx")
We can get the name of students who passed.
df[df["Is Passed"]]["name"]
0 Thomas 1 Alan 2 Jane 3 Susan Name: name, dtype: object
An example of outputing the result.
count_of_passed = len(df[df["Is Passed"]])
result = f"We have total {len(df)} students, {count_of_passed} of them passed."
print(result)
We have total 5 students, 4 of them passed.
names = ', '.join(df[df["Is Passed"]]["name"])
names = f"They are {names}."
print(names)
They are Thomas, Alan, Jane, Susan.
df = pd.read_excel('visitors.xlsx')
df
Time | Visitors | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 0 |
2 | 3 | 1 |
3 | 4 | 3 |
4 | 5 | 0 |
5 | 6 | 8 |
6 | 7 | 32 |
7 | 8 | 48 |
8 | 9 | 21 |
9 | 10 | 10 |
10 | 11 | 9 |
11 | 12 | 47 |
12 | 13 | 42 |
13 | 14 | 15 |
14 | 15 | 11 |
15 | 16 | 12 |
16 | 17 | 19 |
17 | 18 | 55 |
18 | 19 | 57 |
19 | 20 | 55 |
20 | 21 | 32 |
21 | 22 | 24 |
22 | 23 | 11 |
23 | 24 | 4 |
df.head()
Time | Visitors | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 0 |
2 | 3 | 1 |
3 | 4 | 3 |
4 | 5 | 0 |
df.tail()
Time | Visitors | |
---|---|---|
19 | 20 | 55 |
20 | 21 | 32 |
21 | 22 | 24 |
22 | 23 | 11 |
23 | 24 | 4 |
df['Visitors']
0 2 1 0 2 1 3 3 4 0 5 8 6 32 7 48 8 21 9 10 10 9 11 47 12 42 13 15 14 11 15 12 16 19 17 55 18 57 19 55 20 32 21 24 22 11 23 4 Name: Visitors, dtype: int64
Which hours do we have more than 30 visitors?
df[ df['Visitors'] > 30 ]
Time | Visitors | |
---|---|---|
6 | 7 | 32 |
7 | 8 | 48 |
11 | 12 | 47 |
12 | 13 | 42 |
17 | 18 | 55 |
18 | 19 | 57 |
19 | 20 | 55 |
20 | 21 | 32 |
We can create new Boolean column with condition.
df['Good Timing'] = df['Visitors']>30
df
Time | Visitors | Good Timing | |
---|---|---|---|
0 | 1 | 2 | False |
1 | 2 | 0 | False |
2 | 3 | 1 | False |
3 | 4 | 3 | False |
4 | 5 | 0 | False |
5 | 6 | 8 | False |
6 | 7 | 32 | True |
7 | 8 | 48 | True |
8 | 9 | 21 | False |
9 | 10 | 10 | False |
10 | 11 | 9 | False |
11 | 12 | 47 | True |
12 | 13 | 42 | True |
13 | 14 | 15 | False |
14 | 15 | 11 | False |
15 | 16 | 12 | False |
16 | 17 | 19 | False |
17 | 18 | 55 | True |
18 | 19 | 57 | True |
19 | 20 | 55 | True |
20 | 21 | 32 | True |
21 | 22 | 24 | False |
22 | 23 | 11 | False |
23 | 24 | 4 | False |
df[df['Good Timing']]
Time | Visitors | Good Timing | |
---|---|---|---|
6 | 7 | 32 | True |
7 | 8 | 48 | True |
11 | 12 | 47 | True |
12 | 13 | 42 | True |
17 | 18 | 55 | True |
18 | 19 | 57 | True |
19 | 20 | 55 | True |
20 | 21 | 32 | True |
df = pd.read_csv('visitors.csv', delimiter=',', names=('date','visitors'))
df.head()
date | visitors | |
---|---|---|
0 | 2018-12-18 | 22 |
1 | 2018-12-17 | 0 |
2 | 2018-12-16 | 4 |
3 | 2018-12-15 | 218 |
4 | 2018-12-14 | 11 |
df['date'].head()
0 2018-12-18 1 2018-12-17 2 2018-12-16 3 2018-12-15 4 2018-12-14 Name: date, dtype: object
We can convert the data table into CSV by using https://wikitable2csv.ggor.de
For example, we can download the All-time Olympic Games medal table from Wikipedia:
https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table
We can load the converted CSV directly into pandas and process the data set. Alternatively, we can clean up the CSV a little bit before loading it into pandas.
For instance, we can give each column a readable name in header. We can also convert thousands with comma too.
We have prepared a table-1-olympics.csv
file with basic format processed.
df = pd.read_csv('table-1-olympics.csv')
df = df[:-1] # Remove the last row of data
df.head()
Team | Summer Games | Gold | Silver | Bronze | Total | Winter Games | Gold.1 | Silver.1 | Bronze.1 | Total.1 | Combined Participated | Combined Gold | Combined Silver | Combined Bronze | Combined Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan (AFG) | 14 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 2 | 2 |
1 | Algeria (ALG) | 13 | 5 | 4 | 8 | 17 | 3 | 0 | 0 | 0 | 0 | 16 | 5 | 4 | 8 | 17 |
2 | Argentina (ARG) | 24 | 21 | 25 | 28 | 74 | 19 | 0 | 0 | 0 | 0 | 43 | 21 | 25 | 28 | 74 |
3 | Armenia (ARM) | 6 | 2 | 6 | 6 | 14 | 7 | 0 | 0 | 0 | 0 | 13 | 2 | 6 | 6 | 14 |
4 | Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
df.shape
(152, 16)
By checking the dtype
, we find that the data in "O" instead of "Int". "O" stands for Object because it is treated as String.
df['Total'].dtype
dtype('O')
We can convert the dtype by using astype
.
df['Total'] = df['Total'].astype(int)
df['Total'].dtype
dtype('int64')
Now the team name is not quite readable. There are extra symbols and footnote indicators.
We can clean up Team
name and replace space into underscore:
df["Team"] = df["Team"].apply(lambda x: x.split("(")[0].strip().replace(" ","_"))
df.head()
Team | Summer Games | Gold | Silver | Bronze | Total | Winter Games | Gold.1 | Silver.1 | Bronze.1 | Total.1 | Combined Participated | Combined Gold | Combined Silver | Combined Bronze | Combined Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 14 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 2 | 2 |
1 | Algeria | 13 | 5 | 4 | 8 | 17 | 3 | 0 | 0 | 0 | 0 | 16 | 5 | 4 | 8 | 17 |
2 | Argentina | 24 | 21 | 25 | 28 | 74 | 19 | 0 | 0 | 0 | 0 | 43 | 21 | 25 | 28 | 74 |
3 | Armenia | 6 | 2 | 6 | 6 | 14 | 7 | 0 | 0 | 0 | 0 | 13 | 2 | 6 | 6 | 14 |
4 | Australasia | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Set the Team name as dataframe index:
df.set_index("Team", inplace=True)
Make sure all numbers in the dataframe are numeric:
df = df.astype(int)
We can inspect the column names by using df.columns
Expected result |
---|
Index(['Summer Games', 'Gold', 'Silver', 'Bronze', 'Total', 'Winter Games', |
'Gold.1', 'Silver.1', 'Bronze.1', 'Total.1', 'Combined Participated',
'Combined Gold', 'Combined Silver', 'Combined Bronze',
'Combined Total'],
dtype='object')|
We can access a particular row of data by using iloc
df.iloc[16]
Summer Games 22 Gold 30 Silver 36 Bronze 63 Total 129 Winter Games 8 Gold.1 0 Silver.1 0 Bronze.1 0 Total.1 0 Combined Participated 30 Combined Gold 30 Combined Silver 36 Combined Bronze 63 Combined Total 129 Name: Brazil, dtype: int64
We can also access a particular row of data by using the index label df.loc["Brazil"]
.
What is the Olymnpic Medal data from China?
Expected result |
---|
Summer Games 10 |
Gold 224 Silver 167 Bronze 155 Total 546 Winter Games 11 Gold.1 13 Silver.1 28 Bronze.1 21 Total.1 62 Combined Participated 21 Combined Gold 237 Combined Silver 195 Combined Bronze 176 Combined Total 608 Name: China, dtype: int32|
How many gold medal China won in Summer Olymnpics?
Expected result |
---|
224 |
We can get the rank of gold medal list by using df.sort_values
.
df.sort_values(by="Gold", ascending=False)[:5]
Summer Games | Gold | Silver | Bronze | Total | Winter Games | Gold.1 | Silver.1 | Bronze.1 | Total.1 | Combined Participated | Combined Gold | Combined Silver | Combined Bronze | Combined Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Team | |||||||||||||||
United_States | 27 | 1022 | 795 | 705 | 2522 | 23 | 105 | 112 | 88 | 305 | 50 | 1127 | 907 | 793 | 2827 |
Soviet_Union | 9 | 395 | 319 | 296 | 1010 | 9 | 78 | 57 | 59 | 194 | 18 | 473 | 376 | 355 | 1204 |
Great_Britain | 28 | 263 | 295 | 293 | 851 | 23 | 11 | 4 | 17 | 32 | 51 | 274 | 299 | 310 | 883 |
China | 10 | 224 | 167 | 155 | 546 | 11 | 13 | 28 | 21 | 62 | 21 | 237 | 195 | 176 | 608 |
France | 28 | 212 | 241 | 263 | 716 | 23 | 36 | 35 | 53 | 124 | 51 | 248 | 276 | 316 | 840 |
How many teams earn gold medals in both summer Olympics and winder Olympics?
len(df[(df["Gold"] > 0) & (df["Gold.1"] > 0)])
38
Which teams are they?
df[(df["Gold"] > 0) & (df["Gold.1"] > 0)][['Gold','Gold.1']]
Gold | Gold.1 | |
---|---|---|
Team | ||
Australia | 147 | 5 |
Austria | 18 | 64 |
Belarus | 12 | 8 |
Belgium | 40 | 1 |
Bulgaria | 51 | 1 |
Canada | 64 | 73 |
China | 224 | 13 |
Croatia | 11 | 4 |
Czech_Republic | 15 | 9 |
Czechoslovakia | 49 | 2 |
Estonia | 9 | 4 |
Finland | 101 | 43 |
France | 212 | 36 |
Germany | 191 | 92 |
United_Team_of_Germany | 28 | 8 |
East_Germany | 153 | 39 |
West_Germany | 56 | 11 |
Great_Britain | 263 | 11 |
Hungary | 175 | 1 |
Italy | 206 | 40 |
Japan | 142 | 14 |
Kazakhstan | 15 | 1 |
South_Korea | 90 | 31 |
Latvia | 3 | 1 |
Netherlands | 85 | 45 |
Norway | 56 | 132 |
Poland | 68 | 7 |
Russia | 148 | 47 |
Soviet_Union | 395 | 78 |
Slovakia | 9 | 3 |
Slovenia | 5 | 2 |
Spain | 45 | 1 |
Sweden | 145 | 57 |
Switzerland | 50 | 56 |
Ukraine | 35 | 3 |
United_States | 1022 | 105 |
Uzbekistan | 8 | 1 |
Unified_Team | 45 | 9 |
How about the team that wins gold medals in winter Olympics but not summer Olympics?
Expected result |
---|
['Liechtenstein', 'Olympic_Athletes_from_Russia'] |
df = pd.read_excel("dsec_visitors.xlsx", skiprows=4, skipfooter=4)
df.columns
df = df.rename(columns={
"Unnamed: 0": "時期",
"按證件簽發地統計之旅客 b": "按證件簽發地統計之旅客",
"按證件簽發地統計之留宿旅客 b": "按證件簽發地統計之留宿旅客",
"按證件簽發地統計之不過夜旅客 b": "按證件簽發地統計之不過夜旅客"
})
df = df.drop(columns=["Unnamed: 2", "Unnamed: 4", "Unnamed: 6"])
df = df[1:]
df.head()
時期 | 按證件簽發地統計之旅客 | 按證件簽發地統計之留宿旅客 | 按證件簽發地統計之不過夜旅客 | |
---|---|---|---|---|
1 | 2011年1月 | 1.1 | 2.2 | 0.2 |
2 | 2011年2月 | 1 | 2 | 0.2 |
3 | 2011年3月 | 1 | 2.1 | 0.2 |
4 | 2011年4月 | 1 | 2 | 0.2 |
5 | 2011年5月 | 1.1 | 2.1 | 0.2 |
def year_from_dsec_date_period(period):
"""Convert 2011年5月 into 2011"""
return int(period.split("年")[0])
def month_from_dsec_date_period(period):
"""Convert 2011年5月 into 2011"""
return int(period.split("年")[1].replace("月",""))
df["Year"] = df["時期"].apply(year_from_dsec_date_period)
df["Month"] = df["時期"].apply(month_from_dsec_date_period)
df
時期 | 按證件簽發地統計之旅客 | 按證件簽發地統計之留宿旅客 | 按證件簽發地統計之不過夜旅客 | Year | Month | |
---|---|---|---|---|---|---|
1 | 2011年1月 | 1.1 | 2.2 | 0.2 | 2011 | 1 |
2 | 2011年2月 | 1 | 2 | 0.2 | 2011 | 2 |
3 | 2011年3月 | 1 | 2.1 | 0.2 | 2011 | 3 |
4 | 2011年4月 | 1 | 2 | 0.2 | 2011 | 4 |
5 | 2011年5月 | 1.1 | 2.1 | 0.2 | 2011 | 5 |
... | ... | ... | ... | ... | ... | ... |
115 | 2020年7月 | 1.3 | 4.7 | 0.1 | 2020 | 7 |
116 | 2020年8月 | 0.9 | 3.3 | 0.1 | 2020 | 8 |
117 | 2020年9月 | 1 | 2.7 | 0.1 | 2020 | 9 |
118 | 2020年10月 | 1.4 | 2.7 | 0.1 | 2020 | 10 |
119 | 2020年11月 | 1.7 | 3.1 | 0.1 | 2020 | 11 |
119 rows × 6 columns