Lesson 9b: Pandas¶

Version 1.1. Prepared by Makzan. Updated at 2021 March.

Pandas stands for Python Data Analysis Library. It makes use of Series (Like a Numpy Array) and DataFrame (tabular data). In this lesson, we will learn the basic data processing, calculation and filtering by using Pandas.

Is student passed?
Accessing the column
Reading Excel
Reading CSV
Fetching CSV from Wikipedia
Lab 4: Olymnic Games medal table

In [2]:

import numpy as np
import pandas as pd

Is student passed?¶

Remember the pass_or_fail function we defined in lesson 2? Now we can further process this example with Pandas and its DataFrame.

In [2]:

# Apply label to data
def pass_or_fail(x):
    if x >= 60:
        return True
    return False

In [3]:

# Prepare the data
data = [50, 60, 20, 50, 88, 45, 63, 59]

Given the data list, we can convert it into DataFrame

In [4]:

# Where is the pass and fail?
df = pd.DataFrame(data, columns=['Score'])
df

Out[4]:

	Score
0	50
1	60
2	20
3	50
4	88
5	45
6	63
7	59

We can create extra column to indicate if the score is passed.

In [5]:

df['Is Passed'] = df['Score'].apply(pass_or_fail)
df

Out[5]:

	Score	Is Passed
0	50	False
1	60	True
2	20	False
3	50	False
4	88	True
5	45	False
6	63	True
7	59	False

By using Boolean filtering, we can list only the data that meet the condition.

In [6]:

df[ (df['Score']>=60) & (df['Score']<=80) ]

Out[6]:

	Score	Is Passed
1	60	True
6	63	True

Next, we will load the students dictionary from lesson 2.

In [7]:

students = [
    {'name': 'Thomas', 'score': 65},
    {'name': 'Alan', 'score': 95},
    {'name': 'Jane', 'score': 85},
    {'name': 'Susan', 'score': 75},
    {'name': 'Chris', 'score': 45}
]

We convert the dictionary into DataFrame. Pandas will take care of the column name automatically.

In [8]:

df = pd.DataFrame(students)
df

Out[8]:

	name	score
0	Thomas	65
1	Alan	95
2	Jane	85
3	Susan	75
4	Chris	45

Same as the example above, we apply the pass_or_fail function to create a new column with Boolean.

In [9]:

df['Is Passed'] = df['score'].apply(pass_or_fail)
df

Out[9]:

	name	score	Is Passed
0	Thomas	65	True
1	Alan	95	True
2	Jane	85	True
3	Susan	75	True
4	Chris	45	False

We can generate a new DataFrame with only the passed records.

In [10]:

df_passed = df[df["Is Passed"]]
df_passed

Out[10]:

	name	score	Is Passed
0	Thomas	65	True
1	Alan	95	True
2	Jane	85	True
3	Susan	75	True

We can save the processed tabular data into Excel by using to_excel.

In [11]:

df_passed.to_excel("Students Passed.xlsx")

Accessing the column¶

We can get the name of students who passed.

In [12]:

df[df["Is Passed"]]["name"]

Out[12]:

0    Thomas
1      Alan
2      Jane
3     Susan
Name: name, dtype: object

An example of outputing the result.

In [13]:

count_of_passed = len(df[df["Is Passed"]])
result = f"We have total {len(df)} students, {count_of_passed} of them passed."
print(result)

We have total 5 students, 4 of them passed.

In [14]:

names = ', '.join(df[df["Is Passed"]]["name"])
names = f"They are {names}."

print(names)

They are Thomas, Alan, Jane, Susan.

Reading Excel¶

In [15]:

df = pd.read_excel('visitors.xlsx')
df

Out[15]:

	Time	Visitors
0	1	2
1	2	0
2	3	1
3	4	3
4	5	0
5	6	8
6	7	32
7	8	48
8	9	21
9	10	10
10	11	9
11	12	47
12	13	42
13	14	15
14	15	11
15	16	12
16	17	19
17	18	55
18	19	57
19	20	55
20	21	32
21	22	24
22	23	11
23	24	4

In [16]:

df.head()

Out[16]:

	Time	Visitors
0	1	2
1	2	0
2	3	1
3	4	3
4	5	0

In [17]:

df.tail()

Out[17]:

	Time	Visitors
19	20	55
20	21	32
21	22	24
22	23	11
23	24	4

In [18]:

df['Visitors']

Out[18]:

0      2
1      0
2      1
3      3
4      0
5      8
6     32
7     48
8     21
9     10
10     9
11    47
12    42
13    15
14    11
15    12
16    19
17    55
18    57
19    55
20    32
21    24
22    11
23     4
Name: Visitors, dtype: int64

Which hours do we have more than 30 visitors?

In [19]:

df[ df['Visitors'] > 30 ]

Out[19]:

	Time	Visitors
6	7	32
7	8	48
11	12	47
12	13	42
17	18	55
18	19	57
19	20	55
20	21	32

We can create new Boolean column with condition.

In [20]:

df['Good Timing'] = df['Visitors']>30

In [21]:

df

Out[21]:

	Time	Visitors	Good Timing
0	1	2	False
1	2	0	False
2	3	1	False
3	4	3	False
4	5	0	False
5	6	8	False
6	7	32	True
7	8	48	True
8	9	21	False
9	10	10	False
10	11	9	False
11	12	47	True
12	13	42	True
13	14	15	False
14	15	11	False
15	16	12	False
16	17	19	False
17	18	55	True
18	19	57	True
19	20	55	True
20	21	32	True
21	22	24	False
22	23	11	False
23	24	4	False

In [22]:

df[df['Good Timing']]

Out[22]:

	Time	Visitors	Good Timing
6	7	32	True
7	8	48	True
11	12	47	True
12	13	42	True
17	18	55	True
18	19	57	True
19	20	55	True
20	21	32	True

Reading CSV¶

In [23]:

df = pd.read_csv('visitors.csv', delimiter=',', names=('date','visitors'))

df.head()

Out[23]:

	date	visitors
0	2018-12-18	22
1	2018-12-17	0
2	2018-12-16	4
3	2018-12-15	218
4	2018-12-14	11

In [24]:

df['date'].head()

Out[24]:

0    2018-12-18
1    2018-12-17
2    2018-12-16
3    2018-12-15
4    2018-12-14
Name: date, dtype: object

Fetching CSV from Wikipedia¶

We can convert the data table into CSV by using https://wikitable2csv.ggor.de

For example, we can download the All-time Olympic Games medal table from Wikipedia:

https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table

We can load the converted CSV directly into pandas and process the data set. Alternatively, we can clean up the CSV a little bit before loading it into pandas.

For instance, we can give each column a readable name in header. We can also convert thousands with comma too.

Lab 4: All-time Olympic Games medal table¶

We have prepared a table-1-olympics.csv file with basic format processed.

In [25]:

df = pd.read_csv('table-1-olympics.csv')
df = df[:-1] # Remove the last row of data
df.head()

Out[25]:

	Team	Summer Games	Gold	Silver	Bronze	Total	Winter Games	Combined Participated	Combined Gold	Combined Silver	Combined Bronze	Combined Total
0	Afghanistan (AFG)	14	0	0	2	2	0	14	0	0	2	2
1	Algeria (ALG)	13	5	4	8	17	3	16	5	4	8	17
2	Argentina (ARG)	24	21	25	28	74	19	43	21	25	28	74
3	Armenia (ARM)	6	2	6	6	14	7	13	2	6	6	14
4	Australasia (ANZ) [ANZ]	2	3	4	5	12	0	2	3	4	5	12

In [26]:

df.shape

Out[26]:

(152, 16)

By checking the dtype, we find that the data in "O" instead of "Int". "O" stands for Object because it is treated as String.

In [27]:

df['Total'].dtype

Out[27]:

dtype('O')

We can convert the dtype by using astype.

In [28]:

df['Total'] = df['Total'].astype(int)

In [29]:

df['Total'].dtype

Out[29]:

dtype('int64')

Now the team name is not quite readable. There are extra symbols and footnote indicators.

We can clean up Team name and replace space into underscore:

In [30]:

df["Team"] = df["Team"].apply(lambda x: x.split("(")[0].strip().replace(" ","_"))

In [31]:

df.head()

Out[31]:

	Team	Summer Games	Gold	Silver	Bronze	Total	Winter Games	Combined Participated	Combined Gold	Combined Silver	Combined Bronze	Combined Total
0	Afghanistan	14	0	0	2	2	0	14	0	0	2	2
1	Algeria	13	5	4	8	17	3	16	5	4	8	17
2	Argentina	24	21	25	28	74	19	43	21	25	28	74
3	Armenia	6	2	6	6	14	7	13	2	6	6	14
4	Australasia	2	3	4	5	12	0	2	3	4	5	12

Set the Team name as dataframe index:

In [32]:

df.set_index("Team", inplace=True)

Make sure all numbers in the dataframe are numeric:

In [33]:

df = df.astype(int)

We can inspect the column names by using df.columns

In [ ]:

Expected result
Index(['Summer Games', 'Gold', 'Silver', 'Bronze', 'Total', 'Winter Games',

   'Gold.1', 'Silver.1', 'Bronze.1', 'Total.1', 'Combined Participated',
   'Combined Gold', 'Combined Silver', 'Combined Bronze',
   'Combined Total'],
  dtype='object')|

We can access a particular row of data by using iloc

In [34]:

df.iloc[16]

Out[34]:

Summer Games              22
Gold                      30
Silver                    36
Bronze                    63
Total                    129
Winter Games               8
Gold.1                     0
Silver.1                   0
Bronze.1                   0
Total.1                    0
Combined Participated     30
Combined Gold             30
Combined Silver           36
Combined Bronze           63
Combined Total           129
Name: Brazil, dtype: int64

We can also access a particular row of data by using the index label df.loc["Brazil"].

What is the Olymnpic Medal data from China?

In [ ]:

Expected result
Summer Games 10

Gold 224 Silver 167 Bronze 155 Total 546 Winter Games 11 Gold.1 13 Silver.1 28 Bronze.1 21 Total.1 62 Combined Participated 21 Combined Gold 237 Combined Silver 195 Combined Bronze 176 Combined Total 608 Name: China, dtype: int32|

How many gold medal China won in Summer Olymnpics?

In [ ]:

Expected result
224

We can get the rank of gold medal list by using df.sort_values.

In [35]:

df.sort_values(by="Gold", ascending=False)[:5]

Out[35]:

	Summer Games	Gold	Silver	Bronze	Total	Winter Games	Gold.1	Silver.1	Bronze.1	Total.1	Combined Participated	Combined Gold	Combined Silver	Combined Bronze	Combined Total
Team
United_States	27	1022	795	705	2522	23	105	112	88	305	50	1127	907	793	2827
Soviet_Union	9	395	319	296	1010	9	78	57	59	194	18	473	376	355	1204
Great_Britain	28	263	295	293	851	23	11	4	17	32	51	274	299	310	883
China	10	224	167	155	546	11	13	28	21	62	21	237	195	176	608
France	28	212	241	263	716	23	36	35	53	124	51	248	276	316	840

How many teams earn gold medals in both summer Olympics and winder Olympics?

In [36]:

len(df[(df["Gold"] > 0) & (df["Gold.1"] > 0)])

Out[36]:

Which teams are they?

In [37]:

df[(df["Gold"] > 0) & (df["Gold.1"] > 0)][['Gold','Gold.1']]

Out[37]:

	Gold	Gold.1
Team
Australia	147	5
Austria	18	64
Belarus	12	8
Belgium	40	1
Bulgaria	51	1
Canada	64	73
China	224	13
Croatia	11	4
Czech_Republic	15	9
Czechoslovakia	49	2
Estonia	9	4
Finland	101	43
France	212	36
Germany	191	92
United_Team_of_Germany	28	8
East_Germany	153	39
West_Germany	56	11
Great_Britain	263	11
Hungary	175	1
Italy	206	40
Japan	142	14
Kazakhstan	15	1
South_Korea	90	31
Latvia	3	1
Netherlands	85	45
Norway	56	132
Poland	68	7
Russia	148	47
Soviet_Union	395	78
Slovakia	9	3
Slovenia	5	2
Spain	45	1
Sweden	145	57
Switzerland	50	56
Ukraine	35	3
United_States	1022	105
Uzbekistan	8	1
Unified_Team	45	9

How about the team that wins gold medals in winter Olympics but not summer Olympics?

In [ ]:

Expected result
['Liechtenstein', 'Olympic_Athletes_from_Russia']

Example: Reading DSEC Visitors.xlsx¶

In [3]:

df = pd.read_excel("dsec_visitors.xlsx", skiprows=4, skipfooter=4)
df.columns

df = df.rename(columns={
    "Unnamed: 0": "時期",
    "按證件簽發地統計之旅客 b": "按證件簽發地統計之旅客",
    "按證件簽發地統計之留宿旅客 b": "按證件簽發地統計之留宿旅客",
    "按證件簽發地統計之不過夜旅客 b": "按證件簽發地統計之不過夜旅客"
})
df = df.drop(columns=["Unnamed: 2", "Unnamed: 4", "Unnamed: 6"])
df = df[1:]
df.head()

Out[3]:

	時期	按證件簽發地統計之旅客	按證件簽發地統計之留宿旅客	按證件簽發地統計之不過夜旅客
1	2011年1月	1.1	2.2	0.2
2	2011年2月	1	2	0.2
3	2011年3月	1	2.1	0.2
4	2011年4月	1	2	0.2
5	2011年5月	1.1	2.1	0.2

In [4]:

def year_from_dsec_date_period(period):
    """Convert 2011年5月 into 2011"""
    return int(period.split("年")[0])
def month_from_dsec_date_period(period):
    """Convert 2011年5月 into 2011"""
    return int(period.split("年")[1].replace("月",""))

In [5]:

df["Year"] = df["時期"].apply(year_from_dsec_date_period)
df["Month"] = df["時期"].apply(month_from_dsec_date_period)
df

Out[5]:

	時期	按證件簽發地統計之旅客	按證件簽發地統計之留宿旅客	按證件簽發地統計之不過夜旅客	Year	Month
1	2011年1月	1.1	2.2	0.2	2011	1
2	2011年2月	1	2	0.2	2011	2
3	2011年3月	1	2.1	0.2	2011	3
4	2011年4月	1	2	0.2	2011	4
5	2011年5月	1.1	2.1	0.2	2011	5
...	...	...	...	...	...	...
115	2020年7月	1.3	4.7	0.1	2020	7
116	2020年8月	0.9	3.3	0.1	2020	8
117	2020年9月	1	2.7	0.1	2020	9
118	2020年10月	1.4	2.7	0.1	2020	10
119	2020年11月	1.7	3.1	0.1	2020	11

119 rows × 6 columns

Summary¶

In this lesson, we learn to process tabular data by using Pandas.

	Time	Visitors
0	1	2
1	2	0
2	3	1
3	4	3
4	5	0
5	6	8
6	7	32
7	8	48
8	9	21
9	10	10
10	11	9
11	12	47
12	13	42
13	14	15
14	15	11
15	16	12
16	17	19
17	18	55
18	19	57
19	20	55
20	21	32
21	22	24
22	23	11
23	24	4

	Time	Visitors
0	1	2
1	2	0
2	3	1
3	4	3
4	5	0
5	6	8
6	7	32
7	8	48
8	9	21
9	10	10
10	11	9
11	12	47
12	13	42
13	14	15
14	15	11
15	16	12
16	17	19
17	18	55
18	19	57
19	20	55
20	21	32
21	22	24
22	23	11
23	24	4

	Time	Visitors
0	1	2
1	2	0
2	3	1
3	4	3
4	5	0
5	6	8
6	7	32
7	8	48
8	9	21
9	10	10
10	11	9
11	12	47
12	13	42
13	14	15
14	15	11
15	16	12
16	17	19
17	18	55
18	19	57
19	20	55
20	21	32
21	22	24
22	23	11
23	24	4