Pandas Merge and String Methods¶

Lecture Notes and in-class exercises¶

▶️ First, run the code cell below to import unittest, a module used for 🧭 Check Your Work sections and the autograder.

In [1]:

import unittest
tc = unittest.TestCase()

👇 Tasks¶

✔️ Import the following Python packages.
1. pandas: Use alias pd.
2. numpy: Use alias np.

In [2]:

### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

🧭 Check your work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [3]:

import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

📌 Load employees and work laptops data¶

For the first part, we're going to work with a small DataFrame to see how we merge two DataFrames together.

▶️ Run the code cell below to create df_employees and df_laptops.

In [4]:

df_employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Jasper', 'Gary', 'Sally'],
    'laptop_id': ['A', 'B', np.nan]
})

df_laptops = pd.DataFrame({
    'laptop_id': ['A', 'B', 'C', 'D'],
    'model': ['Red Touchbook', 'BlueGo', 'Eco Green', 'Hackbook Pro']
})

# Used for 🧭 Check Your Work sections
df_employees_check = df_employees.copy()
df_laptops_check = df_laptops.copy()
df_join_check = df_employees_check.merge(df_laptops, on='laptop_id', how='outer')

▶️ Run the code cell below to display df_employees.

In [5]:

df_employees

Out[5]:

	emp_id	name	laptop_id
0	1	Jasper	A
1	2	Gary	B
2	3	Sally	NaN

▶️ Run the code cell below to display df_laptops.

In [6]:

df_laptops

Out[6]:

	laptop_id	model
0	A	Red Touchbook
1	B	BlueGo
2	C	Eco Green
3	D	Hackbook Pro

🎯 Exercise 1: Inner merge¶

👇 Tasks¶

✔️ Find employees who have been assigned a work laptop.
✔️ In other words, merge df_employees and df_laptop using an inner merge.
✔️ Store the merged result to a new variable named df_inner.

🚀 Sample Code¶

df_inner = pd.merge(
    left=...,
    right=...,
    on='...',
    how='...'
)

🔑 Expected Output¶

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo

In [7]:

### BEGIN SOLUTION
df_inner = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='inner'
)
### END SOLUTION

display(df_inner)

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [8]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna() & df_jc['laptop_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_inner.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

🎯 Exercise 2: Left merge¶

👇 Tasks¶

✔️ List all employees and their assigned work laptops - if they are assigned one.
✔️ If an employee has not been assigned a work laptop, leave 'laptop_id' and 'model' as np.NaN (or any other null-like value).
✔️ In other words, merge df_employees and df_laptop using a left merge.
✔️ Store the merged result to a new variable named df_left.

🔑 Expected Output¶

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo
2	3	Sally	NaN	NaN

In [9]:

### BEGIN SOLUTION
df_left = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='left'
)
### END SOLUTION

display(df_left)

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo
2	3	Sally	NaN	NaN

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [10]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_left.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

🎯 Exercise 3: Right merge¶

👇 Tasks¶

✔️ List all laptops and their associated owners - if they are assigned one.
✔️ If a laptop has not been assigned to an employee, leave 'emp_id' and 'name' as np.NaN (or any other null-like value).
✔️ In other words, merge df_employees and df_laptop using a right merge.
✔️ Store the merged result to a new variable named df_right.

🔑 Expected Output¶

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo
2	NaN	NaN	C	Eco Green
3	NaN	NaN	D	Hackbook Pro

In [11]:

### BEGIN SOLUTION
df_right = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='right'
)
### END SOLUTION

display(df_right)

	emp_id	name	laptop_id	model
0	1.0	Jasper	A	Red Touchbook
1	2.0	Gary	B	BlueGo
2	NaN	NaN	C	Eco Green
3	NaN	NaN	D	Hackbook Pro

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [12]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['laptop_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_right.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

🎯 Exercise 4: Full outer merge¶

👇 Tasks¶

✔️ List all employees and all work laptops - regardless of whether they are associated with one another.
✔️ If an employee has not been assigned a work laptop, leave 'laptop_id' and 'model' as np.NaN (or any other null-like value).
✔️ If a laptop has not been assigned to an employee, leave 'emp_id' and 'name' as np.NaN (or any other null-like value).
✔️ In other words, merge df_employees and df_laptop using an outer merge.
✔️ Store the merged result to a new variable named df_outer.

🔑 Expected Output¶

	emp_id	name	laptop_id	model
0	1	Jasper	A	Red Touchbook
1	2	Gary	B	BlueGo
2	3	Sally	NaN	NaN
3	NaN	NaN	C	Eco Green
4	NaN	NaN	D	Hackbook Pro

In [13]:

### BEGIN SOLUTION
df_outer = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='outer'
)
### END SOLUTION

display(df_outer)

	emp_id	name	laptop_id	model
0	1.0	Jasper	A	Red Touchbook
1	2.0	Gary	B	BlueGo
2	3.0	Sally	NaN	NaN
3	NaN	NaN	C	Eco Green
4	NaN	NaN	D	Hackbook Pro

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [14]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_left.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

Pandas string methods¶

📌 Load textual data¶

▶️ Run the code cell below to create df_libraries.

In [15]:

df_libraries = pd.DataFrame({
    'name': ['ACES (Funk)', 'Grainger', 'Law', 'Main'],
    'amenities': [
        'Rooms,Scanner,Printer',
        'Rooms,Scanner,Printer,Cafe',
        'Cafe',
        'Rooms,Scanner,Printer,Cafe'
    ],
})

# Used for 🧭 Check Your Work sections
df_libraries_check = df_libraries.copy()

▶️ Run the code cell below to display df_libraries.

In [16]:

df_libraries

Out[16]:

	name	amenities
0	ACES (Funk)	Rooms,Scanner,Printer
1	Grainger	Rooms,Scanner,Printer,Cafe
2	Law	Cafe
3	Main	Rooms,Scanner,Printer,Cafe

🎯 Exercise 5: Length of library names¶

👇 Tasks¶

✔️ Find the number of characters (i.e., string length) of each library.
✔️ Store the result to a new column named 'name_length' in df_libraries.

🔑 Expected Output¶

	name	amenities	name_length
0	ACES (Funk)	Rooms,Scanner,Printer	11
1	Grainger	Rooms,Scanner,Printer,Cafe	8
2	Law	Cafe	3
3	Main	Rooms,Scanner,Printer,Cafe	4

In [17]:

### BEGIN SOLUTION
df_libraries['name_length'] = df_libraries['name'].str.len()
### END SOLUTION

display(df_libraries)

	name	amenities	name_length
0	ACES (Funk)	Rooms,Scanner,Printer	11
1	Grainger	Rooms,Scanner,Printer,Cafe	8
2	Law	Cafe	3
3	Main	Rooms,Scanner,Printer,Cafe	4

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [18]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)

🎯 Exercise 6: Uppercase library names¶

👇 Tasks¶

✔️ Convert the library names to uppercase.
✔️ Directly update the 'name' column in df_libraries.

🔑 Expected Output¶

	name	amenities	name_length
0	ACES (FUNK)	Rooms,Scanner,Printer	11
1	GRAINGER	Rooms,Scanner,Printer,Cafe	8
2	LAW	Cafe	3
3	MAIN	Rooms,Scanner,Printer,Cafe	4

In [19]:

### BEGIN SOLUTION
df_libraries['name'] = df_libraries['name'].str.upper()
### END SOLUTION

display(df_libraries)

	name	amenities	name_length
0	ACES (FUNK)	Rooms,Scanner,Printer	11
1	GRAINGER	Rooms,Scanner,Printer,Cafe	8
2	LAW	Cafe	3
3	MAIN	Rooms,Scanner,Printer,Cafe	4

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [20]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()
df_lc['name'] = df_lc['name'].str.upper()

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)

🎯 Exercise 7: Split amenities into lists¶

👇 Tasks¶

✔️ Split the items in the 'amenities' column using the comma (,) as a delimiter.
✔️ Store the splitted result to a new column named 'amenities_list' in df_libraries.

🔑 Expected Output¶

	name	amenities	name_length	amenities_list
0	ACES (FUNK)	Rooms,Scanner,Printer	11	['Rooms', 'Scanner', 'Printer']
1	GRAINGER	Rooms,Scanner,Printer,Cafe	8	['Rooms', 'Scanner', 'Printer', 'Cafe']
2	LAW	Cafe	3	['Cafe']
3	MAIN	Rooms,Scanner,Printer,Cafe	4	['Rooms', 'Scanner', 'Printer', 'Cafe']

In [21]:

### BEGIN SOLUTION
df_libraries['amenities_list'] = df_libraries['amenities'].str.split(',')
### END SOLUTION

display(df_libraries)

	name	amenities	name_length	amenities_list
0	ACES (FUNK)	Rooms,Scanner,Printer	11	[Rooms, Scanner, Printer]
1	GRAINGER	Rooms,Scanner,Printer,Cafe	8	[Rooms, Scanner, Printer, Cafe]
2	LAW	Cafe	3	[Cafe]
3	MAIN	Rooms,Scanner,Printer,Cafe	4	[Rooms, Scanner, Printer, Cafe]

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [22]:

# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()
df_lc['name'] = df_lc['name'].str.upper()
df_lc['amenities_list'] = df_lc['amenities'].str.split(',')

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)