Introduction to Jupyter Notebooks and Pandas¶

What is a Jupyter Notebook?¶

A Jupyter notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a code cell. It contains a block of executable code.

Run the code below by clicking on the cell below and clicking the "Run" button on top (▶).

In [1]:

print(10 + 20)

▶️ Run the code cell below to import unittest, a module used for 🧭 Check Your Work sections and the autograder.

In [2]:

import unittest
tc = unittest.TestCase()

Types of cells¶

There are three different type of cells.

Code cell
Markdown cell
Raw cell

We will most frequently use the first two types of cells.

🎯 Challenge 1: Find the sum of a list¶

👇 Tasks¶

✔️ Complete the code cell below to find the sum of all values in my_list.
✔️ Store the result in a new variable named result.

In [3]:

my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]

### BEGIN SOLUTION
result = 0

for num in my_list:
    result = result + num
### END SOLUTION

print(result)

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix any incorrect parts.

In [4]:

import unittest

tc = unittest.TestCase()

tc.assertEqual(result, 478)

Introduction to Pandas¶

Pandas is a Python library for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by AQR Capital Management.

Note: A library in the context of programming is a collection of functions (and other data) that others have already written for you.

Pandas is popular for many reasons:

🏃🏿‍♀️ It's fast (for most cases where the dataset can be loaded to your memory).
🪒 It supports most of the features required for data manipulation.
💡 Write less code. Get more done.

🎯 Challenge 2: Import packages¶

👇 Tasks¶

✔️ Import the following Python packages.
1. pandas: Use alias pd.
2. numpy: Use alias np.

In [5]:

### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [6]:

import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

It all starts with a `Series`...¶

The basic building block of Pandas is a Series. A Series is like a list, but with many more features.

You can create a Series by passing a list of values to pd.Series().

In [7]:

s = pd.Series([1, 2, 3, np.nan, 5, 6])

s

Out[7]:

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

Few things to note here¶

These look similar to a Python list.
The last line of the printed output tells us the data type of values in the Series (dtype: float64).

What the heck is np.nan?
- It is used to indicate a "missing value".
- np.nan is NOT the same as 0.

Differences between a list and a Series¶

In [8]:

my_list = [1, 2, 3, 4]

print(type(my_list))
display(my_list * 2)

<class 'list'>

[1, 2, 3, 4, 1, 2, 3, 4]

In [9]:

my_series = pd.Series([1, 2, 3, 4])

print(type(my_series))
display(my_series * 2)

<class 'pandas.core.series.Series'>

0    2
1    4
2    6
3    8
dtype: int64

What happens when you multiply a Python list by number 2? It repeats the elements.

How about a Series? It multiples each element by 2!

🎯 Challenge 3: Create new `Series`¶

👇 Tasks¶

✔️ Create a new Pandas Series named my_series with the following three values: 10, 20, 30.

🚀 Hint¶

The code below creates a new Pandas Series with the values 1 and 2.

my_new_series = pd.Series([1, 2])

In [10]:

### BEGIN SOLUTION
my_series = pd.Series([10, 20, 30])
### END SOLUTION

my_series

Out[10]:

0    10
1    20
2    30
dtype: int64

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix any incorrect parts.

In [11]:

pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)

Using `Series` methods¶

A pandas Series is similar to a Python list. However, a Series provides many methods (equivalent to functions) for you to use.

As an example, num_reviews.mean() will return the average number of reviews.

In [12]:

reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)

# YOUR CODE HERE
...

Out[12]:

Ellipsis

🎯 Challenge 4: Create a Pandas DataFrame¶

👇 Tasks¶

✔️ You are given two lists - product_names and num_reviews that contain the names of make-up products and the number of reviews on Sephora.com.
✔️ Using the two lists, create a new Pandas DataFrame named df_top_products that has the following two columns:
1. product_name: Names of the products
2. num_review: Number of reviews
✔️ Note that the column names are singular.

🚀 Hint¶

The code below creates a new Pandas DataFrame from two series.

my_new_dataframe = pd.DataFrame({
    "column_one": my_series1,
    "column_two": my_series2
})

In [13]:

product_names = [
    "Laneige Lip Sleeping Mask",
    "The Ordinary Hyaluronic Acid 2% + B5",
    "Laneige Lip Glowy Balm",
    "Chanel COCO MADEMOISELLE Eau de Parfum"
]

num_reviews = [
    12715,
    2274,
    2766,
    724
]

### BEGIN SOLUTION
df_top_products = pd.DataFrame({
    "product_name": product_names,
    "num_review": num_reviews
})
### END SOLUTION

display(df_top_products)

	product_name	num_review
0	Laneige Lip Sleeping Mask	12715
1	The Ordinary Hyaluronic Acid 2% + B5	2274
2	Laneige Lip Glowy Balm	2766
3	Chanel COCO MADEMOISELLE Eau de Parfum	724

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix any incorrect parts.

In [14]:

pd.testing.assert_frame_equal(
    df_top_products.reset_index(drop=True),
    pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
        1: "The Ordinary Hyaluronic Acid 2% + B5",
        2: "Laneige Lip Glowy Balm",
        3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
        "num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)

📌 Load data¶

The second part of today's lecture is all about you. 👻 Literally.

▶️ Run the code cell below to create a new DataFrame named df_you.

In [15]:

df_you = pd.read_csv("https://github.com/bdi475/datasets/raw/main/about-you.csv")

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

Out[15]:

	name	major1	major2	city	country	fav_restaurant	fav_movie	has_iphone
0	Ahana Chakraborty	Statistics	Business & Informatics	Chicago	USA	Poke Lab	Shrek 2	True
1	Andrew Rozmus	Psychology	NaN	Elmhurst	USA	NaN	NaN	True
2	Anusha Adira	Computer Engineering	Business	Cupertino	USA	Bangkok Thai	Three Idiots	True
3	Arthur Pyptyuk	Economics	Business	Hoffman Estates	USA	Sakanaya	Hereditary	True
4	Aryajit Das	Economics	Business & Global Markets plus Society	Streamwood	USA	Dubai Grill	Transformers: Age of Extinction	True

☝️ Hold on. Didn't we always create DataFrames using pd.DataFrame()?

Yes. But we can import existing data as a Pandas DataFrame using pd.read_csv(). There are many other similar import methods. For now, we'll mostly use pd.read_csv().

The table below explains each column in df_you.

Column Name	Description
name	First name
major1	Major
major2	Second major OR minor (blank if no second major or minor)
city	City the person is from
country	Country the person is from
fav_restaurant	Favorite restaurant (blank if no restaurant was given)
fav_movie	Favorite movie (blank if no movie was given)
has_iphone	Whether the person use an iPhone

📌 Concise summary of a `DataFrame`¶

👉 A common first step in working with a DataFrame is to use the info() method. info() prints a concise summary of a DataFrame.

Index data type
Column information: for each column, the following information is displayed:
- Number of non-missing values
- Data type of the column
Memory usage

▶️ Run df_you.info() below to see the info() method in action.

In [16]:

### BEGIN SOLUTION
df_you.info()
### END SOLUTION

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            45 non-null     object
 1   major1          44 non-null     object
 2   major2          36 non-null     object
 3   city            35 non-null     object
 4   country         37 non-null     object
 5   fav_restaurant  33 non-null     object
 6   fav_movie       31 non-null     object
 7   has_iphone      45 non-null     bool  
dtypes: bool(1), object(7)
memory usage: 2.6+ KB

👉 From the result of df_you.info(), we can understand a couple of things:

There are 8 columns.
7 out of 8 columns have the object data type.
- In Pandas, a string data type is shown as object, not str.
  - We will skip the technical discussion for now.
The second line of the output tells us the number of rows (i.e., observations).
Some columns contain one or more missing values.
- Missing values are displayed as NaN.
- To denote a missing value, use NumPy's np.nan (more on this later).

🎯 Challenge 5: Display first/last/random rows¶

▶️ Run df_you.head() to print the first 5 rows of df_you.

In [17]:

### BEGIN SOLUTION
df_you.head()
### END SOLUTION

Out[17]:

	name	major1	major2	city	country	fav_restaurant	fav_movie	has_iphone
0	Ahana Chakraborty	Statistics	Business & Informatics	Chicago	USA	Poke Lab	Shrek 2	True
1	Andrew Rozmus	Psychology	NaN	Elmhurst	USA	NaN	NaN	True
2	Anusha Adira	Computer Engineering	Business	Cupertino	USA	Bangkok Thai	Three Idiots	True
3	Arthur Pyptyuk	Economics	Business	Hoffman Estates	USA	Sakanaya	Hereditary	True
4	Aryajit Das	Economics	Business & Global Markets plus Society	Streamwood	USA	Dubai Grill	Transformers: Age of Extinction	True

▶️ Run df_you.tail(4) to print the last 4 rows of df_you.

In [18]:

### BEGIN SOLUTION
df_you.tail(4)
### END SOLUTION

Out[18]:

	name	major1	major2	city	country	fav_restaurant	fav_movie	has_iphone
41	Twinkle Yeruva	Computer Science	Business	Schaumburg	USA	Sticky Rice	Maze Runner	True
42	Valentina Flores	Economics	Business & French	Chicago	USA	NaN	Book of Life	True
43	Victoria Hernandez	Industrial Design	Spanish	East Moline	USA	Bangkok Thai	Mamma Mia or Shrek	True
44	Vikas Chavda	Economics	Business	Geneva	USA	Yogi	Kingsman: Secret Service	True

▶️ Run df_you.sample(3) to print 3 randomly sampled rows from df_you.

In [19]:

### BEGIN SOLUTION
df_you.sample(3)
### END SOLUTION

Out[19]:

	name	major1	major2	city	country	fav_restaurant	fav_movie	has_iphone
8	Cole Jordan	Computer Science	Business	NaN	USA	Chipotle	Ratatouille	True
21	Julia Kevin	Bioengineering	Business	Elmhurst	USA	KoFusion	Set it Up	True
38	Spencer Sadler	Computer Science	Business	Chicago	USA	Bangkok Thai	Ratatouille	True

In [20]:

# Autograder

📌 Number of rows and columns in a `DataFrame`¶

👉 How many rows and columns does df_you have?

▶️ Run df_you.shape below to see the shape (number of rows and columns) of the database.

In [21]:

### BEGIN SOLUTION
df_you.shape
### END SOLUTION

Out[21]:

(45, 8)

👉 Can you store the number of rows and columns to variables?

df_you.shape returns a tuple in (num_rows, num_cols) format.
What is a tuple? 🙀
A tuple is a list that cannot be modified once created.

▶️ Run the code cell below to see how a tuple is nearly identical to a list.

In [22]:

# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)

print(f"my_list[1]={my_list[1]}")    # prints 20
print(f"my_tuple[1]={my_tuple[1]}")  # also prints 20

my_list[1]=20
my_tuple[1]=20

🎯 Challenge 6: Find the number of rows and columns in a `DataFrame`¶

👇 Tasks¶

✔️ Store the number of rows in df_you to a new variable named num_rows.
✔️ Store the number of columns in df_you to a new variable named num_cols.
✔️ Use .shape, not len().

In [23]:

### BEGIN SOLUTION
num_rows = df_you.shape[0]
num_cols = df_you.shape[1]
### END SOLUTION

print(num_rows)
print(num_cols)

45
8

🧭 Check Your Work¶

Once you're done, run the code cell below to test correctness.
✔️ If the code cell runs without an error, you're good to move on.
❌ If the code cell throws an error, go back and fix incorrect parts.

In [24]:

tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")

Introduction to Jupyter Notebooks and Pandas¶

What is a Jupyter Notebook?¶

Types of cells¶

🎯 Challenge 1: Find the sum of a list¶

👇 Tasks¶

🧭 Check Your Work¶

Introduction to Pandas¶

🎯 Challenge 2: Import packages¶

👇 Tasks¶

🧭 Check Your Work¶

It all starts with a Series...¶

Few things to note here¶

Differences between a list and a Series¶

🎯 Challenge 3: Create new Series¶

👇 Tasks¶

🚀 Hint¶

🧭 Check Your Work¶

Using Series methods¶

🎯 Challenge 4: Create a Pandas DataFrame¶

👇 Tasks¶

🚀 Hint¶

🧭 Check Your Work¶

📌 Load data¶

📌 Concise summary of a DataFrame¶

🎯 Challenge 5: Display first/last/random rows¶

📌 Number of rows and columns in a DataFrame¶

🎯 Challenge 6: Find the number of rows and columns in a DataFrame¶

👇 Tasks¶

🧭 Check Your Work¶

It all starts with a `Series`...¶

🎯 Challenge 3: Create new `Series`¶

Using `Series` methods¶

📌 Concise summary of a `DataFrame`¶

📌 Number of rows and columns in a `DataFrame`¶

🎯 Challenge 6: Find the number of rows and columns in a `DataFrame`¶