A Jupyter notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a code cell. It contains a block of executable code.
Run the code below by clicking on the cell below and clicking the "Run" button on top (▶).
print(10 + 20)
30
▶️ Run the code cell below to import unittest
, a module used for 🧭 Check Your Work sections and the autograder.
import unittest
tc = unittest.TestCase()
There are three different type of cells.
We will most frequently use the first two types of cells.
my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]
### BEGIN SOLUTION
result = 0
for num in my_list:
result = result + num
### END SOLUTION
print(result)
478
import unittest
tc = unittest.TestCase()
tc.assertEqual(result, 478)
Pandas is a Python library for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by AQR Capital Management.
Note: A library in the context of programming is a collection of functions (and other data) that others have already written for you.
Pandas is popular for many reasons:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")
Series
...¶The basic building block of Pandas is a Series
. A Series
is like a list, but with many more features.
You can create a Series
by passing a list of values to pd.Series()
.
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s
0 1.0 1 2.0 2 3.0 3 NaN 4 5.0 5 6.0 dtype: float64
my_list = [1, 2, 3, 4]
print(type(my_list))
display(my_list * 2)
<class 'list'>
[1, 2, 3, 4, 1, 2, 3, 4]
my_series = pd.Series([1, 2, 3, 4])
print(type(my_series))
display(my_series * 2)
<class 'pandas.core.series.Series'>
0 2 1 4 2 6 3 8 dtype: int64
What happens when you multiply a Python list
by number 2
? It repeats the elements.
How about a Series
? It multiples each element by 2
!
### BEGIN SOLUTION
my_series = pd.Series([10, 20, 30])
### END SOLUTION
my_series
0 10 1 20 2 30 dtype: int64
pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)
Series
methods¶A pandas Series
is similar to a Python list
. However, a Series
provides many methods (equivalent to functions) for you to use.
As an example, num_reviews.mean()
will return the average number of reviews.
reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)
# YOUR CODE HERE
...
Ellipsis
product_names
and num_reviews
that contain the names of make-up products and the number of reviews on Sephora.com.DataFrame
named df_top_products
that has the following two columns:product_name
: Names of the productsnum_review
: Number of reviewsThe code below creates a new Pandas DataFrame
from two series.
my_new_dataframe = pd.DataFrame({
"column_one": my_series1,
"column_two": my_series2
})
product_names = [
"Laneige Lip Sleeping Mask",
"The Ordinary Hyaluronic Acid 2% + B5",
"Laneige Lip Glowy Balm",
"Chanel COCO MADEMOISELLE Eau de Parfum"
]
num_reviews = [
12715,
2274,
2766,
724
]
### BEGIN SOLUTION
df_top_products = pd.DataFrame({
"product_name": product_names,
"num_review": num_reviews
})
### END SOLUTION
display(df_top_products)
product_name | num_review | |
---|---|---|
0 | Laneige Lip Sleeping Mask | 12715 |
1 | The Ordinary Hyaluronic Acid 2% + B5 | 2274 |
2 | Laneige Lip Glowy Balm | 2766 |
3 | Chanel COCO MADEMOISELLE Eau de Parfum | 724 |
pd.testing.assert_frame_equal(
df_top_products.reset_index(drop=True),
pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
1: "The Ordinary Hyaluronic Acid 2% + B5",
2: "Laneige Lip Glowy Balm",
3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
"num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)
The second part of today's lecture is all about you. 👻 Literally.
▶️ Run the code cell below to create a new DataFrame
named df_you
.
df_you = pd.read_csv("https://github.com/bdi475/datasets/raw/main/about-you.csv")
# Used to keep a clean copy
df_you_backup = df_you.copy()
# head() displays the first 5 rows of a DataFrame
df_you.head()
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
0 | Dane Jacobsen | Agr & Consumer Economics | Business | Chicago | USA | Sushi Man | The Iron Claw | True |
1 | Bingqing Li | Statistics | Econometrics & Quant Econ | Shanghai | China | Yogi | Frozen | True |
2 | Kam Chiu Chong | Economics | Information Systems | Hong Kong | NaN | NaN | NaN | True |
3 | Jiaqi Zeng | Mathematics | Computer Science | Zhuhai | China | Northern Cuisine | Interstellar | True |
4 | Rishi Shah | Information Sciences + DS | Business | Chicago | USA | Taco Bell | Forrest Gump | True |
☝️ Hold on. Didn't we always create DataFrame
s using pd.DataFrame()
?
Yes. But we can import existing data as a Pandas DataFrame
using pd.read_csv()
. There are many other similar import methods. For now, we'll mostly use pd.read_csv()
.
The table below explains each column in df_you
.
Column Name | Description |
---|---|
name | First name |
major1 | Major |
major2 | Second major OR minor (blank if no second major or minor) |
city | City the person is from |
country | Country the person is from |
fav_restaurant | Favorite restaurant (blank if no restaurant was given) |
fav_movie | Favorite movie (blank if no movie was given) |
has_iphone | Whether the person use an iPhone |
DataFrame
¶👉 A common first step in working with a DataFrame
is to use the info()
method. info()
prints a concise summary of a DataFrame
.
▶️ Run df_you.info()
below to see the info()
method in action.
### BEGIN SOLUTION
df_you.info()
### END SOLUTION
<class 'pandas.core.frame.DataFrame'> RangeIndex: 118 entries, 0 to 117 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 118 non-null object 1 major1 117 non-null object 2 major2 93 non-null object 3 city 104 non-null object 4 country 107 non-null object 5 fav_restaurant 97 non-null object 6 fav_movie 99 non-null object 7 has_iphone 118 non-null bool dtypes: bool(1), object(7) memory usage: 6.7+ KB
👉 From the result of df_you.info()
, we can understand a couple of things:
object
data type.object
, not str
.NaN
.np.nan
(more on this later).▶️ Run df_you.head()
to print the first 5 rows of df_you
.
### BEGIN SOLUTION
df_you.head()
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
0 | Dane Jacobsen | Agr & Consumer Economics | Business | Chicago | USA | Sushi Man | The Iron Claw | True |
1 | Bingqing Li | Statistics | Econometrics & Quant Econ | Shanghai | China | Yogi | Frozen | True |
2 | Kam Chiu Chong | Economics | Information Systems | Hong Kong | NaN | NaN | NaN | True |
3 | Jiaqi Zeng | Mathematics | Computer Science | Zhuhai | China | Northern Cuisine | Interstellar | True |
4 | Rishi Shah | Information Sciences + DS | Business | Chicago | USA | Taco Bell | Forrest Gump | True |
▶️ Run df_you.tail(4)
to print the last 4 rows of df_you
.
### BEGIN SOLUTION
df_you.tail(4)
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
114 | Renee Crawley | Interdisciplinary Health | NaN | NaN | NaN | NaN | NaN | True |
115 | Kashvi Panjolia | Computer Science | NaN | Austin | USA | Oozu Ramen | NaN | True |
116 | Connor Gordon | Aerospace Engineering | Business | Sussex | USA | Papa Del's | Interstellar | True |
117 | Dhruv Nambisan | Industrial Engineering | Business | Chicago | USA | Fernandos Tacos | Django Unchained | True |
▶️ Run df_you.sample(3)
to print 3 randomly sampled rows from df_you
.
### BEGIN SOLUTION
df_you.sample(3)
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
71 | Youngjin Song | Economics | Business | Ridgewood | USA | Mia Za's | The Revenant | True |
66 | Andrew Jordan | Economics | Business | Chicago | USA | Mo's Burritos | Interstellar | True |
106 | Will Neff | Information Sciences + DS | Political Science | Baltimore | USA | Yogi | Babylon or Perfect Blue | True |
# Autograder
DataFrame
¶👉 How many rows and columns does df_you
have?
▶️ Run df_you.shape
below to see the shape (number of rows and columns) of the database.
### BEGIN SOLUTION
df_you.shape
### END SOLUTION
(118, 8)
👉 Can you store the number of rows and columns to variables?
df_you.shape
returns a tuple
in (num_rows, num_cols)
format.tuple
? 🙀tuple
is a list
that cannot be modified once created.▶️ Run the code cell below to see how a tuple
is nearly identical to a list
.
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)
print(f"my_list[1]={my_list[1]}") # prints 20
print(f"my_tuple[1]={my_tuple[1]}") # also prints 20
my_list[1]=20 my_tuple[1]=20
DataFrame
¶df_you
to a new variable named num_rows
.df_you
to a new variable named num_cols
..shape
, not len()
.### BEGIN SOLUTION
num_rows = df_you.shape[0]
num_cols = df_you.shape[1]
### END SOLUTION
print(num_rows)
print(num_cols)
118 8
tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")