A Jupyter notebook is a document that can contain live code w/ results, visualizations, and rich text. It is widely used in data science and analytics. The cell below is a code cell. It contains a block of executable code.
Run the code below by clicking on the cell below and clicking the "Run" button on top (▶).
print(10 + 20)
30
▶️ Run the code cell below to import unittest
, a module used for 🧭 Check Your Work sections and the autograder.
import unittest
tc = unittest.TestCase()
There are three different type of cells.
We will most frequently use the first two types of cells.
my_list = [11, 20, 52, 91, 90, 75, 74, 20, 21, 10, 14]
### BEGIN SOLUTION
result = 0
for num in my_list:
result = result + num
### END SOLUTION
print(result)
478
import unittest
tc = unittest.TestCase()
tc.assertEqual(result, 478)
Pandas is a Python library for data manipulation and analysis. Although it's used universally in data-related programming applications, it was initially developed for financial analysis by AQR Capital Management.
Note: A library in the context of programming is a collection of functions (and other data) that others have already written for you.
Pandas is popular for many reasons:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")
Series
...¶The basic building block of Pandas is a Series
. A Series
is like a list, but with many more features.
You can create a Series
by passing a list of values to pd.Series()
.
s = pd.Series([1, 2, 3, np.nan, 5, 6])
s
0 1.0 1 2.0 2 3.0 3 NaN 4 5.0 5 6.0 dtype: float64
my_list = [1, 2, 3, 4]
print(type(my_list))
display(my_list * 2)
<class 'list'>
[1, 2, 3, 4, 1, 2, 3, 4]
my_series = pd.Series([1, 2, 3, 4])
print(type(my_series))
display(my_series * 2)
<class 'pandas.core.series.Series'>
0 2 1 4 2 6 3 8 dtype: int64
What happens when you multiply a Python list
by number 2
? It repeats the elements.
How about a Series
? It multiples each element by 2
!
### BEGIN SOLUTION
my_series = pd.Series([10, 20, 30])
### END SOLUTION
my_series
0 10 1 20 2 30 dtype: int64
pd.testing.assert_series_equal(my_series, pd.Series([1, 2, 3]) * 10)
Series
methods¶A pandas Series
is similar to a Python list
. However, a Series
provides many methods (equivalent to functions) for you to use.
As an example, num_reviews.mean()
will return the average number of reviews.
reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)
# YOUR CODE HERE
...
Ellipsis
product_names
and num_reviews
that contain the names of make-up products and the number of reviews on Sephora.com.DataFrame
named df_top_products
that has the following two columns:product_name
: Names of the productsnum_review
: Number of reviewsThe code below creates a new Pandas DataFrame
from two series.
my_new_dataframe = pd.DataFrame({
"column_one": my_series1,
"column_two": my_series2
})
product_names = [
"Laneige Lip Sleeping Mask",
"The Ordinary Hyaluronic Acid 2% + B5",
"Laneige Lip Glowy Balm",
"Chanel COCO MADEMOISELLE Eau de Parfum"
]
num_reviews = [
12715,
2274,
2766,
724
]
### BEGIN SOLUTION
df_top_products = pd.DataFrame({
"product_name": product_names,
"num_review": num_reviews
})
### END SOLUTION
display(df_top_products)
product_name | num_review | |
---|---|---|
0 | Laneige Lip Sleeping Mask | 12715 |
1 | The Ordinary Hyaluronic Acid 2% + B5 | 2274 |
2 | Laneige Lip Glowy Balm | 2766 |
3 | Chanel COCO MADEMOISELLE Eau de Parfum | 724 |
pd.testing.assert_frame_equal(
df_top_products.reset_index(drop=True),
pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
1: "The Ordinary Hyaluronic Acid 2% + B5",
2: "Laneige Lip Glowy Balm",
3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
"num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)
The second part of today's lecture is all about you. 👻 Literally.
▶️ Run the code cell below to create a new DataFrame
named df_you
.
df_you = pd.read_csv("https://github.com/bdi475/datasets/raw/main/about-you.csv")
# Used to keep a clean copy
df_you_backup = df_you.copy()
# head() displays the first 5 rows of a DataFrame
df_you.head()
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
0 | Ahana Chakraborty | Statistics | Business & Informatics | Chicago | USA | Poke Lab | Shrek 2 | True |
1 | Andrew Rozmus | Psychology | NaN | Elmhurst | USA | NaN | NaN | True |
2 | Anusha Adira | Computer Engineering | Business | Cupertino | USA | Bangkok Thai | Three Idiots | True |
3 | Arthur Pyptyuk | Economics | Business | Hoffman Estates | USA | Sakanaya | Hereditary | True |
4 | Aryajit Das | Economics | Business & Global Markets plus Society | Streamwood | USA | Dubai Grill | Transformers: Age of Extinction | True |
☝️ Hold on. Didn't we always create DataFrame
s using pd.DataFrame()
?
Yes. But we can import existing data as a Pandas DataFrame
using pd.read_csv()
. There are many other similar import methods. For now, we'll mostly use pd.read_csv()
.
The table below explains each column in df_you
.
Column Name | Description |
---|---|
name | First name |
major1 | Major |
major2 | Second major OR minor (blank if no second major or minor) |
city | City the person is from |
country | Country the person is from |
fav_restaurant | Favorite restaurant (blank if no restaurant was given) |
fav_movie | Favorite movie (blank if no movie was given) |
has_iphone | Whether the person use an iPhone |
DataFrame
¶👉 A common first step in working with a DataFrame
is to use the info()
method. info()
prints a concise summary of a DataFrame
.
▶️ Run df_you.info()
below to see the info()
method in action.
### BEGIN SOLUTION
df_you.info()
### END SOLUTION
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45 entries, 0 to 44 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 45 non-null object 1 major1 44 non-null object 2 major2 36 non-null object 3 city 35 non-null object 4 country 37 non-null object 5 fav_restaurant 33 non-null object 6 fav_movie 31 non-null object 7 has_iphone 45 non-null bool dtypes: bool(1), object(7) memory usage: 2.6+ KB
👉 From the result of df_you.info()
, we can understand a couple of things:
object
data type.object
, not str
.NaN
.np.nan
(more on this later).▶️ Run df_you.head()
to print the first 5 rows of df_you
.
### BEGIN SOLUTION
df_you.head()
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
0 | Ahana Chakraborty | Statistics | Business & Informatics | Chicago | USA | Poke Lab | Shrek 2 | True |
1 | Andrew Rozmus | Psychology | NaN | Elmhurst | USA | NaN | NaN | True |
2 | Anusha Adira | Computer Engineering | Business | Cupertino | USA | Bangkok Thai | Three Idiots | True |
3 | Arthur Pyptyuk | Economics | Business | Hoffman Estates | USA | Sakanaya | Hereditary | True |
4 | Aryajit Das | Economics | Business & Global Markets plus Society | Streamwood | USA | Dubai Grill | Transformers: Age of Extinction | True |
▶️ Run df_you.tail(4)
to print the last 4 rows of df_you
.
### BEGIN SOLUTION
df_you.tail(4)
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
41 | Twinkle Yeruva | Computer Science | Business | Schaumburg | USA | Sticky Rice | Maze Runner | True |
42 | Valentina Flores | Economics | Business & French | Chicago | USA | NaN | Book of Life | True |
43 | Victoria Hernandez | Industrial Design | Spanish | East Moline | USA | Bangkok Thai | Mamma Mia or Shrek | True |
44 | Vikas Chavda | Economics | Business | Geneva | USA | Yogi | Kingsman: Secret Service | True |
▶️ Run df_you.sample(3)
to print 3 randomly sampled rows from df_you
.
### BEGIN SOLUTION
df_you.sample(3)
### END SOLUTION
name | major1 | major2 | city | country | fav_restaurant | fav_movie | has_iphone | |
---|---|---|---|---|---|---|---|---|
8 | Cole Jordan | Computer Science | Business | NaN | USA | Chipotle | Ratatouille | True |
21 | Julia Kevin | Bioengineering | Business | Elmhurst | USA | KoFusion | Set it Up | True |
38 | Spencer Sadler | Computer Science | Business | Chicago | USA | Bangkok Thai | Ratatouille | True |
# Autograder
DataFrame
¶👉 How many rows and columns does df_you
have?
▶️ Run df_you.shape
below to see the shape (number of rows and columns) of the database.
### BEGIN SOLUTION
df_you.shape
### END SOLUTION
(45, 8)
👉 Can you store the number of rows and columns to variables?
df_you.shape
returns a tuple
in (num_rows, num_cols)
format.tuple
? 🙀tuple
is a list
that cannot be modified once created.▶️ Run the code cell below to see how a tuple
is nearly identical to a list
.
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)
print(f"my_list[1]={my_list[1]}") # prints 20
print(f"my_tuple[1]={my_tuple[1]}") # also prints 20
my_list[1]=20 my_tuple[1]=20
DataFrame
¶df_you
to a new variable named num_rows
.df_you
to a new variable named num_cols
..shape
, not len()
.### BEGIN SOLUTION
num_rows = df_you.shape[0]
num_cols = df_you.shape[1]
### END SOLUTION
print(num_rows)
print(num_cols)
45 8
tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")