Description: You are a data analyst for a movie streaming service. You have been tasked with analyzing a dataset of movie ratings to determine which genres are the most popular among users.
The dataset contains the following columns:
Your task is to:
Data:
You can use the following sample data to get started:
user_id,movie_id,rating,genre
1,101,4,Action
1,102,3,Comedy
2,101,5,Action
2,103,4,Drama
3,102,2,Comedy
3,104,5,Action
# import libraries
import pandas as pd
import sys
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)] Pandas version 2.2.1
# let's try to copy the data using the clipboard
df = pd.read_clipboard(sep=",")
df
user_id | movie_id | rating | genre | |
---|---|---|---|---|
0 | 1 | 101 | 4 | Action |
1 | 1 | 102 | 3 | Comedy |
2 | 2 | 101 | 5 | Action |
3 | 2 | 103 | 4 | Drama |
4 | 3 | 102 | 2 | Comedy |
5 | 3 | 104 | 5 | Action |
# check the data types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 6 non-null int64 1 movie_id 6 non-null int64 2 rating 6 non-null int64 3 genre 6 non-null object dtypes: int64(3), object(1) memory usage: 324.0+ bytes
# create groupby object
group = df.groupby('genre')
# calculate average rating
avg = group['rating'].mean()
avg
genre Action 4.666667 Comedy 2.500000 Drama 4.000000 Name: rating, dtype: float64
I decided to place it all in one line. Yes, it is a bit ugly.
Here is what I did:
df.set_index('genre').merge(avg.rename('average_rating'), on='genre').sort_values('average_rating', ascending=False)
user_id | movie_id | rating | average_rating | |
---|---|---|---|---|
genre | ||||
Action | 1 | 101 | 4 | 4.666667 |
Action | 2 | 101 | 5 | 4.666667 |
Action | 3 | 104 | 5 | 4.666667 |
Drama | 2 | 103 | 4 | 4.000000 |
Comedy | 1 | 102 | 3 | 2.500000 |
Comedy | 3 | 102 | 2 | 2.500000 |
If all you needed to see was the averages...
avg.sort_values(ascending=False)
genre Action 4.666667 Drama 4.000000 Comedy 2.500000 Name: rating, dtype: float64
This tutorial guided you through the analysis of a movie ratings dataset using Pandas. It covered loading data, grouping by genre, calculating average ratings, merging data, and sorting results.
pd.read_clipboard()
df.info()
df.groupby()
group['rating'].mean()
df.merge()
or df.set_index().merge()
rename()
sort_values()
This tutorial was created by HEDARO