Jace Bothner
Watch the video presentation here:
https://youtu.be/QsSIUkY10kY
If you're not able to view the interactive visuals try viewing the project on my github page:
https://github.com/jbothner21/MIS_665_Final_Project
The goal of this project is to analyze an IMDb movie dataset and use the x variables to predict the y variable, IMDb score. We will accomplish this using data analytics methods data visualization, correlation analysis, regression, classification, and clustering.
## Import packages
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pingouin as pg
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
pd.set_option('display.max_columns', 500)
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import chi2
from sklearn.model_selection import GridSearchCV
import scikitplot as skplt
from graphviz import Source
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.externals.six import StringIO
import pydotplus
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.spatial.distance import cdist
from math import sqrt
import warnings
warnings.filterwarnings("ignore")
## Import data
df = pd.read_csv("data/movie_metadata.csv")
df.head(1)
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | actor_1_name | movie_title | num_voted_users | cast_total_facebook_likes | actor_3_name | facenumber_in_poster | plot_keywords | movie_imdb_link | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | CCH Pounder | Avatar | 886204 | 4834 | Wes Studi | 0.0 | avatar|future|marine|native|paraplegic | http://www.imdb.com/title/tt0499549/?ref_=fn_t... | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 1.78 | 33000 |
## Number of rows in dataset
len(df)
5043
## Check column names
df.columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
Some columns will clearly not be useful in our analysis. The columns 'IMDb link', 'plot keywords', and 'aspect ratio' won't be meaningful in analysis, so we'll drop these extraneous columns.
## Drop unnecessary columns
columns = ['movie_imdb_link','plot_keywords','aspect_ratio']
df.drop(columns, inplace=True, axis=1)
df.head(1)
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | actor_1_name | movie_title | num_voted_users | cast_total_facebook_likes | actor_3_name | facenumber_in_poster | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | CCH Pounder | Avatar | 886204 | 4834 | Wes Studi | 0.0 | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 33000 |
There are 45 duplicate rows in the dataset. We'll drop these duplicated rows to avoid repeated information that could influence our results.
## Check for duplicates
len(df[df.duplicated()])
45
## Drop duplicates and check number of rows
df = df.drop_duplicates()
len(df)
4998
Below are all the values for each column that are null or missing. We'll have to deal with these null values in various ways because some methods of analysis will not allow null values.
## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(19)
gross 874 budget 487 content_rating 301 title_year 107 director_facebook_likes 103 director_name 103 num_critic_for_reviews 49 actor_3_facebook_likes 23 actor_3_name 23 num_user_for_reviews 21 color 19 duration 15 facenumber_in_poster 13 actor_2_facebook_likes 13 actor_2_name 13 language 12 actor_1_name 7 actor_1_facebook_likes 7 country 5 dtype: int64
The three variables missing the most data are gross, budget, and content rating. Because of the nature of these variables, it's difficult to find a solution to fill the null values. There is too much variation in the dataset to replace with mean, median, or mode. Therefore we'll have to drop all rows that have null values for these three variables.
## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(3)
gross 874 budget 487 content_rating 301 dtype: int64
## Drop null gross/budget
df.dropna(subset=['gross'], how='all', inplace = True)
df.dropna(subset=['budget'], how='all', inplace = True)
df.dropna(subset=['content_rating'], how='all', inplace = True)
len(df)
3806
Even after dropping the null values in gross, budget, and content rating, there are still 3,806 observations in the dataset. This is still a sufficient sample size to analyze and draw conclusions from. Now there are still 11 columns that have null values.
## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(10)
actor_3_facebook_likes 6 facenumber_in_poster 6 actor_3_name 6 color 2 actor_2_facebook_likes 2 language 2 actor_2_name 2 num_critic_for_reviews 1 actor_1_facebook_likes 1 actor_1_name 1 dtype: int64
One method of filling missing data, especially categorical data, is to replace the null value with the most commonly observed value.
## Majority of movies are in color
df['color'].value_counts()
Color 3680 Black and White 124 Name: color, dtype: int64
## Majority of movies are in English
df['language'].value_counts().sort_values(ascending=False).head()
English 3644 French 34 Spanish 24 Mandarin 14 German 11 Name: language, dtype: int64
In the case of the color and language columns, each variable has one value that occurs more frequently than others. So we'll replace the missing color values with color, and the missing language values with English. Now only 8 columns have missing values.
# Replace null values with the most popular value in a categorial columns
df = df.fillna({'color': 'Color'})
df = df.fillna({'language': 'English'})
df.isnull().sum().sort_values(ascending=False).head(8)
actor_3_facebook_likes 6 facenumber_in_poster 6 actor_3_name 6 actor_2_facebook_likes 2 actor_2_name 2 actor_1_name 1 num_critic_for_reviews 1 actor_1_facebook_likes 1 dtype: int64
Another technique for filling missing data is by calculating the central tendency, like mean or median, for each column and replacing the null values with that specific central tendency. This is the methodology we'll use for missing integer variables, number of faces in the movie poster, the number of reviews given by critics, and actors Facebook likes. Since there is some variation in the data, and since all the numbers are whole numbers, we'll fill the missing values with the median.
## Median replace because of variation
df.median()
num_critic_for_reviews 136.0 duration 106.0 director_facebook_likes 59.0 actor_3_facebook_likes 432.0 actor_1_facebook_likes 1000.0 gross 28749642.5 num_voted_users 52312.5 cast_total_facebook_likes 3951.0 facenumber_in_poster 1.0 num_user_for_reviews 205.0 budget 25000000.0 title_year 2005.0 actor_2_facebook_likes 670.0 imdb_score 6.6 movie_facebook_likes 218.0 dtype: float64
# Replace null values with median
newposter = df['facenumber_in_poster'].median()
df = df.fillna({'facenumber_in_poster': newposter})
newcritic = df['num_critic_for_reviews'].median()
df = df.fillna({'num_critic_for_reviews': newcritic})
newact1 = df['actor_1_facebook_likes'].median()
df = df.fillna({'actor_1_facebook_likes': newact1})
newact2 = df['actor_2_facebook_likes'].median()
df = df.fillna({'actor_2_facebook_likes': newact2})
newact3 = df['actor_3_facebook_likes'].median()
df = df.fillna({'actor_3_facebook_likes': newact3})
df.isnull().sum().sort_values(ascending=False).head(3)
actor_3_name 6 actor_2_name 2 actor_1_name 1 dtype: int64
After filling these values with the median three columns have null values: the names of actors one, two, and three. We could fill these missing values, but since there are numerous actors in the dataset, and since these few missing values won't affect our analysis, we'll leave these null values.
After filling null values, we'll want to handle any outliers in the data that might influence analysis. Since the median budget is 25 million dollars, some outrageously high budgets in the dataset are present and will be considered outliers. We need to remove these values that are either typos or inaccurate information since there are no movies that have had a budget even close to 12 billion dollars.
## Median budget for context
df['budget'].median()
25000000.0
## Check for outliers
pd.options.display.float_format = '{:,.0f}'.format
df['budget'].sort_values(ascending=False).head()
2988 12,215,500,000 3859 4,200,000,000 3005 2,500,000,000 2323 2,400,000,000 2334 2,127,519,898 Name: budget, dtype: float64
## Reset scientific notation
pd.reset_option('^display.', silent=True)
We'll treat any budget above 300 million dollars as an outlier, and drop the row. After doing so, 12 outliers were dropped from the dataset.
## Drop outliers
df.drop( df[ df['budget'] >= 300000001 ].index , inplace=True)
len(df)
3794
Now we'll sort out the content rating category. There are 12 different options in the content rating column. We can consolidate these ratings and organize them into more manageable categories.
## Check content ratings
df['content_rating'].value_counts()
R 1715 PG-13 1311 PG 572 G 91 Not Rated 42 Unrated 24 Approved 17 X 10 NC-17 6 Passed 3 M 2 GP 1 Name: content_rating, dtype: int64
We'll redistribute the content rating categories the following way:
Now we have 5 different content ratings that follow a natural scale, from movies that are more family friendly to movies that have mature themes.
## Redistribute into more managable groups
df = df.replace('GP', 'PG')
df = df.replace('Approved', 'PG')
df = df.replace('Passed', 'PG')
df = df.replace('M', 'R')
df = df.replace('Not Rated', 'Unrated')
df = df.replace('X', 'Unrated')
df = df.replace('NC-17', 'Unrated')
df['content_rating'].value_counts()
R 1717 PG-13 1311 PG 593 G 91 Unrated 82 Name: content_rating, dtype: int64
In order to analyze categorical columns with certain methodologies, we'll have to convert those columns from objects to integers. We'll do this for the following columns; color, language, country, and content rating. For color, we'll convert the column into dummy variables since there are only two options for the color column. The movie is either in black and white, which is assigned a 0, or in color, which is assigned a 1.
## Convert color to dummy variable
df = pd.get_dummies(df, columns=['color'])
df = df.drop(['color_ Black and White'], axis=1)
df = df.rename(columns={'color_Color':'color'})
df['color'].value_counts()
1 3670 0 124 Name: color, dtype: int64
Since we already redistributed the content rating column into 5 groups that follow a natural scale, we can assign each of the ratings a number. G will be 1, PG will be 2, etc. In other words, a 1 will be a more family friendly movie, and a 5 will have more adult themes.
## Remap content rating from categorical to integer
df['content_rating'] = df['content_rating'].map({'G': 1, 'PG': 2, 'PG-13':3, 'R': 4, 'Unrated': 5})
df['content_rating'].value_counts()
4 1717 3 1311 2 593 1 91 5 82 Name: content_rating, dtype: int64
We've already established that the categorical columns language and country have a logical most frequent value: English is the most occurring language and the United States is the most common country. Therefore, we can create a new column with dummy variables for each of these categories. For language, 1 means the movie is in English, 0 means the movie is in a different language. For country, 1 means the movie is from the United States, and 0 means the movie is outside the USA.
## Remap language and country to dummy variables
df.loc[df.language == 'English', 'in_english'] = 1
df.loc[df.language != 'English', 'in_english'] = 0
df.loc[df.country == 'USA', 'from_USA'] = 1
df.loc[df.country != 'USA', 'from_USA'] = 0
df['in_english'].value_counts()
1.0 3645 0.0 149 Name: in_english, dtype: int64
df['from_USA'].value_counts()
1.0 3025 0.0 769 Name: from_USA, dtype: int64
Now we need to focus on the genres column. Since it's difficult to sum a single movie up into just one genre, many movies list multiple genres. We'll have to split each movie's listed genres so we can analyze them individually.
## Check genres in dataframe
df['genres'].value_counts().head()
Comedy|Drama|Romance 150 Drama 147 Comedy 143 Comedy|Drama 142 Comedy|Romance 136 Name: genres, dtype: int64
## Create a new data frame
gen = df[['genres','imdb_score']]
gen.head(3)
genres | imdb_score | |
---|---|---|
0 | Action|Adventure|Fantasy|Sci-Fi | 7.9 |
1 | Action|Adventure|Fantasy | 7.1 |
2 | Action|Adventure|Thriller | 6.8 |
## Check number of rows
len(gen)
3794
Since we're splitting up the genres, each movie could have multiple rows depending on the number of genres listed. For instance, if a certain movie has 5 genres listed, that movie will have 5 separate rows. For this reason, there are 11,325 rows after splitting the genres.
## Split genres into multiple rows
spl = pd.DataFrame(gen.genres.str.split('|').tolist(), index=gen.imdb_score).stack()
spl = spl.reset_index()[[0, 'imdb_score']]
spl.columns = ['genres', 'imdb_score']
spl.head(3)
genres | imdb_score | |
---|---|---|
0 | Action | 7.9 |
1 | Adventure | 7.9 |
2 | Fantasy | 7.9 |
len(spl)
11281
It may provide some insight to create a new column based on gross and budget. Subtracting budget from gross will yield the total profit of the movie.
## Calculate profit column
df['profit'] = df['gross'] - df['budget']
df.columns
Index(['director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'movie_facebook_likes', 'color', 'in_english', 'from_USA', 'profit'], dtype='object')
For analysis later in the project, we'll set custom bins based on the IMDb score of the movie. The movies in each bin represent the following:
## Set bins
df['movie_quality'] = pd.cut(df['imdb_score'], [0, 4, 6, 8, 10], labels=['1', '2', '3', '4'])
df['movie_quality'].value_counts().sort_index()
1 95 2 1067 3 2476 4 156 Name: movie_quality, dtype: int64
Now we'll reorder the columns so that similar variables are next to each other for easier analysis. We'll also create a copy of the dataframe with only integer columns for use later in the project.
## Reorder columns
df = df[['imdb_score', 'movie_quality', 'movie_title', 'title_year', 'content_rating', 'duration', 'country', 'from_USA', 'language', 'in_english', 'color', 'gross', 'budget', 'profit', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name', 'movie_facebook_likes', 'director_facebook_likes', 'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes', 'cast_total_facebook_likes', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users', 'facenumber_in_poster']]
df.head(1)
imdb_score | movie_quality | movie_title | title_year | content_rating | duration | country | from_USA | language | in_english | ... | movie_facebook_likes | director_facebook_likes | actor_1_facebook_likes | actor_2_facebook_likes | actor_3_facebook_likes | cast_total_facebook_likes | num_critic_for_reviews | num_user_for_reviews | num_voted_users | facenumber_in_poster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.9 | 3 | Avatar | 2009.0 | 3 | 178.0 | USA | 1.0 | English | 1.0 | ... | 33000 | 0.0 | 1000.0 | 936.0 | 855.0 | 4834 | 723.0 | 3054.0 | 886204 | 0.0 |
1 rows × 28 columns
## New dataframe for integer columns
df1 = df.drop(['movie_title', 'country', 'language', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name'], axis =1)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3794 entries, 0 to 5042 Data columns (total 21 columns): imdb_score 3794 non-null float64 movie_quality 3794 non-null category title_year 3794 non-null float64 content_rating 3794 non-null int64 duration 3794 non-null float64 from_USA 3794 non-null float64 in_english 3794 non-null float64 color 3794 non-null uint8 gross 3794 non-null float64 budget 3794 non-null float64 profit 3794 non-null float64 movie_facebook_likes 3794 non-null int64 director_facebook_likes 3794 non-null float64 actor_1_facebook_likes 3794 non-null float64 actor_2_facebook_likes 3794 non-null float64 actor_3_facebook_likes 3794 non-null float64 cast_total_facebook_likes 3794 non-null int64 num_critic_for_reviews 3794 non-null float64 num_user_for_reviews 3794 non-null float64 num_voted_users 3794 non-null int64 facenumber_in_poster 3794 non-null float64 dtypes: category(1), float64(15), int64(4), uint8(1) memory usage: 600.4 KB
While the ultimate goal of this analysis is to predict the IMDb score of certain movies, it may help to first get a better understanding of our data through visualization.
This first figure shows the distribution of the IMDb scores in the form of a box plot. The median IMDb score is 6.6, the first quartile is a score of 5.9, and the third quartile is 7.2.
## IMDb score
fig = px.box(df, y="imdb_score", points='all', hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='IMDb Score Boxplot')
Now we'll visualize the data by year. We can see from the histogram that we have more data from movies made in the late 1990s and onward. This might be because we had to drop the rows that were missing gross and budget values, and that information is less readily available for older movies. Alternatively, it's also possible that the barrier to entry for making films has been reduced due to an increase in technology leading to more films being made.
## Year
fig = px.histogram(df, x="title_year")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movies by Year')
Drama and comedy are the most popular movie genres shown by the bar chart below.
## Genre
fig = px.bar(spl, x="genres")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movies by Genre')
We've already seen the numbers for content rating while cleaning the data, but by visualizing the content rating in a pie chart, we can see just how prevalent PG-13 and R rated movies are. In fact, almost 80% of the movies in the dataset are represented in these two ratings.
## Content rating
labels = ['R','PG-13','PG','G','Unrated']
values = df['content_rating'].value_counts().values
trace=go.Pie(labels=labels,values=values)
py.iplot([trace])
A histogram of duration shows us that most movies are clustered around 100 minutes (1 hour and 40 minutes). However, the duration histogram is skewed right, meaning there are more values above the center.
## Duration
fig = px.histogram(df, x="duration")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Duration')
The histogram of movie gross shown below is severely right-skewed. This means that it's fairly rare for a movie to gross more than 50 million dollars and a decent portion of the movies in the dataset gross below 10 million dollars.
## Gross
fig = px.histogram(df, x="gross")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Gross')
The histogram displaying movie budget tells a similar story to that of movie gross. The histogram is skewed right, although less drastic, meaning it's more common for a movie to have a lower budget as opposed to a higher budget. It's actually fairly common for a movie to have a budget of fewer than 5 million dollars.
## Budget
fig = px.histogram(df, x="budget")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Budget')
The relationship between movie gross and its budget is shown below. There is a moderate positive correlation between these two, which is fairly intuitive due to the similarities between the two histograms above. Essentially, as the movie budget increases, the movie's gross should increase too, the extent of which will be discussed in the correlation analysis section.
## Gross by Budget
fig = px.scatter(df, x="budget", y="gross", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Gross by Budget')
The profit histogram below shows that movie profit is fairly normally distributed, with the center around 0 dollars. This means that a large portion of movies will just breakeven in terms of profit. However, profit isn't perfectly normally distributed as the graph is slightly right-skewed, meaning that some movies make a huge profit.
## Profit
fig = px.histogram(df, x="profit")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Profit')
The scatter plot below shows that movie gross and profit are moderately positively correlated. This is not surprising and is fairly intuitive, as a movie grosses more, it will likely make a larger profit.
## Profit by gross
fig = px.scatter(df, x="profit", y="gross", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Profit by Gross')
The bar chart below shows the top 10 most prolific directors, or in other words, those who have directed the most movies.
## Director most movies
d1 = df['director_name'].value_counts().head(10)
d1 = pd.DataFrame(data=d1).reset_index()
d1.columns= ['director_name', 'count']
fig = px.bar(d1, x="director_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Directors')
The bar chart below shows the directors who have the highest average IMDb scores across all of their movies. For context, as we found from earlier analysis, the median IMDb score is 6.6.
## Director by IMDb score
d2 = df.groupby('director_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
d2 = pd.DataFrame(data=d2).reset_index()
d2.columns= ['director_name', 'avg_imdb_score']
fig = px.bar(d2, x="director_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Directors')
Similar to the prolific directors graph, the graph below shows the highest number of movies for an actor in a leading role.
## Actor 1 most prolific
a1 = df['actor_1_name'].value_counts().head(10)
a1 = pd.DataFrame(data=a1).reset_index()
a1.columns= ['actor_1_name', 'count']
fig = px.bar(a1, x="actor_1_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Leading Actors')
Also similar to the best directors graph, the graph below shows the top 10 leading actors, based on average IMDb score across all their movies.
## Actor 1 by IMDb score best
a2 = df.groupby('actor_1_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
a2 = pd.DataFrame(data=a2).reset_index()
a2.columns= ['actor_1_name', 'avg_imdb_score']
fig = px.bar(a2, x="actor_1_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Leading Actors')
Conversely, the graph below shows the 10 worst leading actors based on their average IMDb score.
## Actor 1 by IMDb score worst
a3 = df.groupby('actor_1_name')['imdb_score'].mean().sort_values(ascending=False).tail(10)
a3 = pd.DataFrame(data=a3).reset_index()
a3.columns= ['actor_1_name', 'avg_imdb_score']
fig = px.bar(a3, x="actor_1_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Worst Leading Actors')
The top 10 most prolific supporting actors are shown below.
## Actor 2 most prolific
a4 = df['actor_2_name'].value_counts().head(10)
a4 = pd.DataFrame(data=a4).reset_index()
a4.columns= ['actor_2_name', 'count']
fig = px.bar(a4, x="actor_2_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Supporting Actors')
The top 10 supporting actors based on average IMDb score is shown in the bar chart below.
## Actor 2 by IMDb score best
a5 = df.groupby('actor_2_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
a5 = pd.DataFrame(data=a5).reset_index()
a5.columns= ['actor_2_name', 'avg_imdb_score']
fig = px.bar(a5, x="actor_2_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Supporting Actors')
The scatter plot below shows the relationship between movie Facebook likes and IMDb score. There's a substantial amount of Facebook likes in the dataset that are 0. This means that a lot of these movies just don't have Facebook pages for one reason or another. Because of this, the variable may not be a strong determinant of IMDb score, but we'll discuss this more in the correlation analysis section.
## Movie likes by IMDb score
fig = px.scatter(df, x="movie_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Facebook Likes by IMDb Score')
The director likes by IMDb score scatter plot tells a similar story to that of movie Facebook likes. At the time of data collection, many of the directors do not have a Facebook page.
## Director likes by IMDb score
fig = px.scatter(df, x="director_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Director Facebook Likes by IMDb Score')
Similar to the other Facebook variables, there is seemingly little correlation between cast Facebook likes and IMDb score.
## Cast likes
fig = px.scatter(df, x="cast_total_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Cast Facebook Likes by IMDb Score')
This figure below shows the distribution of the IMDb reviews from critics in the form of a box plot.
## Critics
fig = px.box(df, y="num_critic_for_reviews", points='all', hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of Critic Reviews')
The histogram below shows the distribution of the number of users who gave a review on IMDb. As with other histograms shown before, this one is also skewed right. Almost half (1852/3794) of the movies in the dataset have under 200 user reviews.
## Users
fig = px.histogram(df, x="num_user_for_reviews")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of IMDb User Reviews')
The histogram for the number of IMDb user votes is shown below, and much like the user reviews variable, the histogram is skewed right. However, the number of user votes has a much more drastic skew. A substantial portion of movies in the dataset have less than 20,000 votes.
## Voted
fig = px.histogram(df, x="num_voted_users")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of IMDb Users Who Voted')
Correlation is a measure of relationship strength between two variables. If two variables are highly positively correlated, then an increase in one will generally lead to an increase in the other. If two variables are highly negatively correlated, then an increase in one will generally lead to a decrease in the other. The correlation coefficients for each of the x variables with the y variable, IMDb score, are shown below.
## Correlation with IMDb score
df[df.columns[0:]].corr()['imdb_score'][:].sort_values(ascending=False)
imdb_score 1.000000 num_voted_users 0.479882 duration 0.370148 num_critic_for_reviews 0.349811 num_user_for_reviews 0.324998 movie_facebook_likes 0.283770 profit 0.255117 gross 0.218389 director_facebook_likes 0.191684 content_rating 0.120074 cast_total_facebook_likes 0.106502 actor_2_facebook_likes 0.101541 actor_1_facebook_likes 0.093774 actor_3_facebook_likes 0.065357 budget 0.038711 facenumber_in_poster -0.069214 color -0.118699 title_year -0.133974 from_USA -0.135822 in_english -0.169427 Name: imdb_score, dtype: float64
## Correlation heatmap
plt.figure(figsize=(14,10))
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df.corr(), mask=mask, vmax=.8, square=True, annot=True, fmt=".2f");
The variables that are the highest determinants of overall IMDb score are likely the ones that are highly correlated with IMDb score. Interestingly, there are no variables that have a strong correlation, so the variables that are 'highly' correlated will have to be based on the context of the other variables. In this case, the variables that we will deem to be 'highly' correlated (above 0.2) are:
These findings are generalized and the extent of the variables' likelihood to affect IMDb score is based on the correlation coefficient as mentioned above. So now that we have this information, what can we do with it? The next step is to set up a regression model based on the determinants to predict IMDb score.
The first step in regression is setting x and y variables. The y variable, the variable we're trying to predict, is IMDb score. The x variables, or determinants of the y variable, will be all the numerical columns aside from IMDb score.
## Set X and y variables
y = df1['imdb_score']
X = df1.drop(['movie_quality', 'imdb_score'], axis =1)
The first model we'll build and test is Scikit-learn's linear regression algorithm.
## Build regression model based on scikit-learn algorithm
model1 = lm.LinearRegression()
model1.fit(X, y)
model1_y = model1.predict(X)
## Display coefficients for each variable
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
[('title_year', '-0.017'), ('content_rating', '0.037'), ('duration', '0.011'), ('from_USA', '-0.205'), ('in_english', '-0.701'), ('color', '-0.335'), ('gross', '-0.000'), ('budget', '-0.000'), ('profit', '0.000'), ('movie_facebook_likes', '-0.000'), ('director_facebook_likes', '0.000'), ('actor_1_facebook_likes', '0.000'), ('actor_2_facebook_likes', '0.000'), ('actor_3_facebook_likes', '0.000'), ('cast_total_facebook_likes', '-0.000'), ('num_critic_for_reviews', '0.003'), ('num_user_for_reviews', '-0.001'), ('num_voted_users', '0.000'), ('facenumber_in_poster', '-0.025')]
The strength or accuracy of a regression model is evaluated through mean square error (MSE) and R-squared. The goal of a regression model is to minimize MSE and maximize R-squared. A lower MSE means less error in the model.
## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, model1_y))
print("Variance or R-squared: ", explained_variance_score(y, model1_y))
Mean Square Error: 0.6492138737974602 Variance or R-squared: 0.4164775382245356
Next we'll test Statsmodel regression algorithm based on ordinary least squares regression.
## Build regression model based on ordinary least squares
runs_reg_model = ols("imdb_score~title_year+content_rating+duration+from_USA+in_english+color+gross+budget+profit+movie_facebook_likes+actor_1_facebook_likes+actor_2_facebook_likes+actor_3_facebook_likes+cast_total_facebook_likes+num_critic_for_reviews+num_user_for_reviews+num_voted_users+facenumber_in_poster", df1)
runs_reg = runs_reg_model.fit()
print(runs_reg.summary())
OLS Regression Results ============================================================================== Dep. Variable: imdb_score R-squared: 0.416 Model: OLS Adj. R-squared: 0.414 Method: Least Squares F-statistic: 158.4 Date: Tue, 26 May 2020 Prob (F-statistic): 0.00 Time: 13:11:29 Log-Likelihood: -4564.7 No. Observations: 3794 AIC: 9165. Df Residuals: 3776 BIC: 9278. Df Model: 17 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------- Intercept 40.5584 3.229 12.560 0.000 34.227 46.889 title_year -0.0173 0.002 -10.713 0.000 -0.020 -0.014 content_rating 0.0368 0.017 2.105 0.035 0.003 0.071 duration 0.0112 0.001 16.637 0.000 0.010 0.012 from_USA -0.2021 0.036 -5.683 0.000 -0.272 -0.132 in_english -0.7014 0.073 -9.603 0.000 -0.845 -0.558 color -0.3387 0.076 -4.486 0.000 -0.487 -0.191 gross -1.876e-06 1.49e-07 -12.561 0.000 -2.17e-06 -1.58e-06 budget 1.871e-06 1.49e-07 12.527 0.000 1.58e-06 2.16e-06 profit 1.877e-06 1.49e-07 12.570 0.000 1.58e-06 2.17e-06 movie_facebook_likes -1.927e-06 9.09e-07 -2.119 0.034 -3.71e-06 -1.44e-07 actor_1_facebook_likes 5.629e-05 1.26e-05 4.484 0.000 3.17e-05 8.09e-05 actor_2_facebook_likes 5.986e-05 1.32e-05 4.534 0.000 3.4e-05 8.57e-05 actor_3_facebook_likes 5.493e-05 2.04e-05 2.687 0.007 1.48e-05 9.5e-05 cast_total_facebook_likes -5.385e-05 1.25e-05 -4.301 0.000 -7.84e-05 -2.93e-05 num_critic_for_reviews 0.0026 0.000 13.581 0.000 0.002 0.003 num_user_for_reviews -0.0006 5.56e-05 -10.546 0.000 -0.001 -0.000 num_voted_users 3.357e-06 1.65e-07 20.288 0.000 3.03e-06 3.68e-06 facenumber_in_poster -0.0253 0.007 -3.853 0.000 -0.038 -0.012 ============================================================================== Omnibus: 601.448 Durbin-Watson: 1.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1317.164 Skew: -0.927 Prob(JB): 9.58e-287 Kurtosis: 5.213 Cond. No. 6.77e+15 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 9.34e-13. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
The R-squared terms of Statsmodel and Scikit-learn models are virtually identical (0.416), but Statsmodel has a slightly higher MSE, meaning it's a slightly less accurate model than Scikit-learn.
## Evaluation of MSE
runs_reg.mse_resid
0.6525603200981047
Now we'll build a regression model using Lasso regularization algorithm.
## Build regression model based on regularization algorithm
modelL = lm.Lasso()
modelL.fit(X, y)
modelL_y = modelL.predict(X)
## Display coefficients for each variable
coef = ["%.3f" % i for i in modelL.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))
[('title_year', '-0.004'), ('content_rating', '0.000'), ('duration', '0.011'), ('from_USA', '-0.000'), ('in_english', '-0.000'), ('color', '-0.000'), ('gross', '-0.000'), ('budget', '-0.000'), ('profit', '0.000'), ('movie_facebook_likes', '-0.000'), ('director_facebook_likes', '0.000'), ('actor_1_facebook_likes', '0.000'), ('actor_2_facebook_likes', '0.000'), ('actor_3_facebook_likes', '0.000'), ('cast_total_facebook_likes', '-0.000'), ('num_critic_for_reviews', '0.002'), ('num_user_for_reviews', '-0.000'), ('num_voted_users', '0.000'), ('facenumber_in_poster', '-0.000')]
This model has a higher MSE and a lower R-squared than previous models, meaning it's less accurate.
## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, modelL_y))
print("Variance or R-squared: ", explained_variance_score(y, modelL_y))
Mean Square Error: 0.705537813993835 Variance or R-squared: 0.36585279718498065
Next we'll build a model using feature selection. SelectKBest will choose the 12 most important variables that we'll be able to build a regression model from.
## Select 12 most important variables using SelectKBest algorithm
X1 = SelectKBest(f_regression, k=12).fit_transform(X, y)
selector = SelectKBest(f_regression, k=12).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)
[ 0 1 2 3 4 6 8 9 10 15 16 17]
## Check which variables were chosen
X.head(1)
title_year | content_rating | duration | from_USA | in_english | color | gross | budget | profit | movie_facebook_likes | director_facebook_likes | actor_1_facebook_likes | actor_2_facebook_likes | actor_3_facebook_likes | cast_total_facebook_likes | num_critic_for_reviews | num_user_for_reviews | num_voted_users | facenumber_in_poster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009.0 | 3 | 178.0 | 1.0 | 1.0 | 1 | 760505847.0 | 237000000.0 | 523505847.0 | 33000 | 0.0 | 1000.0 | 936.0 | 855.0 | 4834 | 723.0 | 3054.0 | 886204 | 0.0 |
The 12 most important variables according to SelectKBest are:
The regression model built with feature selection has a higher MSE and a lower R-squared than previously tested models meaning it's less accurate.
## Build regression model based on 12 chosen variables
modelF = lm.LinearRegression()
modelF.fit(X1, y)
modelF_y = modelF.predict(X1)
## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, modelF_y))
print("Variance or R-squared: ", explained_variance_score(y, modelF_y))
Mean Square Error: 0.6608544967460929 Variance or R-squared: 0.4060147843713958
By analyzing the MSE of each of the regression models, we can tell that Scikit-learn is the most accurate model because it has the lowest MSE of the 4 models tested. This means that the Scikit-learn model has reduced the total amount of error in its regression model.
## Evaluation of all models
print("Scikit-learn MSE:", mean_squared_error(y, model1_y))
print("Statsmodel MSE:", runs_reg.mse_resid)
print("Regularization MSE:", mean_squared_error(y, modelL_y))
print("Feature selection MSE:", mean_squared_error(y, modelF_y))
Scikit-learn MSE: 0.6492138737974602 Statsmodel MSE: 0.6525603200981047 Regularization MSE: 0.705537813993835 Feature selection MSE: 0.6608544967460929
The plot below shows the actual IMDb scores plotted against Scikit-learn's predicted scores. Below the plot is the root mean square (RMS) of the Scikit-learn model. This is helpful in analysis because RMS puts the error term into the same scale as the y variable. In other words, on average, this model wrongly predicts the IMDb score by 0.8. This seems very high, especially since 0.8 can be the difference between a good movie and a great movie. However, the reason for the high RMS is that the model doesn't predict IMDb scores below 5 or 6. This means that there is a greater amount of error in lower IMDb rated movies, which increases the error overall. This is most likely due to the nature of the dataset and not an issue with the model. In other words, because of the dataset, the model will be more accurate with IMDb scores above 5 or 6.
## Scatterplot actual vs predicted for Scikit-learn
plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) #dotted line represents perfect prediction (actual = predicted)
plt.title('Scikit-learn Actual vs Predicted')
plt.xlabel('Actual IMDb Score')
plt.ylabel('Predicted IMDb Score')
plt.show()
## Display RMS of Scikit-learn model
print("Scikit-learn RMS: ", sqrt(mean_squared_error(y, model1_y)))
Scikit-learn RMS: 0.8057380925570419
Earlier on in the project, we set custom bins based on the IMDb score of the movie.
## Show custom movie quality bins
df1['movie_quality'].value_counts().sort_index()
1 95 2 1067 3 2476 4 156 Name: movie_quality, dtype: int64
Now we can use the dataset to build classification models to predict which bin movies will classify as. Effectively we are predicting whether a movie is 'bad,' 'ok,' 'good,' or 'excellent' instead of predicting the IMDb score overall.
## Declare X/y variables
y = df1['movie_quality']
X = df1.drop(['movie_quality', 'imdb_score'], axis =1)
print(y.shape, X.shape)
(3794,) (3794, 19)
First, we'll use a decision tree classification model. This model splits the dataset into training and testing groups. 70% of the data will be used to train the algorithm, while the remaining 30% will be used to test the model and its accuracy.
## Validation
X_train, X_test_dt, y_train, y_test_dt = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=5)
## Train
dt = dt.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_dt, dt.predict(X_test_dt)))
0.7155399473222125
A classification model is evaluated by its accuracy, the amount of correctly classified data points in the test set divided by the total number of data points in the test set. In this case, the model correctly classifies the data 71.5% of the time. A confusion matrix is a visualization of the accuracy of a classification model. The goal of a confusion matrix is to maximize the true rates, the numbers diagonal from top left to bottom right. Naturally, when maximizing the true rates we'll want to minimize the rest, which are called false rates. In this case, the true rate is high enough to suggest a decent amount of accuracy in the model. However, it's worth noting that the model does not correctly predict any 'bad' movies and instead falsely classifies them as 'ok' or 'good.' Similarly, there are quite a few instances of 'ok' movies that are misclassified as 'good' movies. This is a similar phenomenon found in the regression model in which the model does not predict IMDb scores under 5. This further proves the theory that this error is due to the nature of the dataset and not the algorithm itself.
## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_dt), y_pred=dt.predict(X_test_dt))
plt.show()
Shown below is the output of the decision tree used to classify each movie. The decision tree starts with num_voted_users, splits them into two groups, above and below 89,788, then moves to the next node. This process repeats until all observations have been classified. The gini score represents how homogeneous the data in each node is. Zero represents complete homogeneousness, which is the goal of classification. The issue with this decision tree is that some of the nodes on the final level are mostly heterogeneous when they should be completely homogeneous in an ideal model. Again, this is most likely due to the nature of the dataset where the algorithms tend to misclassify movies that have a lower IMDb score.
## Display decision tree
Source(tree.export_graphviz(dt, out_file=None, feature_names=X.columns))
Next, we'll use KNN to build a classification model. Again, like the decision tree, the data is split into 70% training and 30% testing. KNN stands for k-nearest neighbor. The algorithm uses the training data to search for the nearest neighbor of a test data point and classifies it accordingly.
## Validation
X_train, X_test_knn, y_train, y_test_knn = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
knn = KNeighborsClassifier()
## Train
knn = knn.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_knn, knn.predict(X_test_knn)))
0.5943810359964882
The accuracy of the KNN model is shown above. The model correctly classifies the data 59.4% of the time. This is substantially lower than the decision tree model. According to the confusion matrix below, it appears that most of the error is coming from 'ok' movies misclassified as 'good' movies.
## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_knn), y_pred=knn.predict(X_test_knn))
plt.show()
Logistic regression is similar to linear regression used earlier in the regression section. However, logistic regression is used for classification and is a non-linear model. Again we'll split the data into 70% training and 30% testing.
## Validation
X_train, X_test_lr, y_train, y_test_lr = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
lr = LogisticRegression(solver='lbfgs', max_iter=500)
## Train
lr.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_lr, lr.predict(X_test_lr)))
0.659350307287094
The accuracy of the logistic regression model is shown above. The model correctly classifies the data 65.9% of the time. This is more accurate than the KNN model but still less accurate than the decision tree model. The confusion matrix shown below reveals that the algorithm mostly predicts movies as 'good' because that's where the bulk of the data is. This is not conducive to a good classification model.
## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_lr), y_pred=lr.predict(X_test_lr))
plt.show()
Now we'll build a random forest classification model. Random forest operates similarly to the decision tree model, but instead of building one decision tree, it builds 20 decision trees. Again we'll split the data into 70% training and 30% testing.
## Validation
X_train, X_test_clf, y_train, y_test_clf = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
clf = RandomForestClassifier(n_estimators=20)
## Train
clf=clf.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_clf, clf.predict(X_test_clf)))
0.7594381035996488
Random forest is the most accurate of any model tested so far. The model correctly classifies the data 76% of the time. The confusion matrix above shows that the random forest model is not only more accurate but also doesn't rely on classifying the most commonly occurred class for its high accuracy.
## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_clf), y_pred=clf.predict(X_test_clf))
plt.show()
Recursive feature selection ranks the importance of each variable. Those variables are then used to build a logistic regression classification model.
## Rank importance of variables based on feature selection
model = LogisticRegression()
rfe = RFE(model, 1)
rfe = rfe.fit(X, y)
## Features sorted by their rank
pd.DataFrame({'variable':X.columns, 'importance':rfe.ranking_}).sort_values(by='importance').reset_index().drop('index', axis=1).head(10)
variable | importance | |
---|---|---|
0 | in_english | 1 |
1 | color | 2 |
2 | from_USA | 3 |
3 | content_rating | 4 |
4 | facenumber_in_poster | 5 |
5 | duration | 6 |
6 | num_critic_for_reviews | 7 |
7 | title_year | 8 |
8 | num_user_for_reviews | 9 |
9 | actor_2_facebook_likes | 10 |
We'll use the top 10 most important variables according to RFE shown above to build a logistic regression model.
## Select top 10 ranked variables from RFE
X_logistic = df[['in_english','color','from_USA', 'content_rating', 'facenumber_in_poster', 'duration', 'num_critic_for_reviews', 'title_year', 'num_user_for_reviews', 'actor_2_facebook_likes']]
## Validation
X_train, X_test_lr1, y_train, y_test_lr1 = train_test_split(X_logistic, y, test_size=0.3, random_state=0)
## Initialize
lr1 = LogisticRegression()
## Train
lr1.fit(X_train, y_train)
#Model evaluation
print(metrics.accuracy_score(y_test_lr1, lr1.predict(X_test_lr1)))
0.6646180860403863
The RFE model correctly classifies the data 66.3% of the time. Because RFE uses logistic regression, most movies are classified as 'good' movies because that's where most of the training data is classified. This is not conducive to building a good classification model.
## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_test_lr1, lr1.predict(X_test_lr1))
plt.show()
The random forest model is the most accurate of all models tested, both in its overall accuracy and in the visualization of the confusion matrix. The random forest model correctly classifies the data 76% of the time.
## Print accuracy score for each model
dt = metrics.accuracy_score(y_test_dt, dt.predict(X_test_dt))
knn = metrics.accuracy_score(y_test_knn, knn.predict(X_test_knn))
lr = metrics.accuracy_score(y_test_lr, lr.predict(X_test_lr))
clf = metrics.accuracy_score(y_test_clf, clf.predict(X_test_clf))
lr1 = metrics.accuracy_score(y_test_lr1, lr1.predict(X_test_lr1))
print('Decision tree accuracy: %s' % (dt))
print('KNN accuracy: %s' % (knn))
print('Logistic regression accuracy: %s' % (lr))
print('Random forest classifier accuracy: %s' % (clf))
print('Recursive feature selection accuracy: %s' % (lr1))
Decision tree accuracy: 0.7155399473222125 KNN accuracy: 0.5943810359964882 Logistic regression accuracy: 0.659350307287094 Random forest classifier accuracy: 0.7594381035996488 Recursive feature selection accuracy: 0.6646180860403863
The goal of clustering is to segment the data into distinct clusters and develop cluster profiles based on each cluster's information. The first step is to determine the most important variables to use in the clustering analysis. We'll select the 8 most important variables according to feature importance.
## Select 8 most important variables with feature importance
model_extra = ExtraTreesClassifier()
model_extra.fit(X, y)
pd.DataFrame(model_extra.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False).head(8)
importance | |
---|---|
num_voted_users | 0.115421 |
duration | 0.073528 |
num_critic_for_reviews | 0.069830 |
budget | 0.066979 |
title_year | 0.066449 |
num_user_for_reviews | 0.064334 |
profit | 0.060909 |
gross | 0.059518 |
## New dataframe based on these variables
df_c = df1[['title_year', 'duration', 'gross', 'budget', 'profit', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users']]
We'll need to normalize the data before performing cluster analysis. The reason for this is that the scale or variance of each variable is unique and will distort the importance of the clusters. For example, the variance of gross at 4.8 quadrillions, is much higher than the title year variance at 99.2, and the algorithm will give more importance to the higher variance. The point of normalization is to make sure all variables are on the same scale, and therefore on a level playing field when determining the importance of each variable.
## Check variance
pd.options.display.float_format = '{:,.3f}'.format
df_c.var()
title_year 99.151 duration 500.154 gross 4,841,046,710,277,035.000 budget 1,851,932,001,982,648.750 profit 2,812,299,093,271,696.500 num_critic_for_reviews 15,337.725 num_user_for_reviews 167,781.108 num_voted_users 22,803,705,819.598 dtype: float64
## Reset scientific notation
pd.reset_option('^display.', silent=True)
Now that the data has been normalized, all variables share the same scale, and the size of the initial variance won't influence the clustering results.
## Normalize data and check variance again
df_norm = (df_c - df_c.mean()) / (df_c.max() - df_c.min())
df_norm.var()
title_year 0.012518 duration 0.005826 gross 0.008370 budget 0.020577 profit 0.004148 num_critic_for_reviews 0.023262 num_user_for_reviews 0.006556 num_voted_users 0.007987 dtype: float64
The elbow plot below helps determine the optimal number of clusters for a dataset. Too few clusters and there will be a loss of individual information, too many clusters and analysis will be more difficult. The elbow plot's namesake comes from the optimal cluster point resembling an elbow joint. It appears that the 'elbow' is at 3 clusters. This analysis is subjective though and others may argue 2 or 4 clusters.
## Display elbow method plot
kmeans = KMeans(random_state=1)
skplt.cluster.plot_elbow_curve(kmeans, df_norm, cluster_ranges=range(1, 8));
Now we'll assign clusters using k-means clustering based on the number of clusters selected in the elbow plot.
## Clustering analysis using k-means
k_means = KMeans(init='k-means++', n_clusters=3, random_state=0);
## Fit normalized data
k_means.fit(df_norm);
## Cluster labels
k_means.labels_
array([0, 0, 0, ..., 1, 1, 1], dtype=int32)
## Cluster labels to dataframe
df_c1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df_c1.head()
cluster | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
## Join cluster labels and normalized data
df_c2 = df_norm.join(df_c1)
df_c2.head(1)
title_year | duration | gross | budget | profit | num_critic_for_reviews | num_user_for_reviews | num_voted_users | cluster | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.066779 | 0.232443 | 0.93197 | 0.66281 | 0.61929 | 0.687488 | 0.538356 | 0.463054 | 0.0 |
## Cluster with original data
df_c3 = df_c.join(df_c1)
df_c3.head(1)
title_year | duration | gross | budget | profit | num_critic_for_reviews | num_user_for_reviews | num_voted_users | cluster | |
---|---|---|---|---|---|---|---|---|---|
0 | 2009.0 | 178.0 | 760505847.0 | 237000000.0 | 523505847.0 | 723.0 | 3054.0 | 886204 | 0.0 |
The number of movies in each cluster is shown below. The majority of observations fit into cluster 2 (with the index 1), while only a select few have made it into cluster 1 (with the index 0).
Note: python indexing starts at 0 not 1
## Number of observations in each cluster
df_c3['cluster'].value_counts().sort_index()
0.0 271 1.0 1854 2.0 1106 Name: cluster, dtype: int64
Now we can view each cluster's average values for each variable selected. Before we analyze this information and develop profiles from it, we need to make sure certain cluster values are significantly different from other clusters with more certainty than just an arbitrary guess.
## Mean values for each cluster
pd.options.display.float_format = '{:,.3f}'.format
df_c3.groupby('cluster').mean().T
cluster | 0.0 | 1.0 | 2.0 |
---|---|---|---|
title_year | 2,008.092 | 2,003.050 | 2,003.280 |
duration | 123.993 | 109.019 | 112.955 |
gross | 158,429,687.048 | 41,913,748.528 | 61,262,531.428 |
budget | 136,912,915.129 | 28,233,330.099 | 46,975,311.394 |
profit | 21,516,771.919 | 13,680,418.428 | 14,287,220.033 |
num_critic_for_reviews | 308.886 | 152.780 | 172.290 |
num_user_for_reviews | 744.550 | 291.023 | 357.489 |
num_voted_users | 250,640.672 | 91,945.350 | 115,319.672 |
## Reset scientific notation
pd.reset_option('^display.', silent=True)
The purpose of t-testing is to find statistically significant differences between two numbers. In this case, for each variable, we'll test whether each of the 3 cluster's values deviates significantly from other clusters.
## Develop t-test for each variable between each cluster
a = pg.pairwise_tukey(data=df_c3, dv='title_year', between='cluster')['p-tukey']
b = pg.pairwise_tukey(data=df_c3, dv='duration', between='cluster')['p-tukey']
c = pg.pairwise_tukey(data=df_c3, dv='gross', between='cluster')['p-tukey']
d = pg.pairwise_tukey(data=df_c3, dv='budget', between='cluster')['p-tukey']
e = pg.pairwise_tukey(data=df_c3, dv='profit', between='cluster')['p-tukey']
f = pg.pairwise_tukey(data=df_c3, dv='num_critic_for_reviews', between='cluster')['p-tukey']
g = pg.pairwise_tukey(data=df_c3, dv='num_user_for_reviews', between='cluster')['p-tukey']
h = pg.pairwise_tukey(data=df_c3, dv='num_voted_users', between='cluster')['p-tukey']
## Define highlight function
def color(val):
color = 'red' if val > 0.05 else ''
return 'background-color: %s' % color
Two numbers are statistically significantly different if the p-value is less than 0.05. The cells highlighted in red have a p-value above this cutoff value, therefore they are not statistically significant. So in terms of title year, cluster 1 is statistically different than cluster 2 and the same is true between clusters 1 and 3. However, title year is not significantly different between clusters 2 and 3. Profit is the only variable in which none of the clusters are statistically significant because the p-values are greater than 0.05. Therefore we will leave these highlighted cells out of our cluster profiles. The rest of the variables are statistically significant between each of the clusters. Now that we have this information, we can develop cluster profiles.
## Convert p-values to dataframe
ttest = pd.concat([a, b, c, d, e, f, g, h], axis=1)
ttest.columns = ['year', 'duration', 'gross', 'budget', 'profit', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users']
ttest.index.name = 'difference between clusters'
ttest = ttest.rename(index={0: '1 and 2', 1: '1 and 3', 2: '2 and 3'})
ttest.style.applymap(color).format("{:.3}")
year | duration | gross | budget | profit | num_critic_for_reviews | num_user_for_reviews | num_voted_users | |
---|---|---|---|---|---|---|---|---|
difference between clusters | ||||||||
1 and 2 | 0.001 | 0.001 | 0.001 | 0.001 | 0.0827 | 0.001 | 0.001 | 0.001 |
1 and 3 | 0.001 | 0.001 | 0.001 | 0.001 | 0.141 | 0.001 | 0.001 | 0.001 |
2 and 3 | 0.818 | 0.001 | 0.001 | 0.001 | 0.9 | 0.001 | 0.001 | 0.001 |
Cluster 1 - "Blockbusters"
These are the movies that are a hit at the box office, everyone's talking about them, and are in contention during awards season.
Compared to other clusters these movies on average are:
Cluster 2 - "Indie Movies/Cult Classics"
These are the movies that are made with limited resources and don't make large amounts at the box office but still have their place in the cinema world.
Compared to other clusters these movies on average are:
Cluster 3 - "Rainy Day Films"
These movies bridge the gap between the extremes of blockbusters and indie movies/cult classics. They have a decent sized budget and usually recoup their budget at the box office, but don't garner the same amount of attention or award nominations.
Compared to other clusters these movies on average are:
Before concluding the analysis, it's worth noting that there are a few drawbacks to this dataset and project:
According to the dataset and analysis, movies are more likely to have these features:
In general, movies with a higher IMDb score are more likely to have these features:
The goal of regression was to predict IMDb score, the y variable, using various numerical determinants from the dataset. We learned the following things through regression:
The goal of classification was to predict whether a movie is 'bad,' 'ok,' 'good,' or 'excellent' instead of predicting the IMDb score overall. We achieved this by creating custom bins based on IMDb score. These are the things we learned through classification:
The goal of clustering was to segment the data into distinct clusters and develop profiles based on each cluster's information using the 8 most important x variables. These are the things we learned after clustering:
We had to normalize the data before clustering so that variables were all on the same scale in terms of variance. Without normalization, the variables with a higher variance would be given more weight and distort the importance when determining clusters.
The ‘elbow’ in the elbow plot was at 3 clusters, therefore a total of 3 clusters were used. However, this is a subjective analysis, and others may have chosen a different number of clusters.
t-Testing was used to find statistically significant differences between each of the 3 clusters for each variable. Two clusters are statistically significantly different if the p-value is less than 0.05.
Using information gleaned from t-testing, we were able to build profiles with statistical certainty that each variable discussed was significantly different compared to other clusters. These profiles were:
Cluster 1 - "Blockbusters"
Compared to other clusters these movies on average are:
Cluster 2 - "Indie Movies/Cult Classics"
Compared to other clusters these movies on average are:
Cluster 3 - "Rainy Day Films"
Compared to other clusters these movies on average are:
This dataset can be given to an executive producer, actor, director, agent, or anyone else in the movie business to decide if a potential movie is worth taking on. Not necessarily in terms of making money but in terms of making a quality movie and winning awards. Before taking a role, producers, actors, directors, and others in the movie business should ask themselves: