MIS 665 Final Project¶

Jace Bothner

Watch the video presentation here:
https://youtu.be/QsSIUkY10kY

If you're not able to view the interactive visuals try viewing the project on my github page:
https://github.com/jbothner21/MIS_665_Final_Project

1. Project Introduction and Problem Definition¶

The goal of this project is to analyze an IMDb movie dataset and use the x variables to predict the y variable, IMDb score. We will accomplish this using data analytics methods data visualization, correlation analysis, regression, classification, and clustering.

In [2]:

## Import packages
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pingouin as pg
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import RFE
pd.set_option('display.max_columns', 500)
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import chi2
from sklearn.model_selection import GridSearchCV
import scikitplot as skplt
from graphviz import Source
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.externals.six import StringIO
import pydotplus
from sklearn.cluster import ward_tree
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.spatial.distance import cdist
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

2. Import and Clean Data¶

In [3]:

## Import data
df = pd.read_csv("data/movie_metadata.csv")
df.head(1)

Out[3]:

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	actor_1_name	movie_title	num_voted_users	cast_total_facebook_likes	actor_3_name	facenumber_in_poster	plot_keywords	movie_imdb_link	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	CCH Pounder	Avatar	886204	4834	Wes Studi	0.0	avatar\|future\|marine\|native\|paraplegic	http://www.imdb.com/title/tt0499549/?ref_=fn_t...	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	1.78	33000

In [4]:

## Number of rows in dataset
len(df)

Out[4]:

In [5]:

## Check column names
df.columns

Out[5]:

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Some columns will clearly not be useful in our analysis. The columns 'IMDb link', 'plot keywords', and 'aspect ratio' won't be meaningful in analysis, so we'll drop these extraneous columns.

In [6]:

## Drop unnecessary columns
columns = ['movie_imdb_link','plot_keywords','aspect_ratio']
df.drop(columns, inplace=True, axis=1)
df.head(1)

Out[6]:

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	actor_1_name	movie_title	num_voted_users	cast_total_facebook_likes	actor_3_name	facenumber_in_poster	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	CCH Pounder	Avatar	886204	4834	Wes Studi	0.0	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	33000

There are 45 duplicate rows in the dataset. We'll drop these duplicated rows to avoid repeated information that could influence our results.

In [7]:

## Check for duplicates
len(df[df.duplicated()])

Out[7]:

In [8]:

## Drop duplicates and check number of rows
df = df.drop_duplicates()
len(df)

Out[8]:

Below are all the values for each column that are null or missing. We'll have to deal with these null values in various ways because some methods of analysis will not allow null values.

In [9]:

## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(19)

Out[9]:

gross                      874
budget                     487
content_rating             301
title_year                 107
director_facebook_likes    103
director_name              103
num_critic_for_reviews      49
actor_3_facebook_likes      23
actor_3_name                23
num_user_for_reviews        21
color                       19
duration                    15
facenumber_in_poster        13
actor_2_facebook_likes      13
actor_2_name                13
language                    12
actor_1_name                 7
actor_1_facebook_likes       7
country                      5
dtype: int64

The three variables missing the most data are gross, budget, and content rating. Because of the nature of these variables, it's difficult to find a solution to fill the null values. There is too much variation in the dataset to replace with mean, median, or mode. Therefore we'll have to drop all rows that have null values for these three variables.

In [10]:

## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(3)

Out[10]:

gross             874
budget            487
content_rating    301
dtype: int64

In [11]:

## Drop null gross/budget
df.dropna(subset=['gross'], how='all', inplace = True)
df.dropna(subset=['budget'], how='all', inplace = True)
df.dropna(subset=['content_rating'], how='all', inplace = True)
len(df)

Out[11]:

Even after dropping the null values in gross, budget, and content rating, there are still 3,806 observations in the dataset. This is still a sufficient sample size to analyze and draw conclusions from. Now there are still 11 columns that have null values.

In [12]:

## Check for missing values
df.isnull().sum().sort_values(ascending=False).head(10)

Out[12]:

actor_3_facebook_likes    6
facenumber_in_poster      6
actor_3_name              6
color                     2
actor_2_facebook_likes    2
language                  2
actor_2_name              2
num_critic_for_reviews    1
actor_1_facebook_likes    1
actor_1_name              1
dtype: int64

One method of filling missing data, especially categorical data, is to replace the null value with the most commonly observed value.

In [13]:

## Majority of movies are in color
df['color'].value_counts()

Out[13]:

Color               3680
 Black and White     124
Name: color, dtype: int64

In [14]:

## Majority of movies are in English
df['language'].value_counts().sort_values(ascending=False).head()

Out[14]:

English     3644
French        34
Spanish       24
Mandarin      14
German        11
Name: language, dtype: int64

In the case of the color and language columns, each variable has one value that occurs more frequently than others. So we'll replace the missing color values with color, and the missing language values with English. Now only 8 columns have missing values.

In [15]:

# Replace null values with the most popular value in a categorial columns
df = df.fillna({'color': 'Color'})
df = df.fillna({'language': 'English'})

df.isnull().sum().sort_values(ascending=False).head(8)

Out[15]:

actor_3_facebook_likes    6
facenumber_in_poster      6
actor_3_name              6
actor_2_facebook_likes    2
actor_2_name              2
actor_1_name              1
num_critic_for_reviews    1
actor_1_facebook_likes    1
dtype: int64

Another technique for filling missing data is by calculating the central tendency, like mean or median, for each column and replacing the null values with that specific central tendency. This is the methodology we'll use for missing integer variables, number of faces in the movie poster, the number of reviews given by critics, and actors Facebook likes. Since there is some variation in the data, and since all the numbers are whole numbers, we'll fill the missing values with the median.

In [16]:

## Median replace because of variation
df.median()

Out[16]:

num_critic_for_reviews            136.0
duration                          106.0
director_facebook_likes            59.0
actor_3_facebook_likes            432.0
actor_1_facebook_likes           1000.0
gross                        28749642.5
num_voted_users                 52312.5
cast_total_facebook_likes        3951.0
facenumber_in_poster                1.0
num_user_for_reviews              205.0
budget                       25000000.0
title_year                       2005.0
actor_2_facebook_likes            670.0
imdb_score                          6.6
movie_facebook_likes              218.0
dtype: float64

In [17]:

# Replace null values with median
newposter = df['facenumber_in_poster'].median()
df = df.fillna({'facenumber_in_poster': newposter})
newcritic = df['num_critic_for_reviews'].median()
df = df.fillna({'num_critic_for_reviews': newcritic})
newact1 = df['actor_1_facebook_likes'].median()
df = df.fillna({'actor_1_facebook_likes': newact1})
newact2 = df['actor_2_facebook_likes'].median()
df = df.fillna({'actor_2_facebook_likes': newact2})
newact3 = df['actor_3_facebook_likes'].median()
df = df.fillna({'actor_3_facebook_likes': newact3})
df.isnull().sum().sort_values(ascending=False).head(3)

Out[17]:

actor_3_name    6
actor_2_name    2
actor_1_name    1
dtype: int64

After filling these values with the median three columns have null values: the names of actors one, two, and three. We could fill these missing values, but since there are numerous actors in the dataset, and since these few missing values won't affect our analysis, we'll leave these null values.

After filling null values, we'll want to handle any outliers in the data that might influence analysis. Since the median budget is 25 million dollars, some outrageously high budgets in the dataset are present and will be considered outliers. We need to remove these values that are either typos or inaccurate information since there are no movies that have had a budget even close to 12 billion dollars.

In [18]:

## Median budget for context
df['budget'].median()

Out[18]:

25000000.0

In [19]:

## Check for outliers
pd.options.display.float_format = '{:,.0f}'.format
df['budget'].sort_values(ascending=False).head()

Out[19]:

2988   12,215,500,000
3859    4,200,000,000
3005    2,500,000,000
2323    2,400,000,000
2334    2,127,519,898
Name: budget, dtype: float64

In [20]:

## Reset scientific notation
pd.reset_option('^display.', silent=True)

We'll treat any budget above 300 million dollars as an outlier, and drop the row. After doing so, 12 outliers were dropped from the dataset.

In [21]:

## Drop outliers
df.drop( df[ df['budget'] >= 300000001 ].index , inplace=True)
len(df)

Out[21]:

Now we'll sort out the content rating category. There are 12 different options in the content rating column. We can consolidate these ratings and organize them into more manageable categories.

In [22]:

## Check content ratings
df['content_rating'].value_counts()

Out[22]:

R            1715
PG-13        1311
PG            572
G              91
Not Rated      42
Unrated        24
Approved       17
X              10
NC-17           6
Passed          3
M               2
GP              1
Name: content_rating, dtype: int64

We'll redistribute the content rating categories the following way:

GP -> PG
Approved -> PG
Passed -> PG
M -> R
Not Rated -> Unrated
X -> Unrated
NC-17 -> Unrated

Now we have 5 different content ratings that follow a natural scale, from movies that are more family friendly to movies that have mature themes.

In [23]:

## Redistribute into more managable groups
df = df.replace('GP', 'PG')
df = df.replace('Approved', 'PG')
df = df.replace('Passed', 'PG')
df = df.replace('M', 'R')
df = df.replace('Not Rated', 'Unrated')
df = df.replace('X', 'Unrated')
df = df.replace('NC-17', 'Unrated')
df['content_rating'].value_counts()

Out[23]:

R          1717
PG-13      1311
PG          593
G            91
Unrated      82
Name: content_rating, dtype: int64

In order to analyze categorical columns with certain methodologies, we'll have to convert those columns from objects to integers. We'll do this for the following columns; color, language, country, and content rating. For color, we'll convert the column into dummy variables since there are only two options for the color column. The movie is either in black and white, which is assigned a 0, or in color, which is assigned a 1.

In [24]:

## Convert color to dummy variable
df = pd.get_dummies(df, columns=['color'])
df = df.drop(['color_ Black and White'], axis=1)
df = df.rename(columns={'color_Color':'color'})
df['color'].value_counts()

Out[24]:

1    3670
0     124
Name: color, dtype: int64

Since we already redistributed the content rating column into 5 groups that follow a natural scale, we can assign each of the ratings a number. G will be 1, PG will be 2, etc. In other words, a 1 will be a more family friendly movie, and a 5 will have more adult themes.

In [25]:

## Remap content rating from categorical to integer
df['content_rating'] = df['content_rating'].map({'G': 1, 'PG': 2, 'PG-13':3, 'R': 4, 'Unrated': 5})
df['content_rating'].value_counts()

Out[25]:

4    1717
3    1311
2     593
1      91
5      82
Name: content_rating, dtype: int64

We've already established that the categorical columns language and country have a logical most frequent value: English is the most occurring language and the United States is the most common country. Therefore, we can create a new column with dummy variables for each of these categories. For language, 1 means the movie is in English, 0 means the movie is in a different language. For country, 1 means the movie is from the United States, and 0 means the movie is outside the USA.

In [26]:

## Remap language and country to dummy variables
df.loc[df.language == 'English', 'in_english'] = 1  
df.loc[df.language != 'English', 'in_english'] = 0
df.loc[df.country == 'USA', 'from_USA'] = 1  
df.loc[df.country != 'USA', 'from_USA'] = 0
df['in_english'].value_counts()

Out[26]:

1.0    3645
0.0     149
Name: in_english, dtype: int64

In [27]:

df['from_USA'].value_counts()

Out[27]:

1.0    3025
0.0     769
Name: from_USA, dtype: int64

Now we need to focus on the genres column. Since it's difficult to sum a single movie up into just one genre, many movies list multiple genres. We'll have to split each movie's listed genres so we can analyze them individually.

In [28]:

## Check genres in dataframe
df['genres'].value_counts().head()

Out[28]:

Comedy|Drama|Romance    150
Drama                   147
Comedy                  143
Comedy|Drama            142
Comedy|Romance          136
Name: genres, dtype: int64

In [29]:

## Create a new data frame
gen = df[['genres','imdb_score']]
gen.head(3)

Out[29]:

	genres	imdb_score
0	Action\|Adventure\|Fantasy\|Sci-Fi	7.9
1	Action\|Adventure\|Fantasy	7.1
2	Action\|Adventure\|Thriller	6.8

In [30]:

## Check number of rows
len(gen)

Out[30]:

Since we're splitting up the genres, each movie could have multiple rows depending on the number of genres listed. For instance, if a certain movie has 5 genres listed, that movie will have 5 separate rows. For this reason, there are 11,325 rows after splitting the genres.

In [31]:

## Split genres into multiple rows
spl = pd.DataFrame(gen.genres.str.split('|').tolist(), index=gen.imdb_score).stack()
spl = spl.reset_index()[[0, 'imdb_score']]
spl.columns = ['genres', 'imdb_score']
spl.head(3)

Out[31]:

	genres	imdb_score
0	Action	7.9
1	Adventure	7.9
2	Fantasy	7.9

In [32]:

len(spl)

Out[32]:

It may provide some insight to create a new column based on gross and budget. Subtracting budget from gross will yield the total profit of the movie.

In [33]:

## Calculate profit column
df['profit'] = df['gross'] - df['budget']
df.columns

Out[33]:

Index(['director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'num_user_for_reviews',
       'language', 'country', 'content_rating', 'budget', 'title_year',
       'actor_2_facebook_likes', 'imdb_score', 'movie_facebook_likes', 'color',
       'in_english', 'from_USA', 'profit'],
      dtype='object')

For analysis later in the project, we'll set custom bins based on the IMDb score of the movie. The movies in each bin represent the following:

Bin 1: 'bad' movies
Bin 2: 'ok' movies
Bin 3: 'good' movies
Bin 4: 'excellent' movies

In [34]:

## Set bins
df['movie_quality'] = pd.cut(df['imdb_score'], [0, 4, 6, 8, 10], labels=['1', '2', '3', '4'])
df['movie_quality'].value_counts().sort_index()

Out[34]:

1      95
2    1067
3    2476
4     156
Name: movie_quality, dtype: int64

Now we'll reorder the columns so that similar variables are next to each other for easier analysis. We'll also create a copy of the dataframe with only integer columns for use later in the project.

In [35]:

## Reorder columns
df = df[['imdb_score', 'movie_quality', 'movie_title', 'title_year', 'content_rating', 'duration', 'country', 'from_USA', 'language', 'in_english', 'color', 'gross', 'budget', 'profit', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name', 'movie_facebook_likes', 'director_facebook_likes', 'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes', 'cast_total_facebook_likes', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users', 'facenumber_in_poster']]
df.head(1)

Out[35]:

	imdb_score	movie_quality	movie_title	title_year	content_rating	duration	country	from_USA	language	in_english	...	movie_facebook_likes	director_facebook_likes	actor_1_facebook_likes	actor_2_facebook_likes	actor_3_facebook_likes	cast_total_facebook_likes	num_critic_for_reviews	num_user_for_reviews	num_voted_users	facenumber_in_poster
0	7.9	3	Avatar	2009.0	3	178.0	USA	1.0	English	1.0	...	33000	0.0	1000.0	936.0	855.0	4834	723.0	3054.0	886204	0.0

1 rows × 28 columns

In [36]:

## New dataframe for integer columns
df1 = df.drop(['movie_title', 'country', 'language', 'director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name'], axis =1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3794 entries, 0 to 5042
Data columns (total 21 columns):
imdb_score                   3794 non-null float64
movie_quality                3794 non-null category
title_year                   3794 non-null float64
content_rating               3794 non-null int64
duration                     3794 non-null float64
from_USA                     3794 non-null float64
in_english                   3794 non-null float64
color                        3794 non-null uint8
gross                        3794 non-null float64
budget                       3794 non-null float64
profit                       3794 non-null float64
movie_facebook_likes         3794 non-null int64
director_facebook_likes      3794 non-null float64
actor_1_facebook_likes       3794 non-null float64
actor_2_facebook_likes       3794 non-null float64
actor_3_facebook_likes       3794 non-null float64
cast_total_facebook_likes    3794 non-null int64
num_critic_for_reviews       3794 non-null float64
num_user_for_reviews         3794 non-null float64
num_voted_users              3794 non-null int64
facenumber_in_poster         3794 non-null float64
dtypes: category(1), float64(15), int64(4), uint8(1)
memory usage: 600.4 KB

3. Data Visualization¶

While the ultimate goal of this analysis is to predict the IMDb score of certain movies, it may help to first get a better understanding of our data through visualization.

a. Movie Characterisitics¶

This first figure shows the distribution of the IMDb scores in the form of a box plot. The median IMDb score is 6.6, the first quartile is a score of 5.9, and the third quartile is 7.2.

In [37]:

## IMDb score
fig = px.box(df, y="imdb_score", points='all', hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='IMDb Score Boxplot')

Now we'll visualize the data by year. We can see from the histogram that we have more data from movies made in the late 1990s and onward. This might be because we had to drop the rows that were missing gross and budget values, and that information is less readily available for older movies. Alternatively, it's also possible that the barrier to entry for making films has been reduced due to an increase in technology leading to more films being made.

In [38]:

## Year
fig = px.histogram(df, x="title_year")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movies by Year')

Drama and comedy are the most popular movie genres shown by the bar chart below.

In [39]:

## Genre
fig = px.bar(spl, x="genres")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movies by Genre')

We've already seen the numbers for content rating while cleaning the data, but by visualizing the content rating in a pie chart, we can see just how prevalent PG-13 and R rated movies are. In fact, almost 80% of the movies in the dataset are represented in these two ratings.

In [40]:

## Content rating
labels = ['R','PG-13','PG','G','Unrated']
values = df['content_rating'].value_counts().values
trace=go.Pie(labels=labels,values=values)
py.iplot([trace])

A histogram of duration shows us that most movies are clustered around 100 minutes (1 hour and 40 minutes). However, the duration histogram is skewed right, meaning there are more values above the center.

In [41]:

## Duration
fig = px.histogram(df, x="duration")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Duration')

b. Financials¶

The histogram of movie gross shown below is severely right-skewed. This means that it's fairly rare for a movie to gross more than 50 million dollars and a decent portion of the movies in the dataset gross below 10 million dollars.

In [42]:

## Gross
fig = px.histogram(df, x="gross")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Gross')

The histogram displaying movie budget tells a similar story to that of movie gross. The histogram is skewed right, although less drastic, meaning it's more common for a movie to have a lower budget as opposed to a higher budget. It's actually fairly common for a movie to have a budget of fewer than 5 million dollars.

In [43]:

## Budget
fig = px.histogram(df, x="budget")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Budget')

The relationship between movie gross and its budget is shown below. There is a moderate positive correlation between these two, which is fairly intuitive due to the similarities between the two histograms above. Essentially, as the movie budget increases, the movie's gross should increase too, the extent of which will be discussed in the correlation analysis section.

In [44]:

## Gross by Budget
fig = px.scatter(df, x="budget", y="gross", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Gross by Budget')

The profit histogram below shows that movie profit is fairly normally distributed, with the center around 0 dollars. This means that a large portion of movies will just breakeven in terms of profit. However, profit isn't perfectly normally distributed as the graph is slightly right-skewed, meaning that some movies make a huge profit.

In [45]:

## Profit
fig = px.histogram(df, x="profit")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Profit')

The scatter plot below shows that movie gross and profit are moderately positively correlated. This is not surprising and is fairly intuitive, as a movie grosses more, it will likely make a larger profit.

In [46]:

## Profit by gross
fig = px.scatter(df, x="profit", y="gross", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Profit by Gross')

c. People¶

The bar chart below shows the top 10 most prolific directors, or in other words, those who have directed the most movies.

In [47]:

## Director most movies
d1 = df['director_name'].value_counts().head(10)
d1 = pd.DataFrame(data=d1).reset_index()
d1.columns= ['director_name', 'count']
fig = px.bar(d1, x="director_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Directors')

The bar chart below shows the directors who have the highest average IMDb scores across all of their movies. For context, as we found from earlier analysis, the median IMDb score is 6.6.

In [48]:

## Director by IMDb score
d2 = df.groupby('director_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
d2 = pd.DataFrame(data=d2).reset_index()
d2.columns= ['director_name', 'avg_imdb_score']
fig = px.bar(d2, x="director_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Directors')

Similar to the prolific directors graph, the graph below shows the highest number of movies for an actor in a leading role.

In [49]:

## Actor 1 most prolific
a1 = df['actor_1_name'].value_counts().head(10)
a1 = pd.DataFrame(data=a1).reset_index()
a1.columns= ['actor_1_name', 'count']
fig = px.bar(a1, x="actor_1_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Leading Actors')

Also similar to the best directors graph, the graph below shows the top 10 leading actors, based on average IMDb score across all their movies.

In [50]:

## Actor 1 by IMDb score best
a2 = df.groupby('actor_1_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
a2 = pd.DataFrame(data=a2).reset_index()
a2.columns= ['actor_1_name', 'avg_imdb_score']
fig = px.bar(a2, x="actor_1_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Leading Actors')

Conversely, the graph below shows the 10 worst leading actors based on their average IMDb score.

In [51]:

## Actor 1 by IMDb score worst
a3 = df.groupby('actor_1_name')['imdb_score'].mean().sort_values(ascending=False).tail(10)
a3 = pd.DataFrame(data=a3).reset_index()
a3.columns= ['actor_1_name', 'avg_imdb_score']
fig = px.bar(a3, x="actor_1_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Worst Leading Actors')

The top 10 most prolific supporting actors are shown below.

In [52]:

## Actor 2 most prolific
a4 = df['actor_2_name'].value_counts().head(10)
a4 = pd.DataFrame(data=a4).reset_index()
a4.columns= ['actor_2_name', 'count']
fig = px.bar(a4, x="actor_2_name", y='count')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Most Prolific Supporting Actors')

The top 10 supporting actors based on average IMDb score is shown in the bar chart below.

In [53]:

## Actor 2 by IMDb score best
a5 = df.groupby('actor_2_name')['imdb_score'].mean().sort_values(ascending=False).head(10)
a5 = pd.DataFrame(data=a5).reset_index()
a5.columns= ['actor_2_name', 'avg_imdb_score']
fig = px.bar(a5, x="actor_2_name", y='avg_imdb_score')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Top 10 Best Supporting Actors')

d. Facebook¶

The scatter plot below shows the relationship between movie Facebook likes and IMDb score. There's a substantial amount of Facebook likes in the dataset that are 0. This means that a lot of these movies just don't have Facebook pages for one reason or another. Because of this, the variable may not be a strong determinant of IMDb score, but we'll discuss this more in the correlation analysis section.

In [54]:

## Movie likes by IMDb score
fig = px.scatter(df, x="movie_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Movie Facebook Likes by IMDb Score')

The director likes by IMDb score scatter plot tells a similar story to that of movie Facebook likes. At the time of data collection, many of the directors do not have a Facebook page.

In [55]:

## Director likes by IMDb score
fig = px.scatter(df, x="director_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Director Facebook Likes by IMDb Score')

Similar to the other Facebook variables, there is seemingly little correlation between cast Facebook likes and IMDb score.

In [56]:

## Cast likes
fig = px.scatter(df, x="cast_total_facebook_likes", y="imdb_score", hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Cast Facebook Likes by IMDb Score')

e. Reviews¶

This figure below shows the distribution of the IMDb reviews from critics in the form of a box plot.

In [57]:

## Critics
fig = px.box(df, y="num_critic_for_reviews", points='all', hover_name='movie_title')
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of Critic Reviews')

The histogram below shows the distribution of the number of users who gave a review on IMDb. As with other histograms shown before, this one is also skewed right. Almost half (1852/3794) of the movies in the dataset have under 200 user reviews.

In [58]:

## Users  
fig = px.histogram(df, x="num_user_for_reviews")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of IMDb User Reviews')

The histogram for the number of IMDb user votes is shown below, and much like the user reviews variable, the histogram is skewed right. However, the number of user votes has a much more drastic skew. A substantial portion of movies in the dataset have less than 20,000 votes.

In [59]:

## Voted
fig = px.histogram(df, x="num_voted_users")
fig.update_traces(marker_color='blue', marker_line_color='blue', opacity=0.6)
fig.update_layout(title_text='Number of IMDb Users Who Voted')

4. Correlation Analysis¶

Correlation is a measure of relationship strength between two variables. If two variables are highly positively correlated, then an increase in one will generally lead to an increase in the other. If two variables are highly negatively correlated, then an increase in one will generally lead to a decrease in the other. The correlation coefficients for each of the x variables with the y variable, IMDb score, are shown below.

In [60]:

## Correlation with IMDb score
df[df.columns[0:]].corr()['imdb_score'][:].sort_values(ascending=False)

Out[60]:

imdb_score                   1.000000
num_voted_users              0.479882
duration                     0.370148
num_critic_for_reviews       0.349811
num_user_for_reviews         0.324998
movie_facebook_likes         0.283770
profit                       0.255117
gross                        0.218389
director_facebook_likes      0.191684
content_rating               0.120074
cast_total_facebook_likes    0.106502
actor_2_facebook_likes       0.101541
actor_1_facebook_likes       0.093774
actor_3_facebook_likes       0.065357
budget                       0.038711
facenumber_in_poster        -0.069214
color                       -0.118699
title_year                  -0.133974
from_USA                    -0.135822
in_english                  -0.169427
Name: imdb_score, dtype: float64

In [61]:

## Correlation heatmap
plt.figure(figsize=(14,10))
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df.corr(), mask=mask, vmax=.8, square=True, annot=True, fmt=".2f");

The variables that are the highest determinants of overall IMDb score are likely the ones that are highly correlated with IMDb score. Interestingly, there are no variables that have a strong correlation, so the variables that are 'highly' correlated will have to be based on the context of the other variables. In this case, the variables that we will deem to be 'highly' correlated (above 0.2) are:

num_voted_users
duration
num_critic_for_reviews
num_user_for_reviews
movie_facebook_likes
profit
gross

These findings are generalized and the extent of the variables' likelihood to affect IMDb score is based on the correlation coefficient as mentioned above. So now that we have this information, what can we do with it? The next step is to set up a regression model based on the determinants to predict IMDb score.

5. Regression¶

The first step in regression is setting x and y variables. The y variable, the variable we're trying to predict, is IMDb score. The x variables, or determinants of the y variable, will be all the numerical columns aside from IMDb score.

In [62]:

## Set X and y variables
y = df1['imdb_score'] 
X = df1.drop(['movie_quality', 'imdb_score'], axis =1)

a. Scikit-learn¶

The first model we'll build and test is Scikit-learn's linear regression algorithm.

In [63]:

## Build regression model based on scikit-learn algorithm
model1 = lm.LinearRegression()
model1.fit(X, y)
model1_y = model1.predict(X)

In [64]:

## Display coefficients for each variable
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))

Out[64]:

[('title_year', '-0.017'),
 ('content_rating', '0.037'),
 ('duration', '0.011'),
 ('from_USA', '-0.205'),
 ('in_english', '-0.701'),
 ('color', '-0.335'),
 ('gross', '-0.000'),
 ('budget', '-0.000'),
 ('profit', '0.000'),
 ('movie_facebook_likes', '-0.000'),
 ('director_facebook_likes', '0.000'),
 ('actor_1_facebook_likes', '0.000'),
 ('actor_2_facebook_likes', '0.000'),
 ('actor_3_facebook_likes', '0.000'),
 ('cast_total_facebook_likes', '-0.000'),
 ('num_critic_for_reviews', '0.003'),
 ('num_user_for_reviews', '-0.001'),
 ('num_voted_users', '0.000'),
 ('facenumber_in_poster', '-0.025')]

The strength or accuracy of a regression model is evaluated through mean square error (MSE) and R-squared. The goal of a regression model is to minimize MSE and maximize R-squared. A lower MSE means less error in the model.

In [65]:

## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, model1_y))
print("Variance or R-squared: ", explained_variance_score(y, model1_y))

Mean Square Error:  0.6492138737974602
Variance or R-squared:  0.4164775382245356

b. Statsmodel¶

Next we'll test Statsmodel regression algorithm based on ordinary least squares regression.

In [66]:

## Build regression model based on ordinary least squares
runs_reg_model = ols("imdb_score~title_year+content_rating+duration+from_USA+in_english+color+gross+budget+profit+movie_facebook_likes+actor_1_facebook_likes+actor_2_facebook_likes+actor_3_facebook_likes+cast_total_facebook_likes+num_critic_for_reviews+num_user_for_reviews+num_voted_users+facenumber_in_poster", df1)
runs_reg = runs_reg_model.fit()
print(runs_reg.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             imdb_score   R-squared:                       0.416
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     158.4
Date:                Tue, 26 May 2020   Prob (F-statistic):               0.00
Time:                        13:11:29   Log-Likelihood:                -4564.7
No. Observations:                3794   AIC:                             9165.
Df Residuals:                    3776   BIC:                             9278.
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                    40.5584      3.229     12.560      0.000      34.227      46.889
title_year                   -0.0173      0.002    -10.713      0.000      -0.020      -0.014
content_rating                0.0368      0.017      2.105      0.035       0.003       0.071
duration                      0.0112      0.001     16.637      0.000       0.010       0.012
from_USA                     -0.2021      0.036     -5.683      0.000      -0.272      -0.132
in_english                   -0.7014      0.073     -9.603      0.000      -0.845      -0.558
color                        -0.3387      0.076     -4.486      0.000      -0.487      -0.191
gross                     -1.876e-06   1.49e-07    -12.561      0.000   -2.17e-06   -1.58e-06
budget                     1.871e-06   1.49e-07     12.527      0.000    1.58e-06    2.16e-06
profit                     1.877e-06   1.49e-07     12.570      0.000    1.58e-06    2.17e-06
movie_facebook_likes      -1.927e-06   9.09e-07     -2.119      0.034   -3.71e-06   -1.44e-07
actor_1_facebook_likes     5.629e-05   1.26e-05      4.484      0.000    3.17e-05    8.09e-05
actor_2_facebook_likes     5.986e-05   1.32e-05      4.534      0.000     3.4e-05    8.57e-05
actor_3_facebook_likes     5.493e-05   2.04e-05      2.687      0.007    1.48e-05     9.5e-05
cast_total_facebook_likes -5.385e-05   1.25e-05     -4.301      0.000   -7.84e-05   -2.93e-05
num_critic_for_reviews        0.0026      0.000     13.581      0.000       0.002       0.003
num_user_for_reviews         -0.0006   5.56e-05    -10.546      0.000      -0.001      -0.000
num_voted_users            3.357e-06   1.65e-07     20.288      0.000    3.03e-06    3.68e-06
facenumber_in_poster         -0.0253      0.007     -3.853      0.000      -0.038      -0.012
==============================================================================
Omnibus:                      601.448   Durbin-Watson:                   1.975
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1317.164
Skew:                          -0.927   Prob(JB):                    9.58e-287
Kurtosis:                       5.213   Cond. No.                     6.77e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.34e-13. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The R-squared terms of Statsmodel and Scikit-learn models are virtually identical (0.416), but Statsmodel has a slightly higher MSE, meaning it's a slightly less accurate model than Scikit-learn.

In [67]:

## Evaluation of MSE
runs_reg.mse_resid

Out[67]:

0.6525603200981047

c. Regularization¶

Now we'll build a regression model using Lasso regularization algorithm.

In [68]:

## Build regression model based on regularization algorithm
modelL = lm.Lasso()
modelL.fit(X, y)
modelL_y = modelL.predict(X)

In [69]:

## Display coefficients for each variable
coef = ["%.3f" % i for i in modelL.coef_]
xcolumns = [ i for i in X.columns ]
list(zip(xcolumns, coef))

Out[69]:

[('title_year', '-0.004'),
 ('content_rating', '0.000'),
 ('duration', '0.011'),
 ('from_USA', '-0.000'),
 ('in_english', '-0.000'),
 ('color', '-0.000'),
 ('gross', '-0.000'),
 ('budget', '-0.000'),
 ('profit', '0.000'),
 ('movie_facebook_likes', '-0.000'),
 ('director_facebook_likes', '0.000'),
 ('actor_1_facebook_likes', '0.000'),
 ('actor_2_facebook_likes', '0.000'),
 ('actor_3_facebook_likes', '0.000'),
 ('cast_total_facebook_likes', '-0.000'),
 ('num_critic_for_reviews', '0.002'),
 ('num_user_for_reviews', '-0.000'),
 ('num_voted_users', '0.000'),
 ('facenumber_in_poster', '-0.000')]

This model has a higher MSE and a lower R-squared than previous models, meaning it's less accurate.

In [70]:

## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, modelL_y))
print("Variance or R-squared: ", explained_variance_score(y, modelL_y))

Mean Square Error:  0.705537813993835
Variance or R-squared:  0.36585279718498065

d. Feature selection¶

Next we'll build a model using feature selection. SelectKBest will choose the 12 most important variables that we'll be able to build a regression model from.

In [71]:

## Select 12 most important variables using SelectKBest algorithm
X1 = SelectKBest(f_regression, k=12).fit_transform(X, y)
selector = SelectKBest(f_regression, k=12).fit(X, y)
idxs_selected = selector.get_support(indices=True)
print(idxs_selected)

[ 0  1  2  3  4  6  8  9 10 15 16 17]

In [72]:

## Check which variables were chosen
X.head(1)

Out[72]:

	title_year	content_rating	duration	from_USA	in_english	color	gross	budget	profit	movie_facebook_likes	director_facebook_likes	actor_1_facebook_likes	actor_2_facebook_likes	actor_3_facebook_likes	cast_total_facebook_likes	num_critic_for_reviews	num_user_for_reviews	num_voted_users	facenumber_in_poster
0	2009.0	3	178.0	1.0	1.0	1	760505847.0	237000000.0	523505847.0	33000	0.0	1000.0	936.0	855.0	4834	723.0	3054.0	886204	0.0

The 12 most important variables according to SelectKBest are:

title year
content rating
duration
from USA
in english
gross
profit
movie facebook likes
director facebook likes
num critic for reviews
num user for reviews
num voted users

The regression model built with feature selection has a higher MSE and a lower R-squared than previously tested models meaning it's less accurate.

In [73]:

## Build regression model based on 12 chosen variables
modelF = lm.LinearRegression()
modelF.fit(X1, y)
modelF_y = modelF.predict(X1)
## Model evaluation
print("Mean Square Error: ", mean_squared_error(y, modelF_y))
print("Variance or R-squared: ", explained_variance_score(y, modelF_y))

Mean Square Error:  0.6608544967460929
Variance or R-squared:  0.4060147843713958

e. Testing best model¶

By analyzing the MSE of each of the regression models, we can tell that Scikit-learn is the most accurate model because it has the lowest MSE of the 4 models tested. This means that the Scikit-learn model has reduced the total amount of error in its regression model.

In [74]:

## Evaluation of all models
print("Scikit-learn MSE:", mean_squared_error(y, model1_y))
print("Statsmodel MSE:", runs_reg.mse_resid)
print("Regularization MSE:", mean_squared_error(y, modelL_y))
print("Feature selection MSE:", mean_squared_error(y, modelF_y))

Scikit-learn MSE: 0.6492138737974602
Statsmodel MSE: 0.6525603200981047
Regularization MSE: 0.705537813993835
Feature selection MSE: 0.6608544967460929

The plot below shows the actual IMDb scores plotted against Scikit-learn's predicted scores. Below the plot is the root mean square (RMS) of the Scikit-learn model. This is helpful in analysis because RMS puts the error term into the same scale as the y variable. In other words, on average, this model wrongly predicts the IMDb score by 0.8. This seems very high, especially since 0.8 can be the difference between a good movie and a great movie. However, the reason for the high RMS is that the model doesn't predict IMDb scores below 5 or 6. This means that there is a greater amount of error in lower IMDb rated movies, which increases the error overall. This is most likely due to the nature of the dataset and not an issue with the model. In other words, because of the dataset, the model will be more accurate with IMDb scores above 5 or 6.

In [75]:

## Scatterplot actual vs predicted for Scikit-learn
plt.subplots()
plt.scatter(y, model1_y)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)   #dotted line represents perfect prediction (actual = predicted)
plt.title('Scikit-learn Actual vs Predicted')
plt.xlabel('Actual IMDb Score')
plt.ylabel('Predicted IMDb Score')
plt.show()
## Display RMS of Scikit-learn model
print("Scikit-learn RMS: ", sqrt(mean_squared_error(y, model1_y)))

Scikit-learn RMS:  0.8057380925570419

6. Classification¶

Earlier on in the project, we set custom bins based on the IMDb score of the movie.

Movies in bin 1 represent 'bad' movies (0-4 IMDb score)
Movies in bin 2 represent 'ok' movies (4-6 IMDb score)
Movies in bin 3 represent 'good' movies (6-8 IMDb score)
Movies in bin 4 movies represent 'excellent' movies (8-10 IMDb score)

In [76]:

## Show custom movie quality bins
df1['movie_quality'].value_counts().sort_index()

Out[76]:

1      95
2    1067
3    2476
4     156
Name: movie_quality, dtype: int64

Now we can use the dataset to build classification models to predict which bin movies will classify as. Effectively we are predicting whether a movie is 'bad,' 'ok,' 'good,' or 'excellent' instead of predicting the IMDb score overall.

In [77]:

## Declare X/y variables

y = df1['movie_quality']
X = df1.drop(['movie_quality', 'imdb_score'], axis =1)

print(y.shape, X.shape)

(3794,) (3794, 19)

a. Decision tree¶

First, we'll use a decision tree classification model. This model splits the dataset into training and testing groups. 70% of the data will be used to train the algorithm, while the remaining 30% will be used to test the model and its accuracy.

In [78]:

## Validation
X_train, X_test_dt, y_train, y_test_dt = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=5)
## Train
dt = dt.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_dt, dt.predict(X_test_dt)))

0.7155399473222125

A classification model is evaluated by its accuracy, the amount of correctly classified data points in the test set divided by the total number of data points in the test set. In this case, the model correctly classifies the data 71.5% of the time. A confusion matrix is a visualization of the accuracy of a classification model. The goal of a confusion matrix is to maximize the true rates, the numbers diagonal from top left to bottom right. Naturally, when maximizing the true rates we'll want to minimize the rest, which are called false rates. In this case, the true rate is high enough to suggest a decent amount of accuracy in the model. However, it's worth noting that the model does not correctly predict any 'bad' movies and instead falsely classifies them as 'ok' or 'good.' Similarly, there are quite a few instances of 'ok' movies that are misclassified as 'good' movies. This is a similar phenomenon found in the regression model in which the model does not predict IMDb scores under 5. This further proves the theory that this error is due to the nature of the dataset and not the algorithm itself.

In [79]:

## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_dt), y_pred=dt.predict(X_test_dt))
plt.show()

Shown below is the output of the decision tree used to classify each movie. The decision tree starts with num_voted_users, splits them into two groups, above and below 89,788, then moves to the next node. This process repeats until all observations have been classified. The gini score represents how homogeneous the data in each node is. Zero represents complete homogeneousness, which is the goal of classification. The issue with this decision tree is that some of the nodes on the final level are mostly heterogeneous when they should be completely homogeneous in an ideal model. Again, this is most likely due to the nature of the dataset where the algorithms tend to misclassify movies that have a lower IMDb score.

In [80]:

## Display decision tree
Source(tree.export_graphviz(dt, out_file=None, feature_names=X.columns))

Out[80]:

b. KNN¶

Next, we'll use KNN to build a classification model. Again, like the decision tree, the data is split into 70% training and 30% testing. KNN stands for k-nearest neighbor. The algorithm uses the training data to search for the nearest neighbor of a test data point and classifies it accordingly.

In [81]:

## Validation
X_train, X_test_knn, y_train, y_test_knn = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
knn = KNeighborsClassifier()
## Train
knn = knn.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_knn, knn.predict(X_test_knn)))

0.5943810359964882

The accuracy of the KNN model is shown above. The model correctly classifies the data 59.4% of the time. This is substantially lower than the decision tree model. According to the confusion matrix below, it appears that most of the error is coming from 'ok' movies misclassified as 'good' movies.

In [82]:

## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_knn), y_pred=knn.predict(X_test_knn))
plt.show()

c. Logistic regression¶

Logistic regression is similar to linear regression used earlier in the regression section. However, logistic regression is used for classification and is a non-linear model. Again we'll split the data into 70% training and 30% testing.

In [83]:

## Validation
X_train, X_test_lr, y_train, y_test_lr = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
lr = LogisticRegression(solver='lbfgs', max_iter=500)
## Train
lr.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_lr, lr.predict(X_test_lr)))

0.659350307287094

The accuracy of the logistic regression model is shown above. The model correctly classifies the data 65.9% of the time. This is more accurate than the KNN model but still less accurate than the decision tree model. The confusion matrix shown below reveals that the algorithm mostly predicts movies as 'good' because that's where the bulk of the data is. This is not conducive to a good classification model.

In [84]:

## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_lr), y_pred=lr.predict(X_test_lr))
plt.show()

d. Random forest classifier¶

Now we'll build a random forest classification model. Random forest operates similarly to the decision tree model, but instead of building one decision tree, it builds 20 decision trees. Again we'll split the data into 70% training and 30% testing.

In [85]:

## Validation
X_train, X_test_clf, y_train, y_test_clf = train_test_split(X, y, test_size=0.3, random_state=0)
## Initialize
clf = RandomForestClassifier(n_estimators=20)
## Train
clf=clf.fit(X_train, y_train)
## Model evaluation
print(metrics.accuracy_score(y_test_clf, clf.predict(X_test_clf)))

0.7594381035996488

Random forest is the most accurate of any model tested so far. The model correctly classifies the data 76% of the time. The confusion matrix above shows that the random forest model is not only more accurate but also doesn't rely on classifying the most commonly occurred class for its high accuracy.

In [86]:

## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_true=np.array(y_test_clf), y_pred=clf.predict(X_test_clf))
plt.show()

f. Recursive feature selection¶

Recursive feature selection ranks the importance of each variable. Those variables are then used to build a logistic regression classification model.

In [87]:

## Rank importance of variables based on feature selection
model = LogisticRegression()
rfe = RFE(model, 1)
rfe = rfe.fit(X, y)

In [88]:

## Features sorted by their rank
pd.DataFrame({'variable':X.columns, 'importance':rfe.ranking_}).sort_values(by='importance').reset_index().drop('index', axis=1).head(10)

Out[88]:

	variable	importance
0	in_english	1
1	color	2
2	from_USA	3
3	content_rating	4
4	facenumber_in_poster	5
5	duration	6
6	num_critic_for_reviews	7
7	title_year	8
8	num_user_for_reviews	9
9	actor_2_facebook_likes	10

We'll use the top 10 most important variables according to RFE shown above to build a logistic regression model.

In [89]:

## Select top 10 ranked variables from RFE
X_logistic = df[['in_english','color','from_USA', 'content_rating', 'facenumber_in_poster', 'duration', 'num_critic_for_reviews', 'title_year', 'num_user_for_reviews', 'actor_2_facebook_likes']]
## Validation
X_train, X_test_lr1, y_train, y_test_lr1 = train_test_split(X_logistic, y, test_size=0.3, random_state=0)
## Initialize
lr1 = LogisticRegression()
## Train
lr1.fit(X_train, y_train)
#Model evaluation
print(metrics.accuracy_score(y_test_lr1, lr1.predict(X_test_lr1)))

0.6646180860403863

The RFE model correctly classifies the data 66.3% of the time. Because RFE uses logistic regression, most movies are classified as 'good' movies because that's where most of the training data is classified. This is not conducive to building a good classification model.

In [90]:

## Confusion matrix
skplt.metrics.plot_confusion_matrix(y_test_lr1, lr1.predict(X_test_lr1))
plt.show()

g. Best model¶

The random forest model is the most accurate of all models tested, both in its overall accuracy and in the visualization of the confusion matrix. The random forest model correctly classifies the data 76% of the time.

In [91]:

## Print accuracy score for each model
dt = metrics.accuracy_score(y_test_dt, dt.predict(X_test_dt))
knn = metrics.accuracy_score(y_test_knn, knn.predict(X_test_knn))
lr = metrics.accuracy_score(y_test_lr, lr.predict(X_test_lr))
clf = metrics.accuracy_score(y_test_clf, clf.predict(X_test_clf))
lr1 = metrics.accuracy_score(y_test_lr1, lr1.predict(X_test_lr1))
print('Decision tree accuracy: %s' % (dt))
print('KNN accuracy: %s' % (knn))
print('Logistic regression accuracy: %s' % (lr))
print('Random forest classifier accuracy: %s' % (clf))
print('Recursive feature selection accuracy: %s' % (lr1))

Decision tree accuracy: 0.7155399473222125
KNN accuracy: 0.5943810359964882
Logistic regression accuracy: 0.659350307287094
Random forest classifier accuracy: 0.7594381035996488
Recursive feature selection accuracy: 0.6646180860403863

7. Clustering¶

The goal of clustering is to segment the data into distinct clusters and develop cluster profiles based on each cluster's information. The first step is to determine the most important variables to use in the clustering analysis. We'll select the 8 most important variables according to feature importance.

In [92]:

## Select 8 most important variables with feature importance
model_extra = ExtraTreesClassifier()
model_extra.fit(X, y)
pd.DataFrame(model_extra.feature_importances_, index = X.columns, columns=['importance']).sort_values('importance', ascending=False).head(8)

Out[92]:

	importance
num_voted_users	0.115421
duration	0.073528
num_critic_for_reviews	0.069830
budget	0.066979
title_year	0.066449
num_user_for_reviews	0.064334
profit	0.060909
gross	0.059518

In [93]:

## New dataframe based on these variables
df_c = df1[['title_year', 'duration', 'gross', 'budget', 'profit', 'num_critic_for_reviews',  'num_user_for_reviews', 'num_voted_users']]

a. Normalize data¶

We'll need to normalize the data before performing cluster analysis. The reason for this is that the scale or variance of each variable is unique and will distort the importance of the clusters. For example, the variance of gross at 4.8 quadrillions, is much higher than the title year variance at 99.2, and the algorithm will give more importance to the higher variance. The point of normalization is to make sure all variables are on the same scale, and therefore on a level playing field when determining the importance of each variable.

In [94]:

## Check variance
pd.options.display.float_format = '{:,.3f}'.format
df_c.var()

Out[94]:

title_year                                  99.151
duration                                   500.154
gross                    4,841,046,710,277,035.000
budget                   1,851,932,001,982,648.750
profit                   2,812,299,093,271,696.500
num_critic_for_reviews                  15,337.725
num_user_for_reviews                   167,781.108
num_voted_users                 22,803,705,819.598
dtype: float64

In [95]:

## Reset scientific notation
pd.reset_option('^display.', silent=True)

Now that the data has been normalized, all variables share the same scale, and the size of the initial variance won't influence the clustering results.

In [96]:

## Normalize data and check variance again
df_norm = (df_c - df_c.mean()) / (df_c.max() - df_c.min())
df_norm.var()

Out[96]:

title_year                0.012518
duration                  0.005826
gross                     0.008370
budget                    0.020577
profit                    0.004148
num_critic_for_reviews    0.023262
num_user_for_reviews      0.006556
num_voted_users           0.007987
dtype: float64

b. Elbow method for determining number of clusters¶

The elbow plot below helps determine the optimal number of clusters for a dataset. Too few clusters and there will be a loss of individual information, too many clusters and analysis will be more difficult. The elbow plot's namesake comes from the optimal cluster point resembling an elbow joint. It appears that the 'elbow' is at 3 clusters. This analysis is subjective though and others may argue 2 or 4 clusters.

In [97]:

## Display elbow method plot
kmeans = KMeans(random_state=1) 
skplt.cluster.plot_elbow_curve(kmeans, df_norm, cluster_ranges=range(1, 8));

c. Assign clusters¶

Now we'll assign clusters using k-means clustering based on the number of clusters selected in the elbow plot.

In [98]:

## Clustering analysis using k-means
k_means = KMeans(init='k-means++', n_clusters=3, random_state=0);
## Fit normalized data
k_means.fit(df_norm);
## Cluster labels
k_means.labels_

Out[98]:

array([0, 0, 0, ..., 1, 1, 1], dtype=int32)

In [99]:

## Cluster labels to dataframe
df_c1 = pd.DataFrame(k_means.labels_, columns = ['cluster'])
df_c1.head()

Out[99]:

	cluster
0	0
1	0
2	0
3	0
4	0

In [100]:

## Join cluster labels and normalized data
df_c2 = df_norm.join(df_c1)
df_c2.head(1)

Out[100]:

	title_year	duration	gross	budget	profit	num_critic_for_reviews	num_user_for_reviews	num_voted_users	cluster
0	0.066779	0.232443	0.93197	0.66281	0.61929	0.687488	0.538356	0.463054	0.0

In [101]:

## Cluster with original data
df_c3 = df_c.join(df_c1)
df_c3.head(1)

Out[101]:

	title_year	duration	gross	budget	profit	num_critic_for_reviews	num_user_for_reviews	num_voted_users	cluster
0	2009.0	178.0	760505847.0	237000000.0	523505847.0	723.0	3054.0	886204	0.0

The number of movies in each cluster is shown below. The majority of observations fit into cluster 2 (with the index 1), while only a select few have made it into cluster 1 (with the index 0).
Note: python indexing starts at 0 not 1

In [102]:

## Number of observations in each cluster
df_c3['cluster'].value_counts().sort_index()

Out[102]:

0.0     271
1.0    1854
2.0    1106
Name: cluster, dtype: int64

Now we can view each cluster's average values for each variable selected. Before we analyze this information and develop profiles from it, we need to make sure certain cluster values are significantly different from other clusters with more certainty than just an arbitrary guess.

In [103]:

## Mean values for each cluster
pd.options.display.float_format = '{:,.3f}'.format
df_c3.groupby('cluster').mean().T

Out[103]:

cluster	0.0	1.0	2.0
title_year	2,008.092	2,003.050	2,003.280
duration	123.993	109.019	112.955
gross	158,429,687.048	41,913,748.528	61,262,531.428
budget	136,912,915.129	28,233,330.099	46,975,311.394
profit	21,516,771.919	13,680,418.428	14,287,220.033
num_critic_for_reviews	308.886	152.780	172.290
num_user_for_reviews	744.550	291.023	357.489
num_voted_users	250,640.672	91,945.350	115,319.672

In [104]:

## Reset scientific notation
pd.reset_option('^display.', silent=True)

d. t-Testing¶

The purpose of t-testing is to find statistically significant differences between two numbers. In this case, for each variable, we'll test whether each of the 3 cluster's values deviates significantly from other clusters.

In [105]:

## Develop t-test for each variable between each cluster
a = pg.pairwise_tukey(data=df_c3, dv='title_year', between='cluster')['p-tukey']
b = pg.pairwise_tukey(data=df_c3, dv='duration', between='cluster')['p-tukey']  
c = pg.pairwise_tukey(data=df_c3, dv='gross', between='cluster')['p-tukey']
d = pg.pairwise_tukey(data=df_c3, dv='budget', between='cluster')['p-tukey'] 
e = pg.pairwise_tukey(data=df_c3, dv='profit', between='cluster')['p-tukey'] 
f = pg.pairwise_tukey(data=df_c3, dv='num_critic_for_reviews', between='cluster')['p-tukey'] 
g = pg.pairwise_tukey(data=df_c3, dv='num_user_for_reviews', between='cluster')['p-tukey']
h = pg.pairwise_tukey(data=df_c3, dv='num_voted_users', between='cluster')['p-tukey']

In [106]:

## Define highlight function
def color(val):
    color = 'red' if val > 0.05 else ''
    return 'background-color: %s' % color

Two numbers are statistically significantly different if the p-value is less than 0.05. The cells highlighted in red have a p-value above this cutoff value, therefore they are not statistically significant. So in terms of title year, cluster 1 is statistically different than cluster 2 and the same is true between clusters 1 and 3. However, title year is not significantly different between clusters 2 and 3. Profit is the only variable in which none of the clusters are statistically significant because the p-values are greater than 0.05. Therefore we will leave these highlighted cells out of our cluster profiles. The rest of the variables are statistically significant between each of the clusters. Now that we have this information, we can develop cluster profiles.

In [107]:

## Convert p-values to dataframe
ttest = pd.concat([a, b, c, d, e, f, g, h], axis=1)
ttest.columns = ['year', 'duration', 'gross', 'budget', 'profit', 'num_critic_for_reviews', 'num_user_for_reviews', 'num_voted_users']
ttest.index.name = 'difference between clusters'
ttest = ttest.rename(index={0: '1 and 2', 1: '1 and 3', 2: '2 and 3'})
ttest.style.applymap(color).format("{:.3}")

Out[107]:

	year	duration	gross	budget	profit	num_critic_for_reviews	num_user_for_reviews	num_voted_users
difference between clusters
1 and 2	0.001	0.001	0.001	0.001	0.0827	0.001	0.001	0.001
1 and 3	0.001	0.001	0.001	0.001	0.141	0.001	0.001	0.001
2 and 3	0.818	0.001	0.001	0.001	0.9	0.001	0.001	0.001

e. Cluster profiles¶

Cluster 1 - "Blockbusters"
These are the movies that are a hit at the box office, everyone's talking about them, and are in contention during awards season.
Compared to other clusters these movies on average are:

More recently released
Longer in duration
Very high grossing with very high budget
Very high number of critic reviews, user reviews, and voted users

Cluster 2 - "Indie Movies/Cult Classics"
These are the movies that are made with limited resources and don't make large amounts at the box office but still have their place in the cinema world.
Compared to other clusters these movies on average are:

Shorter in duration
Lower grossing movies with lower budget
Lower number of critic reviews, user reviews, and voted users

Cluster 3 - "Rainy Day Films"
These movies bridge the gap between the extremes of blockbusters and indie movies/cult classics. They have a decent sized budget and usually recoup their budget at the box office, but don't garner the same amount of attention or award nominations.
Compared to other clusters these movies on average are:

Right in the middle in terms of duration
Average budget and gross
Between the other clusters in terms of critic reviews, user reviews, and voted users

8. Storytelling and Conclusion¶

a. Limitations¶

Before concluding the analysis, it's worth noting that there are a few drawbacks to this dataset and project:

The dataset features outdated and inaccurate information
- The most recent movies in the dataset were released in 2016
- IMDb scores, number of reviews and votes, change daily and are therefore outdated
- Facebook likes were scraped around this time also and therefore are outdated
- It's unclear whether or not financial information like budget and gross were adjusted for inflation and if gross is purely box office sales or incorporates streaming
- Some outliers were removed, like a movie with a 12 billion dollar budget, but other, less obvious outliers may still exist putting the data integrity into question
The dataset uses some variables that are useless for predicting the success of a movie before release
- Some of the most important variables in the dataset involve the number of IMDb reviews a movie has which is impossible to tell before a movie is released
The dataset fails to predict movies below a certain IMDb rating threshold
- There is a strange phenomenon within this dataset where each algorithm, whether regression, classification, or clustering, fails to predict movies below an IMDb rating of about 5-6
- This is very apparent in section 5e where the actual versus predicted values are plotted and no values are predicted below an IMDb score of 4.5
- This means that each model will be more accurate with movies above an IMDb score of 5-6 and much less accurate below that threshold

b. Data Visualization & Correlation¶

According to the dataset and analysis, movies are more likely to have these features:

Made from the late 90's onward
Drama or Comedy
Content rating of PG-13 or R
Runtime around 100 minutes (1 hour and 40 minutes)
Gross less than 10 million dollars
Budget of less than 15 million dollars
Just breakeven in terms of profit
Less than 150 reviews from critics on IMDb
Less than 200 user reviews on IMDb
Less than 20,000 user votes on IMDb

In general, movies with a higher IMDb score are more likely to have these features:

Longer runtime/duration
More 'adult' content rating
Released further in the past
Black and white
Made in a country other than the United States
In a language other than English
Less faces in the movie poster
Higher budget, box office gross, and overall profit
Director or actor with a higher than average IMDb score
More Facebook likes on the movie's, director's, and cast's Facebook pages
More user votes, reviews from critics, and user reviews on IMDb

c. Regression¶

The goal of regression was to predict IMDb score, the y variable, using various numerical determinants from the dataset. We learned the following things through regression:

The strength or accuracy of a regression model is evaluated through mean square error (MSE) and R-squared. The goal of a regression model is to minimize MSE and maximize R-squared. A lower MSE means less error in the model.
The most accurate regression model tested was Scikit-learn because it had the lowest MSE at 0.649 and therefore reduced the total amount of error in the model.
The RMS or root mean square of the regression model puts the error term into the same scale as the y variable. On average, the Scikit-learn model wrongly predicts the IMDb score by 0.8 points. This seems very high, especially since 0.8 can be the difference between a good movie and a great movie, but this error is most likely due to the limitations of the dataset previously discussed.

d. Classification¶

The goal of classification was to predict whether a movie is 'bad,' 'ok,' 'good,' or 'excellent' instead of predicting the IMDb score overall. We achieved this by creating custom bins based on IMDb score. These are the things we learned through classification:

All models split the dataset into training and testing. 70% of the data was used to train the algorithm, while the remaining 30% was used to test the model and its accuracy.
A classification model is evaluated by its accuracy, the amount of correctly classified data points in the test set divided by the total number of data points in the test set.
Confusion matrices visualize the accuracy of a classification model and the goal is to maximize the true rates and minimize the false rates.
The random forest model was the most accurate of all models tested, both in its overall accuracy and in the visualization of the confusion matrix. The majority of movies fit into the 'good' bin, and the model did not classify most or all of the data in this bin for the sake of total accuracy. The random forest model correctly classified the data 76% of the time.

e. Clustering¶

The goal of clustering was to segment the data into distinct clusters and develop profiles based on each cluster's information using the 8 most important x variables. These are the things we learned after clustering:

We had to normalize the data before clustering so that variables were all on the same scale in terms of variance. Without normalization, the variables with a higher variance would be given more weight and distort the importance when determining clusters.
The ‘elbow’ in the elbow plot was at 3 clusters, therefore a total of 3 clusters were used. However, this is a subjective analysis, and others may have chosen a different number of clusters.
t-Testing was used to find statistically significant differences between each of the 3 clusters for each variable. Two clusters are statistically significantly different if the p-value is less than 0.05.
Using information gleaned from t-testing, we were able to build profiles with statistical certainty that each variable discussed was significantly different compared to other clusters. These profiles were:
- Cluster 1 - "Blockbusters"
  Compared to other clusters these movies on average are:
  - More recently released
  - Longer in duration
  - Very high grossing with very high budget
  - Very high number of critic reviews, user reviews, and voted users
- Cluster 2 - "Indie Movies/Cult Classics"
  Compared to other clusters these movies on average are:
  - Shorter in duration
  - Lower grossing movies with lower budget
  - Lower number of critic reviews, user reviews, and voted users
- Cluster 3 - "Rainy Day Films"
  Compared to other clusters these movies on average are:
  - Right in the middle in terms of duration
  - Average budget and gross
  - Between the other clusters in terms of critic reviews, user reviews, and voted users

f. Implications¶

This dataset can be given to an executive producer, actor, director, agent, or anyone else in the movie business to decide if a potential movie is worth taking on. Not necessarily in terms of making money but in terms of making a quality movie and winning awards. Before taking a role, producers, actors, directors, and others in the movie business should ask themselves:

How long is the proposed runtime/duration?
What is the proposed content rating for this movie?
Will this movie be in color or black and white?
What country will this movie be made in?
What language will this country be in?
What is the proposed budget for this movie?
Will this movie feature a director or actor with a higher than average IMDb score?
How many Facebook likes do the director and actor's pages have?