PREDICTING MOVIES BOX OFFICE REVENUES

Cover

Introduction

Movies are one of the most universal forms of entertainment; people in every part of the world crowd into theaters waiting for the projector's light to hit the big screen. Before the COVID pandemic stroke in 2020, the global yearly revenue generated by the movie industry was estimated to be $41.7 billion.

The myriad of different stories to tell, the creative freedom the cinematographic art allows, is arguably what really makes movies so attractive. Beneath this romantic view, however, we need to keep in mind that film studios are in business to make a profit; for the most part, the prospect of multimillion revenues is really what convinces stakeholders to invest in the production of a flick.

Project Objective

In this project we use machine-learning modeling to predict movies' box office gross revenues; to reach this goal, we rely on information such as movies' budget, genres, release date and popularity.

Along the way, through data wrangling and exploratory analysis, we unveil interesting statistical facts involving movies. By the end of the project, we have a clear idea of what are the factors that more greatly influence the revenue of a film.

About the Data

To train our models we use a variety of metadata about 3,000 different movies. The dataset containing this information can be downloaded from this Kaggle page and was originally obtained using the API of The Movie Database (TMDB), a popular website which collects extensive data about motion pictures and TV shows.

Project Overview

The project is divided into 3 main steps:

  • Step 1: Loading the Data and Identifying the Target Variable.
  • Step 2: Data Wrangling, Exploratory Analysis and Feature Engineering: this is the most extensive part and it is functional to getting the data ready for modeling.
  • Step 3: Predictive Modeling and Performance Evaluation: here we actually train the regression models and predict the movies' revenues; models' performance is evaluated using the Root Mean Squared Error metric.

STEP 1: Loading the Data and Identifying the Target Variable

Let's first of all activate an IPython option that will improve the notebook organization by reducing the number of cells we use.

In [1]:
# Allow to run all code in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Getting Familiar with the Data

We are ready to dive into the data.

In [2]:
import pandas as pd

# Read-in the CSV file with the data
movies_df = pd.read_csv("Datasets/TMDB_data.csv")
In [3]:
# Print number of rows and columns
print("# of rows:", movies_df.shape[0])
print("# of columns:", movies_df.shape[1])
# of rows: 3000
# of columns: 23
In [4]:
# Print the names of the columns
movies_df.columns
Out[4]:
Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue'],
      dtype='object')
In [5]:
import numpy as np

# Print the name of the numeric columns
numeric_movies_df = movies_df.select_dtypes(include = np.number)
print("The numeric columns are:\n", numeric_movies_df.columns)

print("---")

# Print the name of the non-numeric columns
non_numeric_movies_df = movies_df.select_dtypes(exclude = np.number)
print("The non-numeric columns are:\n", non_numeric_movies_df.columns)
The numeric columns are:
 Index(['id', 'budget', 'popularity', 'runtime', 'revenue'], dtype='object')
---
The non-numeric columns are:
 Index(['belongs_to_collection', 'genres', 'homepage', 'imdb_id',
       'original_language', 'original_title', 'overview', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'spoken_languages', 'status', 'tagline', 'title', 'Keywords', 'cast',
       'crew'],
      dtype='object')
In [6]:
# Allow display of all columns in a dataframe
pd.set_option("display.max_columns", None)

# Take a look at the first three rows
movies_df.head(3)
Out[6]:
id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity poster_path production_companies production_countries release_date runtime spoken_languages status tagline title Keywords cast crew revenue
0 1 [{'id': 313576, 'name': 'Hot Tub Time Machine ... 14000000 [{'id': 35, 'name': 'Comedy'}] NaN tt2637294 en Hot Tub Time Machine 2 When Lou, who has become the "father of the In... 6.575393 /tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg [{'name': 'Paramount Pictures', 'id': 4}, {'na... [{'iso_3166_1': 'US', 'name': 'United States o... 2/20/15 93.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The Laws of Space and Time are About to be Vio... Hot Tub Time Machine 2 [{'id': 4379, 'name': 'time travel'}, {'id': 9... [{'cast_id': 4, 'character': 'Lou', 'credit_id... [{'credit_id': '59ac067c92514107af02c8c8', 'de... 12314651
1 2 [{'id': 107674, 'name': 'The Princess Diaries ... 40000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN tt0368933 en The Princess Diaries 2: Royal Engagement Mia Thermopolis is now a college graduate and ... 8.248895 /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg [{'name': 'Walt Disney Pictures', 'id': 2}] [{'iso_3166_1': 'US', 'name': 'United States o... 8/6/04 113.0 [{'iso_639_1': 'en', 'name': 'English'}] Released It can take a lifetime to find true love; she'... The Princess Diaries 2: Royal Engagement [{'id': 2505, 'name': 'coronation'}, {'id': 42... [{'cast_id': 1, 'character': 'Mia Thermopolis'... [{'credit_id': '52fe43fe9251416c7502563d', 'de... 95149435
2 3 NaN 3300000 [{'id': 18, 'name': 'Drama'}] http://sonyclassics.com/whiplash/ tt2582802 en Whiplash Under the direction of a ruthless instructor, ... 64.299990 /lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg [{'name': 'Bold Films', 'id': 2266}, {'name': ... [{'iso_3166_1': 'US', 'name': 'United States o... 10/10/14 105.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The road to greatness can take you to the edge. Whiplash [{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n... [{'cast_id': 5, 'character': 'Andrew Neimann',... [{'credit_id': '54d5356ec3a3683ba0000039', 'de... 13092000

The dataset contains 3,000 rows; each of them corresponds to a movie. We have a total of 23 columns. The kind of information contained in each column can be inferred easily from the column's name but more specific details can be found at the Bible page of the TMDB website.

The id and imdb_id columns don't really contain any substantial information but serve as identifiers to distinguish between different movies. Specifically, the latter contains the path one needs to append to the URL https://www.imdb.com/title/ in order to reach the IMDB's movie page. This is a popular and comprehensive database for movies; its pages could come handy in case we will need to replace some missing information.

A few columns (such as production_companies, spoken_languages, genres and cast) stick out because they contain strings which use a syntax reminiscent of JSON. This must be the result of how the API returns the data following a request. We will make sure to format these strings properly when needed.

The popularity column contains a popularity score for each movie; it's a litte unclear how this score is calculated, but some info can be found at the related page on the TMDB API website.

It's important to notice that out of all the 23 columns, only 5 of them are of numeric type; this suggests that we will need quite a bit of data wrangling and feature engineering before being able to train our first model.

As one last piece of general information, let's print each column's null values count and let's check for duplicate rows.

In [7]:
# Print the columns' null values count
movies_df.isnull().sum()
Out[7]:
id                          0
belongs_to_collection    2396
budget                      0
genres                      7
homepage                 2054
imdb_id                     0
original_language           0
original_title              0
overview                    8
popularity                  0
poster_path                 1
production_companies      156
production_countries       55
release_date                0
runtime                     2
spoken_languages           20
status                      0
tagline                   597
title                       0
Keywords                  276
cast                       13
crew                       16
revenue                     0
dtype: int64
In [8]:
print("The dataframe has duplicate rows:", movies_df.duplicated().any())
print("The dataframe has duplicate movie titles:", movies_df.duplicated(subset = "title").any())
The dataframe has duplicate rows: False
The dataframe has duplicate movie titles: True

There are quite a few null values: we will look more into this issue during the exploratory data analysis step.

We don't have any duplicate row; however, there are some duplicate movie titles. This means that, to distinguish between movies, we are better using one of the identifiers columns rather than the titles.

Setting Up a Plotting Environment

We will soon start exploring the data in more details by looking at each column separately and displaying some of the variable distributions, but for now let's fix the plotting style and parameters we will use throughout the notebook.

In [9]:
import matplotlib.pyplot as plt

# Reset the plotting parameters to make titles bigger and bold throughout the notebook
plt.style.use(style="ggplot")
par = {"axes.titlesize":18, "axes.titleweight":"bold", "figure.figsize":[12,6]}
plt.rcParams.update(par)

Selecting and Analyzing the Target Variable

Our goal is to predict movies' box office grosses. Looking at the first few rows we printed above, we soon realize this information is contained in the revenue column. However, it's not clear whether these numbers refer to US only, domestic (US and Canada) or worldwide box office. Consulting again to the Bible page, we learn that there's no enforced rule to establish it.

This lack of uniformity introduces a certain degree of variance we don't have control over. To make the data more rigorous we could try to pull more data from other resources which keep track of movies' financial information. However, most of them are either not as exhaustive (like BoxOfficeMojo.com) or require specific permissions in order to use their API (like TheNumbers.com).

Ultimately, we must be content with the data we have. Hence, we officially elect the revenue column as our target variable. Let's start getting acquainted with the distribution of its values.

In [10]:
# Show the 3 movie with highest revenue
movies_df.sort_values("revenue", ascending = False)[["title", "revenue"]].head(3)
Out[10]:
title revenue
1126 The Avengers 1519557910
1761 Furious 7 1506249360
2770 Avengers: Age of Ultron 1405403694
In [11]:
# Print summary statistics of `revenue` column
movies_df["revenue"].describe()
Out[11]:
count    3.000000e+03
mean     6.672585e+07
std      1.375323e+08
min      1.000000e+00
25%      2.379808e+06
50%      1.680707e+07
75%      6.891920e+07
max      1.519558e+09
Name: revenue, dtype: float64

We immediately notice a minimum value of \$1, which clearly can't be any movie's real revenue; we will need to inspect this issue further and possibly identify other similar invalid values.

The mean box office revenue is close to \$67 million. However, this value is not representative of the whole distribution: for one thing, the standard deviation is insanely high (more than \\$100 million); also, the median revenue doesn't reach \$20 million. In fact, the mean value falls just shy of the upper quartile.

Given these obeservations, we expect the distribution of revenues to have a very strong positive skew. Let's quantify this skew by checking its magnitude; moreover, to get a quick visual intuition, let's also draw an histogram for the revenue column.

In [12]:
print("The skew of the `revenue` distribution is:", movies_df["revenue"].skew())
The skew of the `revenue` distribution is: 4.542285301660683
In [13]:
import seaborn as sns

# Draw distribution of the revenues
sns.displot(data = movies_df, x = "revenue", kde = True, aspect = 16/8);
plt.title("Distribution of Revenue");

As expected, the distribution features a very long tail on the right side. This tail, however, gets very flat very quickly; in fact, the vast majority of movies appears near the origin but the presence of a few blockbusters pulls the distribution in the positive direction.

Because of these outliers, the scale of the plot is irksome and makes it hard to recognize the different bins. To get a better picture, let's zoom into the part of the graph which contains most movies.

In [14]:
# Filter out blockbuster movies with very high revenue values
most_movies_df = movies_df[movies_df["revenue"] < 0.2 * 10**9]
In [15]:
# Draw distribution of the revenues zooming into the segment which contains the most movies
sns.displot(data = most_movies_df, x = "revenue", kde = True, aspect = 16/8);
plt.xlabel("");
plt.title("Zoomed-in Distribution of Revenue");

Now we see more clearly that roughly half of the movies have a revenue below \$12.5 million.

Filtering Out Invalid Values

In the plot above the highest bar is the one that starts at the origin and ends at \$6.25 million. Such a range is quite large and can very well include many movies. Since we have already spotted an invalid minimum value, however, we should make sure this high density of "low-revenue" films is legit and is not influenced by the presence of bogus values.

At this point, it's hard to say what would be a good threshold to separate suspicioulsy low revenues but for now let's set it at \$5,000.

In [16]:
# Find movies with revenue lower than $5,000
low_revenue = movies_df[movies_df["revenue"] < 5000]

print("There are {} movies with revenue lower than $5,000.".format(low_revenue.shape[0]))
There are 70 movies with revenue lower than $5,000.
In [17]:
print("The lowest revenues in ascending order are:")
np.sort(low_revenue["revenue"].unique())
The lowest revenues in ascending order are:
Out[17]:
array([   1,    2,    3,    4,    5,    6,    7,    8,   10,   11,   12,
         13,   15,   18,   20,   23,   25,   30,   32,   46,   60,   70,
         79,   85,   88,   97,  100,  121,  125,  126,  135,  198,  204,
        241,  306,  311,  404,  422,  500,  525,  641,  671,  923, 1404,
       1469, 1938, 2303, 2340, 2970, 3003, 3358, 3462, 4187, 4275, 4542,
       4631], dtype=int64)

Besides the minimum revenue of \$1, we have many other questionable values. Trying to find a method to replace these values could be cumbersome and possibly damaging, since we might introduce bias in the data. Since our search above returned just a small portion of datapoints (70 out of 3,000), we can drop the rows corresponding to movies with suspicious revenues without suffering an important loss of data; but of course, the hard part is deciding which revenues, among the ones above, we should disregard.

We can probably safely assume that any value below \$100 is artificial. On the other hand, it's not impossible to imagine that a movie which is released for a very short period of time in a handful of theaters can accumulate a revenue of just a few hundreds dollars. We notice that the first significant gap between two consecutive values happens in correspondence of the \\$1,000 mark. Therefore, to be a little safer, we will drop all movies with a revenue below \$1,000.

In [18]:
# Keep only movies with revenue greater than $1,000
# Save the result in a new variable so that we can always refer back to the original dataframe
movies = movies_df[movies_df["revenue"] > 1000].reset_index(drop = True) # We reset and drop the index because we can use
                                                                         # the `id` column to loop back to the original data
In [19]:
print("We have {} datapoints remaining.".format(movies.shape[0]))
We have 2943 datapoints remaining.

Transforming the Target Column

In the previous step we have dropped 2% of all the movies we had available. Given this small fraction, the distribution of the revenue column can't have changed much and must still feature a strong positive skew.

In [20]:
print("The skew of the original `revenue` distribution was:", movies_df["revenue"].skew())
print("The skew of the current `revenue` distribution is:", movies["revenue"].skew())
The skew of the original `revenue` distribution was: 4.542285301660683
The skew of the current `revenue` distribution is: 4.504084089706932

Regression models, especially linear models, do not perform well on strongly skewed data like the one we have here. This is because:

  • Skewness is usually caused by the presence of outliers and it is well known that outliers tend to pull the predictions away from the true values.
  • Skewed data often leads to skewed residuals and linear regression operates under the assumption the residuals are normally distributed.
  • The skewness of the target variable could negatively impact its linear relationship with the features; since linearity is one of the basic hypothesis of linear regression, this could affect the model accuracy.

To mitigate the skeweness of a variable, it is advisable to transform it by applying a function to all values. A popular choice is to use the log function. Besides addressing the issues we mentioned above, training a model on log transformed data yields errors that are multiplicative with respect to the original data. This means that, when evaluating the performance of our models, we will be able to say something like "our predictions are on average one and a half time the true value". This is much better than saying "our predictions are on average \$1 million above the true value": such an error is acceptable for a blockbuster which grossed \\$80 million but is utter nonsense if we consider movies with much lower revenues (possibly even lower than \$1 million).

In what follows we put into action what we theorized above by applying the log function to the revenue column. When we are done, we look at the distribution of the new, log transformed values and see how it compares to the original variable.

In [21]:
# Log transform the `revenue` column
movies["log_revenue"] = np.log(movies["revenue"])
In [22]:
print("The skew of the log transformed `revenue` distribution is:", movies["log_revenue"].skew())
The skew of the log transformed `revenue` distribution is: -0.8439178021333138
In [23]:
# Draw the distributions of the original and the log transformed revenues side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw the distribution of the original revenues on the left
revenue = sns.histplot(data = movies, x = "revenue", kde = True, ax = axs[0]);
_ = revenue.set_title("Distribution of Revenue");

# Draw the distribution of the log transformed revenues on the right
log_revenue = sns.histplot(data = movies, x = "log_revenue", kde = True, ax = axs[1]);
_ = log_revenue.set_title("Distribution of Log Transformed Revenue")

# Use same scale on y-axis to facilitate comparison
for ax in axs:
    _ = ax.set(yticks = np.arange(0,1100,100));

The skew of the distribution on the right is indeed much less severe. As a consequence, applying the log function has notably decreased the spread of the revenue values resulting in a better scale that is easier to look at.

After all these convincing steps, we can update the target variable: it will be the log_revenue column which contains the log transformed revenue values.

In [24]:
# Make the `log_revenue` column the target variable
target = ["log_revenue"]

STEP 2: Data Wrangling, Exploratory Analysis and Feature Engineering

Let's now look at each column separately and decide if we should include it in the list of features to feed the regression models.

Working with the Numeric Features

Since most regression models only accept numeric variables, let's start analyzing the numeric columns that appear in the dataframe.

In [25]:
# Select only the numeric columns
numeric_df = movies.select_dtypes(include = np.number)
numeric_features = numeric_df.columns
In [26]:
print("There are {} numeric columns. They are:\n{}".format(len(numeric_features), numeric_features))
There are 6 numeric columns. They are:
Index(['id', 'budget', 'popularity', 'runtime', 'revenue', 'log_revenue'], dtype='object')
In [27]:
# Print null values count
numeric_df.isnull().sum()
Out[27]:
id             0
budget         0
popularity     0
runtime        2
revenue        0
log_revenue    0
dtype: int64
In [28]:
# Print some summary statistics
numeric_df.describe()
Out[28]:
id budget popularity runtime revenue log_revenue
count 2943.000000 2.943000e+03 2943.000000 2941.000000 2.943000e+03 2943.000000
mean 1503.001019 2.294052e+07 8.561427 108.029582 6.801820e+07 16.202226
std 866.667067 3.726020e+07 12.175863 21.858298 1.385413e+08 2.541269
min 1.000000 0.000000e+00 0.000001 0.000000 1.404000e+03 7.247081
25% 753.500000 0.000000e+00 4.215948 94.000000 2.747264e+06 14.826115
50% 1506.000000 8.000000e+06 7.441117 104.000000 1.748753e+07 16.676999
75% 2252.000000 3.000000e+07 10.946306 118.000000 7.103961e+07 18.078748
max 3000.000000 3.800000e+08 294.337037 338.000000 1.519558e+09 21.141685

We have a total of 6 columns. We have already noticed how one of them (id) serves just as an identifier proxy; moreover, revenue and log_revenue are the target variables and we have already taken care of them in the previous section.

This leaves us with 3 extra numeric features: budget, popularity and runtime. These columns combined only have a couple of null values but it's important to notice they also have minimum values of zero: considering the kind of information this number represents, it's clear that zero values should be treated as missing data too.

Before inspecting these flawed values further and decide how to deal with them, let's check the correlation between the numeric columns.

In [29]:
# Check numeric columns correlation
sns.heatmap(numeric_df.corr(), annot = True);

The strongest correlation is between revenue and budget. The other correlations above 0.5 are between revenue and log_revenue and log_revenue and budget: neither of them is surprising since the log_revenue column was obtained directly by transforming the revenue one. Otherwise, revenue also shows some mild correlation with popularity.

Let's start exploring some of these correlations in more details.

Analyzing the budget Column

Distribution of Budget Values and Correlation with Revenues

The budget column uses the same unit of revenue (number of dollars). Since we elected the transformed log_revenue as our target variable, we should log transform the budgets too in order to obtain a feature in the same scale. For now, however, let's plot the distribution and the correlation between the non-transformed variables.

In [30]:
# Draw the distribution of the budget values and their correlation with revenues side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw distribution of the budgets on the left
budget = sns.histplot(data = movies, x = "budget", ax = axs[0], kde = True);
_ = budget.set_title("Distribution of Budget");

# Draw `budget` vs. `revenue` scatterplot on the right
sns.scatterplot(data = movies_df, x = "budget", y = "revenue");
plt.title("Correlation of Budget and Revenue");

From the picture on the left, we see that the budget distribution too, just as the revenue one, is strongly skewed to the right. As a consequence, most datapoints in the scatterplot on the right accumulate in the lower left corner.

The real issue here, however, is that some of these datapoints lie directly on the y-axis: this confirms the presence of a group of movies with a budget of \$0.

Exposing Invalid Budget Values

Let's find out how many movies are listed as having a \$0 budget.

In [31]:
# Count movies with $0 budget
(movies["budget"] == 0).sum()
Out[31]:
778

Unfortunately, there are quite a few. Since we are at it, let's also look for other possibly wrong budget values; in order to do so, we need to determine a minimum budget under which it's unreasonable to believe one can make a movie. In this list of low-budget movies, there is one that was made with as little as \$7,000. We'll be more lenient and set an initial threshold of \\$5,000.

In [32]:
# Counts movies with budget lower than $5,000
low_budget = movies[movies["budget"] < 5000]

print("There are {} movies with budget lower than $5,000".format(low_budget.shape[0]))
low_budget["budget"].value_counts().sort_index()
There are 784 movies with budget lower than $5,000
Out[32]:
0       778
1         1
8         1
10        1
93        1
113       1
2500      1
Name: budget, dtype: int64

The budget values go up to \$113 somewhat uniformly, they suddenly jump to \\$2,500 and then nothing is found between that and \$5,000. This suggests that \\$5,000 is probably a viable budget threshold. Let's decide whether we should include the least severe outlier we discovered above.

In [33]:
# Check the movie with a $2,500 budget
movies_df[movies_df["budget"] == 2500]
Out[33]:
id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity poster_path production_companies production_countries release_date runtime spoken_languages status tagline title Keywords cast crew revenue
1803 1804 NaN 2500 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n... NaN tt0914382 en Quiet City Jamie is 21. She's from Atlanta. She's come to... 0.404753 /vn4iLUhXYAtoVXgXeQ2L7fFZZBr.jpg NaN [{'iso_3166_1': 'US', 'name': 'United States o... 3/12/07 78.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Quiet City [{'id': 10183, 'name': 'independent film'}] [{'cast_id': 1000, 'character': 'Jamie', 'cred... [{'credit_id': '52fe4712c3a36847f8121ef5', 'de... 15425

The IMDB page of the movie (which we quickly reached using the the imbdb_id appearing in the row above) confirms both the budget and revenue values. Hence we save this row and update our budget threshold to \$2,500.

In [34]:
# Isolate the movies with invalid budget value, i.e., lower than $2,500
invalid_budget = movies[movies["budget"] < 2500]

We need to find a way to fix these movies. The easiest solution would be to drop the corresponding rows altogether; however, this would result in a loss of more than a quarter of the data (783 out of 2943 rows). To avoid this outcome, in order to retain as much information as possible, we prefer to replace the ill values through an appropriate imputation strategy.

Figuring Out a Strategy to Replace the Invalid Budget Values

It is legitimate to expect that the budgets production companies (especially the smaller ones) allocate to make a movie don't have a big variance. Operating under this assumption, the option of replacing an invalid budget value with the median budget of movies produced by the same company seems valid.

The information we need to carry out our strategy is contained in the production_companies column. Let's gather more information about this variable.

In [35]:
# Print null values count of the `production_companies` column
movies["production_companies"].isnull().sum()
Out[35]:
145
In [36]:
# Count movies with invalid budget and missing `production_companies` value
invalid_budget[invalid_budget["production_companies"].isnull()].shape[0]
Out[36]:
73

Out of the 783 movies with invalid budget, only 73 are missing a production company. This count doesn't take into consideration the fact that some of these movies might be produced by companies for which we have no other record. For this movies, our imputation strategy falls short. Nevertheless, we still believe there's a good margin of improvement so we go ahead with our plan.

Cleaning the production_companies Column

Let's look at the format of the production_companies column.

In [37]:
# Print the first 3 values
movies["production_companies"].iloc[0]
movies["production_companies"].iloc[1]
movies["production_companies"].iloc[2]
Out[37]:
"[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
Out[37]:
"[{'name': 'Walt Disney Pictures', 'id': 2}]"
Out[37]:
"[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"

The data is stored in lists of dictionaries, each one containing information about a single production company. These lists, however, are actually saved as strings. This is most likely the effect of a deserialization process performed by the API; in other words, the API pulled the data in JSON format and, when queried, it decoded the JSON into strings. These strings is what we see above.

To turn the strings back into actual lists of dictionaries, the json package includes a method that allows to parse valid JSON syntax. However, we first need to make sure our strings do indeed match the JSON syntax. With this in mind, we define below a function that serves exactly this purpose.

In [38]:
import re
import json

def format_json(string):
    """
    Turns a deserialized JSON string into a list of dictionaries containing the original information.
    
    :param string: the string that needs to be formatted 
    :type string: str

    :return: a list of dictionaries with the same info contained in the original string
    """
    
    # Initialize the list of dictionaries
    dicts_list = []
    
    # Substitute double quotes with single quote. These double quotes could appear inside the dictionary values
    string = re.sub("\"", "'", string)
    
    # Substitute the single quotes that delimit the dictionary keys and values with double quotes to follow JSON syntax
    # We can recognize such single quotes because the characters that appear before or after them come in finitely many combos
    
    string = re.sub("(?<={)\'", "\"", string)    # Substitute a single quote preceded by "{" 
    string = re.sub("(?<=,\s)\'", "\"", string)  # Substitute a single quote preceded by ", "
    string = re.sub("(?<=:\s)\'", "\"", string)  # Substitute a single quote preceded by ": "
    string = re.sub("\'(?=:\s)", "\"", string)   # Substitute a single quote followed by ": "
    string = re.sub("\'(?=,\s)", "\"", string)   # Substitute a single quote followed by ", "
    string = re.sub("\'(?=})", "\"", string)     # Substitute a single quote followed by "}"
   
    # Remove the `\xa0` unicode characater (this signifies hard space and could appear in the dictionary values) 
    string = string.replace(u'\\xa0', u'')
      
    # Split the string at every closing bracket, i.e., at the end of every dictionary
    # This creates a list will all the JSON dictionaries (which at this point are still saved as strings)
    str_dicts = re.split(r"(?<=}).", string)
    
    # Loop through the dictionaries in the list we just created
    for str_dict in str_dicts[:-1]: # We skip the last element because it will always be an empty string
        
        # Remove any leftover leading and trailing foreign character
        str_dict = str_dict.strip("[, ]")
        # Parse the JSON dictionary converting into an actual dictionary
        real_dict = json.loads(str_dict)
        # Add the dictionary to the list which collects them all
        dicts_list.append(real_dict)
        
    return dicts_list            
In [39]:
# Clean the `production_companies` column parsing the JSON strings
movies["production_companies"] = movies["production_companies"].apply(lambda x : format_json(x) if pd.notnull(x) else x)
In [40]:
# Take a look at the result
movies["production_companies"].iloc[0]
movies["production_companies"].iloc[1]
movies["production_companies"].iloc[2]
Out[40]:
[{'name': 'Paramount Pictures', 'id': 4},
 {'name': 'United Artists', 'id': 60},
 {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]
Out[40]:
[{'name': 'Walt Disney Pictures', 'id': 2}]
Out[40]:
[{'name': 'Bold Films', 'id': 2266},
 {'name': 'Blumhouse Productions', 'id': 3172},
 {'name': 'Right of Way Films', 'id': 32157}]

Since the id associated to each company doesn't add any important information, let's extract the production companies names from these lists.

In [41]:
# Extract the production companies' names
movies["production_companies"] = movies["production_companies"].apply(lambda x : [company["name"] for company in x]
                                                                      if type(x) == list else x)

Let's also create a column in which we keep only the main production company for each movie. To make our life easier, we assume the main company is the first one appearing in the list.

In [42]:
# Create a column with only the main production_company
movies["main_production_company"] = movies["production_companies"].apply(lambda x : x[0] if type(x) == list else x)
In [43]:
# Check the result
movies["main_production_company"]
Out[43]:
0         Paramount Pictures
1       Walt Disney Pictures
2                 Bold Films
3                        NaN
4                        NaN
                ...         
2938            Warner Bros.
2939             Memfis Film
2940         New Line Cinema
2941            Jersey Films
2942        Lions Gate Films
Name: main_production_company, Length: 2943, dtype: object

Replacing Invalid Budget Values

Now that we have obtained all the info we need, let's replace the invalid budget values using the stratgy we originally outlined. First of all, since we established all budgets below \$2,500 are wrong and meaningless, let's replace them with null values.

In [44]:
# Replace the invalid budget values with NaNs
movies.loc[movies["budget"] < 2500, "budget"] = np.nan

Next we compute the budgets we will use to fill these null values we just introduced. Remember our strategy is to take the median budget of movies produced by each main production company.

In [45]:
# Initialize a dictionary to contain the median budgets
median_budgets = dict()

# Isolate the segment we want to fix
segment_to_fix = movies[(movies["budget"].isnull()) & (movies["main_production_company"].notnull())]

# Iterate over the unique companies which produced movies with invalid budgets 
for prod_comp in segment_to_fix["main_production_company"].unique():
    
    # Calculate the median budget of movies produced by the current production company 
    median_budget = movies.loc[movies["main_production_company"] == prod_comp, "budget"].median()
    
    # Update the dictionary with the median budget for the current production company
    median_budgets[prod_comp] = median_budget

Finally, we use the dictionary we just created to replace the invalid budget values.

In [46]:
# Replace the missing budgets with the median budgets we calculated (when possible)
movies["budget"] = movies[["budget", "main_production_company"]].apply(lambda row : 
                                                                       median_budgets[row["main_production_company"]] 
                                                                       if row.name in segment_to_fix.index
                                                                       else row["budget"], axis = 1)
In [47]:
# Check how many invalid budgets survived
movies["budget"].isnull().sum()
Out[47]:
415

Unfortunately, we still have quite a large number of missing budgets. This must be due to the fact that many of the movies which had an invalid budget value are produced by companies for which we have no other record. Looking back, out of the 783 movies with ill budgets we had at the beginning, we were able to clean almost half of them, so we are pretty content.

At this point we need to decide what to do with the remaining movies we were not able to fix. Since we saw the distribution of the budget variable is very skewed, we are worried replacing missing values with the mean or even median budget of the other movies could introduce bias in the data. Hence we prefer to drop the troubled datapoints.

In [48]:
# Drop movies with missing budget
movies = movies.dropna(subset = ["budget"]).reset_index(drop = True)
In [49]:
print("We have {} datapoints remaining.".format(movies.shape[0]))
We have 2528 datapoints remaining.

Transforming the budget Column

Now that we have finally cleaned the revenue values, we can perform the same log transformation we used to normalize the revenue column. Since we either replaced or dropped any \$0 budget, we can be sure this operation won't introduce any new null value.

In [50]:
# Log transform the `budget` column
movies["log_budget"] = np.log(movies["budget"])

Let's see how this transformation affected the shape of the distribution.

In [51]:
# Draw the distributions of the original and the log transformed budgets side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw distribution of the original budgets on the left
budget = sns.histplot(data = movies, x = "budget", ax = axs[0], kde = True);
_ = budget.set_title("Distribution of Budget");

# Draw distribution of the log transformed budgets on the right
log_budget = sns.histplot(data = movies, x = "log_budget", ax = axs[1], kde = True);
_ = log_budget.set_title("Distribution of Log Transformed Budget")

# Use same scale on y-axis to facilitate comparison
for ax in axs:
    _ = ax.set(yticks = np.arange(0,700,100));

Even in this case, the log transformation helped reducing the skew of the distribution.

Let's also dobule check the correlation between the new, log transformed variables.

In [52]:
# Draw the correlation between the original and the log transformed budgets and revenues side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw correlation of original budgets and revenues on the left
budget_revenue = sns.scatterplot(data = movies, x = "budget", y = "revenue", ax = axs[0]);
_ = budget_revenue.set_title("Correlation of Budget and Revenue");

# Draw correlation of log transformed budgets and revenues on the right
log_budget_revenue = sns.scatterplot(data = movies, x = "log_budget", y = "log_revenue", ax = axs[1]);
_ = log_budget_revenue.set_title("Correlation of Log_Budget and Log_Revenue")

Thanks to the log transformation, the datapoints in the plot on the right are not squished into the lower left corner, around the origin anymore but appear more spread out. As a consequence, it is also easier to recognize a linear correlation between the two columns.

Working with the popularity and runtime Columns

Let's move on to the two remaining numeric features, popularity and runtime.

In [53]:
# Print null values count in the `runtime` and `popularity`
movies[["runtime", "popularity"]].isnull().sum()
Out[53]:
runtime       2
popularity    0
dtype: int64
In [54]:
# Print summary statistics
movies[["runtime", "popularity"]].describe()
Out[54]:
runtime popularity
count 2526.000000 2528.000000
mean 109.099367 9.254648
std 21.835570 12.910059
min 0.000000 0.000308
25% 95.000000 5.144377
50% 105.000000 7.943562
75% 119.750000 11.341189
max 338.000000 294.337037

Besides the 2 "official" null values, the runtime column also has a minimum of 0 minutes which reveals the presence of other invalid values that need to be replaced. Let's check how many of them are there.

In [55]:
# Look for movies with 0 minutes runtime
(movies["runtime"] == 0).sum()
Out[55]:
6

There are 6 movies with a runtime value of zero; adding the additional 2 missing values, it makes a total of 8 movies that need to be fix. Given this small amount, we could easily drop the corresponding rows. However, since movies' runtime are fairly easy to find online, we will input the true values manually.

Once again, we will gather the missing information from the movies' IMDB pages by appending the code that appears in the imdb_id column to the path https://www.imdb.com/title/.

In [56]:
# Show the movies with invalid runtimes
invalid_runtimes = movies.loc[(movies["runtime"] == 0) | (movies["runtime"].isnull()), ["imdb_id", "title"]]
invalid_runtimes
Out[56]:
imdb_id title
324 tt2550838 The Worst Christmas of My Life
777 tt2604346 ¿Quién mató a Bambi?
825 tt2076251 La peggior settimana della mia vita
1111 tt1107828 Королёв
1794 tt0477337 Mechenosets
1932 tt0116485 Happy Weekend
2100 tt1620549 Hooked on the Game 2. The Next Level
2412 tt2456720 Tutto tutto niente niente
In [57]:
# Collect the true runtime values
# Unrelated curiosity: three of these movies are Italian and I actually saw them when I lived in Rome
true_runtimes = [86, 86, 93, 130, 108, 90, 86, 96]
In [58]:
# Replace the invalid runtimes with the true values
index_to_fix = invalid_runtimes.index
movies.loc[index_to_fix, "runtime"] = true_runtimes
In [59]:
# Make sure there's no missing runtime anymore
movies["runtime"].isnull().sum()

# Make sure there's no 0 runtime left
(movies["runtime"] == 0).sum()
Out[59]:
0
Out[59]:
0

Now that we have replaced the invalid values, we can check how runtimes correlate with movies' revenues.

In [60]:
# Draw the distribution of the runtime values their correlation with log_revenues side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw distribution of runtimes on the left
runtime = sns.histplot(data = movies, x = "runtime", ax = axs[0], kde = True);
_ = runtime.set_title("Distribution of Runtime (in mins)");

# Draw correlation of runtimes and log_revenues on the right
runtime_log_revenue = sns.scatterplot(data = movies, x = "runtime", y = "log_revenue", ax = axs[1]);
_ = runtime_log_revenue.set_title("Correlation of Runtime and Log_Revenue")

From the histogram on the left, we can see that most movies are shorter than 2.5 hrs and the most frequent runtimes fall just below the 100 minutes mark. As expected, given the correlation coefficients we have already printed, we do not observe any strong correlation with the log_revenue values. The scatterplot on the right, however, clearly shows the presence of a group of movies with high runtimes which exceed 200 minutes. In order to guarantee a good performance of the regression models, we get rid of these outliers.

In [61]:
# Remove the `runtime` outliers
movies = movies[movies["runtime"] < 200]

Let's draw the same kind of plots for the popularity column.

In [62]:
# Draw the distribution of the popularity scores and their correlation with log_revenues side by side
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw distribution of popularity scores on the left
popularity = sns.histplot(data = movies, x = "popularity", ax = axs[0], kde = True);
_ = popularity.set_title("Distribution of Popularity");

# Draw correlation of popularity scores and log_revenues on the right
popularity_log_revenue = sns.scatterplot(data = movies, x = "popularity", y = "log_revenue", ax = axs[1]);
_ = popularity_log_revenue.set_title("Correlation of Popularity and Log_Revenue")

The distribution of the popularity values has such a strong skew that the histogram on the left is nearly impossible to read out. From the scatterplot on the right, we infer that this skew is mostly due to the presence of outliers with popularity greater than 50 extending all the way up to nearly 300.

Let's remove these outliers and look again at the distribution of the remaining datapoints.

In [63]:
# Remove the `popularity` outliers
# We save the result in a new variable because we haven't yet decided how to deal with the `popularity` column   
movies_temp = movies[movies["popularity"] < 50]
In [64]:
# Draw zoomed in distribution of the popularity scores and their correlation with log_revenues side by side
# Specifically, we leave out the outliers, i.e., movies with popularity > 50 
fig, axs = plt.subplots(1,2, figsize = (18,8))

# Draw zoomed in distribution of popularity scores on the left
popularity = sns.histplot(data = movies_temp, x = "popularity", ax = axs[0], kde = True);
_ = popularity.set_title("Zoomed in Distribution of Popularity");

# Draw zoomed in correlation of popularity scores and log_revenues on the right
popularity_log_revenue = sns.scatterplot(data = movies_temp, x = "popularity", y = "log_revenue", ax = axs[1]);
_ = popularity_log_revenue.set_title("Zoomed in Correlation of Popularity and Log_Revenue")

It looks like we still have a few outliers characterized by a popularity score above 30. There is only a handful of them hence, just to be safe, we drop those too.

In [65]:
# Drop `popularity` outliers by setting threshold at 30
movies = movies[movies["popularity"] < 30].reset_index(drop = True)
In [66]:
print("We have {} datapoints remaining.".format(movies.shape[0]))
We have 2488 datapoints remaining.

Establishing the Numeric Features

We are done analyzing the numeric features; let's put them in a list so we don't lose track of them.

In [67]:
numeric_features = ["budget", "log_budget", "runtime", "popularity"]

Trimming the Non-Numeric Features

Let's now focus on the non-numeric columns which, as we've seen, make up the majority of the data we have available. We first need to decide which of these columns to use for making predictions; then, we'll need to do some feature engineering in order to encode the information into numeric values that we can feed to the regression models.

In [68]:
# Select the non-numeric columns and make them the initial set of non-numeric features
non_numeric_df = movies.select_dtypes(exclude = [np.number])
non_numeric_features = non_numeric_df.columns
In [69]:
print("There are {} non-numeric columns. They are:\n{}".format(len(non_numeric_features), non_numeric_features))
There are 19 non-numeric columns. They are:
Index(['belongs_to_collection', 'genres', 'homepage', 'imdb_id',
       'original_language', 'original_title', 'overview', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'spoken_languages', 'status', 'tagline', 'title', 'Keywords', 'cast',
       'crew', 'main_production_company'],
      dtype='object')
In [70]:
# Print the first three rows and all non-numeric columns
non_numeric_df.head(3)
Out[70]:
belongs_to_collection genres homepage imdb_id original_language original_title overview poster_path production_companies production_countries release_date spoken_languages status tagline title Keywords cast crew main_production_company
0 [{'id': 313576, 'name': 'Hot Tub Time Machine ... [{'id': 35, 'name': 'Comedy'}] NaN tt2637294 en Hot Tub Time Machine 2 When Lou, who has become the "father of the In... /tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg [Paramount Pictures, United Artists, Metro-Gol... [{'iso_3166_1': 'US', 'name': 'United States o... 2/20/15 [{'iso_639_1': 'en', 'name': 'English'}] Released The Laws of Space and Time are About to be Vio... Hot Tub Time Machine 2 [{'id': 4379, 'name': 'time travel'}, {'id': 9... [{'cast_id': 4, 'character': 'Lou', 'credit_id... [{'credit_id': '59ac067c92514107af02c8c8', 'de... Paramount Pictures
1 [{'id': 107674, 'name': 'The Princess Diaries ... [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN tt0368933 en The Princess Diaries 2: Royal Engagement Mia Thermopolis is now a college graduate and ... /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg [Walt Disney Pictures] [{'iso_3166_1': 'US', 'name': 'United States o... 8/6/04 [{'iso_639_1': 'en', 'name': 'English'}] Released It can take a lifetime to find true love; she'... The Princess Diaries 2: Royal Engagement [{'id': 2505, 'name': 'coronation'}, {'id': 42... [{'cast_id': 1, 'character': 'Mia Thermopolis'... [{'credit_id': '52fe43fe9251416c7502563d', 'de... Walt Disney Pictures
2 NaN [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n... http://kahaanithefilm.com/ tt1821480 hi Kahaani Vidya Bagchi (Vidya Balan) arrives in Kolkata ... /aTXRaPrWSinhcmCrcfJK17urp3F.jpg NaN [{'iso_3166_1': 'IN', 'name': 'India'}] 3/9/12 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released NaN Kahaani [{'id': 10092, 'name': 'mystery'}, {'id': 1054... [{'cast_id': 1, 'character': 'Vidya Bagchi', '... [{'credit_id': '52fe48779251416c9108d6eb', 'de... NaN
In [71]:
# Print null values count
non_numeric_df.isnull().sum()
Out[71]:
belongs_to_collection      1957
genres                        5
homepage                   1686
imdb_id                       0
original_language             0
original_title                0
overview                      6
poster_path                   1
production_companies         72
production_countries         26
release_date                  0
spoken_languages              8
status                        0
tagline                     373
title                         0
Keywords                    175
cast                          9
crew                         11
main_production_company      72
dtype: int64
In [72]:
# Print some summary statistics
non_numeric_df.describe()
Out[72]:
belongs_to_collection genres homepage imdb_id original_language original_title overview poster_path production_companies production_countries release_date spoken_languages status tagline title Keywords cast crew main_production_company
count 531 2483 802 2488 2488 2488 2482 2487 2416 2462 2488 2480 2488 2115 2488 2313 2479 2477 2416
unique 372 790 799 2488 32 2470 2482 2487 1969 269 2046 351 2 2112 2468 2266 2476 2477 715
top [{'id': 645, 'name': 'James Bond Collection', ... [{'id': 18, 'name': 'Drama'}] http://www.transformersmovie.com/ tt2637294 en Sahara When Lou, who has become the "father of the In... /tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg [Paramount Pictures] [{'iso_3166_1': 'US', 'name': 'United States o... 9/10/10 [{'iso_639_1': 'en', 'name': 'English'}] Released Based on a true story. Fantastic Four [{'id': 10183, 'name': 'independent film'}] [] [{'credit_id': '59ac067c92514107af02c8c8', 'de... Universal Pictures
freq 16 199 3 1 2203 2 1 1 49 1529 5 1545 2487 3 2 19 4 1 164

Removing Useless Features

First of all we spot some columns that are of no help for our regression purposes: both imdb_id and poster_path contain path to webpages and are not useful for making predictions. We can remove these columns from the list of features we'll use.

In [73]:
# Remove the `imdb_id` and `poster_path` columns from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["imdb_id", "poster_path"])

The title and original_title columns too are helpful to identify and locate a movie but have very little predictive power. The only helpful piece of information a movie title could convey is whether the movie is part of a collection/series; but this kind of clue is already provided by the belongs_to_collection column.

For these reasons, we remove title and original_title from the list of features.

In [74]:
# Remove the `title` and `original_title` columns from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["title", "original_title"])

Removing Redundant Features

The spoken_languages column contains lists (using the same JSON-syntax strings we have already encountered) of all languages that are used throughout each movie. This information is more granular than the one contained in the original_language column (which is, the main language used in the movie) but ends up being redundant. Hence, we drop this feature as well.

In [75]:
# Remove the `spoken_languages` column from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["spoken_languages"])

Furthermore, the information contained in the production_countries is kind of a mix of what can already be found in production_companies and original_language: we go ahead and remove this feature too.

In [76]:
# Remove the `production_countries` column from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["production_countries"])

Finally, we have already transformed the production_companies column and obtained more manageable data by extracting each movie's main production company. As a consequence, the former column is redundant and can be removed.

In [77]:
# Remove the `production_companies` column from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["production_companies"])
In [78]:
print("The remaining non-numeric features are:\n", non_numeric_features)
The remaining non-numeric features are:
 Index(['belongs_to_collection', 'genres', 'homepage', 'original_language',
       'overview', 'release_date', 'status', 'tagline', 'Keywords', 'cast',
       'crew', 'main_production_company'],
      dtype='object')

Removing Hard to Process Features

Both the overview and tagline columns contain strings which help describing each movie. These variables are neither numeric nor qualitative and it would be hard to find a method that allows to transform them into viable features. Hence, we prefer to exclude these columns from the list of features we are crafting.

In [79]:
# Remove the `overview` and `tagline` columns from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["overview", "tagline"])

Finally, the Keywords, cast and crew columns all contain potentially helpful information. For instance, it is logical to expect that the allure of a famous actor or the artistry of a talented director can boost a movie's revenue. We could treat these columns as qualitative variables by considering each keyword, cast and crew member as a different categorical level. However, even if we limited the cast members to just the leading roles and the crew members to just the directors, this would realistically result in an exorbitant number of levels which would be hard to handle.

For these reasons, we prefer to first train some models leaving these features out. After evaluating the accuracy of the predictions, we can always go back and strive to incorporate this information using some clever approach.

In [80]:
# Remove the `Keywords`, `cast` and `crew` columns from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["Keywords", "cast", "crew"])
In [81]:
print("The remaining non-numeric features are:\n", non_numeric_features)
The remaining non-numeric features are:
 Index(['belongs_to_collection', 'genres', 'homepage', 'original_language',
       'release_date', 'status', 'main_production_company'],
      dtype='object')
In [82]:
# Print summary statistics of the remaining features
non_numeric_df = movies[non_numeric_features]
non_numeric_df.describe()
Out[82]:
belongs_to_collection genres homepage original_language release_date status main_production_company
count 531 2483 802 2488 2488 2488 2416
unique 372 790 799 32 2046 2 715
top [{'id': 645, 'name': 'James Bond Collection', ... [{'id': 18, 'name': 'Drama'}] http://www.transformersmovie.com/ en 9/10/10 Released Universal Pictures
freq 16 199 3 2203 5 2487 164

Before moving on, let's acknowledge that issues similar to the ones we raised above involve some of the remaining features too (for instance, genres, main_production_companies and original_language). However, these columns should yield a smaller number of categorical levels; we feel confident we can find a good strategy to organize these levels and then, using one-hot encoding, we can obtain valic numeric variables.

Removing Features with Low Variance

From the window above, we see that the status column contains only two distinct values, one of which appears once. It's very unlikely that this distinction bears any relevance but, for the sake of completeness, let's still check the status of the unique non-uniform datapoint.

In [83]:
# Check the only datapoint with non-uniform status
movies[movies["status"] != "Released"][["status"] + [col for col in movies.columns if col != "status"]]
                                                     # We make the `status` column appear in first position for ease of reading
Out[83]:
status id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity poster_path production_companies production_countries release_date runtime spoken_languages tagline title Keywords cast crew revenue log_revenue main_production_company log_budget
1337 Rumored 1619 NaN 891000.0 [{'id': 99, 'name': 'Documentary'}] NaN tt0145394 en A Place Called Chiapas On January 1, 1994, the Zapatista National Lib... 0.179246 /4j9U7QMT5D5HhUoIsoAI1niVqz8.jpg NaN [{'iso_3166_1': 'CA', 'name': 'Canada'}] 2/20/98 89.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Inside the world's first postmodern revolution A Place Called Chiapas [{'id': 187056, 'name': 'woman director'}] [{'cast_id': 1, 'character': 'Himself', 'credi... [{'credit_id': '57b4f7619251417287000dfc', 'de... 229000 12.341477 NaN 13.7001

It's not clear why this movie is marked as Rumored even if it actually came out in 1998; but we got the confirmation the status column doesn't add any kind of precious knowledge. Therefore, we can remove it entirely from our dataframe.

In [84]:
# Drop the `status` column from the entire dataframe
movies = movies.drop(columns = ["status"])

# Remove the `status` column from the list of features we'll use
non_numeric_features = non_numeric_features.drop(["status"])
In [85]:
print("The remaining non-numeric features are:\n", non_numeric_features)
The remaining non-numeric features are:
 Index(['belongs_to_collection', 'genres', 'homepage', 'original_language',
       'release_date', 'main_production_company'],
      dtype='object')
In [86]:
# Print null values count
movies[non_numeric_features].isnull().sum()
Out[86]:
belongs_to_collection      1957
genres                        5
homepage                   1686
original_language             0
release_date                  0
main_production_company      72
dtype: int64

Engineering the Non-Numeric Features

We have done a good job trimming the list of non-numeric features leaving only the ones that can be useful. Now we need to perform some feature engineering to transform these columns into numeric variables that we can use in regression modeling. In doing this, we also need to figure out how to handle the null values which, as showed by the window above, are copious.

Dichotomizing the belongs_to_collection and homepage Columns

The columns with the greatest number of null values are belongs_to_collection and homepage. However, in this case, rather than actual missing values, NaNs denote the absence of the related information. More specifically, if a movie is not part of a collection/series or doesn't have an official webpage, then the corresponding row will have a null value in the belongs_to_collection and homepage columns respectively.

Let's have a visual confirmation of this phenomen by printing a few rows.

In [87]:
# Print a few movies which have not/have a null value in the `belongs_to_collection` column 
# By locating these movies on IMDB one can check that NaNs are used to denote movies which are not part of a collection
movies.loc[movies["belongs_to_collection"].notnull(), ["imdb_id", "title", "belongs_to_collection"]].head(3)
movies.loc[movies["belongs_to_collection"].isnull(), ["imdb_id", "title", "belongs_to_collection"]].head(3)
Out[87]:
imdb_id title belongs_to_collection
0 tt2637294 Hot Tub Time Machine 2 [{'id': 313576, 'name': 'Hot Tub Time Machine ...
1 tt0368933 The Princess Diaries 2: Royal Engagement [{'id': 107674, 'name': 'The Princess Diaries ...
5 tt0117110 Muppet Treasure Island [{'id': 256377, 'name': 'The Muppet Collection...
Out[87]:
imdb_id title belongs_to_collection
2 tt1821480 Kahaani NaN
3 tt0093743 Pinocchio and the Emperor of the Night NaN
4 tt0431021 The Possession NaN
In [88]:
# Print a few movies which have not/have a null value in the `homepage` column 
# By locating these movies on IMDB one can check that NaNs are used to denote movies which do not have an official website
movies.loc[movies["homepage"].notnull(), ["imdb_id", "title", "homepage"]].tail(3)
movies.loc[movies["homepage"].isnull(), ["imdb_id", "title", "homepage"]].tail(3)
Out[88]:
imdb_id title homepage
2482 tt0362227 The Terminal http://www.theterminal-themovie.com/
2486 tt0343135 Along Came Polly http://www.alongcamepolly.com/
2487 tt1600195 Abduction http://www.abductionthefilm.com/
Out[88]:
imdb_id title homepage
2483 tt0105327 School Ties NaN
2484 tt0109403 Chasers NaN
2485 tt0116908 The Long Kiss Goodnight NaN

Given the description above, these two columns look perfectly suitable for being transformed into dichotomous variables. In fact, rather than the specific collection or link to the movie's website, we are mostly interested to know if a movie is part of a collection and if a movie has a website.

Being more specific, we'll replace all the null values in these columns with 0s (to denote the absence of the variable) and we'll replace all the other values with 1s (to denote the presence of the variable).

In [89]:
# Dichotomize `belongs_to_collection` and `homepage` columns
movies["belongs_to_collection"] = np.where(movies["belongs_to_collection"].isnull(), 0, 1)
movies["homepage"] = np.where(movies["homepage"].isnull(), 0, 1)
In [90]:
# Check the updated values
movies["belongs_to_collection"].value_counts(dropna = False)
movies["homepage"].value_counts(dropna = False)
Out[90]:
0    1957
1     531
Name: belongs_to_collection, dtype: int64
Out[90]:
0    1686
1     802
Name: homepage, dtype: int64

The values have been updated and the number of 0s match the null values count we had prior the transformation.

Before moving on, let's actually check how the target values depend on the variables we just created. In other words, we want to see if there's a difference between the revenues of movies which are/are not part of a collection and which have/have not an official website.

In [91]:
fig, axs = plt.subplots(1,2, figsize = [16, 6])

# Draw distribution of log_revenues based on the `belongs_to_collection` value
sns.boxplot(data = movies, x = "belongs_to_collection", y = "log_revenue", ax = axs[0]);
axs[0].set_title("Log_Revenue Distribution Depending on \n Belongs/Not Belongs to Collection");

# Draw distribution of log_revenues based on the `homepage` value
sns.boxplot(data = movies, x = "homepage", y = "log_revenue", ax = axs[1]);
axs[1].set_title("Log_Revenue Distribution Depending on \n Has/Hasn't Homepage");

# To improve data to ink ratio, we only draw the y-axis label and y-tick labels on the first plot
axs[1].set(ylabel= "", yticklabels = []);

We can see that movies which are part of a collection or have a website do tend to have higher revenues than movies which don't. The difference is not negligible either since the median of the blue distributions (those ones which denote the presence of the variable) is in both cases around the upper quartile of the corresponding red distributions.

These observations might not come as a surprise and do follow common sense (for instance, if a movie got approved for a sequel, it means the previous installments were box office hits so it is more probable it will generate a high gross). For our purposes, the figure above serves as a validation that the belongs_to_collection and homepage variables can help us predict movie revenues more accurately. Hence, we keep them both.

In [92]:
# Update the list of features that are left to treat
features_to_treat = non_numeric_features.drop(["belongs_to_collection", "homepage"])
In [93]:
print("The remaining non-numeric features to treat:\n", features_to_treat)
The remaining non-numeric features to treat:
 Index(['genres', 'original_language', 'release_date',
       'main_production_company'],
      dtype='object')

Analyzing the genres Column

Cleaning the Values

The next column we take into consideration is genres. A priori, before even looking at the data, it's normal to believe that a movie's genre can have quite a big impact on its box office gross: more popular genres attract more people hence have a better chance of making money. Given this premise, we are eager to include this variable in the list of features we use for training models; however, we need to decide how to encode the different genres as numeric values.

Baby steps first: let's inspect how the values in the genres column are stored.

In [94]:
# Print the first two values of the `genres` column
movies.iloc[0]["genres"]
movies.iloc[1]["genres"]
Out[94]:
"[{'id': 35, 'name': 'Comedy'}]"
Out[94]:
"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]"
In [95]:
# Print the null values count
movies["genres"].isnull().sum()
Out[95]:
5

We recognize the same JSON-syntax strings that we found in the production_companies column. We can clean these values using the same function we defined back then. As usual, we will make sure to disregard null values.

In [96]:
# Clean the `genres` column parsing the JSON strings
movies["genres"] = movies["genres"].apply(lambda x : format_json(x) if pd.notnull(x) else x)
In [97]:
# Take a look at the result
movies["genres"]
Out[97]:
0                          [{'id': 35, 'name': 'Comedy'}]
1       [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
2       [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...
3       [{'id': 16, 'name': 'Animation'}, {'id': 12, '...
4       [{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...
                              ...                        
2483                        [{'id': 18, 'name': 'Drama'}]
2484    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2485    [{'id': 80, 'name': 'Crime'}, {'id': 28, 'name...
2486    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2487    [{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...
Name: genres, Length: 2488, dtype: object

The last step consists in extracting the genres names from each dictionary in each list.

In [98]:
# Extract the genres names
movies["genres"] = movies["genres"].apply(lambda x : [genre["name"] for genre in x] if type(x) == list else x)
In [99]:
# Check the result
movies["genres"]
Out[99]:
0                                 [Comedy]
1         [Comedy, Drama, Family, Romance]
2                        [Thriller, Drama]
3           [Animation, Adventure, Family]
4                       [Horror, Thriller]
                       ...                
2483                               [Drama]
2484                     [Comedy, Romance]
2485    [Crime, Action, Mystery, Thriller]
2486                     [Comedy, Romance]
2487           [Thriller, Action, Mystery]
Name: genres, Length: 2488, dtype: object
In [100]:
# Make sure the number of null values hasn't changed
movies["genres"].isnull().sum()
Out[100]:
5

Filling in the Null Values

Before we analyze the genres in more details, let's treat the null values. Since there are only 5 of them, we'll fill in the missing information locating each movie on IMDB with the help of the imdb_ids.

In [101]:
# Show the movies with missing genres
missing_genres = movies.loc[movies["genres"].isnull(), ["imdb_id", "title", "release_date"]]
missing_genres
Out[101]:
imdb_id title release_date
385 tt0349159 The Book of Mormon Movie, Volume 1: The Journey 9/12/03
1341 tt0261755 Jackpot 7/26/01
1502 tt0110289 Ryaba, My Chicken 10/1/94
2001 tt0984177 Amarkalam 8/25/99
2401 tt1766044 Rita's Last Fairy Tale 11/1/12
In [102]:
# Collect the true genres values
true_genres = np.array([["Adventure"], ["Drama", "Comedy"], ["Comedy"],
                       ["Action", "Drama", "Romance"], ["Drama", "Fantasy", "Mystery"]], dtype = "object")
In [103]:
# Replace the missing genres with the true values
index_to_fix = missing_genres.index
movies.loc[index_to_fix, "genres"] = true_genres
In [104]:
# Make sure there's no missing value left
movies["genres"].isnull().sum()
Out[104]:
0

Analyzing the Different Genres

Let's take a look at all the unique genres that appear.

In [105]:
# Get a list of the unique genres
unique_genres = np.unique([genre for genres in movies["genres"] for genre in genres])
print("There are {} unique genres. They are:\n {}".format(len(unique_genres), unique_genres))
There are 20 unique genres. They are:
 ['Action' 'Adventure' 'Animation' 'Comedy' 'Crime' 'Documentary' 'Drama'
 'Family' 'Fantasy' 'Foreign' 'History' 'Horror' 'Music' 'Mystery'
 'Romance' 'Science Fiction' 'TV Movie' 'Thriller' 'War' 'Western']

Before visualizing each genre's frequency, we need to decide what counting method to use. More specifically, we need to take into consideration the fact that many movies are assigned multiple genres. We had a similar situation when we cleaned the production_companies column and back then we decided to only keep the first value (the "main" production company).

Here, however, we are worried this same strategy would result in an hefty loss of granularity: oftentimes, confining a movie to just one genre doesn't make it justice; in fact, one might argue that the ability of blending different atmospheres, styles and themes typical of different genres is a main success factor for a movie.

For these reasons, we count genres by allowing multiple genres for a single movie.

In [106]:
# Create a series containing all genres from all movies
stacked_genres = pd.Series([genre for genres in movies["genres"] for genre in genres], name = "genres")
In [107]:
# Plot the genres frequencies

plt.figure(figsize = [18,8]);

sns.countplot(x = stacked_genres, order = stacked_genres.value_counts(ascending = True).index)
plt.xticks(rotation = 90);
plt.title("Movies Count by Genre");

Drama is by far the most frequent genre; Comedy comes second followed by Thriller and Action which are almost tied for third place. There is a relatively large gap between the fourth and fifth place but then we find another group of three genres (in order, Romance, Crime and Adventure) with similar counts.

Let's also compare what are the most followed genres using the movies' popularity scores. Once again, we will take into consideration the fact that a movie can have multiple genres.

In [108]:
# Create a series with popularity scores repeated according to movies with multiple genres
stacked_popularity = movies["popularity"].repeat(movies["genres"].apply(len)).reset_index(drop = True)
In [109]:
# Create a dataframe with the stacked genres and repeated popularity scores
# For each movie, this allows us to assign its popularity score to all of its genres
genres_popularity = pd.concat([stacked_genres, stacked_popularity], axis = 1)
In [110]:
plt.figure(figsize = [18,8]);

# Plot the popularity score by genre
sns.stripplot(data = genres_popularity, x = "genres", y = "popularity",
              order = stacked_genres.value_counts(ascending = True).index); # Order the genres by movie count

plt.xticks(rotation = 90);
plt.xlabel("")
plt.title("Popularity Score by Genre");