Do movies that cost more make more money?

In this unit of the class we are going to consider ways to measure relationships between variables, such as budget and revenue.

This cell is a "markdown" cell. It displays formatted text in italics bold monospace and $m^at_h$.

In [2]:
## You will often see import statements with
##  aliases: import numpy as np
## I can't see any reason to do this other than the
##  fact that lots of people seem to do it.
## Personally, if I have to choose between readability
##  and a few extra keystrokes, I will ALWAYS choose
##  readability, but if you are used to seeing np.,
##  that might be more readable.
## For beginning programming I prefer to keep things
##  simple and direct, but be aware that people may
##  expect you to use aliases, and may judge you.

import numpy
from matplotlib import pyplot

## The pandas library provides tools for working with
##  "data frames": tables of data with named columns
##  each with some type (int, float, date, etc)
import pandas

## This is a notebook command that makes sure plots
##  are displayed and not just returned as objects.
%matplotlib inline

## These are special packages for notebooks that
##  give us interactive controls.
from ipywidgets import interact, interactive, fixed


If we're going to use multiple variables we need ways to represent multiple variables.

If every variable is the same type we can combine them into a single numpy array with more than one axis. For two-dimensional arrays by convention we refer to the first axis as rows and the second as columns.

In [24]:
x = numpy.random.normal(0, 1, size=(100,2))
x.shape

Out[24]:
(100, 2)
In [30]:
# to access a cell in the array we can specify an index for both axes (row, column)
x[1,1]

Out[30]:
-0.025806743956922857
In [31]:
# we can ask for more than one value, here the first five rows of the second column
x[0:5,1]

Out[31]:
array([ 0.28624967, -0.02580674, -0.09846675,  0.43131053,  1.60502716])
In [32]:
# if we want every index from a certain axis, we use :
x[1,:]

Out[32]:
array([-0.23507934, -0.02580674])

I now have a matrix with 100 rows and two columns. I can think of the columns as random variables $X_0$ and $X_1$, and each row as a pair $(X_{i0}, X_{i1})$. (I'm trying to be consistent with notation in my variable subscripts: row, then column.) Visualizing the data as a scatter plot can give us an overview.

I know that there is no connection between the variable in the first column and the variable in the second column "by construction": because I just sampled them all independently from a normal distribution. But in the specific case I have here (rerunning the notebook will almost certainly change this), there seems to be a slight downward trend. If I put the mean of the pairs as a second layer (orange), the quadrant left/below the mean looks sparser than the left/above. I know that's just random, but let's see how it affects our measurement of covariance.

In [38]:
pyplot.scatter(x[:,0], x[:,1]) # the x_0, x_1 pairs in blue
pyplot.scatter(x[:,0].mean(), x[:,1].mean()) # the mean in orange
pyplot.show()

In [39]:
# first, let's calculate the variance of the two dimensions.
#  the ddof=1 argument means divide by (n-1) instead of n.
# it's around 1, which is good since these values are from
#  a normal with variance 1, and sample size 100.
x[:,0].var(ddof=1)

Out[39]:
1.0584813638646358
In [40]:
# same with the second column. this one is quite close to 1.0.
x[:,1].var(ddof=1)

Out[40]:
1.0079081230212463

Now let's look at the cov command. It doesn't just give us the covariance, it gives us a full "covariance matrix". You should be able to recognize the two variances as the diagonal of the matrix, with the covariance in the two off-diagonal entries.

col 1 col 2
var(A) cov(A, B)
cov(A,B) var(B)

The covariance here is negative, which matches our visual impression that the trend is slightly downward. We know (by construction) that that's purely a coincidence, but it is a measurably true fact about this sample.

Note that cov uses the $\frac{1}{n-1}$ form for variance, while var by default uses the $\frac{1}{n}$ form.

In [35]:
numpy.cov(x[:,0], x[:,1])

Out[35]:
array([[ 1.05848136, -0.19008835],
[-0.19008835,  1.00790812]])

Now let's look at the Pearson correlation coefficient. Numpy returns it to us in the same matrix format as cov, but it's really just one number.

In [41]:
numpy.corrcoef(x[:,0], x[:,1])

Out[41]:
array([[ 1.        , -0.18403627],
[-0.18403627,  1.        ]])

Where did this number come from? It's exactly the same as the covariance divided by the product of standard deviations.

In [43]:
-0.19008835 / (numpy.sqrt(1.05848136) * numpy.sqrt(1.00790812))

Out[43]:
-0.18403626966432207
In [46]:
inputs = numpy.random.normal(0, 1, size=100)
errors = numpy.random.normal(0, 1, size=100)

def show_linear(scale):
outputs = 1.3 * inputs + 0.2 + scale * errors
pyplot.scatter(inputs, outputs)
pyplot.text(-2, 3, str(numpy.corrcoef(inputs, outputs)[0,1]))
pyplot.show()


This next cell is very cool but inscrutable. We are passing the function show_linear as an argument to the function interact. This behavior rarely shows up in Python, it's much more frequent in Javascript. Since show_linear has one argument, scale, we also pass in a tuple that lists a minimum and maximum value for scale along with a step (0.1).

interact will now create an input element for scale. Since the variable is a number based on its value range, the input will be a slider. If we move the slider we call show_linear with scale set to the value of the slider. As we scale up the noise, the correlation coefficient decreases.

In [48]:
interact(show_linear, scale=(0, 2, 0.1))

Out[48]:
<function __main__.show_linear(scale)>

Now let's consider our original question: Do movies with larger budgets have more revenue?

First we need to load the data. I found this on Kaggle, but removed all movies before 2016 and movies with a budget or revenue less than \$100,000. The pandas library makes it easy to load this file: In [18]: movies = pandas.read_csv("movies.csv")  In [19]: movies.describe()  Out[19]: budget id popularity revenue runtime vote_average vote_count count 2.960000e+02 296.000000 296.000000 2.960000e+02 296.000000 296.000000 296.000000 mean 4.639639e+07 316878.361486 20.237346 1.504744e+08 112.016892 6.384797 1158.388514 std 5.647466e+07 73209.405016 34.197320 2.406840e+08 18.190416 0.765532 1531.144387 min 2.000000e+05 14564.000000 0.350207 1.006590e+05 66.000000 4.100000 4.000000 25% 9.000000e+06 291144.750000 8.119095 8.952934e+06 98.000000 5.800000 162.000000 50% 2.200000e+07 330220.000000 12.281325 4.753546e+07 110.000000 6.400000 597.500000 75% 6.000000e+07 368258.250000 17.928330 1.782043e+08 123.000000 6.925000 1570.500000 max 2.600000e+08 443319.000000 294.337037 1.262886e+09 170.000000 8.100000 11444.000000 In [20]: # In a pandas data frame each column becomes an attribute: movies.budget  Out[20]: 0 100000000.0 1 160000000.0 2 230000000.0 3 58000000.0 4 200000000.0 5 250000000.0 6 165000000.0 7 178000000.0 8 3500000.0 9 18000000.0 10 110000000.0 11 180000000.0 12 75000000.0 13 31500000.0 14 175000000.0 15 165000000.0 16 185000000.0 17 250000000.0 18 175000000.0 19 50000000.0 20 14000000.0 21 3500000.0 22 5000000.0 23 5000000.0 24 10000000.0 25 149000000.0 26 18000000.0 27 46000000.0 28 22000000.0 29 38000000.0 ... 266 7075038.0 267 69000000.0 268 12000000.0 269 42000000.0 270 125000000.0 271 250000000.0 272 38000000.0 273 175000000.0 274 10000000.0 275 2800000.0 276 916000.0 277 707503.0 278 18000000.0 279 10500000.0 280 34000000.0 281 20000000.0 282 3500000.0 283 80000000.0 284 60000000.0 285 5000000.0 286 152000000.0 287 21000000.0 288 197471676.0 289 30000000.0 290 100000000.0 291 8520000.0 292 260000000.0 293 60000000.0 294 50000000.0 295 11000000.0 Name: budget, Length: 296, dtype: float64 Let's start by plotting budget against revenue. Most of the movies have budgets below \$50M. Almost all movies with revenue above \\$30M have budgets above \$50M.

In [23]:
pyplot.scatter(movies.budget, movies.revenue)
pyplot.show()


How does this affect the covariance? Since these numbers are extremely big, their variance and covariance are also huge. This makes sense: variance is the expectation of the square of the deviation from the mean.

In [21]:
numpy.cov(movies.budget, movies.revenue)

Out[21]:
array([[3.18938774e+15, 1.07111151e+16],
[1.07111151e+16, 5.79288099e+16]])

Correlation coefficient is more well-behaved. Here we get a correlation of 0.788, which is pretty large for such variable data.

In [22]:
numpy.corrcoef(movies.budget, movies.revenue)

Out[22]:
array([[1.        , 0.78801362],
[0.78801362, 1.        ]])

How much of this correlation is because small budget films rarely make huge revenue? Let's try cutting out films under \\$50M budget.

In [51]:
## movies.budget is a numpy array
## movies.budget > 50000000 returns an array of booleans,
##  where each element is the result of

big_budget_movies = movies[ movies.budget > 50000000 ]
numpy.corrcoef(big_budget_movies.budget, big_budget_movies.revenue)

Out[51]:
array([[1.        , 0.55165039],
[0.55165039, 1.        ]])