The Need for Openness in Data Journalism

Brian Keegan, Ph.D. (@bkeegan) College of Humanities and Social Sciences, Northeastern University

Do films that pass the Bechdel Test make more money for their producers? I've replicated Walt Hickey's recent article in FiveThirtyEight to find out. My results confirm his own in part, but also find notable differences that point the need for clarification at a minimum. While I am far from the first to make this argument, this case is illustrative of a larger need for journalism and other data-driven enterprises to borrow from hard-won scientific practices of sharing data and code as well as supporting the review and revision of findings. This admittedly lengthy post is a critique of not only this particular case but also an attempt to work through what open data journalism could look like.

The Angle: Data Journalism should emulate the openness of science

New data-driven journalists such as FiveThirtyEight have faced criticism from many quarters and the critiques, particularly around the naïveté of assuming credentialed experts can be bowled over by quantitative analysis so easily as the terrifyingly innumerate pundits who infest our political media [1,2,3,4]. While I find these critiques persuasive, I depart from them here to instead argue that I have found this "new" brand of data journalism disappointing foremost because it wants to perform science without abiding by scientific norms.

The questions of demarcating what is or is not science are fraught, so let's instead label my gripe a "failure to be open." By openness, I don't mean users commenting on articles or publishing whistleblowers' documents. I mean "openness" more in the sense of "open source software" where the code is made freely available to everyone to inspect, copy, modify, and redistribute. But the principles of open-source software trace their roots more directly back norms in the scientific community that Robert Merton identified and came to known as "CUDOS" norms. It's worth reviewing two of these norms because Punk-ass Data Journalism is very much on the lawn of Old Man Science and therein lie exciting possibilities for exciting adventures.

The first and last elements of Merton's "CUDOS" norms merit special attention for our discussion of openness. Communalism is the norm that scientific results are shared and become part of a commons that others can build upon --- this is the bit about "standing upon the shoulders of giants." Skepticism is the norm that claims must be subject to organized scrutiny by community --- which typically manifests as peer review. Both of these strongly motivated philosophies in the open source movement, and while they are practiced imperfectly in my experience within the social and information sciences (see my colleagues' recent work on the "Parable of Google Flu"), I nevertheless think data journalists should strive to make them their own practice as well.

  1. Data journalists should be open in making their data and analysis available to all comers. This flies in the face of traditions and professional anxieties surrounding autonomy, scooping, and the protection of sources. But how can claims be evaluated as true unless they can be inspected? If I ask a data journalist for her data or code, is she bound by the same norms as a scientist to share it? Where and how should journalists share and document these code and data?

  2. Data journalists should be open in soliciting and publishing feedback. Sure, journalists are used to clearing their story with an editor, but have they solicited an expert's evaluation of their claims? How willing are they to publish critiques of, commentary on, or revisions to their findings? If not, what are the venues for these discussions? How should a reporter or editor manage such a system?

The Guardian's DataBlog and ProPublica have each been doing exemplary work in posting their datasets, code, and other tools for several years. Other organizations like the Sunlight Foundation develop outstanding tools to aid reporters and activists, the Knight Foundation has been funding exciting projects around journalism innovation for years, and the Data Journalism Handbook reviews other excellent cases as well. My former colleague, Professor Richard Gordon at Medill reminded me ideas around "computer assisted reporting" have been in circulation in the outer orbits of journalism for decades. For example, Philip Meyer has been (what we would now call) evangelizing since the 1970s for "precision journalism" in which journalists adopt the tools and methods of the social and behavioral sciences as well as its norms of sharing data and replicating research. Actually, if you stopped reading now and promised to read his 2011 Hedy Lamarr Lecture, I won't even be mad.

The remainder of this post is an attempt demonstrate some ideas of what an "open collaboration" model for data journalism might look like. To that end, this article tries to do many things for many audiences which admittedly makes it hard for any single person to read. Let me try to sketch some of these out now and send you off in the right path.

  • First, I use an article Walt Hickey of FiveThirtyEight published on the relationship between the financial performance of films that the extent to which they grant their female characters substantive roles as a case to illustrate some pitfalls in both the practice and interpretation of statistical data. This is a story about having good questions, ambiguous models, wrong inferences, and exciting opportunities for investigation going forward. If you don't care for code or statistics, you can start reading at "The Hook" below and stop after "The Clip" below.
  • Second, for those readers who are willing to pay what one might call the "Iron Price of Data Journalism", I go "soup to nuts" and attempt to replicate Hickey's findings. I document all the steps I took to crawl and analyze this data to illustrate the need for better documentation of analyses and methods. This level of documentation may be excessive or it may yet be insufficient for others to replicate my own findings. But providing this code and data may expose flaws in my technical style (almost certainly), shortcomings in my interpretations (likely), and errors in my data and modeling (hopefully not). I actively invite this feedback via email, tweets, comments, or pull requests and hope to learn from it. I wish new data journalism enterprises adopted the same openness and tentativeness in their empirical claims. You should start reading at "Start Your Kernels..."
  • Third, I want to experiment with styles for analyzing and narrating findings that make both available in the same document. The hope is that motivated users can find the detail and skimmers can learn something new or relevant while being confident they can come back and dive in deeper if they wish. Does it make sense to have the story up front and the analysis "below the fold" or to mix narrative with analysis? How much background should I presume or provide about different analytical techniques? How much time do I need to spend on tweaking a visualization? Are there better libraries or platforms for serving the needs of mixed audiences? This is a meta point as we're in it now, but it'll crop up in the conclusion.
  • Fourth, I want to experiment with technologies for supporting collaboration in data journalism by adopting best practices from open collaborations in free software, Wikipedia, and others. For example, this blog post is not written in a traditional content-management system like WordPress, but is an interactive "notebook" that you can download and execute the code to verify that it works. Furthermore, I'm also "hosting" this data on GitHub so that others can easily access the writeup, code, and data, to see how it's changed over time (and has it ever...), and to suggest changes that I should incorporate. These can be frustrating tools with demoralizing learning curves, but these are incredibly powerful once apprenticed. Moreover, there are amazing resources and communities who exist to support newcomers and new tools are being released to flatten these learning curves. If data journalists joined data scientists and data analysts in sharing their work, it would contribute to an incredible knowledge commons of examples and cases that is lowering the bars for others who want to learn. This is also a meta point since it exists outside of this story, but I'll also come back to it in the conclusion.

In this outro to a very unusual introduction, I want to thank Professor Gordon from above, Professor Deen Freelon, Nathan Matias, and Alex Leavitt for their invaluable feedback on earlier drafts of this... post? article? piece? notebook?

The Hook: The Bechdel Test article in FiveThirtyEight

Walk Hickey published an article on April 1 on FiveThirtyEight, titled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women. The article examines the relationship between movies' finances and their portrayals of women using a well-known heuristic call the Bechdel test. The test has 3 simple requirements: a movie passes the Bechdel test if there are (1) two women in it, (2) who talk to each other, (3) about something besides a man.

Let me say at the outset, I like this article: It identifies a troubling problem, asks important questions, identifies appropriate data, and brings in relevant voices to speak to these issues. I should also include the disclaimer that I am not an expert in the area of empirical film studies like Dean Keith Simonton or Nick Redfern. I've invested a good amount time in criticizing the methods and findings of this article, but to Hickey's credit, I also haven't come across any scholarship that has attempted to quantify this relationship before: this is new knowledge about the world. Crucially, it speaks to empirical scholarship that has exposed how films with award-winning female roles are significantly less likely to win awards themselves [5], older women are less likely to win awards [6], actresses' earnings peak 17 years earlier than actors' earnings [7], and differences in how male and female critics rate films [8]. I have qualms about the methods and others may be justified in complaining it overlooks related scholarship like those I cited above, but this article is in the best traditions of journalism that focuses our attention on problems we should address as a society.

Hickey's article makes two central claims:

  1. We found that the median budget of movies that passed the test...was substantially lower than the median budget of all films in the sample.

  2. We found evidence that films that feature meaningful interactions between women may in fact have a better return on investment, overall, than films that don’t.

I call Claim 1 the "Budgets Differ" finding and Claim 2 the "Earnings Differ" finding. The results, as they're summarized here are relatively straightforward to test whether there's an effect of Bechdel scores on earnings and budget controlling for other explanatory variables.

But before I even get to running the numbers, I want to examine the claims Hickey made in the article. The interpretations he makes about the return on investment are particularly problematic interpretations of basic statistics. Hickey reports the following findings from his models (emphasis added).

We did a statistical analysis of films to test two claims: first, that films that pass the Bechdel test — featuring women in stronger roles — see a lower return on investment, and second, that they see lower gross profits. We found no evidence to support either claim.

On the first test, we ran a regression to find out if passing the Bechdel test corresponded to lower return on investment. Controlling for the movie’s budget, which has a negative and significant relationship to a film’s return on investment, passing the Bechdel test had no effect on the film’s return on investment. In other words, adding women to a film’s cast didn’t hurt its investors’ returns, contrary to what Hollywood investors seem to believe.

The total median gross return on investment for a film that passed the Bechdel test was \$2.68 for each dollar spent. The total median gross return on investment for films that failed was only \$2.45 for each dollar spent.

...On the second test, we ran a regression to find out if passing the Bechdel test corresponded to having lower gross profits — domestic and international. Also controlling for the movie’s budget, which has a positive and significant relationship to a film’s gross profits, once again passing the Bechdel test did not have any effect on a film’s gross profits.

Both models (whatever their faults, and there are some as we will explore in the next section) apparently produce an estimate that the Bechdel test has no effect on a film's financial performance. That is to say, the statistical test could not determine with a greater than 95% confidence that the correlation between these two variables was greater or less than 0. Because we cannot confidently rule out the possibility of there being zero effect, we cannot make any claims about its direction.

Hickey argues that passing the test "didn't hurt its investors' returns", which is to say there was no significant negative relationship, but neither was there a significant positive relationship: The model provides no evidence of a positive correlation between Bechdel scores and financial performance. However, Hickey switches gears an in the conclusions, writes:

...our data demonstrates that films containing meaningful interactions between women do better at the box office than movies that don’t...

I don't know what analysis supports this interpretation. The analysis Hickey just performed, again taking the findings at their face, concluded that "passing the Bechdel test did not have any effect on a film’s gross profits" not "passing the Bechdel test increased the film's profits." While Bayesians will cavil about frequentist assumptions --- as they are wont to do --- and the absence of evidence is not evidence of absence, the "Results Differ finding" is not empirically supported in any appropriate interpretation of the analysis. The appropriate conclusion from Hickey's analysis is "there no relationship between the Bechdel test and financial performance," which he makes... then ignores.

What to make of this analysis? In the next section, I summarize the findings of my own analysis of the same data. In the subsequent sections, I attempt to replicate the findings of this article, and in so doing, highlight the perils of reporting statistical findings without adhering to scientific norms.

The Clip: Look in here for what to tweet

I tried to retrieve and re-analyze the data that Hickey described in his article, but came to some conclusions that were the same, others that were very different, and still others that I hope are new.

In the absence of knowing the precise methods used but making reasnable assumptions of what was done, I was able to replicate some of his findings, but not others because specific decisions had to be made about the data or modeling that dramatically change the results of the statistical models. However, the article provides no specifics so we're left to wonder when and where these findings hold, which points to the need for openness in sharing data and code. Specifically, while Hickey found that women's representation in movies had no significant relationship on revenue, I found a positive and significant relationship.

But the questions and hypotheses Hickey posed about systematic biases in Hollywood were also the right ones. With a reanalysis using different methods as well as adding in new data, I found statistically significant differences in popular ratings also exist. These differences persist after controlling for each other and in the face of other potential explanations about differences arising because of genres, MPAA ratings, time, and other effects.

In the image below, we see that movies that have non-trivial women's roles get 24% lower budgets, make 55% more revenue, get better reviews from critics, and face harsher criticism from IMDB users. Bars that are faded out mean my models are less confident about these findings being non-random (higher p-values) while bars that are darker mean my models are more confident that this is a significant finding (lower p-values).

Movies passing the Bechdel test (the red bars):

  • ...receive budgets that are 24% smaller

  • ...make 55% more revenue

  • ...are awarded 1.8 more Metacritic points by professional reviewers

  • ...are awarded 0.12 fewer stars by IMDB's amateur reviewers

In [165]:

Start your kernels...

In the sections below, I provide all the code I used to scrape data from the websites Hickey mentioned as sources as well as some others I use to extend his analysis to include other control measures commonly used in statstical models of movie performance (for example, Nagle & Riedl's work on movie ratings). The goal here is to attempt to replicate Hickey's analyses using the same data and same methods he described. I want to emphasize three points ahead of time:

  1. What are the data? Through this process of obtaining the data, it becomes clear that the analyst quickly has to make choices about the types of data to be obtained. Does the data come from Table X or Table Y? Does budget data include marketing expenditures or not? Are these inflation-corrected real dollars or nominal dollars? They should be the same, but there are things missing in one that aren't missing in the other. These decisions are not documented in the article nor are these data made available, both of which makes it hard to determine what exactly is being measured.

  2. What are the variables? The article also creates variables such as "return on investment" and "gross profits" from other variables in the data. These data can be highly-skewed or contain missing data which can break some statistical assumptions -- how were these dealt with? The article doesn't actually say how these variables are constructed or where they come from, so I could be wrong but the data and tradition definitions of these variables present obvious candidates involving different relationships between the same two variables Budget (Expenses) and Income (Revenue). In creating a new variable that combines two other variables, the behavior of this new variable is intrinsically related to the other variables. This can become problematic very quickly, as we will see in the next step.

  3. What are the models? The article then performs analyses on these new and old variables using methods that assume they should be independent. As the previous bullet emphasizes, "return on investment" and "gross profits" are financial performance outcomes that already capture a relationship between Budget and Income. The model could measure these outcomes as a function of the Bechdel score alone. The model could also estimate basic Income as a function of Bechdel plus throw in Budget as a control. The model might still also could try to estimate the combined financial performance as a function of Bechdel controlling for Budget again, even though is already in the outcome. The write-up makes it seem like the last model is the one used, which is problematic.

1. What are the data?

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import json,urllib2,string
from itertools import product
from matplotlib.collections import LineCollection
import statsmodels.formula.api as smf
from bs4 import BeautifulSoup
from scipy.stats import chisquare
from IPython.display import Image

Revenue data from

Revenue data taken from scraping

Hickey's article uses data about international receipts, which doesn't appear to be available publicly from this website, so I cannot replicate these analyses. Indeed, if provided Hickey with the data, it may very well be different in extent and detail than what is possible to access from their website.

In [ ]:
# This may take a bit of time
revenue_df_list = list()

for character in string.printable[1:36]:
    soup = BeautifulSoup(urllib2.urlopen('{0}'.format(character)).read())
    archive = pd.read_html(soup.findAll('table')[0].encode('latin1'),header=0,flavor='bs4',infer_types=False)[0]
    archive.columns = ['Released','Movie','Genre','Budget','Revenue','Trailer']
    archive['Released'] = pd.to_datetime(archive['Released'],format=u"%b\xa0%d,\xa0%Y",coerce=True)
    archive = archive.replace([u'\xa0',u'nan'],[np.nan,np.nan])
    archive['Budget'] = archive['Budget'].dropna().str.replace('$','').apply(lambda x:int(x.replace(',','')))
    archive['Revenue'] = archive['Revenue'].dropna().str.replace('$','').apply(lambda x:int(x.replace(',','')))
    archive['Year'] = archive['Released'].apply(lambda x:x.year)
numbers_df = pd.concat(revenue_df_list)
In [3]:
numbers_df = pd.read_csv('revenue.csv',encoding='utf8',index_col=0)
numbers_df['Released'] = pd.to_datetime(numbers_df['Released'],unit='D')
Released Movie Genre Budget Revenue Trailer Year
19967 1979-12-14 Zulu Dawn NaN NaN 0 NaN 1979
19968 2003-02-07 Zus & Zo NaN NaN 49468 NaN 2003
19969 2007-04-06 Zwartboek Thriller/Suspense 22000000 4398532 NaN 2007
19970 2014-02-28 Zwei Leben Drama NaN 39673 NaN 2014
19971 2006-02-24 Zyzzyx Rd. Thriller/Suspense NaN 20 NaN 2006

5 rows × 7 columns

Inflation data from BLS

Here I tried to write some code to get historical CPI data from the BLS API, but I couldn't get it to work. Instead, I downloaded the January monthly data from 1913 through 2014 for Series ID CUUR0000SA0 which is the "CPI - All Urban Consumers" and saved it as a CSV. The inflator dictionary provides a mapping that lets me to adjust the dollars reported in previous years for inflation: a dollar in 1913 bought 23.8 times more goods and services than a dollar in 2014.

In [4]:
cpi = pd.read_csv('cpi.csv',index_col='Year')
inflator = dict(cpi.ix[2014,'Annual']/cpi['Annual'])

Bechdel Test data from's API

Documentation here:

First, download a list of all movie IDs using the getAllMovieIds method. Inspect the first 5. Create a list of IMDB ids out of this list of dictionaries to pass to the API.

In [5]:
movie_ids = json.loads(urllib2.urlopen('').read())
imdb_ids = [movie[u'imdbid'] for movie in movie_ids]
print u"There are {0} movies in the Bechdel test corpus".format(len(imdb_ids))
There are 5062 movies in the Bechdel test corpus

Pull an example getMovieByImdbId from the API.

In [5]:
{u'date': u'2013-12-25 14:31:21',
 u'dubious': u'0',
 u'id': u'4982',
 u'imdbid': u'0000091',
 u'rating': u'0',
 u'submitterid': u'9025',
 u'title': u'House of the Devil, The',
 u'visible': u'1',
 u'year': u'1896'}

Pull all of the data we can down. This will take a very long time. We don't want to have to repeat that crawl again, so let's write the data to disk as bechdel.json

In [ ]:
ratings = dict()
exceptions = list()
for num,imdb_id in enumerate(imdb_ids):
        if num in range(0,len(imdb_ids),int(round(len(imdb_ids)/10,0))):
            print imdb_id
        ratings[imdb_id] = json.loads(urllib2.urlopen('{0}'.format(imdb_id)).read())

with open('bechdel.json', 'wb') as f:
  json.dump(ratings.values(), f)

Read the data back from disk and convert to ratings_df.

In [5]:
ratings_df = pd.read_json('bechdel.json')
ratings_df['imdbid'] = ratings_df['imdbid'].dropna().apply(int)
ratings_df = ratings_df.set_index('imdbid')
In [6]:
ratings_df2 = ratings_df[['rating','title']]
rating title
435528 2 Whisper
1456472 1 We Have a Pope
435680 3 Kidulthood
460721 3 Big Bad Swim, The
1571222 3 A Dangerous Method

5 rows × 2 columns

The field we want is rating which is defined as:

"The actual score. Number from 0 to 3.
0.0 means no two women,
1.0 means no talking, 
2.0 means talking about a man, 
3.0 means it passes the test."

Additional data from OMDBAPI

Pull an example from

In [20]:
{u'Actors': u"Jeanne d'Alcy, Georges M\xe9li\xe8s",
 u'Awards': u'N/A',
 u'Country': u'France',
 u'Director': u'Georges M\xe9li\xe8s',
 u'Genre': u'Short, Horror',
 u'Language': u'N/A',
 u'Metascore': u'N/A',
 u'Plot': u'A bat flies into an ancient castle and transforms itself into Mephistopheles himself. Producing a cauldron, Mephistopheles conjures up a young girl and various supernatural creatures, one ...',
 u'Poster': u'N/A',
 u'Rated': u'N/A',
 u'Released': u'N/A',
 u'Response': u'True',
 u'Runtime': u'3 min',
 u'Title': u'The House of the Devil',
 u'Type': u'movie',
 u'Writer': u'Georges M\xe9li\xe8s',
 u'Year': u'1896',
 u'imdbID': u'tt0000091',
 u'imdbRating': u'6.8',
 u'imdbVotes': u'768'}

Crawl the data. This will take a while. Write the data to disk when you're done as imdb_data.json.

In [ ]:
imdb_data = dict()
exceptions = list()
for num,imdb_id in enumerate(imdb_ids):
        if num in range(0,len(imdb_ids),int(round(len(imdb_ids)/10,0))):
            print imdb_id
        imdb_data[imdb_id] = json.loads(urllib2.urlopen('{0}'.format(imdb_id)).read())

with open('imdb_data.json', 'wb') as f:
  json.dump(imdb_data.values(), f)

Read the data back from disk and convert to imdb_df. Print out an example row, but transpose it because there are a lot of columns.

In [7]:
# Read the data in
imdb_df = pd.read_json('imdb_data.json')

# Drop non-movies
imdb_df = imdb_df[imdb_df['Type'] == 'movie']

# Convert to datetime objects
imdb_df['Released'] = pd.to_datetime(imdb_df['Released'], format="%d %b %Y", unit='D', coerce=True)

# Drop errant identifying characters in the ID field
imdb_df['imdbID'] = imdb_df['imdbID'].str.slice(start=2)

More cleanup happens below.

In [8]:
# Remove the " min" at the end of Runtime entries so we can convert to ints
imdb_df['Runtime'] = imdb_df['Runtime'].str.slice(stop=-4).replace('',np.nan)

# Some errant runtimes have "h" in them. Commented-out code below identifies them.
#s = imdb_df['Runtime'].dropna()

# Manually recode these h-containing Runtimes to minutes
imdb_df.ix[946,'Runtime'] = '169'
imdb_df.ix[1192,'Runtime'] = '96'
imdb_df.ix[1652,'Runtime'] = '80'
imdb_df.ix[2337,'Runtime'] = '87'
imdb_df.ix[3335,'Runtime'] = '62'

# Blank out non-MPAA or minor ratings (NC-17, X)
imdb_df['Rated'] = imdb_df['Rated'].replace(to_replace=['N/A','Not Rated','Approved','Unrated','TV-PG','TV-G','TV-14','TV-MA','NC-17','X'],value=np.nan)

# Convert Release dateimte into new columns for year, month, and week
imdb_df['Year'] = imdb_df['Released'].apply(lambda x:x.year)
imdb_df['Month'] = imdb_df['Released'].apply(lambda x:x.month)
imdb_df['Week'] = imdb_df['Released'].apply(lambda x:x.week)

# Convert the series to float
imdb_df['Runtime'] = imdb_df['Runtime'].apply(float)

# Take the imdbVotes formatted as string containing "N/A" and comma-delimited thousands, convert to float
imdb_df['imdbVotes'] = imdb_df['imdbVotes'].dropna().replace('N/A',np.nan).dropna().apply(lambda x:float(x.replace(',','')))

# Take the Metascore formatted as string containing "N/A", convert to float
# Also divide by 10 to make effect sizes more comparable
imdb_df['Metascore'] = imdb_df['Metascore'].dropna().replace('N/A',np.nan).dropna().apply(float)/10.

# Take the imdbRating formatted as string containing "N/A", convert to float
imdb_df['imdbRating'] = imdb_df['imdbRating'].dropna().replace('N/A',np.nan).dropna().apply(float)

# Create a dummy variable for English language
imdb_df['English'] = (imdb_df['Language']  == u'English').astype(int)
imdb_df['USA'] = (imdb_df['Country']  == u'USA').astype(int)

# Convert imdb_ID to int, set it as the index
imdb_df['imdbID'] = imdb_df['imdbID'].dropna().apply(int)
imdb_df = imdb_df.set_index('imdbID')

Join datasets together

In [9]:
df = imdb_df.join(ratings_df2,how='inner').reset_index()
df = pd.merge(df,numbers_df,left_on=['Title','Year'],right_on=['Movie','Year'])
df['Year'] = df['Released_x'].apply(lambda x:x.year)
df['Adj_Revenue'] = df.apply(lambda x:x['Revenue']*inflator[x['Year']],axis=1)
df['Adj_Budget'] = df.apply(lambda x:x['Budget']*inflator[x['Year']],axis=1)

2624 2625
imdbID 1531663 282771
Actors Will Ferrell, Christopher Jordan Wallace, Rebe... Om Puri, Aasif Mandvi, Ayesha Dharker, Jimi Mi...
Awards N/A 1 win.
Country USA UK, India, USA
Director Dan Rush Ismail Merchant
Error NaN NaN
Genre_x Comedy, Drama Comedy, Drama
Language English English
Metascore 6.5 6.4
Plot When an alcoholic relapses, causing him to los... In 1950s Trinidad, a frustrated writer support...
Rated R PG
Released_x 2011-10-14 00:00:00 2002-03-29 00:00:00
Response True True
Runtime 97 117
Title Everything Must Go The Mystic Masseur
Type movie movie
Writer Dan Rush, Raymond Carver (short story "Why Don... V.S. Naipaul (novel), Caryl Phillips (screenplay)
Year 2011 2002
imdbRating 6.5 5.9
imdbVotes 31814 393
Month 10 3
Week 41 13
English 1 1
USA 1 0
rating 1 2
title Everything Must Go Mystic Masseur, The
Released_y 2011-05-13 00:00:00 2002-05-03 00:00:00
Movie Everything Must Go The Mystic Masseur
Genre_y Drama NaN
Budget 5000000 NaN
Revenue 2712131 398610
Trailer Play NaN
Adj_Revenue 2880766 526489.3
Adj_Budget 5310889 NaN

35 rows × 2 columns

One question that both Hickey and others have posed is whether the Bechdel site data is biased in some way. We can do a quick check to see if the distributions of the movies in the joined dataset (where movies are only included if they appear across datasets) and the raw Bechdel data. There's clearly a bias towards the Bechdel community covering movies that pass the test and potentially omitting data about movies that do not. This is problematic, but unless we want to watch these movies and code them ourselves, there's not much to be done. But in terms of the data we use for our analysis, it doesn't appear to be the case that using the ...

In [10]:
representative_df = pd.DataFrame()

# Group data by scores
representative_df['All Bechdel'] = ratings_df2.groupby('rating')['title'].agg(len)
representative_df['Analysis Subset'] = df.groupby('rating')['title'].agg(len)

# Convert values of data to fractions of total
representative_df2 = representative_df.apply(lambda x: x/x.sum(),axis=0)

# Make the plot
plt.legend(fontsize=15,loc='lower right')
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.xlabel('Fraction of movies',fontsize=18)
plt.title('Does the analyzed data differ from the original data?',fontsize=18)
<matplotlib.text.Text at 0x10f448cd0>

Performing a $\chi^2$-test to see if these differences in these frequencies are significant, we find they are (p-values reported below). This suggests that the distribution of movies across Bechdel categories in the combined data we analyze are significantly different than the raw data from the site -- or there's a risk that the day we're using in the full analysis isn't representative of the data on the Bechdel site itself. Alternatively, the Bechdel site's data may be biased away from the expected distribution. $\chi^2(3,n=4) = 27.78, p < 0.01$

In [30]:
# The observed data are the counts of the number of movies in each of the different categories for the Analysis subset.
# The expected data are the fractions of the number of movies in the original Bechdel data,
# multiplied by the number of movies in the Analysis subset.
chisquare(f_obs=representative_df['Analysis Subset'].as_matrix(),
          f_exp=representative_df2['All Bechdel'].as_matrix()*representative_df['Analysis Subset'].sum())
(27.784302084004732, 4.0310508706887412e-06)

2 . What are the variables?

Now that we have data in hand about movies, their financial performance, and their performance on the Bechdel test from the data sources that Hickey descibed in the article, we can attempt to replicate the specific variables used. Because the article does not make clear which data or definitions were used for making these constructs, we're left to infer what exactly Hickey did.

The article claims to use two specific outcomes: return on investment and gross profits which are different ways of accounting for the relationship between income and costs. I don't claim to be an accounting or business expert, but "gross profit" is traditionally defined as "income minus costs" ($P = I - C$) while "return on investment" (RoI) is traditionally defined as profits divided by assets ($ROI = P/A$). Thus we need at least three variables: income, costs, and assets.

However, only two of these are available in the movie financial data we obtained from income and costs. We can calculate profits easily, but it's unclear what a movie's assets are here unless we use costs again. Again, Hickey may have done something different for make this variable, but we don't know so we can't replicate.

In [36]:
df['Profit'] = df['Adj_Revenue'] - df['Adj_Budget']
df['ROI'] = df['Profit']/df['Adj_Budget']

3. What are the models?

Bechdel Test over time

We can do some exploratory analysis, First, we might want to plot the change in the average Bechdel score over time. Of course, the number of movies that have been released has also increased in recent years. So we'll change the size of the line by the (logged) number of movies released that year. We observe a general upward trend but a lot of noise in the early years owing the sparsity of the data.

In [37]:
# Do some fancy re-indexing and groupby operations to get average scores and number of movies
avg_bechdel = df.set_index('Released_x').groupby(lambda x:x.year)['rating'].agg(np.mean)
num_movies = df.set_index('Released_x').groupby(lambda x:x.year)['rating'].agg(lambda x:np.log(len(x)+1))

# Changing line widths from:
coords = sorted(dict(avg_bechdel).items(),key=lambda x:x[0])
lines = [(start, end) for start, end in zip(coords[:-1], coords[1:])]
lines = LineCollection(lines, linewidths=dict(num_movies).values())

# Plot the figure
fig, ax = plt.subplots()
plt.ylabel('Avg. Bechdel score',fontsize=18)
<matplotlib.text.Text at 0x10f7696d0>

We can also try to replicate the stacked bar chart that Hickey used to show the distributions of different classes of movies by decade. I find this harder to read and interpret, but I reproduce it below.

In [38]:
df_rating_ct = pd.crosstab(df['Year'],df['rating'])

# Bakes in a lot here and is bad Pythonic form -- but I'm lazy

df_rating_ct.groupby((df_rating_ct.index//10)*10).sum().apply(lambda x: x/x.sum(),axis=1).ix[1930:].plot(kind='bar',stacked=True)

# Fix up the plot
plt.ylabel('Percentage of movies',fontsize=18)
<matplotlib.legend.Legend at 0x10f7d72d0>

We can also estimate a statistical model to forecast future changes in the average Bechdel score over time. We observe a general upward trend in movies passing more of the Bechdel test and can try to extrapolate this going forward.

In [39]:
avg_bechdel = df.set_index('Released_x').groupby(lambda x:x.year)['rating'].agg(np.mean) = 'date' = 'bechdel'
avg_bechdel = avg_bechdel.reset_index()

l_model = smf.ols(formula='bechdel ~ date', data=avg_bechdel).fit()
#print l_model.summary()

# Create a DataFrame "X" that we can extrapolate into
X = pd.DataFrame({"date": np.linspace(start=1912.,stop=2100.,num=100)})

# Plot the observed data

# Plot the predictions from the model
plt.plot(X,l_model.predict(X),c='r',label='Linear model',lw=4)
plt.legend(loc='lower right')
plt.ylabel('Avg. Bechdel Test')

plt.ylabel('Avg. Bechdel Test',fontsize=18)
<matplotlib.text.Text at 0x10bef8e90>

Re-arrange the $y=Bx+I$ equation to solve for $y=3$: $x=(y-I)/B$, or in other words, find the time in the future when the average movie will pass the Bechdel test.

Then convert the float into a proper date and return the day of the week as well (Mondays are 0).

In [40]:
import datetime
year_est = (3 - l_model.params['Intercept'])/l_model.params['date']

year_int = int(year_est)

d = datetime.timedelta(days=(year_est - year_int)*365.25)

day_one = datetime.datetime(year_int,1,1)

date = d + day_one

print date, date.weekday()
2089-08-30 04:54:43.147324 1

Extrapolating this linear model forward in time, on Tuesday, August 30, 2089, the average movie will finally pass the Bechdel test. Just 75 years to go even the average summer blockbuster will have minimally-developed female characters! Hooray!

But there could also be other things going on in the model. Maybe the rate at which movies passing the Bechdel test is accelerating or decelerating instead of changing constantly. We estimate a model with a quadratic term on time below. This model returns a concave function which suggests we've apparently passed "Peak Bechdel" back in the 1990s and we're well on our way regressing towards a misogynist cultural dystopia in the future.

But we shouldn't put too much stock in such simple models -- especially when they make such divergent predictions about the 2090s. However, the lack of progress and perhaps the presence of even a slowdown in the improvement in Bechdel scores over time should be a cause for concern.

In [41]:
q_model = smf.ols(formula='bechdel ~ date + I(date**2)', data=avg_bechdel).fit()

# Create a DataFrame "X" that we can extrapolate into
X = pd.DataFrame({"date": np.linspace(start=1912.,stop=2112.,num=201)})

# Plot the observed data

# Plot the predictions from the model
plt.plot(X,q_model.predict(X),c='r',label='Quadratic model',lw=4)
plt.legend(loc='upper right')
plt.ylabel('Avg. Bechdel Test',fontsize=18)
<matplotlib.text.Text at 0x10bf4ddd0>

Budgets differ

As I discussed above, the claim that budgets for Bechdel-passing movies is relatively uncontroversial and given the description of the data in the article, should be easy to reproduce. Visualizing the distribution of the data, there are no strong differences that jump out but if you squint hard enough, you can make out a negative trend: movies passing the Bechdel test have lower budgets.

You'll notice each cloud of points around 0, 1, 2, and 3. These are simply "jittered" by adding a bit of normally-distributed errors to the plots (but not the underlying data we're estimating) to show the frequency of datapoints without them all sitting on top of each other.

In [42]:
x = df['rating'].apply(lambda x:float(x)+np.random.normal(0, 0.05))
y = df['Adj_Budget']

# Plot with an alpha so the overlaps reveal something about relative density
plt.ylabel('Budgets (2014 $)',fontsize=18)
plt.xlabel('Bechdel dimensions',fontsize=18)

We can also visualize the median budgets with a bar chart to see that movies featuring two women who don't talk to each other appear to have much larger budgets than the rest. Movies that pass the Bechdel test also appear to have slightly smaller budgets than movies that don't pass.

In [43]:
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.xlabel('Budget (2014 $millions)',fontsize=18)
plt.title('Median Budget for Films',fontsize=24)
<matplotlib.text.Text at 0x10c9fa490>

In order to get a consistent look at budget data, we’re going to focus on films released from 1990 to 2013, since the data has significantly more depth since then... We rely on median, as opposed to average, budgets to minimize the effect of outliers, or in this case, huge blockbusters whose budgets are orders of magnitude larger than the typical movie’s budget. We ran a statistical test analyzing the inflation-adjusted median budgets of films, and found that films passing the Bechdet Test had a median budget that was 16 percent lower than the median budget of all films in the set.

With skewed data choosing medians doesn't always help, so I'll log-transform the data instead to enforce more normal distributions. We can quantitatively test this hypothesis by regressing budget against these categories. Indeed, the -0.2504 estimate for C(rating)[T.3.0] confirms the finding that movies passing all 3 Bechdel dimensions have significantly lower budgets than movies that pass none of the Bechdel dimensions. The model predicts that movies that fail every Bechdel dimension have budgets of \$13.6 million on average while movies that pass every Bechdel dimension have budgets of \$10.6 million on average --- a $3.0 million dollar or 22% difference.

But this simple model does a poor job explaining the distribution of data -- it explains 2.3% of the variance. Furthermore, we can also chalk this difference up to a number of other explanations as well: the Bechdel test may be standing in for differences in other variables such as genre, year, time of year, and composition of the cast and crew, among many possibilities. We'll explore some of these later.

In [44]:
bd_m = smf.ols(formula='log(Adj_Budget+1) ~ C(rating)', data=df).fit()
print bd_m.summary()

print "\n\nMovies passing every Bechdel dimension have budgets of ${:,} on average.".format(round(np.exp(bd_m.params['Intercept']+bd_m.params['C(rating)[T.3.0]']-1),2))
print "Movies failing every Bechdel dimension have budgets of ${:,} on average.".format(round(np.exp(bd_m.params['Intercept']-1),2))
                             OLS Regression Results                            
Dep. Variable:     log(Adj_Budget + 1)   R-squared:                       0.023
Model:                             OLS   Adj. R-squared:                  0.021
Method:                  Least Squares   F-statistic:                     12.35
Date:                 Mon, 07 Apr 2014   Prob (F-statistic):           5.52e-08
Time:                         09:19:18   Log-Likelihood:                -2632.4
No. Observations:                 1583   AIC:                             5273.
Df Residuals:                     1579   BIC:                             5294.
Df Model:                            3                                         
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept           17.4318      0.115    151.286      0.000        17.206    17.658
C(rating)[T.1.0]     0.1693      0.130      1.304      0.192        -0.085     0.424
C(rating)[T.2.0]    -0.3081      0.148     -2.076      0.038        -0.599    -0.017
C(rating)[T.3.0]    -0.2504      0.124     -2.026      0.043        -0.493    -0.008
Omnibus:                      420.224   Durbin-Watson:                   2.067
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1253.612
Skew:                          -1.339   Prob(JB):                    6.05e-273
Kurtosis:                       6.440   Cond. No.                         8.76

Movies passing every Bechdel dimension have budgets of $10,653,341.87 on average.
Movies failing every Bechdel dimension have budgets of $13,684,932.09 on average.

Earnings differ

Next, we turn our attention to the second claim the article makes that earnings differ. Let's plot the data first.

In [45]:
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.xlabel('Return on Investment',fontsize=18)
plt.title('Median ROI for Films',fontsize=24)
<matplotlib.text.Text at 0x10ca04110>

It looks like movies satisfying 2 or more of the Bechdel provide much better ROI, but we'll need to run the model to see if these differences are significant.

...we ran a regression to find out if passing the Bechdel test corresponded to lower return on investment. Controlling for the movie’s budget, which has a negative and significant relationship to a film’s return on investment, passing the Bechdel test had no effect on the film’s return on investment.

Our model replicates the same findings described in the article: negative and significant relationship with budget and no significant relationship with ratings. Thus, what appears to be higher ROIs actually isn't a significant difference.

However, it's important to note that we could have estimated this model differently by coding the rating as a continuous variable or not logging the budget. The article isn't clear whether it does this, and while estimating such a "bad" model (ROI ~ rating + Budget) doesn't change the significance or direction of the findings, it does produce a much poorer model (explaining only 2% of the variance, versus the 12% below).

In [46]:
# Estimate the model
ed_m1 = smf.ols(formula='ROI ~ C(rating)  + log(Adj_Budget+1)', data=df).fit()
print ed_m1.summary()
                            OLS Regression Results                            
Dep. Variable:                    ROI   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     52.12
Date:                Mon, 07 Apr 2014   Prob (F-statistic):           2.83e-41
Time:                        09:19:19   Log-Likelihood:                -6221.7
No. Observations:                1583   AIC:                         1.245e+04
Df Residuals:                    1578   BIC:                         1.248e+04
Df Model:                           4                                         
                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept              61.2394      4.380     13.980      0.000        52.647    69.832
C(rating)[T.1.0]        1.3111      1.254      1.045      0.296        -1.149     3.771
C(rating)[T.2.0]        2.4859      1.435      1.733      0.083        -0.328     5.300
C(rating)[T.3.0]        0.7976      1.195      0.667      0.505        -1.547     3.142
log(Adj_Budget + 1)    -3.4338      0.243    -14.128      0.000        -3.911    -2.957
Omnibus:                     2613.532   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1650679.978
Skew:                          10.671   Prob(JB):                         0.00
Kurtosis:                     159.750   Cond. No.                         248.

The article ran a second regression model using gross profits as an outcome.

we ran a regression to find out if passing the Bechdel test corresponded to having lower gross profits — domestic and international. Also controlling for the movie’s budget, which has a positive and significant relationship to a film’s gross profits, once again passing the Bechdel test did not have any effect on a film’s gross profits.

It's unclear why Hickey expected this outcome to differ given that it's another relationship between the original Budget and Revenue data. The article also mentions international receipts when the data available from only reports domestic receipts, so we can't replicate this finding.

We find, unsurprisingly, that the same results as above hold that Bechdel scores don't significantly influence Profit and Budget has a negative relationship with Profit.

In [47]:
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.xlabel('Gross Profit',fontsize=18)
plt.title('Median ROI for Films',fontsize=24)

# Estimate the model
ed_m2 = smf.ols(formula='Profit ~ C(rating)  + log(Budget+1)', data=df).fit()
print ed_m2.summary()
                            OLS Regression Results                            
Dep. Variable:                 Profit   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     4.622
Date:                Mon, 07 Apr 2014   Prob (F-statistic):            0.00103
Time:                        09:19:19   Log-Likelihood:                -31704.
No. Observations:                1583   AIC:                         6.342e+04
Df Residuals:                    1578   BIC:                         6.344e+04
Df Model:                           4                                         
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept         1.957e+08   3.84e+07      5.099      0.000       1.2e+08  2.71e+08
C(rating)[T.1.0]  6.223e+06   1.23e+07      0.506      0.613     -1.79e+07  3.03e+07
C(rating)[T.2.0]  9.217e+06   1.41e+07      0.656      0.512     -1.83e+07  3.68e+07
C(rating)[T.3.0] -6.948e+05   1.17e+07     -0.059      0.953     -2.36e+07  2.22e+07
log(Budget + 1)  -8.921e+06   2.17e+06     -4.115      0.000     -1.32e+07 -4.67e+06
Omnibus:                     1394.494   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            48699.636
Skew:                           4.044   Prob(JB):                         0.00
Kurtosis:                      28.941   Cond. No.                         219.

A different model

The model above inexplicably use "Budget" on both sides of the equation, which is a big no-no. Remember, we constructed ROI as $(Revenue - Budget)/Budget$ and Profit as $Revenue - Budget$ so in these models budget ends up being a function of itself.

What happens if we just leave Budget on the right side of the equation and simply estimate Revenue as a function of Bechdel rating and controlling for Budget?

We get very different findings. Now there is a significant and positive relationship between Budget and Revenue, as we'd expect. Furthermore, there is also a significant and positive relationship between Bechdel criteria and Revenue. Better roles for women translates into better revenue, even controlling for the fact that bigger budgets also create more revenue. This model also explains approximately 24% of the variance, versus 15% in the article's model suggesting it's doing a better job modeling the relationship of Bechdel test and financial performance.

In [48]:
# Make the bar-plot

plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.xlabel('Return on Investment',fontsize=18)
plt.title('Revenue for Films',fontsize=24)


# Estimate the model
ed_m3 = smf.ols(formula='log(Adj_Revenue+1) ~ C(rating)  + log(Adj_Budget+1)', data=df).fit()
print ed_m3.summary()
                             OLS Regression Results                             
Dep. Variable:     log(Adj_Revenue + 1)   R-squared:                       0.244
Model:                              OLS   Adj. R-squared:                  0.242
Method:                   Least Squares   F-statistic:                     127.2
Date:                  Mon, 07 Apr 2014   Prob (F-statistic):           3.12e-94
Time:                          09:19:20   Log-Likelihood:                -3338.6
No. Observations:                  1583   AIC:                             6687.
Df Residuals:                      1578   BIC:                             6714.
Df Model:                             4                                         
                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept               1.8864      0.709      2.661      0.008         0.496     3.277
C(rating)[T.1.0]        0.2741      0.203      1.351      0.177        -0.124     0.672
C(rating)[T.2.0]        0.6096      0.232      2.626      0.009         0.154     1.065
C(rating)[T.3.0]        0.4483      0.193      2.318      0.021         0.069     0.828
log(Adj_Budget + 1)     0.8825      0.039     22.438      0.000         0.805     0.960
Omnibus:                     1650.678   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           111167.694
Skew:                          -5.042   Prob(JB):                         0.00
Kurtosis:                      42.796   Cond. No.                         248.

However, this effect is pretty weak and the Budget variable does most of the heavy lifting in terms of explaining why this model does well. In the figure below, we plot the relationships between Revenue and Bechdel scores after controlling for Budget. The clouds of green dots are the residual variance in the data that's not explained by the Budget. If we estimate a model that tries to explain this residual variance using the Bechdel scores now, we see the small but positive trend in red. This is the effect of Bechdel-passing movies making more money.

In [49]:
# Estimate model without the Bechdel test IV and containing only the IVs to get residuals
ed_m3_bad_res = smf.ols(formula='log(Adj_Revenue+1) ~ log(Adj_Budget+1)', data=df[df['Adj_Revenue']>0]).fit()

# Put these residual estimates in a temporary DataFrame
df_temp = df
df_temp['residuals'] = ed_m3_bad_res.resid

# Estimate a bad model with a continuous Bechdel rating against the residuals
ed_m3_bad = smf.ols(formula='residuals ~ rating', data=df_temp).fit()
ed_m3_bad_x = pd.DataFrame({'rating':np.linspace(min(df['rating']),max(df['rating']),100)})

# Add a jitter so the points aren't overlapping
x = df['rating'].apply(lambda x:float(x)+np.random.normal(0, 0.05))
y = df_temp['residuals']


# Plot with an alpha so the overlaps reveal something about relative density
plt.xlabel('Bechdel score',fontsize=18)
plt.ylabel('Residual Adj. Revenue',fontsize=18)
plt.legend(loc='lower left')
<matplotlib.legend.Legend at 0x10ca1b4d0>

Summary of replication results

I summarize the findings of my replication attempts, alternative statistical analyses, as well as other exploratory findings:

  • Hickey found that the number of movies passing the Bechdel test has been increasing over time. I found the same, but the effect had slowed in recent years. Simplistic extrapolations suggest it may take until the end of the century until the average movie passes the Bechdel test or that we run the risk of regressing towards more backwards portrayals of women in film.
  • Hickey found that Bechdel movies had lower median budgets. I also found that passing the Bechdel test was correlated with significantly lower budgets.
  • Hickey found that Bechdel movies had no effect on the film's ROI, controlling for budget. I also found that there was no significant relationship between ROI and movies passing the Bechdel test.
  • Hickey found that Bechdel movies had no effect on the film's gross profits, controlling for budget. I also found that there was no significant relationship between Profit and movies passing the Bechdel test.
  • I noted that ROI and Profit are problematic variables for the modeling Hickey described, so I used a simpler model of Revenue alone but still controlling for budget. This model provided a better fit to the data and also found that Bechdel movies had significantly higher revenue.

BONUS! Relationship between Bechdel scores and ratings

If you've made it this far, I want to reward you with some new analysis rather than trying to replicate and criticizing others' models. Here I want to extend the bigger --- and very important --- question that Hickey was posing of how gender biases manifest in the movie business. In particular, I want to ask: are movies that pass the Bechdel test are systematically rated lower by fans and critics?

Here I draw on data from the OMDB that captures Metacritic scores (professional critics) as well as the ratings of IMDB users (crowd critics). We'll examine each of these in turn, starting with the Metacritic.

As before, we can plot the distribution of median Metacritic scores across the four Bechdel dimensions. This suggests that movies passing the Bechdel test are more strongly criticized and received lower ratings that movies that fail it.

In [50]:
# Make the bar-plot
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.title('Mean Metascore',fontsize=24)
<matplotlib.text.Text at 0x10d3cb210>

We run an analogous statistical model to those above to estimate the relationship between Metascores and Bechdel ratings controlling for budget. There is a significant and negative relationship between budget and Metascore suggesting that blockbuster budgets, while they do translate into money for studios, don't translate into glowing reviews from critics. However, we can put away the pitchforks because although there is a negative relationship between Bechdel scores and MetaCritic scores, it is not statistically significant. In other words the differences we saw above can be chalked up to differences in the budgets of movies and other variability rather than systematic biases by critics.

In [51]:
m2 = smf.ols(formula='Metascore ~ C(rating) + log(Adj_Budget+1)', data=df).fit()
print m2.summary()
                            OLS Regression Results                            
Dep. Variable:              Metascore   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     7.607
Date:                Mon, 07 Apr 2014   Prob (F-statistic):           4.63e-06
Time:                        09:19:23   Log-Likelihood:                -2672.0
No. Observations:                1373   AIC:                             5354.
Df Residuals:                    1368   BIC:                             5380.
Df Model:                           4                                         
                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept               9.2635      0.655     14.138      0.000         7.978    10.549
C(rating)[T.1.0]       -0.0015      0.187     -0.008      0.994        -0.368     0.365
C(rating)[T.2.0]       -0.2650      0.213     -1.242      0.214        -0.683     0.153
C(rating)[T.3.0]       -0.1679      0.178     -0.944      0.345        -0.517     0.181
log(Adj_Budget + 1)    -0.1947      0.036     -5.388      0.000        -0.266    -0.124
Omnibus:                       15.070   Durbin-Watson:                   1.927
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               10.743
Skew:                          -0.095   Prob(JB):                      0.00465
Kurtosis:                       2.611   Cond. No.                         253.

Next I examine whether there are systematic biases among the ratings of IMDB users in their assessments of movies that pass or fail the Bechdel test. As before, it appears that users are more critical of movies passing the Bechdel test than movies that fail it.

In [52]:
# Make the bar-plot
plt.yticks(plt.yticks()[0],["Fewer than two women",
                             "Women don't talk to each other",
                             'Women only talk about men',
                             'Passes Bechdel Test'
plt.title('Mean IMDB rating',fontsize=24)
<matplotlib.text.Text at 0x10d3eae90>

Here we run a model that predicts the IMDB rating as a combination of Bechdel rating, Budget, and now we add the number of votes received as well. This new variable might control for the fact that popular movies may receive more glowing reviews while less popular movies are subject to the whims of receiving votes from cinephiles who don't like anything but early 20th century European avant-garde experimental films.

The results show a significant and negative relationship between Bechdel score and popular rating: IMDB users appear to rate "feminist" films significantly lower than movies that have negligible female presence Keep in mind we've controlled for other possible explanations like Budget (more budget, significantly lower scores) and the number of votes a movie received (more votes, significantly higher scores) and the model explains approximately 30% of the variance in the data. Put another way, for two movies with the same budget and same number of votes, the movie that passes the Bechdel test will receive an IMDB score 0.25 below the movie that passes none of the Bechdel criteria.

In [53]:
m3 = smf.ols(formula='imdbRating ~ C(rating) + log(Adj_Budget+1) + log(imdbVotes+1)', data=df).fit()
print m3.summary()
                            OLS Regression Results                            
Dep. Variable:             imdbRating   R-squared:                       0.300
Model:                            OLS   Adj. R-squared:                  0.298
Method:                 Least Squares   F-statistic:                     134.9
Date:                Mon, 07 Apr 2014   Prob (F-statistic):          3.05e-119
Time:                        09:19:24   Log-Likelihood:                -1927.4
No. Observations:                1582   AIC:                             3867.
Df Residuals:                    1576   BIC:                             3899.
Df Model:                           5                                         
                          coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept               5.1252      0.308     16.624      0.000         4.520     5.730
C(rating)[T.1.0]       -0.1312      0.083     -1.575      0.116        -0.295     0.032
C(rating)[T.2.0]       -0.1067      0.095     -1.119      0.263        -0.294     0.080
C(rating)[T.3.0]       -0.2477      0.080     -3.113      0.002        -0.404    -0.092
log(Adj_Budget + 1)    -0.2154      0.018    -12.230      0.000        -0.250    -0.181
log(imdbVotes + 1)      0.4978      0.020     25.044      0.000         0.459     0.537
Omnibus:                      102.934   Durbin-Watson:                   2.017
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              153.629
Skew:                          -0.530   Prob(JB):                     4.36e-34
Kurtosis:                       4.098   Cond. No.                         312.

DOUBLE SECRET BONUS! The kitchen sink model

In [54]:

At this point I've found some evidence that there are systematic differences how movies that pass or fail the Bechdel test have different budgets, different earnings, and different reviews. But I've also included only a handful of relevant variables to explain the relationship when there may be others that explain it instead. You may be thinking to yourself that X, Y, Z probably explains the patterns we're observing and these models just aren't capturing it. So let's throw everything and the kitchen sink into the model and see what happens.

The significant differences seen above in the IMDB scores seem especially problematic, so let's try to see if adding other variables makes this finding change in direction or significance. In addition to adding Revenue, Budget, Metascore, and number of IMDB votes, we add several new variables as controls that we hadn't included before. The result is a model that has the potential to be overfitted, but we're not interested in the combined model for any predictive purposes --- only when the estimate for the Bechdel dimensions has changed.

  • MPAA Rating. People dislike G-rated movies that happen to pass the Bechdel test more, perhaps.
  • Runtime. Instead of people hating "feminist" movies, maybe movies passing the Bechdel test are just longer and people don't like 2-hour marathons.
  • Genre. Maybe some genres like romantic comedies or dramas have an easier time passing the Bechdel test.
  • Year. There may be a nostalgia effect of movies in the past that pass the test being rated differently than movies released more recently that pass the test.
  • Week. Summer and holiday blockbusters are different animals than awards vehicles that are released in the fall and winter.
  • English language. "Seriously, who likes strong female leads and subtitles? Get me a Bud Light Lime and let's fire up Michael Bay's magnum opus Transformers!"
  • USA. As bad as it may be here, other countries may have it worse.

There are certainly others that also may explain patterns we see, some of which might be relevant such as the reputation of the cast or crewmembers or others that may not be relevant such as the fraction of seats Republicans hold in the House. Neither the relevant nor unlikely variables are present in the data above, but you're welcome to collect them and integrate it into the analysis below -- that's what's so great about making the models, data, and findings open for others!

In [55]:
m4 = smf.ols(formula='imdbRating ~ C(rating) + log(Adj_Revenue+1) + log(Adj_Budget+1) + Metascore + log(imdbVotes+1) + Runtime + C(Genre_y) + Year + C(Week) + C(English) + C(USA) + C(Rated)', data=df).fit()
print m4.summary()
                            OLS Regression Results                            
Dep. Variable:             imdbRating   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                     2827.
Date:                Mon, 07 Apr 2014   Prob (F-statistic):               0.00
Time:                        09:19:27   Log-Likelihood:                -1018.5
No. Observations:                1335   AIC:                             2189.
Df Residuals:                    1259   BIC:                             2584.
Df Model:                          76                                         
                                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
Intercept                            19.0301      4.094      4.648      0.000        10.999    27.062
C(rating)[T.1.0]                     -0.0272      0.062     -0.440      0.660        -0.149     0.094
C(rating)[T.2.0]                     -0.0204      0.072     -0.284      0.777        -0.162     0.121
C(rating)[T.3.0]                     -0.1245      0.061     -2.034      0.042        -0.245    -0.004
C(Genre_y)[T.Adventure]               0.1224      0.063      1.944      0.052        -0.001     0.246
C(Genre_y)[T.Black Comedy]            0.1418      0.131      1.081      0.280        -0.116     0.399
C(Genre_y)[T.Comedy]                  0.0472      0.055      0.863      0.388        -0.060     0.155
C(Genre_y)[T.Concert/Performance]  7.664e-14   1.64e-14      4.675      0.000      4.45e-14  1.09e-13
C(Genre_y)[T.Documentary]             0.5056      0.391      1.293      0.196        -0.261     1.273
C(Genre_y)[T.Drama]                   0.2554      0.058      4.436      0.000         0.142     0.368
C(Genre_y)[T.Horror]                 -0.1923      0.070     -2.732      0.006        -0.330    -0.054
C(Genre_y)[T.Musical]                 0.0938      0.133      0.706      0.480        -0.167     0.354
C(Genre_y)[T.Romantic Comedy]         0.1290      0.079      1.627      0.104        -0.027     0.285
C(Genre_y)[T.Thriller/Suspense]       0.0999      0.060      1.671      0.095        -0.017     0.217
C(Genre_y)[T.Western]                 0.2121      0.177      1.197      0.232        -0.136     0.560
C(Week)[T.2]                          0.5191      0.237      2.191      0.029         0.054     0.984
C(Week)[T.3]                          0.4848      0.242      2.004      0.045         0.010     0.959
C(Week)[T.4]                          0.6735      0.236      2.852      0.004         0.210     1.137
C(Week)[T.5]                          0.6091      0.224      2.721      0.007         0.170     1.048
C(Week)[T.6]                          0.4566      0.219      2.083      0.037         0.027     0.887
C(Week)[T.7]                          0.5184      0.215      2.407      0.016         0.096     0.941
C(Week)[T.8]                          0.6153      0.231      2.661      0.008         0.162     1.069
C(Week)[T.9]                          0.5156      0.224      2.302      0.021         0.076     0.955
C(Week)[T.10]                         0.6411      0.225      2.847      0.004         0.199     1.083
C(Week)[T.11]                         0.6942      0.221      3.134      0.002         0.260     1.129
C(Week)[T.12]                         0.5089      0.218      2.329      0.020         0.080     0.938
C(Week)[T.13]                         0.4294      0.222      1.931      0.054        -0.007     0.866
C(Week)[T.14]                         0.4885      0.219      2.235      0.026         0.060     0.917
C(Week)[T.15]                         0.4384      0.226      1.937      0.053        -0.006     0.882
C(Week)[T.16]                         0.6169      0.232      2.656      0.008         0.161     1.073
C(Week)[T.17]                         0.5584      0.236      2.362      0.018         0.095     1.022
C(Week)[T.18]                         0.6164      0.238      2.593      0.010         0.150     1.083
C(Week)[T.19]                         0.4574      0.225      2.032      0.042         0.016     0.899
C(Week)[T.20]                         0.3126      0.236      1.323      0.186        -0.151     0.776
C(Week)[T.21]                         0.4790      0.213      2.245      0.025         0.060     0.898
C(Week)[T.22]                         0.3992      0.220      1.812      0.070        -0.033     0.832
C(Week)[T.23]                         0.3518      0.221      1.590      0.112        -0.082     0.786
C(Week)[T.24]                         0.4425      0.222      1.996      0.046         0.008     0.877
C(Week)[T.25]                         0.4382      0.215      2.043      0.041         0.017     0.859
C(Week)[T.26]                         0.2977      0.215      1.383      0.167        -0.125     0.720
C(Week)[T.27]                         0.4669      0.214      2.180      0.029         0.047     0.887
C(Week)[T.28]                         0.4005      0.225      1.778      0.076        -0.041     0.842
C(Week)[T.29]                         0.6816      0.215      3.164      0.002         0.259     1.104
C(Week)[T.30]                         0.1955      0.214      0.912      0.362        -0.225     0.616
C(Week)[T.31]                         0.4792      0.219      2.192      0.029         0.050     0.908
C(Week)[T.32]                         0.5554      0.218      2.547      0.011         0.128     0.983
C(Week)[T.33]                         0.4002      0.216      1.853      0.064        -0.024     0.824
C(Week)[T.34]                         0.4123      0.226      1.826      0.068        -0.031     0.855
C(Week)[T.35]                         0.4659      0.231      2.018      0.044         0.013     0.919
C(Week)[T.36]                         0.5719      0.247      2.315      0.021         0.087     1.057
C(Week)[T.37]                         0.5011      0.222      2.261      0.024         0.066     0.936
C(Week)[T.38]                         0.5194      0.215      2.412      0.016         0.097     0.942
C(Week)[T.39]                         0.7210      0.211      3.412      0.001         0.306     1.136
C(Week)[T.40]                         0.6133      0.219      2.806      0.005         0.184     1.042
C(Week)[T.41]                         0.5021      0.217      2.317      0.021         0.077     0.927
C(Week)[T.42]                         0.5390      0.211      2.554      0.011         0.125     0.953
C(Week)[T.43]                         0.3747      0.222      1.691      0.091        -0.060     0.809
C(Week)[T.44]                         0.5368      0.212      2.532      0.011         0.121     0.953
C(Week)[T.45]                         0.4429      0.217      2.039      0.042         0.017     0.869
C(Week)[T.46]                         0.5039      0.227      2.220      0.027         0.059     0.949
C(Week)[T.47]                         0.5634      0.211      2.667      0.008         0.149     0.978
C(Week)[T.48]                         0.4132      0.236      1.754      0.080        -0.049     0.875
C(Week)[T.49]                         0.4696      0.221      2.123      0.034         0.036     0.904
C(Week)[T.50]                         0.3296      0.215      1.529      0.126        -0.093     0.752
C(Week)[T.51]                         0.5439      0.210      2.587      0.010         0.131     0.956
C(Week)[T.52]                         0.6576      0.217      3.034      0.002         0.232     1.083
C(Week)[T.53]                     -5.274e-17   7.33e-17     -0.719      0.472     -1.97e-16  9.11e-17
C(English)[T.1]                      -0.0113      0.032     -0.352      0.725        -0.074     0.052
C(USA)[T.1]                          -0.0094      0.032     -0.290      0.772        -0.073     0.054
C(Rated)[T.PG]                       -0.0466      0.109     -0.430      0.668        -0.260     0.166
C(Rated)[T.PG-13]                    -0.1838      0.110     -1.665      0.096        -0.400     0.033
C(Rated)[T.R]                        -0.1111      0.113     -0.978      0.328        -0.334     0.112
log(Adj_Revenue + 1)                 -0.0557      0.014     -3.974      0.000        -0.083    -0.028
log(Adj_Budget + 1)                  -0.0810      0.019     -4.184      0.000        -0.119    -0.043
Metascore                             0.2885      0.011     25.339      0.000         0.266     0.311
log(imdbVotes + 1)                    0.3533      0.021     17.205      0.000         0.313     0.394
Runtime                               0.0069      0.001      6.804      0.000         0.005     0.009
Year                                 -0.0083      0.002     -4.100      0.000        -0.012    -0.004
Omnibus:                      135.731   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              400.944
Skew:                          -0.518   Prob(JB):                     8.63e-88
Kurtosis:                       5.477   Cond. No.                          nan

[1] The smallest eigenvalue is -3.98e-14. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Remarkably, even after controlling for all these other potential explanations for why IMDB voters might dislike movies that pass the Bechdel, the model still has a significant and negative estimate for the Bechdel test variable (C(rating)[T.3.0]): movies that pass the Bechdel test receive IMDB scores 0.12 below similar movies that pass none of the Bechdel criteria. The significance of this finding also means that the chances of this finding be attributable to randomness is less than 5%. Given everything that we did to try to erase this pattern of IMDB users systematically awarding lower scores to movies passing the Bechdel test, the persistence of this significant and negative finding suggests that there's something to the argument that the IMDB audience is biased against movies with "non-trivial" female roles.

But we can run the same kitchen sink model on the relationship between revenue and the Bechdel test. Indeed, the significant and positive finding remains here: movies that pass the Bechdel movie make 43% more money controlling for all the other explanatory factors than movies that do not pass the Bechdel test. Indeed, even for movies that don't pass all of its dimensions, a movie passing 2 makes 34% more and a movie passing 1 makes 21% more than movies passing none.

In [66]:
m5 = smf.ols(formula='Metascore ~ C(rating) + log(Adj_Revenue+1) + log(Adj_Budget+1) + imdbRating + log(imdbVotes+1) + Runtime + C(Genre_y) + Year + C(Week) + C(English) + C(USA) + C(Rated)', data=df).fit()
m6 = smf.ols(formula='log(Adj_Revenue+1) ~ C(rating) + Metascore + log(Adj_Budget+1) + imdbRating + log(imdbVotes+1) + Runtime + C(Genre_y) + Year + C(Week) + C(English) + C(USA) + C(Rated)', data=df).fit()
m7 = smf.ols(formula='log(Adj_Budget+1) ~ C(rating) + Metascore + log(Adj_Revenue+1) + imdbRating + log(imdbVotes+1) + Runtime + C(Genre_y) + Year + C(Week) + C(English) + C(USA) + C(Rated)', data=df).fit()

print "Difference in IMDB scores between movies that pass all and fail all requirements of the Bechdel test: {0}.".format(round(m4.params['C(rating)[T.3.0]'],2))
print "Difference in Metascores between movies that pass all and fail all requirements of the Bechdel test: {0}.".format(round(m5.params['C(rating)[T.3.0]']*10,2))
print "Difference in revenue between movies that pass all and fail all requirements of the Bechdel test: {0}%.".format(round(100*(np.exp(m6.params['C(rating)[T.3.0]']) - 1),2))
print "Difference in budget between movies that pass all and fail all requirements of the Bechdel test: {0}%.".format(round(100*(np.exp(m7.params['C(rating)[T.3.0]']) - 1),2))

results_df = pd.DataFrame({'m4_params':m4.params[['C(rating)[T.1.0]','C(rating)[T.2.0]','C(rating)[T.3.0]']],
results_df.index = ["Women don't talk to each other",
                    'Women only talk about men',
                    'Passes Bechdel Test']

ax = results_df[['m4_params','m5_params','m6_params','m7_params']].T.plot(kind='barh')

# Change the alpha channel to correspond with the p-values
# More transparent bars are less significant results
for i, (j, k) in enumerate(product(range(3), range(3))):
    patch = ax.patches[i]
    alpha = 1 - results_df[['m4_pvalues','m5_pvalues','m6_pvalues','m7_pvalues']].T.iloc[j, k]
    patch.set_alpha(max(alpha, .2))
plt.yticks(plt.yticks()[0],['IMDB Ratings','Metascore','Revenue','Budget'],fontsize=18)
plt.xlabel('Effect size compared to movies failing\nall elements of the Bechdel test',fontsize=18)
plt.title("Effects of women's roles in movies",fontsize=24)
Difference in IMDB scores between movies that pass all and fail all requirements of the Bechdel test: -0.12.
Difference in Metascores between movies that pass all and fail all requirements of the Bechdel test: 1.87.
Difference in revenue between movies that pass all and fail all requirements of the Bechdel test: 54.49%.
Difference in budget between movies that pass all and fail all requirements of the Bechdel test: -24.0%.


The four main findings from this analysis of the effects of women's roles in movies are summarized in the chart above. Even after controlling for everything in the "kitchen sink" model, compared to similar movies that pass none of the requirements, movies passing the Bechdel test (the red bars):

  • Receive budgets that are 24% smaller

  • Make 55% more revenue

  • Are awarded 1.8 more Metacritic points by professional reviewers

  • Are awarded 0.12 fewer stars by IMDB's amateur reviewers

These four points point to a paradox in which movies that pass an embarrassingly low bar for female character development make more money and are rated more highly by critics, but have to deal with lower budgets and more critical community responses. Is this definitive evidence of active discrimination in the film industry and culture? No, but it suggests systemic prejudices are contributing to producers irrationally ignoring significant evidence that "feminist" films make them more money and earn higher praise.

The data that I used here was scraped from public-facing websites, but there may be reasons to think that these data are inaccurate by those who are more familiar with how they're generated. Similarly, the models I used here are simple Stats 101 ordinary least squares regression models with some minor changes to account for categorical variables and skewed data. There are no Bayesian models, no cross-validation or bootstrapping, and no exotic machine learning methods here. But in making the data available (or at least the process for replicating how I obtained my own data), others are welcome to perform and share the results such analyses --- and this is ultimately my goal of asking data journalism to adopt the norms of open collaboration. When other people take their methodological hammers or other datasets and still can't break the finding, we have greater confidence that the finding is "real".

But the length and technical complexity of this post also raise the question of, who is the audience for this kind of work? Journalistic norms emphasize quick summaries turned around rapidly with opaque discussions of methods and analysis and making definitive claims. Scientific norms emphasize more deliberative and transparent processes that prize abstruse discussions and tentative claims about their "truth". I am certainly not saying that Hickey should have the output of regression models in his article --- 99% of people won't care to see that. But in the absence of soliciting peer reviews of this research, how are we as analysts, scientists, and journalists to evaluate the validity of the claims unless the code and data are made available for others to inspect? Even this is a higher bar than many scientific publications hold their authors to (and I'm certainly guilty of not doing more to make my own code and data available), but it should be the standard, especially for a genre of research like data journalism where the claims reach such large audiences.

However, there are exciting technologies for supporting this kind of open documentation and collaboration. I used an "open notebook" technology called IPython Notebook to write this post in such a way that the text, code, and figures I generated are all stitched together into one file. You're likely reading this post on a website that lets you view any such Notebook on the web where other developers and researchers share code about how to do all manner of data analysis. Unfortunately, this was intended as a word processing or blogging tool, so the the lack of features such as more dynamic layout options or spell-checking will frustrate many journalists (apologies for the typos!). However, there are tools for customizing the CSS so that it plays well (see here and here). The code and data are hosted on GitHub, which is traditionally used for software collaboration, but its features for others to discuss problems in my analysis (issue tracker) or propose changes to my code (pull requests) promote critique, deliberation, and improvement. I have no idea how these will work in the context of a journalistic project, and to be honest, I've never used them before, but I'd love to try and see what breaks.

Realistically, practices only change if there are incentives to do so. Academic scientists aren't awarded tenure on the basis of writing well-trafficed blogs or high-quality Wikipedia articles, they are promoted for publishing rigorous research in competitive, peer-reviewed outlets. Likewise, journalists aren't promoted for providing meticulously-documented supplemental material or replicating other analyses instead of contributing to coverage of a major news event. Amidst contemporary anxieties about information overload as well as the weaponization of fear, uncertainty, and doubt tactics, data-driven journalism could serve a crucial role in empirically grounding our discussions of policies, economic trends, and social changes. But unless the new leaders set and enforce standards that emulate the scientific community's norms, this data-driven journalism risks falling into traps that can undermine the public's and scientific community's trust.

This suggests several models going forward:

  • Open data. Data-driven journalists could share their code and data on open source repositories like GitHub for others to inspect, replicate, and extend. But as any data librarian will rush to tell you, there are non-trivial standards for ensuring the documentation, completeness, formatting, and non-proprietaryness of data.
  • Open collaboration. Journalists could collaborate with scientists and analysts to pose questions that they jointly analyze and then write up as articles or features as well as submitting for academic peer review. But peer review takes time and publishing results in advance of this review, even working with credentialed experts, doesn't imply their reliability.
  • Open deliberation. Organizations that practice data-driven journalism (to the extent this is different from other flavors of journalism) should invite and provide empirical critiques of their analyses and findings. Making well-documented data available or finding the right experts to collaborate with are extremely time-intensive, but if you're going to publish original empirical research, you should accept and respond to legitimate critiques.
  • Data omsbudsmen. Data-driven news organizations might consider appointing independent advocates to represent public interests and promote scientific norms of communalism, skepticism, and empirical rigor. Such a position would serve as a check against authors making sloppy claims, using improper methods, analyzing proprietary data, or acting for their personal benefit.

I have very much enjoyed thinking through many of these larger issues and confronting the challenges of the critiques I've raised. I look forward to your feedback and I very hope this drives conversations about what kind of science data-driven journalism hopes to become.