The Internet Movie Database (IMDb) is a website that serves as an online database of world cinema. This website contains a large number of public data on films such as the title of the film, the year of release of the film, the genre of the film, the audience, the rating of critics, the duration of the film, the summary of the film, actors, directors and much more. Faced with the large amount of data available on this site, I thought that it would be interesting to analyze the movies data on the IMDb website between the year 2000 and the year 2017.
This notebook aims to analyse the movies data with many various graphics and gives an interpretation of these data.
Python stack: Numpy, Matplotlib, Pandas and Seaborn
On the IMDb website, it is possible to filter the searches, and thus to display all the movies for one year, such as the year 2017. For example, the first page of all 2017 IMDb movies is available under the following URL:
http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1
So I started to list all the data available on this page, understand their meaning, and especially think of a way that can recover the data on IMDb. After having inventoried the data available on this page and understanding the meaning of each data item, I started the data selection phase, that is, the data I want to keep for my Data Science study.
Here are the data I want to keep:
It remains now to recover these data on all the films between 2000 and 2017. My knowledge of HTML, CSS and Javascript helped me a lot to find a way to recover this data automatically. Like any website, the IMDb site code is HTML, CSS and Javascript. It was therefore necessary to parse this HTML code, and to recover only the concerned data between certain HTML tags and to apply this on several pages and on all the years of the year 2000 to the year 2017.
So I developed a Python script using the BeautifulSoup library, which allows to parse HTML code, I limited the parsing to 8 pages for each year, so starting with the year 2000, my Python script retrieves the data on 8 pages, then redo the same step on the following year until the year 2017. It is a webscraping technique.
In my Python script, I send a GET HTML request to the IMDb site to retrieve the concerned page at regular times. Before launching the Python script, I still looked at the IMDb website with the movie list, and I realized that some data is missing on this IMDb site. For some movies, there is for example, no gross, no votes or no duration of the film. Since there are a lot of movies, it is likely that there are other missing data, so if I had started my Python script, I would have got a dataset with missing values.
I have been thinking of several solutions to fix this dataset problem with missing values as follows:
I opted for the first solution, so I updated my Python script, so that it does not take into account the movies whose data is missing during the parsing. Once done, I run my script then I got the dataset between 2000 and 2017. From this dataset, I built a dataframe then converted this dataframe to CSV file named 'dataframeMovies.csv'.
I loaded the CSV file 'dataframeMovies.csv' and with the Pandas library, it is possible to have an overview of this dataset and by applying functions like info(), describe() and head(), I checked the contents of my dataset.
import pandas as pd
import os
dataMoviesFull = pd.read_csv('dataframeMovies.csv')
len(dataMoviesFull)
4583
My dataset contains 4583 lines.
dataMoviesFull.head(10)
Unnamed: 0 | audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8.5 | Action | 67 | 155 | 187.71 | Gladiator | 1096457 | 2000 |
1 | 1 | 8.5 | Mystery | 80 | 113 | 25.54 | Memento | 942923 | 2000 |
2 | 2 | 8.3 | Comedy | 55 | 104 | 30.33 | Snatch - Tu braques ou tu raques | 663244 | 2000 |
3 | 3 | 8.3 | Drama | 68 | 102 | 3.64 | Requiem for a Dream | 640047 | 2000 |
4 | 4 | 7.4 | Action | 64 | 104 | 157.30 | X-Men | 498388 | 2000 |
5 | 5 | 7.8 | Adventure | 73 | 143 | 233.63 | Seul au monde | 435798 | 2000 |
6 | 6 | 7.6 | Crime | 64 | 102 | 15.07 | American Psycho | 396271 | 2000 |
7 | 7 | 7.2 | Drama | 62 | 106 | 95.01 | Incassable | 284747 | 2000 |
8 | 8 | 7.0 | Comedy | 73 | 108 | 166.24 | Mon beau-père et moi | 277723 | 2000 |
9 | 9 | 6.1 | Action | 59 | 123 | 215.41 | M-I:2 Mission: Impossible 2 | 263188 | 2000 |
I display the first 10 lines.
I delete the unnamed column.
#delete the useless column
del dataMoviesFull['Unnamed: 0']
dataMoviesFull.head(10)
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
0 | 8.5 | Action | 67 | 155 | 187.71 | Gladiator | 1096457 | 2000 |
1 | 8.5 | Mystery | 80 | 113 | 25.54 | Memento | 942923 | 2000 |
2 | 8.3 | Comedy | 55 | 104 | 30.33 | Snatch - Tu braques ou tu raques | 663244 | 2000 |
3 | 8.3 | Drama | 68 | 102 | 3.64 | Requiem for a Dream | 640047 | 2000 |
4 | 7.4 | Action | 64 | 104 | 157.30 | X-Men | 498388 | 2000 |
5 | 7.8 | Adventure | 73 | 143 | 233.63 | Seul au monde | 435798 | 2000 |
6 | 7.6 | Crime | 64 | 102 | 15.07 | American Psycho | 396271 | 2000 |
7 | 7.2 | Drama | 62 | 106 | 95.01 | Incassable | 284747 | 2000 |
8 | 7.0 | Comedy | 73 | 108 | 166.24 | Mon beau-père et moi | 277723 | 2000 |
9 | 6.1 | Action | 59 | 123 | 215.41 | M-I:2 Mission: Impossible 2 | 263188 | 2000 |
max(dataMoviesFull.grossMillions)
936.66
dataMoviesFull[(dataMoviesFull['grossMillions'] == 936.66)]
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
3900 | 8.1 | Action | 81 | 136 | 936.66 | Star Wars: Episode VII - Le réveil de la Force | 711233 | 2015 |
max(dataMoviesFull.audienceRating)
9.0
dataMoviesFull[(dataMoviesFull['audienceRating'] == 9.0)]
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
769 | 9.0 | Documentary | 80 | 235 | 0.00 | The Century of the Self | 3680 | 2002 |
2081 | 9.0 | Action | 82 | 152 | 534.86 | The Dark Knight: Le chevalier noir | 1865768 | 2008 |
dataMoviesFull.columns
Index(['audienceRating', 'Genre', 'criticRating', 'timeMin', 'grossMillions', 'Movie', 'Vote', 'Year'], dtype='object')
dataMoviesFull.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4583 entries, 0 to 4582 Data columns (total 8 columns): audienceRating 4583 non-null float64 Genre 4583 non-null object criticRating 4583 non-null int64 timeMin 4583 non-null int64 grossMillions 4583 non-null float64 Movie 4583 non-null object Vote 4583 non-null int64 Year 4583 non-null int64 dtypes: float64(2), int64(4), object(2) memory usage: 286.5+ KB
dataMoviesFull.describe()
audienceRating | criticRating | timeMin | grossMillions | Vote | Year | |
---|---|---|---|---|---|---|
count | 4583.000000 | 4583.000000 | 4583.000000 | 4583.000000 | 4.583000e+03 | 4583.000000 |
mean | 6.510081 | 55.227144 | 107.127646 | 37.234787 | 8.450176e+04 | 2008.365263 |
std | 0.993184 | 17.947463 | 18.810458 | 66.157452 | 1.334644e+05 | 5.102369 |
min | 1.500000 | 5.000000 | 48.000000 | 0.000000 | 2.760000e+03 | 2000.000000 |
25% | 5.900000 | 42.000000 | 94.000000 | 0.500000 | 1.428100e+04 | 2004.000000 |
50% | 6.600000 | 56.000000 | 104.000000 | 10.910000 | 3.541600e+04 | 2008.000000 |
75% | 7.200000 | 69.000000 | 116.000000 | 45.060000 | 9.317800e+04 | 2013.000000 |
max | 9.000000 | 100.000000 | 366.000000 | 936.660000 | 1.865768e+06 | 2017.000000 |
# convert in category to count and use these columns
dataMoviesFull.Movie = dataMoviesFull.Movie.astype('category')
dataMoviesFull.Genre = dataMoviesFull.Genre.astype('category')
dataMoviesFull.Year = dataMoviesFull.Year.astype('int64')
dataMoviesFull.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4583 entries, 0 to 4582 Data columns (total 8 columns): audienceRating 4583 non-null float64 Genre 4583 non-null category criticRating 4583 non-null int64 timeMin 4583 non-null int64 grossMillions 4583 non-null float64 Movie 4583 non-null category Vote 4583 non-null int64 Year 4583 non-null int64 dtypes: category(2), float64(2), int64(4) memory usage: 424.7 KB
dataMoviesFull.Genre.cat.categories
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Horror', 'Music', 'Mystery', 'Romance', 'Sci', 'Thriller', 'War', 'Western'], dtype='object')
dataMoviesFull.Genre.cat.categories[0]
'Action'
# subsetting the dataframe
dataMovies_2000_2005_genre1 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2000) & (dataMoviesFull.Year <= 2005) & \
((dataMoviesFull.Genre == 'Action') | (dataMoviesFull.Genre == 'Adventure') | \
(dataMoviesFull.Genre == 'Animation') | (dataMoviesFull.Genre == 'Biography') | \
(dataMoviesFull.Genre == 'Comedy') | (dataMoviesFull.Genre == 'Crime'))]
dataMovies_2000_2005_genre2 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2000) & (dataMoviesFull.Year <= 2005) & \
((dataMoviesFull.Genre == 'Documentary') | (dataMoviesFull.Genre == 'Drama') | \
(dataMoviesFull.Genre == 'Family') | (dataMoviesFull.Genre == 'Fantasy') | \
(dataMoviesFull.Genre == 'Horror') | (dataMoviesFull.Genre == 'Music'))]
dataMovies_2000_2005_genre3 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2000) & (dataMoviesFull.Year <= 2005) & \
((dataMoviesFull.Genre == 'Mystery') | (dataMoviesFull.Genre == 'Romance') | \
(dataMoviesFull.Genre == 'Sci') | (dataMoviesFull.Genre == 'Thriller') | \
(dataMoviesFull.Genre == 'War') | (dataMoviesFull.Genre == 'Western'))]
# subsetting the dataframe
dataMovies_2006_2011_genre1 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2006) & (dataMoviesFull.Year <= 2011) & \
((dataMoviesFull.Genre == 'Action') | (dataMoviesFull.Genre == 'Adventure') | \
(dataMoviesFull.Genre == 'Animation') | (dataMoviesFull.Genre == 'Biography') | \
(dataMoviesFull.Genre == 'Comedy') | (dataMoviesFull.Genre == 'Crime'))]
dataMovies_2006_2011_genre2 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2006) & (dataMoviesFull.Year <= 2011) & \
((dataMoviesFull.Genre == 'Documentary') | (dataMoviesFull.Genre == 'Drama') | \
(dataMoviesFull.Genre == 'Family') | (dataMoviesFull.Genre == 'Fantasy') | \
(dataMoviesFull.Genre == 'Horror') | (dataMoviesFull.Genre == 'Music'))]
dataMovies_2006_2011_genre3 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2006) & (dataMoviesFull.Year <= 2011) & \
((dataMoviesFull.Genre == 'Mystery') | (dataMoviesFull.Genre == 'Romance') | \
(dataMoviesFull.Genre == 'Sci') | (dataMoviesFull.Genre == 'Thriller') | \
(dataMoviesFull.Genre == 'War') | (dataMoviesFull.Genre == 'Western'))]
# subsetting the dataframe
dataMovies_2012_2017_genre1 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2012) & (dataMoviesFull.Year <= 2017) & \
((dataMoviesFull.Genre == 'Action') | (dataMoviesFull.Genre == 'Adventure') | \
(dataMoviesFull.Genre == 'Animation') | (dataMoviesFull.Genre == 'Biography') | \
(dataMoviesFull.Genre == 'Comedy') | (dataMoviesFull.Genre == 'Crime'))]
dataMovies_2012_2017_genre2 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2012) & (dataMoviesFull.Year <= 2017) & \
((dataMoviesFull.Genre == 'Documentary') | (dataMoviesFull.Genre == 'Drama') | \
(dataMoviesFull.Genre == 'Family') | (dataMoviesFull.Genre == 'Fantasy') | \
(dataMoviesFull.Genre == 'Horror') | (dataMoviesFull.Genre == 'Music'))]
dataMovies_2012_2017_genre3 = dataMoviesFull.loc[(dataMoviesFull.Year >= 2012) & (dataMoviesFull.Year <= 2017) & \
((dataMoviesFull.Genre == 'Mystery') | (dataMoviesFull.Genre == 'Romance') | \
(dataMoviesFull.Genre == 'Sci') | (dataMoviesFull.Genre == 'Thriller') | \
(dataMoviesFull.Genre == 'War') | (dataMoviesFull.Genre == 'Western'))]
dataMovies_2012_2017_genre2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 439 entries, 3101 to 4579 Data columns (total 8 columns): audienceRating 439 non-null float64 Genre 439 non-null category criticRating 439 non-null int64 timeMin 439 non-null int64 grossMillions 439 non-null float64 Movie 439 non-null category Vote 439 non-null int64 Year 439 non-null int64 dtypes: category(2), float64(2), int64(4) memory usage: 221.7 KB
To sum up, here are the data that I use:
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
dataMovies_2000_2005_genre1.Genre.cat.categories[0:6]
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime'], dtype='object')
j1 = sns.jointplot(data=dataMoviesFull, x='criticRating', y='audienceRating')
We see that there is a high concentration of points, following a straight line, which means that in most cases, the audience ratings of the movies are in agreement with those of the critics ratings. We also see that for the public, the distribution is stronger between 5/10 and 8/10 and those of the critics between 30/100 and 80/100, which confirms that in most cases, the coherence between the audience ratings and critics ratings.
However, we can see that for some movies, the public is not in agreement with the critics, for example, for some movies, the audience ratings are between 1/10 and 3/10 while the ratings of the critics are between 40/100 and 60/100. We can also see that for other films, the audience ratings (ratings of the public) are between 4/10 and 7/10 while those of the critics are between 20/100 and 50/100.
In this graph, we can conclude that the public often appreciates the movies and generally gives a score above 5/10 while the critics are more severe because the ratings of the critics are often lower than those of the public for any movie.
k1 = sns.jointplot(data=dataMoviesFull, x='criticRating', y='audienceRating', kind='hex')
On this graph, we can see the linearity of the notes between the audience and the critics.
After searching the dataset, we can determine the most popular movies by the public and the critics. The best movies appreciated by the public between 2000 and 2017 are:
The movie most appreciated by the critics is:
j2 = sns.jointplot(data=dataMoviesFull, y='audienceRating', x='timeMin')
On this graph, we see that most of the movies last between 60 minutes and 120 minutes and collect the most scores and these scores are between 4/10 and 8/10 with a majority of scores above 6/10.
For some films that last more than 3 hours (180 minutes), we notice that the public appreciates them because it generally gives a score above 7/10. In this graph, we see that the longest film lasts 366 minutes, ie 6 hours and 10 minutes and has a score of 8.5/10, and after a search in the dataset, it is about the film “Our best years” released in 2003 which is a drama film.
j3 = sns.jointplot(data=dataMoviesFull, y='criticRating', x='timeMin')
On this graph, we note that for films between 60 minutes and 120 minutes, the ratings of the critics are more concentrated and vary between 10/100 and 98/100.
j4 = sns.jointplot(data=dataMoviesFull, y='grossMillions', x='audienceRating')
On this chart, it is clear that the movies that have been well rated by the public are movies that have generated the most millions of dollars, which is logical because if people have enjoyed a movie, they will talk about them, which will encourage other people to go to the cinema to see it, and thus increase the gross of the movie. Audience (public) ratings are more concentrated between 5/10 and 8/10.
Movies that last a long time usually have a high score given by the audience
In the dataset, the movie that brought in the most millions of dollars is the movie “Star Wars: Episode VII – The Force Awakens” with 936.66 million dollars released in 2015.
j5 = sns.jointplot(data=dataMoviesFull, x='criticRating', y='grossMillions')
In this graph, we note that the ratings of the critics are more concentrated between 30/100 and 80/100, which means that the critics are more demanding towards the films than the public. We also note that the films that have high ratings from critics are those who have brought back a lot of money.
j6 = sns.jointplot(data=dataMoviesFull, x='audienceRating', y='Vote')
j7 = sns.jointplot(data=dataMoviesFull, x='timeMin', y='grossMillions')
On this graph, we notice that the movies between 60 minutes and 150 minutes (2h30) are the ones that bring the most. On the other hand, movies with a very long duration, exceeding 3 hours, yield much less, that is to say, under one million dollars.
We deduce that a director should avoid making a film with a duration at least 3 hours, and that he should limit his movie to duration between 1 and 2:30 so that his audience does not get tired during the projection of the film.
fig, axes = plt.subplots(2, 3)
fig.set_size_inches(12, 8)
m1 = sns.distplot(dataMoviesFull.audienceRating, bins=15, ax=axes[0, 0])
m2 = sns.distplot(dataMoviesFull.criticRating, bins=15, ax=axes[0, 1])
m3 = sns.distplot(dataMoviesFull.timeMin, bins=15, ax=axes[0, 2])
m4 = sns.distplot(dataMoviesFull.grossMillions, bins=15, ax=axes[1, 0])
m5 = sns.distplot(dataMoviesFull.Vote, bins=15, ax=axes[1, 1])
m6 = sns.distplot(dataMoviesFull.Year, bins=15, ax=axes[1, 2])
#plt.setp(axes, yticks=[])
plt.tight_layout()
# subsetting the dataframe
dataMovies_2000_2017_genre1 = dataMoviesFull.loc[(dataMoviesFull.Genre == 'Action') | (dataMoviesFull.Genre == 'Adventure') | \
(dataMoviesFull.Genre == 'Animation') | (dataMoviesFull.Genre == 'Biography') | \
(dataMoviesFull.Genre == 'Comedy') | (dataMoviesFull.Genre == 'Crime')]
dataMovies_2000_2017_genre1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3206 entries, 0 to 4582 Data columns (total 8 columns): audienceRating 3206 non-null float64 Genre 3206 non-null category criticRating 3206 non-null int64 timeMin 3206 non-null int64 grossMillions 3206 non-null float64 Movie 3206 non-null category Vote 3206 non-null int64 Year 3206 non-null int64 dtypes: category(2), float64(2), int64(4) memory usage: 381.1 KB
dataMovies_2000_2017_genre1.Genre.cat.categories
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Horror', 'Music', 'Mystery', 'Romance', 'Sci', 'Thriller', 'War', 'Western'], dtype='object')
#on supprime toutes les categories non utilises
dataMovies_2000_2017_genre1.Genre = dataMovies_2000_2017_genre1.Genre.cat.remove_unused_categories()
dataMovies_2000_2017_genre1.Genre.cat.categories
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime'], dtype='object')
dataMovies_2000_2017_genre1.head()
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
0 | 8.5 | Action | 67 | 155 | 187.71 | Gladiator | 1096457 | 2000 |
2 | 8.3 | Comedy | 55 | 104 | 30.33 | Snatch - Tu braques ou tu raques | 663244 | 2000 |
4 | 7.4 | Action | 64 | 104 | 157.30 | X-Men | 498388 | 2000 |
5 | 7.8 | Adventure | 73 | 143 | 233.63 | Seul au monde | 435798 | 2000 |
6 | 7.6 | Crime | 64 | 102 | 15.07 | American Psycho | 396271 | 2000 |
dataMovies_2000_2017_genre2 = dataMoviesFull.loc[(dataMoviesFull.Genre == 'Documentary') | (dataMoviesFull.Genre == 'Drama') | \
(dataMoviesFull.Genre == 'Family') | (dataMoviesFull.Genre == 'Fantasy') | \
(dataMoviesFull.Genre == 'Horror') | (dataMoviesFull.Genre == 'Music')]
dataMovies_2000_2017_genre2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1329 entries, 3 to 4579 Data columns (total 8 columns): audienceRating 1329 non-null float64 Genre 1329 non-null category criticRating 1329 non-null int64 timeMin 1329 non-null int64 grossMillions 1329 non-null float64 Movie 1329 non-null category Vote 1329 non-null int64 Year 1329 non-null int64 dtypes: category(2), float64(2), int64(4) memory usage: 273.0 KB
dataMovies_2000_2017_genre2.Genre.cat.categories
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Horror', 'Music', 'Mystery', 'Romance', 'Sci', 'Thriller', 'War', 'Western'], dtype='object')
#on supprime toutes les categories non utilises
dataMovies_2000_2017_genre2.Genre = dataMovies_2000_2017_genre2.Genre.cat.remove_unused_categories()
dataMovies_2000_2017_genre2.Genre.cat.categories
Index(['Documentary', 'Drama', 'Family', 'Fantasy', 'Horror', 'Music'], dtype='object')
dataMovies_2000_2017_genre2.head()
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
3 | 8.3 | Drama | 68 | 102 | 3.64 | Requiem for a Dream | 640047 | 2000 |
7 | 7.2 | Drama | 62 | 106 | 95.01 | Incassable | 284747 | 2000 |
15 | 7.1 | Horror | 49 | 109 | 39.24 | Pitch Black | 204849 | 2000 |
18 | 6.7 | Horror | 36 | 98 | 53.33 | Destination finale | 196668 | 2000 |
20 | 8.1 | Drama | 83 | 154 | 5.38 | Amours chiennes | 190248 | 2000 |
dataMovies_2000_2017_genre3 = dataMoviesFull.loc[(dataMoviesFull.Genre == 'Mystery') | (dataMoviesFull.Genre == 'Romance') | \
(dataMoviesFull.Genre == 'Sci') | (dataMoviesFull.Genre == 'Thriller') | \
(dataMoviesFull.Genre == 'War') | (dataMoviesFull.Genre == 'Western')]
dataMovies_2000_2017_genre3.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48 entries, 1 to 4571 Data columns (total 8 columns): audienceRating 48 non-null float64 Genre 48 non-null category criticRating 48 non-null int64 timeMin 48 non-null int64 grossMillions 48 non-null float64 Movie 48 non-null category Vote 48 non-null int64 Year 48 non-null int64 dtypes: category(2), float64(2), int64(4) memory usage: 199.2 KB
dataMovies_2000_2017_genre3.Genre.cat.categories
Index(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Horror', 'Music', 'Mystery', 'Romance', 'Sci', 'Thriller', 'War', 'Western'], dtype='object')
#on supprime toutes les categories non utilises
dataMovies_2000_2017_genre3.Genre = dataMovies_2000_2017_genre3.Genre.cat.remove_unused_categories()
dataMovies_2000_2017_genre3.Genre.cat.categories
Index(['Mystery', 'Romance', 'Sci', 'Thriller', 'War', 'Western'], dtype='object')
dataMovies_2000_2017_genre3.head()
audienceRating | Genre | criticRating | timeMin | grossMillions | Movie | Vote | Year | |
---|---|---|---|---|---|---|---|---|
1 | 8.5 | Mystery | 80 | 113 | 25.54 | Memento | 942923 | 2000 |
184 | 7.1 | Romance | 78 | 135 | 3.04 | Chez les heureux du monde | 6641 | 2000 |
223 | 6.1 | Romance | 35 | 93 | 0.28 | Peines d'amour perdues | 4019 | 2000 |
326 | 6.6 | Mystery | 75 | 97 | 21.97 | Une virée en enfer | 56964 | 2001 |
477 | 7.3 | War | 56 | 112 | 0.26 | Tmavomodrý svet | 4941 | 2001 |
Faced with the large amount of data, I divided my dataset into 3 sub dataset by grouping by 6 genres for each dataset because I had 18 genres of films on my whole dataset.
The genres of movies are:
I thus obtain three graphs of histograms by group of 6 genres.
sns.set()
#2000-2017
grossByGenre_2000_2017_genre1 = list()
grossByGenre_2000_2017_genre2 = list()
grossByGenre_2000_2017_genre3 = list()
labels_2000_2017_genre1 = list()
labels_2000_2017_genre2 = list()
labels_2000_2017_genre3 = list()
#2000-2017
#2000-2017 genre1
for dataGenre_2000_2017_genre1 in dataMovies_2000_2017_genre1.Genre.cat.categories[0:6]:
grossByGenre_2000_2017_genre1.append(dataMovies_2000_2017_genre1[dataMovies_2000_2017_genre1.Genre == \
dataGenre_2000_2017_genre1].grossMillions)
labels_2000_2017_genre1.append(dataGenre_2000_2017_genre1)
#2000-2017 genre2
for dataGenre_2000_2017_genre2 in dataMovies_2000_2017_genre2.Genre.cat.categories[0:6]:
grossByGenre_2000_2017_genre2.append(dataMovies_2000_2017_genre2[dataMovies_2000_2017_genre2.Genre == \
dataGenre_2000_2017_genre2].grossMillions)
labels_2000_2017_genre2.append(dataGenre_2000_2017_genre2)
#2000-2017 genre3
for dataGenre_2000_2017_genre3 in dataMovies_2000_2017_genre3.Genre.cat.categories[0:6]:
grossByGenre_2000_2017_genre3.append(dataMovies_2000_2017_genre3[dataMovies_2000_2017_genre3.Genre == \
dataGenre_2000_2017_genre3].grossMillions)
labels_2000_2017_genre3.append(dataGenre_2000_2017_genre3)
# 3 lines, 3 columns
fig, ax = plt.subplots(1,3)
fig.set_size_inches(15, 3) #size of A4 paper
#2000-2017
ax[0].hist(grossByGenre_2000_2017_genre1, bins=30, stacked=True, rwidth=1, label=labels_2000_2017_genre1)
ax[1].hist(grossByGenre_2000_2017_genre2, bins=30, stacked=True, rwidth=1, label=labels_2000_2017_genre2)
ax[2].hist(grossByGenre_2000_2017_genre3, bins=30, stacked=True, rwidth=1, label=labels_2000_2017_genre3)
ax[0].set(xlabel='GrossMillions 2000-2017', ylabel='Number of movies')
ax[1].set(xlabel='GrossMillions 2000-2017')
ax[2].set(xlabel='GrossMillions 2000-2017')
plt.suptitle('Movie Gross Distribution: 2000-2017')
for i in range(3):
ax[i].legend()
plt.show()