#!/usr/bin/env python # coding: utf-8 # # The Evolution of Cinema: IMDb Movie Analysis from 2000 to 2020 # This project analyzes IMDb-listed movies released between 2000 and 2020 to uncover key trends, patterns, and insights within the film industry over two decades. By examining attributes such as genre, duration, language, actors, directors, IMDb ratings, and votes, this analysis seeks to reveal popular genres, shifts in movie durations, audience engagement patterns, and the rise of different languages and genres. Through data exploration and visualization, we aim to understand how the film landscape has evolved and what factors contribute to high ratings and audience interest. This project will provide valuable insights into the changing dynamics of the movie industry and viewer preferences over time. # In[3]: #Importing important libraries which we will use in this project. import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt import seaborn as sns sns.set_style('whitegrid') get_ipython().run_line_magic('matplotlib', 'inline') # In[4]: #Import the Data data = pd.read_csv('IMDB Movies 2000 - 2020.csv') # In[5]: data.head() # In[6]: data.info() # In[7]: # Since this is a cleaned and transformed dataset so we will skip the data cleaning process # In[92]: # All the data analysis start with questions so we will jump directly to answer them # In[9]: # Let's Start with basic question or insights i.e. Number of Movies Released each year. movies_per_year = data.groupby('year').size() # Group by the 'year' column and count the number of movies per year movies_per_year.plot(kind='bar', color='skyblue') # We can see the year 2020 has the lowest number of movies produced this is because our dataset is till 31st July 2020, hence its incomplete. # In[ ]: # #### Identify the most popular genres each year or over certain periods & Finding if there are genre trends. # #### Average Movie Duration by Genre and Year. # In[10]: # Split and Explode the Genre Column- When genres are in a single cell separated by commas, # it's difficult to count the number of times each genre appears individually. # Convert all values in the 'genre' column to strings because its in object data['genre'] = data['genre'].astype(str) # In[11]: # Split the 'genre' column and explode it so each genre has its own row data['genre'] = data['genre'].str.split(',') # In[12]: # Explode: This function takes each element in the list and creates a new row for it, duplicating the other columns as needed. # In[13]: data = data.explode('genre') # In[14]: # Remove any leading/trailing whitespace in genres data['genre'] = data['genre'].str.strip() # In[15]: data.head() # In[16]: # Calculate the Most Popular Genres Each Year # Group by 'year' and 'genre' to count the number of movies per genre each year genre_counts = data.groupby(['year', 'genre']).size().reset_index(name='count') # Find the most popular genre per year most_popular_genres = genre_counts.loc[genre_counts.groupby('year')['count'].idxmax()] # Display the most popular genres by year print(most_popular_genres) # Dominance of Drama: These trends suggest that drama has been a staple genre with broad appeal across years.The number of drama movies released per year generally increases from 2000 to around 2014, where it peaks with 195 movies. This suggests a growing interest in the genre during this period. # In[17]: # Plot Genre Trends Over Time # Pivot the data to create a matrix of years (rows) and genres (columns) with movie counts genre_trends = genre_counts.pivot(index='year', columns='genre', values='count').fillna(0) # Plot the genre trends plt.figure(figsize=(12, 8)) sns.lineplot(data=genre_trends) plt.title("Genre Trends Over Time (2000-2020)") plt.xlabel("Year") plt.ylabel("Number of Movies") plt.legend(title="Genres", bbox_to_anchor=(1.05, 1), loc='upper left') plt.show() # 1. The graph provides insight into changing audience preferences, with Drama dominating for a long time but facing a gradual decline. Action and Comedy remain strong but show some decline as well. # 2. The impact of external factors like the COVID-19 pandemic is apparent, with a substantial drop in all genres in 2020. # 3. After the mid-2010s, action movies also start to decline, though not as drastically as Drama. This suggests a slight shift in audience preference away from action in recent years. # 4. Genres like Thriller, Horror, and Sci-Fi show a modest increase over time, particularly from 2010 onwards, though they never surpass the most popular genres like Drama or Action. # In[18]: # Calculate Average Movie Duration by Genre and Year average_duration = data.groupby(['year', 'genre'])['duration'].mean().reset_index() # Display the average duration data print(average_duration.head()) # In[19]: # Plot Average Movie Duration by Genre and Year # Pivot the data to create a matrix for the heatmap duration_trends = average_duration.pivot(index='year', columns='genre', values='duration') # Plot the heatmap plt.figure(figsize=(12, 8)) sns.heatmap(duration_trends, cmap="YlGnBu", cbar_kws={'label': 'Average Duration (minutes)'}) plt.title("Average Movie Duration by Genre and Year") plt.xlabel("Genre") plt.ylabel("Year") plt.show() # In[20]: # Plot average duration for a few selected genres over time selected_genres = ['Animation', 'Biography', 'History', 'Musical'] # Select genres of interest plt.figure(figsize=(12, 8)) for genre in selected_genres: sns.lineplot(data=average_duration[average_duration['genre'] == genre], x='year', y='duration', label=genre) plt.title("Average Movie Duration by Genre (2000-2020)") plt.xlabel("Year") plt.ylabel("Average Duration (minutes)") plt.legend(title="Genre") plt.show() # By calculating the average duration of movies in each genre we get the insights that # 1. Short Duration- Animation movies are shortest among the others but as year passed they are getting slightly longer. # 2. Long Duration- Biography, History and Musical movies have the highest length. # # # In[53]: # Identify top-rated movies each year # Using a new data set because in previous we split the genre to get detailed analysis, which lead to duplicate the same movie. new_data = pd.read_csv('IMDB Movies 2000 - 2020.csv') # Find the top-rated movie for each year # Group by 'year' and select the row with the highest rating for each year top_rated_each_year = new_data.loc[data.groupby('year')['avg_vote'].idxmax()] # Select only the desired columns top_rated_each_year = top_rated_each_year[['year', 'original_title', 'avg_vote', 'genre','country']] # Display the result top_rated_each_year # Looks like India produced some good movies from the past 20 years which got the highest rating among all the movies released that year. As an Indian this is a proud moment for me since we are good at making it. # In[54]: # Analyze the most common languages used in movies over the years. # Count the occurrences of each language in each year language_counts = new_data.groupby(['year', 'language_1']).size().reset_index(name='count') # Find the most popular language in each year # For each year, find the language with the maximum count most_popular_language_each_year = language_counts.loc[language_counts.groupby('year')['count'].idxmax()] # Select the columns we are interested in for better readability most_popular_language_each_year = most_popular_language_each_year[['year', 'language_1', 'count']] # Since we have three column so we do this same step for all the language column. # In[55]: most_popular_language_each_year # By looking at the output we can say that english is the most common language used in movies. # I have also tried to get the 2nd most common language from SQL because its easy and here is the output i.e. Hindi is the 2nd most common language. # ![image.png](attachment:image.png) # In[56]: # Analyze the 2nd language column language_count = new_data.groupby(['year', 'language_2']).size().reset_index(name='count') most_popular_language_each = language_count.loc[language_count.groupby('year')['count'].idxmax()] most_popular_language_each = most_popular_language_each[['year', 'language_2', 'count']] # Display the result most_popular_language_each # Here we can see that Spanish and French language are the most common secondary languages. # So by looking at the output from both the columns we can conclude that although English is the most popular language, French and Spanish also gained popularity in certain periods. # In[57]: # Directors with Consistently High Ratings: For this analysis I have used SQL because its easy. # In[58]: # Calculate Average Rating by Director: Use the AVG aggregate function to find the average rating for each director. # Filter for High Ratings: Use a HAVING clause to filter directors who have an average rating above a certain threshold. # SELECT # director, # AVG(avg_vote) AS avg_rating, # COUNT(title) AS total_movies # FROM # movies # GROUP BY # director # HAVING # AVG(avg_vote) >= 8 # ORDER BY # total_movies DESC; # ![image.png](attachment:image.png) # So here we can see that two directors who produce some of the best movies in this decade are on top of the list. # Christopher Nolan directed movies like "Interstellar", "Inception", and "Tenet" are highly rated. # Anurag Kashyap: Acclaimed Indian Director # # Known for bold narratives and social commentary, Kashyap's notable works include: # # - Gangs of Wasseypur # - Dev.D # - Black Friday # # Recipient of National Film Award and Filmfare Award. Pioneering Indian new wave cinema. # In[ ]: # **Top 10 Actors in Terms of Total Votes** # In[62]: # Split the actors column because each row has many actors. new_data['actors'] = new_data['actors'].astype(str) new_data['actors'] = new_data['actors'].str.split(', ') new_data_exploded = new_data.explode('actors') # Group by actor and calculate average rating, total votes, and movie count top_actors = ( new_data_exploded.groupby('actors') .agg(avg_rating=('avg_vote', 'mean'), total_votes=('votes', 'sum'), total_movies=('avg_vote', 'count')) .reset_index() ) # Sort by average total votes. Later we can use rating or movie count top_actors = top_actors.sort_values(by=['total_votes'], ascending=[False]) top_actors.head(10) # In[67]: # Getting the top 10 actors top_actors_sorted = top_actors.sort_values(by='total_votes', ascending=False).head(10) # Plotting plt.figure(figsize=(12, 8)) sns.barplot(x='total_votes', y='actors', data=top_actors_sorted, palette='viridis') plt.xlabel("Total Votes") plt.ylabel("Actors") plt.title("Top 10 Actors by Total Votes") plt.show() # Here is the list and graph of the best actors in Film Industry and I don't need to tell you about them since they are already people's favourites. # In[ ]: # **Actors who appear frequently in high-rated movies or movies with a high number of votes.** # In[68]: # Define thresholds for high rating and high votes high_rating_threshold = 8.0 high_votes_threshold = 50000 # Find actors with an average rating above the high rating threshold high_rating_actors = top_actors[top_actors['avg_rating'] >= high_rating_threshold] # Find actors with a total number of votes above the high votes threshold high_votes_actors = top_actors[top_actors['total_votes'] >= high_votes_threshold] # Find actors who meet both criteria (high-rated movies and high votes) high_rating_and_votes_actors = top_actors[ (top_actors['avg_rating'] >= high_rating_threshold) & (top_actors['total_votes'] >= high_votes_threshold) ] # Display the actors who meet both criteria print("Actors frequently appearing in high-rated and high-vote movies:") print(high_rating_and_votes_actors[['actors', 'avg_rating', 'total_votes', 'total_movies']]) # In[74]: # Sorting & Plotting graph for the top 10 actors in high rated and voted movies # In[71]: high= high_rating_and_votes_actors.sort_values(by='total_votes', ascending=False).head(10) # In[75]: plt.figure(figsize=(12, 8)) sns.barplot(x='total_votes', y='actors', data=high, palette='viridis') plt.xlabel("Total Votes") plt.ylabel("Actors") plt.title("Top 10 Actors in high-rated movies") plt.show() # John Ratzenberger: The only person to voice a character in all of Pixar Animation's feature films. # # John Bach: His screen career spans more than 90 roles. John Bach was born on 5 June 1946 in Wales, UK. He is an actor, known for The Lord of the Rings. # # This actors worked in several movies and many of them get the highest rating. # In[ ]: # ***Plot the average IMDb rating and number of votes each year to see if audience engagement has changed.*** # In[79]: # Assuming 'year', 'avg_vote' (IMDb rating), and 'votes' columns are in your DataFrame # Group by year to calculate the average IMDb rating and total votes each year yearly_stats = new_data.groupby('year').agg(avg_rating=('avg_vote', 'mean'), total_votes=('votes', 'sum')).reset_index() # Create the plot fig, ax1 = plt.subplots(figsize=(10, 4)) # Plot average IMDb rating on the primary y-axis sns.lineplot(data=yearly_stats, x='year', y='avg_rating', marker='o', color='b', ax=ax1) ax1.set_ylabel('Average IMDb Rating', color='b') ax1.tick_params(axis='y', labelcolor='b') ax1.set_title('Average IMDb Rating and Total Votes by Year') # Create a second y-axis for total votes ax2 = ax1.twinx() sns.lineplot(data=yearly_stats, x='year', y='total_votes', marker='o', color='r', ax=ax2) ax2.set_ylabel('Total Votes', color='r') ax2.tick_params(axis='y', labelcolor='r') # Show the plot plt.show() # The average IMDb rating appears to stay relatively stable over the years, with slight fluctuations. Most years maintain an average rating between 6.3 and 6.6. # # The total votes exhibit an upward trend until around 2017, with a general increase in audience engagement. This trend suggests that more people started voting on IMDb over time, likely due to increased popularity of the platform and internet accessibility. # In[ ]: # **Determine if there’s a correlation between a movie's duration and its IMDb rating.** # In[85]: from scipy.stats import pearsonr # Calculate the Pearson correlation coefficient correlation, p_value = pearsonr(new_data['duration'], new_data['avg_vote']) print(f"Correlation between duration and IMDb rating: {correlation:.2f}") print(f"P-value: {p_value:.2f}") # Scatter plot to visualize the relationship plt.figure(figsize=(10, 6)) sns.scatterplot(data=new_data, x='duration', y='avg_vote', alpha=0.5) plt.title("Scatter Plot of Movie Duration vs IMDb Rating") plt.xlabel("Duration (minutes)") plt.ylabel("IMDb Rating (avg_vote)") plt.show() # The moderate positive correlation (0.38) and significant p-value (0.00) imply that there is a statistically significant, albeit not very strong, relationship between movie duration and IMDb rating. # # The plot shows a large concentration of movies with a duration of around 100 minutes. This suggests that most movies in the dataset fall within this duration range. # # There appears to be a slight upward trend in the scatter, meaning that **movies with a longer duration (above 100 minutes) tend to have slightly higher ratings on average.** This aligns with the moderate positive correlation coefficient (0.38) we observed earlier, indicating a tendency for longer movies to have higher ratings. # # For movies with durations above 150 minutes, there is a visible concentration of movies with IMDb ratings mostly above 6. **This suggests that longer movies are more likely to receive higher ratings.** # In[ ]: # In[ ]: