In this notebook, we will explore Netflix movies and TV shows with kotlin/dataframe. Also, we will use kandy library for data visualization.
We use the latest available versions of the libraries, the following line magic is responsible for this:
%useLatestDescriptors
Importing dataframe
%use dataframe
Importing the visualization library
%use kandy
To get started, need to read data from csv
val rawDf = DataFrame.read("netflix_titles.csv")
First look could be taken at its content
// taking a look at types and columns
rawDf.schema()
show_id: String type: String title: String director: String? cast: String? country: String? date_added: String? release_year: Int rating: String? duration: String listed_in: String description: String
rawDf.size() // rowsCount x columnsCount
7787 x 12
rawDf.head() // return first five rows
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
s1 | TV Show | 3% | null | João Miguel, Bianca Comparato, Michel... | Brazil | August 14, 2020 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV... | In a future where the elite inhabit a... |
s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar ... | Mexico | December 23, 2016 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits M... |
s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, ... | Singapore | December 20, 2018 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, h... |
s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer... | United States | November 16, 2017 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movie... | In a postapocalyptic world, rag-doll ... |
s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosw... | United States | January 1, 2020 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become ... |
// Getting general statistics and info for each columns
rawDf.describe()
name | type | count | unique | nulls | top | freq | mean | std | min | median | max |
---|---|---|---|---|---|---|---|---|---|---|---|
show_id | String | 7787 | 7787 | 0 | s1 | 1 | null | null | s1 | s4502 | s999 |
type | String | 7787 | 2 | 0 | Movie | 5377 | null | null | Movie | Movie | TV Show |
title | String | 7787 | 7787 | 0 | 3% | 1 | null | null | #Alive | Manglehorn | 최강전사 미니특공대 : 영웅의 탄생 |
director | String? | 7787 | 4050 | 2389 | Raúl Campos, Jan Suter | 18 | null | null | A. L. Vijay | Lance Bangs | Şenol Sönmez |
cast | String? | 7787 | 6832 | 718 | David Attenborough | 18 | null | null | 'Najite Dede, Jude Chukwuka, Taiwo Ar... | Kay Kay Menon, Shiney Ahuja, Chitrang... | Ṣọpẹ́ Dìrísù, Wunmi Mosaku, Matt Smit... |
country | String? | 7787 | 682 | 507 | United States | 2555 | null | null | Argentina | Thailand | Zimbabwe |
date_added | String? | 7787 | 1566 | 10 | January 1, 2020 | 118 | null | null | April 15, 2018 | July 6, 2020 | September 9, 2020 |
release_year | Int | 7787 | 73 | 0 | 2018 | 1121 | 2013.932580 | 8.757395 | 1925 | 2017 | 2021 |
rating | String? | 7787 | 15 | 7 | TV-MA | 2863 | null | null | G | TV-MA | UR |
duration | String | 7787 | 216 | 0 | 1 Season | 1608 | null | null | 1 Season | 148 min | 99 min |
listed_in | String | 7787 | 492 | 0 | Documentaries | 334 | null | null | Action & Adventure | Documentaries, Sports Movies | Thrillers |
description | String | 7787 | 7769 | 0 | Multiple women report their husbands ... | 3 | null | null | "Brooklyn Nine-Nine" star Chelsea Per... | In a 1950s orphanage, a young girl re... | Zoe Walker leaves her quiet life behi... |
Data consists of Netflix TV shows and movies up to 2020. Each row contains information about one specific project and consists of:
show_id
- unique show numbertype
- *TV Show* or *Movie*title
- the name of a TV show or moviedirector
- director's namecast
- cast listcountry
- the country where the title was releaseddate_added
- when the title was added on netflixrelease_year
- the year the title was releasedrating
- rating of the titlelisted_in
- in which lists/genres the title is present on netflixdescription
- title descriptionval df = rawDf.dropNulls { date_added } // remove rows where `date_added` is not specified
.convert { date_added }.toLocalDate("MMMM d, yyyy") // convert date_added to LocalDate using date pattern
.sortBy { date_added } // and let's also sort by date for easy operation later
df
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... |
s5766 | Movie | Splatter | Joe Dante | Corey Feldman, Tony Todd, Tara Leigh,... | United States | 2009-11-18 | 2009 | TV-MA | 29 min | Horror Movies | After committing suicide, a washed-up... |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... |
s2042 | Movie | Even the Rain | Icíar Bollaín | Luis Tosar, Gael García Bernal, Juan ... | Spain, Mexico, France | 2011-05-17 | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | Ben Affleck, Mark Hamill, Richard Her... | United States | 2011-09-27 | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | Ken Olin, Jill Eikenberry, Maureen Mu... | United States | 2011-10-01 | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | Daniel J. Travanti, JoBeth Williams, ... | United States | 2011-10-01 | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... |
s2623 | Movie | Hard Lessons | Eric Laneuville | Denzel Washington, Lynn Whitfield, Ri... | United States | 2011-10-01 | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | Judith Light, Michael Ontkean, Jerry ... | United States | 2011-10-01 | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | Pam Dawber, Michael Nouri, Bess Meyer... | United States | 2011-10-01 | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | Nancy McKeon, Valerie Harper, Stephen... | United States | 2011-10-01 | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... |
s6846 | Movie | The Ryan White Story | John Herzfeld | Judith Light, Lukas Haas, Michael Bow... | United States | 2011-10-01 | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... |
s7150 | Movie | Too Young the Hero | Buzz Kulik | Ricky Schroder, Jon DeVries, Debra Mo... | United States | 2011-10-01 | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... |
s7231 | Movie | Triumph of the Heart | Richard Michaels | Mario Van Peebles, Susan Ruttan, Lane... | United States | 2011-10-01 | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... |
s7363 | Movie | Unspeakable Acts | Linda Otto | Jill Clayburgh, Brad Davis, Sam Behrens | United States | 2011-10-01 | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... |
s7415 | Movie | Victim of Beauty | Roger Young | William Devane, Jeri Ryan, Michele Ab... | United States | 2011-10-01 | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | Kevin Clash, Whoopi Goldberg | United States | 2012-02-21 | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... |
s1230 | Movie | Casa de mi Padre | Matt Piedmont | Will Ferrell, Gael García Bernal, Die... | United States, Mexico | 2012-11-14 | 2012 | R | 84 min | Comedies | Will Ferrell stars as a Spanish-speak... |
// let's look at what type of column it turned out
df.date_added.type()
kotlinx.datetime.LocalDate
First, let's see what more shows or films.
rawDf
.valueCounts(sort = false) { type }
.plot {
bars {
x(type)
y("count")
fillColor(type) {
scale = categorical(range = listOf(Color.hex("#00BCD4"), Color.hex("#009688")))
}
}
layout {
title = "Count of TV Shows and Movies"
size = 900 to 550
}
}
It can be seen that the number of films on Netflix is about twice the number of TV shows. But has it always been this way? To do this, let's see if there was such a year when the number of TV Shows was more than Movies and let's see the cumulative amount for Movies and TV Shows.
val df_date_count = df
.convert { date_added }.with { it.year } // converting `date_added` to extract `year`
.groupBy { date_added } // grouping by `year` stored in `date_added`
.aggregate {
count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
count { type == "Movie" } into "movies" // counting Movies into column `movies`
}
df_date_count
date_added | tvshows | movies |
---|---|---|
2008 | 1 | 1 |
2009 | 0 | 2 |
2010 | 0 | 1 |
2011 | 0 | 13 |
2012 | 0 | 3 |
2013 | 5 | 6 |
2014 | 6 | 19 |
2015 | 30 | 58 |
2016 | 185 | 258 |
2017 | 361 | 864 |
2018 | 430 | 1255 |
2019 | 656 | 1497 |
2020 | 697 | 1312 |
2021 | 29 | 88 |
Let's hold on and see how we can simplify this expression using more advanced operations. First of all, we can combine conversion of date_added
into year
and grouping using map
function within column selector.
val df_date_count = df
.groupBy { date_added.map { it.year } } // grouping by year added extracted from `date_added`
.aggregate {
count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
count { type == "Movie" } into "movies" // counting Movies into column `movies`
}
df_date_count
date_added | tvshows | movies |
---|---|---|
2008 | 1 | 1 |
2009 | 0 | 2 |
2010 | 0 | 1 |
2011 | 0 | 13 |
2012 | 0 | 3 |
2013 | 5 | 6 |
2014 | 6 | 19 |
2015 | 30 | 58 |
2016 | 185 | 258 |
2017 | 361 | 864 |
2018 | 430 | 1255 |
2019 | 656 | 1497 |
2020 | 697 | 1312 |
2021 | 29 | 88 |
Our groupBy aggregation adds new columns for "TV Show" and "Movie". This is exactly what pivot
does: generates new columns for every unique value in type
.
df.groupBy { date_added.map { it.year } }
.pivot { type }
date_added | type | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Movie | TV Show | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008 | DataFrame [1 x 12]
| DataFrame [1 x 12]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009 | DataFrame [2 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2010 | DataFrame [1 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2011 | DataFrame [13 x 12]
... showing only top 5 of 13 rows | DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2012 | DataFrame [3 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2013 | DataFrame [6 x 12]
... showing only top 5 of 6 rows | DataFrame [5 x 12]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2014 | DataFrame [19 x 12]
... showing only top 5 of 19 rows | DataFrame [6 x 12]
... showing only top 5 of 6 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2015 | DataFrame [58 x 12]
... showing only top 5 of 58 rows | DataFrame [30 x 12]
... showing only top 5 of 30 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2016 | DataFrame [258 x 12]
... showing only top 5 of 258 rows | DataFrame [185 x 12]
... showing only top 5 of 185 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2017 | DataFrame [864 x 12]
... showing only top 5 of 864 rows | DataFrame [361 x 12]
... showing only top 5 of 361 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2018 | DataFrame [1255 x 12]
... showing only top 5 of 1255 rows | DataFrame [430 x 12]
... showing only top 5 of 430 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2019 | DataFrame [1497 x 12]
... showing only top 5 of 1497 rows | DataFrame [656 x 12]
... showing only top 5 of 656 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020 | DataFrame [1312 x 12]
... showing only top 5 of 1312 rows | DataFrame [697 x 12]
... showing only top 5 of 697 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2021 | DataFrame [88 x 12]
... showing only top 5 of 88 rows | DataFrame [29 x 12]
... showing only top 5 of 29 rows |
After type
column is pivoted, we call aggregate
to specify metrics to be calculated for every data group.
df.groupBy { date_added.map { it.year } }
.pivot { type }.aggregate { count() }
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | null |
2010 | 1 | null |
2011 | 13 | null |
2012 | 3 | null |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
Simple statistics can be aggregated without aggregate
:
df.groupBy { date_added.map { it.year } }
.pivot { type }.count()
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | 0 |
2010 | 1 | 0 |
2011 | 13 | 0 |
2012 | 3 | 0 |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
For count
statistics there is even shorter API pivotCounts
.
Here is the final version:
val df_date_count = df
.groupBy { date_added.map { it.year } }.pivotCounts { type }
df_date_count
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | 0 |
2010 | 1 | 0 |
2011 | 13 | 0 |
2012 | 3 | 0 |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
Now we will prepare dataframe for rendering. We will call flatten
to remove column grouping and convert dataframe to Map
.
df_date_count.plot {
x(date_added) { axis.name = "year" }
area {
y(type.`TV Show`) { axis.name = "count" }
fillColor = Color.hex("#BF360C")
borderLine.color = Color.hex("#BF360C")
alpha = .5
}
area {
y(type.Movie)
fillColor = Color.hex("#01579B")
borderLine.color = Color.hex("#01579B")
alpha = .5
}
layout {
title = "Number of titles by year"
size = 800 to 500
style {
panel {
background {
fillColor = Color.hex("#ECEFF1")
borderLineColor = Color.hex("#ECEFF1")
}
grid.lineGlobal { blank = true }
}
}
}
}
It can be seen that more films were added every year than shows. Obviously, the cumulative sum of the movies was also always higher than the TV Shows, but let's build such a plot.
val df_cumsum_titles = df_date_count
.sortBy { date_added } // sorting by date_added
.cumSum { type.allCols() } // count cumulative sum for columns `TV Show` and `Movie` that are nested under column `type`
df_cumsum_titles
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 3 | 1 |
2010 | 4 | 1 |
2011 | 17 | 1 |
2012 | 20 | 1 |
2013 | 26 | 6 |
2014 | 45 | 12 |
2015 | 103 | 42 |
2016 | 361 | 227 |
2017 | 1225 | 588 |
2018 | 2480 | 1018 |
2019 | 3977 | 1674 |
2020 | 5289 | 2371 |
2021 | 5377 | 2400 |
df_cumsum_titles.plot {
x(date_added) { axis.name = "year" }
area {
y(type.`TV Show`) { axis.name = "cumulative count" }
fillColor = Color.hex("#BF360C")
borderLine.color = Color.hex("#BF360C")
alpha = .5
}
area {
y(type.Movie)
fillColor = Color.hex("#01579B")
borderLine.color = Color.hex("#01579B")
alpha = .5
}
layout {
title = "Cumulative count of titles by year"
size = 800 to 500
style {
panel {
background {
fillColor = Color.hex("#ECEFF1")
borderLineColor = Color.hex("#ECEFF1")
}
grid.lineGlobal { blank = true }
}
}
}
}
Let's take a look at the distribution by the lifetime of titles on the platform. To do this, find the most recently uploaded title and calculate the difference between the date it was added and the maximum date found.
import kotlinx.datetime.*
val maxDate = df.date_added.max()
val df_days = df.add {
"days_on_platform" from { date_added.daysUntil(maxDate) } // adding column for number of days on the platform
"months_on_platform" from { date_added.monthsUntil(maxDate) } // adding column for number of months on the platform
"years_on_platform" from { date_added.yearsUntil(maxDate) } // adding column for number of years on the platform
}
val p1 = df_days.select { type and days_on_platform }.plot {
histogram(days_on_platform, binsOption = BinsOption.byNumber(30)) {
y(Stat.density)
fillColor = Color.hex("#ef0b0b")
borderLine.color = Color.hex("#ECEFF1")
}
statBin(days_on_platform, binsOption = BinsOption.byNumber(30)) {
area {
x(Stat.x)
y(Stat.density)
alpha = .5
fillColor = Color.hex("#0befef")
}
}
layout {
xAxisLabel = "days"
title = "Age distribution (in days) on Netflix"
}
}
val p2 = df_days.select { type and days_on_platform }.plot {
boxplot(x = type, y = days_on_platform) {
boxes {
fillColor(Stat.x) {
scale = categorical(range = listOf(Color.hex("#792020"), Color.hex("#207979")))
}
}
}
layout {
yAxisLabel = "days"
title = "Boxplot for age (in days) by type"
}
}
plotBunch {
add(p1, 0, 0, 500, 450)
add(p2, 500, 0, 500, 450)
}
The age distribution of titles on the platform is similar to movies and TV shows. But you can see in the second graph that there are very old titles among the movies compared to the shows. Let's take a closer look at this moment. To do this, let's build a graph of the duration in years of being on the platform of films and shows.
df_days.valueCounts(sort = false) { type and years_on_platform }.plot {
bars {
x(years_on_platform) { axis.name = "years" }
y("count")
fillColor(type) {
scale = categorical(range = listOf(Color.hex("#bc3076"), Color.hex("#30bc76")))
}
position = Position.dodge()
}
layout {
title = "Years of Movies and TV Shows on Netflix"
size = 900 to 500
}
}
As you can see, movies are usually older than TV shows. After that, you might ask yourself: how quickly were titles added to Netflix after their release? Well, finding the answer to it will be quite simple.
val df_years = df
// adding a new column of the difference between the year of release and the year of addition
.add("years_off_platform") {
date_added.year - release_year
}
// dropping negative values and equal to zero
.filter { "years_off_platform"<Int>() > 0 }
df_years
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | years_off_platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... | 2 |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... | 1 |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... | 2 |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... | 23 |
s2042 | Movie | Even the Rain | Icíar Bollaín | Luis Tosar, Gael García Bernal, Juan ... | Spain, Mexico, France | 2011-05-17 | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... | 1 |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | Ben Affleck, Mark Hamill, Richard Her... | United States | 2011-09-27 | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... | 11 |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | Ken Olin, Jill Eikenberry, Maureen Mu... | United States | 2011-10-01 | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... | 23 |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | Daniel J. Travanti, JoBeth Williams, ... | United States | 2011-10-01 | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... | 25 |
s2623 | Movie | Hard Lessons | Eric Laneuville | Denzel Washington, Lynn Whitfield, Ri... | United States | 2011-10-01 | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... | 25 |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | Judith Light, Michael Ontkean, Jerry ... | United States | 2011-10-01 | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... | 21 |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | Pam Dawber, Michael Nouri, Bess Meyer... | United States | 2011-10-01 | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... | 23 |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | Nancy McKeon, Valerie Harper, Stephen... | United States | 2011-10-01 | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... | 24 |
s6846 | Movie | The Ryan White Story | John Herzfeld | Judith Light, Lukas Haas, Michael Bow... | United States | 2011-10-01 | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... | 22 |
s7150 | Movie | Too Young the Hero | Buzz Kulik | Ricky Schroder, Jon DeVries, Debra Mo... | United States | 2011-10-01 | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... | 23 |
s7231 | Movie | Triumph of the Heart | Richard Michaels | Mario Van Peebles, Susan Ruttan, Lane... | United States | 2011-10-01 | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... | 20 |
s7363 | Movie | Unspeakable Acts | Linda Otto | Jill Clayburgh, Brad Davis, Sam Behrens | United States | 2011-10-01 | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... | 21 |
s7415 | Movie | Victim of Beauty | Roger Young | William Devane, Jeri Ryan, Michele Ab... | United States | 2011-10-01 | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... | 20 |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | Kevin Clash, Whoopi Goldberg | United States | 2012-02-21 | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... | 1 |
s3467 | Movie | Kung Fu Panda: Holiday | Tim Johnson | Jack Black, Angelina Jolie, Dustin Ho... | United States | 2012-12-01 | 2010 | TV-PG | 26 min | Children & Family Movies, Comedies | As preparations for the Winter Feast ... | 2 |
s6057 | TV Show | The 4400 | null | Joel Gretsch, Jacqueline McKenzie, Pa... | United States, United Kingdom | 2013-09-01 | 2007 | TV-14 | 4 Seasons | TV Dramas, TV Mysteries, TV Sci-Fi & ... | 4400 people who vanished over the cou... | 6 |
We dropped negative values because it happens that titles are added to the platform while it is still in production. Also dropped the zero values as they are of no interest.
df_years.valueCounts(false) { years_off_platform }.plot {
x(years_off_platform) { axis.name = "years" }
points {
y("count")
size = 7.5
color(years_off_platform) {
scale = continuous(range = Color.hex("#97a6d9")..Color.hex("#00256e"))
}
}
layout {
title = "How long does it take for a title to be added to Netflix?"
size = 1000 to 500
}
}
Well, let's build the informal top charts for the oldest and newest movies and TV shows.
// Top 5 oldest movies
df_days
.filter { type == "Movie" } // filtering by type
.sortByDesc { days_on_platform } // sorting by number of days on Netflix
.select { cols(type, title, country, date_added, release_year, duration) } // selecting required columns
.head() // taking first five rows
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
Movie | To and From New York | United States | 2008-01-01 | 2006 | 81 min |
Movie | Just Another Love Story | Denmark | 2009-05-05 | 2007 | 104 min |
Movie | Splatter | United States | 2009-11-18 | 2009 | 29 min |
Movie | Mad Ron's Prevues from Hell | United States | 2010-11-01 | 1987 | 84 min |
Movie | Even the Rain | Spain, Mexico, France | 2011-05-17 | 2010 | 103 min |
// Top 5 newest movies
df_days
.filter { type == "Movie" }
.sortBy { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
Movie | A Monster Calls | United Kingdom, Spain, United States | 2021-01-16 | 2016 | 108 min |
Movie | Death of Me | United States, Thailand | 2021-01-16 | 2020 | 94 min |
Movie | Radium Girls | United States | 2021-01-16 | 2018 | 103 min |
Movie | Double Dad | Brazil | 2021-01-15 | 2020 | 105 min |
Movie | Hook | United States | 2021-01-15 | 1991 | 142 min |
// Top 5 oldest shows
df_days
.filter { type == "TV Show" }
.sortByDesc { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
TV Show | Dinner for Five | United States | 2008-02-04 | 2007 | 1 Season |
TV Show | Jack Taylor | United States, Ireland | 2013-03-31 | 2016 | 1 Season |
TV Show | Breaking Bad | United States | 2013-08-02 | 2013 | 5 Seasons |
TV Show | The 4400 | United States, United Kingdom | 2013-09-01 | 2007 | 4 Seasons |
TV Show | Gossip Girl | United States | 2013-10-08 | 2012 | 6 Seasons |
// Top 5 newest shows
df_days
.filter { type == "TV Show" }
.sortBy { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
TV Show | Bling Empire | null | 2021-01-15 | 2021 | 1 Season |
TV Show | Carmen Sandiego | United States | 2021-01-15 | 2021 | 4 Seasons |
TV Show | Disenchantment | United States | 2021-01-15 | 2021 | 3 Seasons |
TV Show | Henry Danger | United States | 2021-01-15 | 2016 | 3 Seasons |
TV Show | Kuroko's Basketball | Japan | 2021-01-15 | 2013 | 1 Season |
You might be interested in what months are titles added most often?
val df_split_date = df
// splitting dates into four columns
.split { date_added }.by { listOf(it, it.dayOfWeek, it.month, it.year) }
.into("date", "day", "month", "year")
.sortBy("month") // sorting by month
df_split_date
show_id | type | title | director | cast | country | date | day | month | year | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | TUESDAY | JANUARY | 2008 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s6899 | Movie | The Square | Jehane Noujaim | Ahmed Hassan, Khalid Abdalla, Magdy A... | United Kingdom, Egypt, United States | 2014-01-17 | FRIDAY | JANUARY | 2014 | 2013 | TV-MA | 105 min | Documentaries, International Movies | This Emmy-winning, street-level view ... |
s4154 | Movie | Mitt | Greg Whiteley | Mitt Romney | United States | 2014-01-24 | FRIDAY | JANUARY | 2014 | 2014 | TV-PG | 93 min | Documentaries | The real Mitt Romney is revealed in t... |
s2947 | Movie | Iliza Shlesinger: Freezing Hot | null | Iliza Shlesinger | null | 2015-01-23 | FRIDAY | JANUARY | 2015 | 2015 | TV-MA | 72 min | Stand-Up Comedy | Smart and brazen comedian Iliza Shles... |
s900 | TV Show | Big Bad Beetleborgs | null | Wesley Barker, Herbie Baez, Elisabeth... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1997 | TV-G | 2 Seasons | Kids' TV, TV Comedies | When three teens free a spirit that o... |
s2843 | Movie | How to Change the World | Jerry Rothwell | null | Canada, United Kingdom, Netherlands | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2015 | NR | 110 min | Documentaries, International Movies | In the 1970s, a group of activists wh... |
s4088 | TV Show | Mighty Morphin Alien Rangers | null | Julia Jordan, Matthew Sakimoto, Sicil... | null | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1996 | TV-Y7 | 1 Season | Kids' TV | Visitors arrive from space to help Re... |
s4089 | TV Show | Mighty Morphin Power Rangers | null | Austin St. John, Thuy Trang, Walter J... | United States, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2010 | TV-Y7 | 4 Seasons | Kids' TV | Five average teens are chosen by an i... |
s4480 | TV Show | Ninja Turtles: The Next Mutation | null | Jarred Blancard, Mitchell A. Lee Yuen... | Canada, United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1997 | TV-G | 1 Season | Kids' TV, TV Comedies | Everyone's favorite teenage mutants a... |
s4931 | TV Show | Power Rangers Dino Thunder | null | James Napier, Kevin Duhaney, Emma Lah... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2004 | TV-Y7 | 1 Season | Kids' TV | Dr. Tommy Oliver returns when his stu... |
s4932 | TV Show | Power Rangers in Space | null | Tracy Lynn Cruz, Patricia Ja Lee, Chr... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1998 | TV-Y7 | 1 Season | Kids' TV | With the Power Chamber destroyed, for... |
s4933 | TV Show | Power Rangers Jungle Fury | null | Jason Smith, Aljin Abella, Anna Hutch... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2008 | TV-Y7 | 1 Season | Kids' TV | The Power Rangers travel to Californi... |
s4934 | TV Show | Power Rangers Lightspeed Rescue | null | Michael Chaturantabut, Sean CW Johnso... | France, Japan, United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2000 | TV-Y7 | 1 Season | Kids' TV | As demons rumble from their graves be... |
s4935 | TV Show | Power Rangers Lost Galaxy | null | Archie Kao, Reggie Rolle, Danny Slavi... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1999 | TV-Y7 | 1 Season | Kids' TV | Five teenagers, transformed by the my... |
s4936 | TV Show | Power Rangers Mystic Force | null | Firass Dirani, Angie Diaz, Richard Br... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2006 | TV-Y7 | 1 Season | Kids' TV | When the wicked Undead Army is unleas... |
s4938 | TV Show | Power Rangers Ninja Storm | null | Pua Magasiva, Sally Martin, Glenn McM... | United States, New Zealand | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2003 | TV-Y7 | 1 Season | Kids' TV | When the elite warriors from the Wind... |
s4939 | TV Show | Power Rangers Operation Overdrive | null | James Maclurcan, Caitlin Murphy, Samu... | United States, New Zealand, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2007 | TV-Y7 | 1 Season | Kids' TV | To keep powerful jewels from falling ... |
s4940 | TV Show | Power Rangers RPM | null | Eka Darville, Ari Boyland, Rose McIve... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2009 | TV-Y7 | 1 Season | Kids' TV | The Power Rangers' new member, Dillon... |
s4941 | TV Show | Power Rangers S.P.D. | null | Brandon Jay McLaren, Chris Violette, ... | United States, New Zealand | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2005 | TV-Y7 | 1 Season | Kids' TV | When the Troobian Empire attacks Eart... |
s4942 | TV Show | Power Rangers Samurai | null | Alex Heartman, Erika Fong, Hector Dav... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2011 | TV-Y7 | 1 Season | Kids' TV | A new generation of Power Rangers mus... |
df_split_date
.valueCounts(false) { year and month }
.plot {
tiles {
x(year)
y(month)
width = .9
height = .9
fillColor("count") {
scale = continuous(range = Color.hex("#FFF3E0")..Color.hex("#E65100"))
}
}
layout {
title = "Content additions by month and year"
size = 900 to 700
style {
panel {
background {
blank = true
}
grid.lineGlobal { blank = true }
}
}
}
}
In this section, let's take a look at the actors and directors who make the content. First, let's determine the average number of actors in titles.
// splitting cast and couting number of actors
val cast_df = df
.split { cast }.by(',').inplace()
.add("size_cast") { "cast"<List<String>>().size }
.convert { date_added } // Since we need the time in milliseconds since epoch for the plots, let's convert date_added to an Instant
.with { it.atStartOfDayIn(TimeZone.UTC) }
cast_df
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | size_cast |
---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | [Barbara King, Shaana Diya, John Kris... | United States | 2008-01-01T00:00:00Z | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... | 6 |
s1766 | TV Show | Dinner for Five | null | [ ] | United States | 2008-02-04T00:00:00Z | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... | 0 |
s3249 | Movie | Just Another Love Story | Ole Bornedal | [Anders W. Berthelsen, Rebecka Hemse,... | Denmark | 2009-05-05T00:00:00Z | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... | 12 |
s5766 | Movie | Splatter | Joe Dante | [Corey Feldman, Tony Todd, Tara Leigh... | United States | 2009-11-18T00:00:00Z | 2009 | TV-MA | 29 min | Horror Movies | After committing suicide, a washed-up... | 6 |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | [Nick Pawlow, Jordu Schell, Jay Kushw... | United States | 2010-11-01T00:00:00Z | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... | 10 |
s2042 | Movie | Even the Rain | Icíar Bollaín | [Luis Tosar, Gael García Bernal, Juan... | Spain, Mexico, France | 2011-05-17T00:00:00Z | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... | 12 |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | [Ben Affleck, Mark Hamill, Richard He... | United States | 2011-09-27T00:00:00Z | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... | 15 |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | [Ken Olin, Jill Eikenberry, Maureen M... | United States | 2011-10-01T00:00:00Z | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... | 12 |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | [Daniel J. Travanti, JoBeth Williams,... | United States | 2011-10-01T00:00:00Z | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... | 5 |
s2623 | Movie | Hard Lessons | Eric Laneuville | [Denzel Washington, Lynn Whitfield, R... | United States | 2011-10-01T00:00:00Z | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... | 4 |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | [Judith Light, Michael Ontkean, Jerry... | United States | 2011-10-01T00:00:00Z | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... | 8 |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | [Pam Dawber, Michael Nouri, Bess Meye... | United States | 2011-10-01T00:00:00Z | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... | 6 |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | [Nancy McKeon, Valerie Harper, Stephe... | United States | 2011-10-01T00:00:00Z | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... | 5 |
s6846 | Movie | The Ryan White Story | John Herzfeld | [Judith Light, Lukas Haas, Michael Bo... | United States | 2011-10-01T00:00:00Z | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... | 8 |
s7150 | Movie | Too Young the Hero | Buzz Kulik | [Ricky Schroder, Jon DeVries, Debra M... | United States | 2011-10-01T00:00:00Z | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... | 7 |
s7231 | Movie | Triumph of the Heart | Richard Michaels | [Mario Van Peebles, Susan Ruttan, Lan... | United States | 2011-10-01T00:00:00Z | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... | 8 |
s7363 | Movie | Unspeakable Acts | Linda Otto | [Jill Clayburgh, Brad Davis, Sam Behr... | United States | 2011-10-01T00:00:00Z | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... | 3 |
s7415 | Movie | Victim of Beauty | Roger Young | [William Devane, Jeri Ryan, Michele A... | United States | 2011-10-01T00:00:00Z | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... | 8 |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | [Kevin Clash, Whoopi Goldberg] | United States | 2012-02-21T00:00:00Z | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... | 2 |
s1230 | Movie | Casa de mi Padre | Matt Piedmont | [Will Ferrell, Gael García Bernal, Di... | United States, Mexico | 2012-11-14T00:00:00Z | 2012 | R | 84 min | Comedies | Will Ferrell stars as a Spanish-speak... | 8 |
cast_df.plot {
histogram(size_cast, binsOption = BinsOption.byNumber(50)) {
fillColor(Stat.count) {
scale = continuous(range = Color.hex("#E0F7FA")..Color.hex("#006064"))
legend {
type = LegendType.None
}
}
}
layout {
xAxisLabel = "actors"
title = "Number of people on cast"
size = 950 to 650
}
}
It can be seen that usually 8-9 people are included in the cast.
But what about who exactly is involved in creating the content? Let's take a look at these actors and how many times they took part in movies and shows.
// counting the participation of each actor
val actors_df = cast_df.cast.explode().valueCounts()
actors_df
cast | count |
---|---|
Anupam Kher | 42 |
Shah Rukh Khan | 35 |
Om Puri | 30 |
Naseeruddin Shah | 30 |
Takahiro Sakurai | 29 |
Akshay Kumar | 29 |
Yuki Kaji | 27 |
Amitabh Bachchan | 27 |
Boman Irani | 27 |
Paresh Rawal | 27 |
Kareena Kapoor | 25 |
Andrea Libman | 24 |
John Cleese | 24 |
Vincent Tong | 24 |
Tara Strong | 22 |
Ashleigh Ball | 22 |
Nawazuddin Siddiqui | 21 |
Ajay Devgn | 21 |
Daisuke Ono | 20 |
Salman Khan | 20 |
actors_df.take(30).plot {
barsH {
y(cast) { scale = categorical() }
x(count)
fillColor(cast) {
scale = categoricalColorHue()
legend {
type = LegendType.None
}
}
}
layout.title = "Top 30 actors"
layout.size = 950 to 900
}
Anupam Kher is definitely in the lead with 42 titles. Now we will split the castes for participation in movies or shows.
val actors = cast_df.pivot { type }.aggregate {
cast.explode().valueCounts()
}
actors
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [23049 x 2]
... showing only top 5 of 23049 rows | DataFrame [13538 x 2]
... showing only top 5 of 13538 rows |
val p1 = actors.`TV Show`.take(30).plot {
barsH {
x(count)
y(cast)
fillColor(cast) {
scale = continuous(Color.hex("#263238")..Color.hex("#ECEFF1"))
legend { type = LegendType.None }
}
}
layout.title = "Top 30 actors in Shows"
}
val p2 = actors.Movie.take(30).plot {
barsH {
x(count)
y(cast)
fillColor(cast) {
scale = continuousColorGradientN(listOf(Color.hex("#006064"), Color.hex("#E0F7FA")))
legend { type = LegendType.None }
}
}
layout.title = "Top 30 actors in Movies"
}
plotBunch {
add(p1, 0, 0, 500, 700)
add(p2, 500, 0, 500, 700)
}
How about directors? Let's see the top 10 directors with more appearance on Netflix catalog.
val directors_df = df.valueCounts { director }
directors_df.take(10).plot {
barsH {
x(count)
y(director) { axis.name = "Name" }
fillColor(director) {
scale = categoricalColorHue()
legend { type = LegendType.None }
}
}
layout.title = "Top 10 directors"
layout.size = 850 to 500
}
These people work very productively.
This section focuses on analyzing how content is distributed across various countries. To do so, we will need to import libraries that work with geospatial data and maps, and then perform the necessary manipulations to render the maps.
%use lets-plot
%use lets-plot-gt(gt=30.1)
USE {
repository("https://repo.osgeo.org/repository/release")
dependencies {
implementation("org.geotools:gt-shapefile:30.1")
implementation("org.geotools:gt-cql:30.1")
}
}
import org.geotools.data.shapefile.ShapefileDataStoreFactory
import org.geotools.data.simple.SimpleFeatureCollection
import java.net.URL
val factory = ShapefileDataStoreFactory()
val worldFeatures: SimpleFeatureCollection = with("naturalearth_lowres") {
val url = "https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/shp/${this}/${this}.shp"
factory.createDataStore(URL(url)).featureSource.features
}
// Convert Feature Collection to SpatialDataset.
// Use 10 decimals to encode floating point numbers (this is the default).
val world = worldFeatures.toSpatialDataset(10)
val voidTheme = theme(
axisTitle = "blank",
axisLine = "blank",
axisTicks = "blank",
axisText = "blank",
)
val worldLimits = coordMap(ylim = -55 to 85)
Let's add another dataframe with country labels.
val countries = DataFrame.readCSV("country_codes.csv")
countries.head()
country | code | iso | lat | lon |
---|---|---|---|---|
United States | US | USA | 39.783730 | -100.445883 |
India | IN | IND | 22.351115 | 78.667743 |
United Kingdom | GB | GBR | 54.702355 | -3.276575 |
null | Unknown | Unknown | 51.146139 | 12.233285 |
Canada | CA | CAN | 61.066692 | -107.991707 |
// counting number of titles by county and joining them with country codes dataframe
val df_country = df.valueCounts { country }.join(countries)
df_country
country | count | code | iso | lat | lon |
---|---|---|---|---|---|
United States | 2549 | US | USA | 39.783730 | -100.445883 |
India | 923 | IN | IND | 22.351115 | 78.667743 |
United Kingdom | 396 | GB | GBR | 54.702355 | -3.276575 |
Japan | 225 | JP | JPN | 36.574844 | 139.239418 |
South Korea | 183 | KR | KOR | 36.638392 | 127.696119 |
Canada | 177 | CA | CAN | 61.066692 | -107.991707 |
Spain | 134 | ES | ESP | 39.326069 | -4.837979 |
France | 115 | FR | FRA | 46.603354 | 1.888334 |
Egypt | 101 | EG | EGY | 26.254049 | 29.267547 |
Mexico | 100 | MX | MEX | 22.500049 | -100.000038 |
Turkey | 100 | TR | TUR | 38.959759 | 34.924965 |
Australia | 82 | AU | AUS | -24.776109 | 134.755000 |
Taiwan | 78 | TW | TWN | 23.973937 | 120.982018 |
Brazil | 72 | BR | BRA | -10.333333 | -53.200000 |
Philippines | 71 | PH | PHL | 12.750349 | 122.731210 |
Nigeria | 70 | NG | NGA | 9.600036 | 7.999972 |
Indonesia | 70 | ID | IDN | -2.483383 | 117.890285 |
Germany | 61 | DE | DEU | 51.083420 | 10.423447 |
China | 57 | CN | CHN | 35.000074 | 104.999927 |
Thailand | 57 | TH | THA | 14.897192 | 100.832730 |
ggplot() +
geomMap(
data = df_country.toMap(),
map = world,
mapJoin = "iso" to "iso_a3",
color = "white",
) { fill = "count" } +
scaleFillGradient(
low = "#FFF3E0",
high = "#E65100",
name = "Number of Titles",
) +
ggsize(width = 1000, height = 800) +
voidTheme +
worldLimits
The map clearly shows where the content is mainly produced and gets to Netflix. Let's take a closer look at the top of such countries.
df_country[0..9].sortByDesc { count }.plot {
bars {
x(country)
y(count)
fillColor = Color.hex("#00796B")
}
layout.title = "Top 10 Countries"
layout.size = 900 to 450
}
How long does the content usually last to keep the viewer?
val df_dur = df
.split { duration }.by(" ").inward("duration_num", "duration_scale") // splitting duration by time and scale inward
.convert { "duration"["duration_num"] }.toInt() // converting by column path
.update { "duration"["duration_scale"] }.with { if (it == "Seasons") "Season" else it }
df_dur.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
duration_num | duration_scale | |||||||||||
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 | min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 | Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 | min | Dramas, International Movies | When he causes a car accident that le... |
s5766 | Movie | Splatter | Joe Dante | Corey Feldman, Tony Todd, Tara Leigh,... | United States | 2009-11-18 | 2009 | TV-MA | 29 | min | Horror Movies | After committing suicide, a washed-up... |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 | min | Cult Movies, Horror Movies | This collection cherry-picks trailers... |
val durations = df_dur.pivot { type }.values { duration }
durations
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [5377 x 2]
... showing only top 5 of 5377 rows | DataFrame [2400 x 2]
... showing only top 5 of 2400 rows |
val p1 = durations.Movie.plot {
histogram(duration_num, binsOption = BinsOption.byNumber(100)) {
y(Stat.density)
fillColor = Color.hex("#00BCD4")
}
statBin(duration_num, binsOption = BinsOption.byNumber(25)) {
line {
x(Stat.x) { axis.name = "minutes" }
y(Stat.density) { axis.name = "density" }
alpha = 1.0
width = 1.0
color = Color.hex("#d41900")
}
}
layout.title = "Distribution of movies duration in minutes"
}
val p2 = durations.`TV Show`.plot {
statBin(duration_num, binsOption = BinsOption.byNumber(15)) {
bars {
x(Stat.x)
y(Stat.count)
fillColor = Color.hex("#00BCD4")
}
}
}
plotBunch {
add(p1, 0, 0, 1000, 500)
add(p2, 0, 500, 1000, 500)
}
And according to tradition, the top longest movies and TV shows.
df_dur.xs("Movie") { type }
.sortByDesc { duration.duration_num }.head()
.select { title and country and date_added and release_year and duration.all() }
title | country | date_added | release_year | duration_num | duration_scale |
---|---|---|---|---|---|
Black Mirror: Bandersnatch | United States | 2018-12-28 | 2018 | 312 | min |
The School of Mischief | Egypt | 2020-05-21 | 1973 | 253 | min |
No Longer kids | Egypt | 2020-05-21 | 1979 | 237 | min |
Lock Your Girls In | null | 2020-05-21 | 1982 | 233 | min |
Raya and Sakina | null | 2020-05-21 | 1984 | 230 | min |
df_dur.xs("TV Show") { type }
.sortByDesc { duration.duration_num }.head()
.select { title and country and date_added and release_year and duration.all() }
title | country | date_added | release_year | duration_num | duration_scale |
---|---|---|---|---|---|
Grey's Anatomy | United States | 2020-05-09 | 2019 | 16 | Season |
NCIS | United States | 2018-07-01 | 2017 | 15 | Season |
Supernatural | United States, Canada | 2020-06-05 | 2019 | 15 | Season |
COMEDIANS of the world | United States | 2019-01-01 | 2019 | 13 | Season |
Criminal Minds | United States, Canada | 2017-06-30 | 2017 | 12 | Season |
And in the top content producing countries, how long are movies and TV shows?
val list_top_countries = df_country.country.take(10).toSet()
val df_cntr = df_dur
.filter { country in list_top_countries }
.pivot { type }.aggregate {
groupBy { country }.mean { duration.duration_num }
}
df_cntr
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [10 x 2]
... showing only top 5 of 10 rows | DataFrame [10 x 2]
... showing only top 5 of 10 rows |
val p1 = df_cntr.Movie.sortBy { duration_num }.plot {
bars {
x(country) { axis.name = "Name" }
y(duration_num) { axis.name = "Minute" }
fillColor(duration_num) {
scale = continuous(Color.hex("#ECEFF1")..Color.hex("#263238"))
legend.type = LegendType.None
}
}
layout.title = "Top 10 cast on Movies by country"
}
val p2 = df_cntr.`TV Show`.sortBy { duration_num }.plot {
bars {
x(country) { axis.name = "Name" }
y(duration_num) { axis.name = "Season" }
fillColor(duration_num) {
scale = continuous(Color.hex("#E0F7FA")..Color.hex("#006064"))
legend.type = LegendType.None
}
}
layout.title = "Top 10 cast on TV Shows by country"
}
plotBunch {
add(p1, 0, 0, 900, 550)
add(p2, 0, 550, 900, 550)
}
Finally, let's take a look at the rating column. Here we will find out what is the most commonly assigned rating for films and shows.
val dfInstants = df.convert { date_added }.with { it.atStartOfDayIn(TimeZone.UTC) }
dfInstants.valueCounts(false) { rating }.sortBy("count").plot {
bars {
x(rating)
y("count")
fillColor(rating) {
scale = categoricalColorHue()
legend.type = LegendType.None
}
}
layout.title = "Rating of Titles"
layout.size = 950 to 500
}
dfInstants.valueCounts(sort = false) { rating and type }.plot {
bars {
x(rating)
y("count")
fillColor(type) { scale = categorical(listOf(Color.hex("#607D8B"), Color.hex("#00BCD4"))) }
position = Position.dodge()
}
layout.title = "Rating of Titles"
layout.size = 950 to 500
}