In this notebook, we will explore Netflix movies and TV shows with kotlin/dataframe. Also, we will use the kandy library for data visualization.
We use the latest available versions of the libraries, the following line magic is responsible for this:
%useLatestDescriptors
Importing dataframe
%use dataframe@kc25
Importing the visualization library
%use kandy@kc25
To get started, need to read data from csv
You can also drag-and-drop the file here!
val rawDf = DataFrame.read("netflix_titles.csv")
First, let's take a look at the data.
// taking a look at types and columns
rawDf.schema()
show_id: String type: String title: String director: String? cast: String? country: String? date_added: String? release_year: Int rating: String? duration: String listed_in: String description: String
rawDf.size() // rowsCount x columnsCount
7787 x 12
rawDf.head() // return the first five rows
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
s1 | TV Show | 3% | null | João Miguel, Bianca Comparato, Michel... | Brazil | August 14, 2020 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV... | In a future where the elite inhabit a... |
s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar ... | Mexico | December 23, 2016 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits M... |
s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, ... | Singapore | December 20, 2018 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, h... |
s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer... | United States | November 16, 2017 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movie... | In a postapocalyptic world, rag-doll ... |
s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosw... | United States | January 1, 2020 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become ... |
// Getting general statistics and info for each columns
rawDf.describe()
name | type | count | unique | nulls | top | freq | mean | std | min | p25 | median | p75 | max |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
show_id | String | 7787 | 7787 | 0 | s1 | 1 | null | null | s1 | s2750 | s4502 | s6254 | s999 |
type | String | 7787 | 2 | 0 | Movie | 5377 | null | null | Movie | Movie | Movie | TV Show | TV Show |
title | String | 7787 | 7787 | 0 | 3% | 1 | null | null | #Alive | El club de los insomnes | Manglehorn | Stronger Than the World | 최강전사 미니특공대 : 영웅의 탄생 |
director | String? | 7787 | 4050 | 2389 | Raúl Campos, Jan Suter | 18 | null | null | A. L. Vijay | Eleonore Pourriat | Lance Bangs | Raúl Campos, Jan Suter | Şenol Sönmez |
cast | String? | 7787 | 6832 | 718 | David Attenborough | 18 | null | null | 'Najite Dede, Jude Chukwuka, Taiwo Ar... | Donnie Yen, Zhang Jin, Lynn Hung, Pat... | Kay Kay Menon, Ranvir Shorey, Manu Ri... | Rafael Ferro, Sol Moreno, Jonathan Sa... | Ṣọpẹ́ Dìrísù, Wunmi Mosaku, Matt Smit... |
country | String? | 7787 | 682 | 507 | United States | 2555 | null | null | Argentina | India | Thailand | United States | Zimbabwe |
date_added | String? | 7787 | 1566 | 10 | January 1, 2020 | 118 | null | null | April 15, 2018 | December 31, 2019 | July 6, 2019 | November 1, 2019 | September 9, 2020 |
release_year | Int | 7787 | 73 | 0 | 2018 | 1121 | 2013.932580 | 8.757395 | 1925 | 2013.000000 | 2017.000000 | 2018.000000 | 2021 |
rating | String? | 7787 | 15 | 7 | TV-MA | 2863 | null | null | G | TV-14 | TV-MA | TV-MA | UR |
duration | String | 7787 | 216 | 0 | 1 Season | 1608 | null | null | 1 Season | 103 min | 148 min | 79 min | 99 min |
listed_in | String | 7787 | 492 | 0 | Documentaries | 334 | null | null | Action & Adventure | Comedies, Dramas, Independent Movies | Documentaries, Sports Movies | International Movies, Thrillers | Thrillers |
description | String | 7787 | 7769 | 0 | Multiple women report their husbands ... | 3 | null | null | "Brooklyn Nine-Nine" star Chelsea Per... | An ad agency CEO is put under investi... | In a 1950s orphanage, a young girl re... | The isolated life of an extreme intro... | Zoe Walker leaves her quiet life behi... |
The data consists of Netflix TV shows and movies up to 2020. Each row contains information about one specific project and consists of:
show_id
- unique show numbertype
- *TV Show* or *Movie*title
- the name of a TV show or moviedirector
- director's namecast
- cast listcountry
- the country where the title was releaseddate_added
- when the title was added on netflixrelease_year
- the year the title was releasedrating
- rating of the titlelisted_in
- in which lists/genres the title is present on netflixdescription
- title descriptionBefore we get started, let's process the dataframe. It can be seen that date_added
is of type String
, let's convert it to LocalDate
for further convenience.
Kotlin DataFrame provides built-in type converters for major types. We will use String
-> LocalDate
conversion and specify the date format pattern
val df = rawDf.dropNulls { date_added } // remove rows where `date_added` is not specified
.convert { date_added }.toLocalDate("MMMM d, yyyy", java.util.Locale.ENGLISH) // convert date_added to LocalDate using date pattern
.sortBy { date_added } // and let's also sort by date for easy operation later
df
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... |
s5766 | Movie | Splatter | Joe Dante | Corey Feldman, Tony Todd, Tara Leigh,... | United States | 2009-11-18 | 2009 | TV-MA | 29 min | Horror Movies | After committing suicide, a washed-up... |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... |
s2042 | Movie | Even the Rain | Icíar Bollaín | Luis Tosar, Gael García Bernal, Juan ... | Spain, Mexico, France | 2011-05-17 | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | Ben Affleck, Mark Hamill, Richard Her... | United States | 2011-09-27 | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | Ken Olin, Jill Eikenberry, Maureen Mu... | United States | 2011-10-01 | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | Daniel J. Travanti, JoBeth Williams, ... | United States | 2011-10-01 | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... |
s2623 | Movie | Hard Lessons | Eric Laneuville | Denzel Washington, Lynn Whitfield, Ri... | United States | 2011-10-01 | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | Judith Light, Michael Ontkean, Jerry ... | United States | 2011-10-01 | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | Pam Dawber, Michael Nouri, Bess Meyer... | United States | 2011-10-01 | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | Nancy McKeon, Valerie Harper, Stephen... | United States | 2011-10-01 | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... |
s6846 | Movie | The Ryan White Story | John Herzfeld | Judith Light, Lukas Haas, Michael Bow... | United States | 2011-10-01 | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... |
s7150 | Movie | Too Young the Hero | Buzz Kulik | Ricky Schroder, Jon DeVries, Debra Mo... | United States | 2011-10-01 | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... |
s7231 | Movie | Triumph of the Heart | Richard Michaels | Mario Van Peebles, Susan Ruttan, Lane... | United States | 2011-10-01 | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... |
s7363 | Movie | Unspeakable Acts | Linda Otto | Jill Clayburgh, Brad Davis, Sam Behrens | United States | 2011-10-01 | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... |
s7415 | Movie | Victim of Beauty | Roger Young | William Devane, Jeri Ryan, Michele Ab... | United States | 2011-10-01 | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | Kevin Clash, Whoopi Goldberg | United States | 2012-02-21 | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... |
s1230 | Movie | Casa de mi Padre | Matt Piedmont | Will Ferrell, Gael García Bernal, Die... | United States, Mexico | 2012-11-14 | 2012 | R | 84 min | Comedies | Will Ferrell stars as a Spanish-speak... |
// let's look at what type of column it turned out
df.date_added.type()
kotlinx.datetime.LocalDate
First, let's see what there are more of: shows or films.
rawDf
.valueCounts(sort = false) { type }
.plot {
bars {
x(type)
y("count")
fillColor(type) {
scale = categorical(range = listOf(Color.hex("#00BCD4"), Color.hex("#009688")))
}
}
layout {
title = "Count of TV Shows and Movies"
size = 900 to 550
}
}
You can see the number of films on Netflix is about twice the number of TV shows. But has it always been this way? To figure this out, let's see if there was a year when the number of TV Shows was higher than that of Movies and let's see the cumulative amount for Movies and TV Shows.
val df_date_count = df
.convert { date_added }.with { it.year } // converting `date_added` to extract `year`
.groupBy { date_added } // grouping by `year` stored in `date_added`
.aggregate {
count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
count { type == "Movie" } into "movies" // counting Movies into column `movies`
}
df_date_count
date_added | tvshows | movies |
---|---|---|
2008 | 1 | 1 |
2009 | 0 | 2 |
2010 | 0 | 1 |
2011 | 0 | 13 |
2012 | 0 | 3 |
2013 | 5 | 6 |
2014 | 6 | 19 |
2015 | 30 | 58 |
2016 | 185 | 258 |
2017 | 361 | 864 |
2018 | 430 | 1255 |
2019 | 656 | 1497 |
2020 | 697 | 1312 |
2021 | 29 | 88 |
Let's see if we can simplify this expression using more advanced operations. First, we can combine the conversion of date_added
into year
and grouping using map
function within column selector.
val df_date_count = df
.groupBy { date_added.map { it.year } } // grouping by year added extracted from `date_added`
.aggregate {
count { type == "TV Show" } into "tvshows" // counting TV Shows into column `tvshows`
count { type == "Movie" } into "movies" // counting Movies into column `movies`
}
df_date_count
date_added | tvshows | movies |
---|---|---|
2008 | 1 | 1 |
2009 | 0 | 2 |
2010 | 0 | 1 |
2011 | 0 | 13 |
2012 | 0 | 3 |
2013 | 5 | 6 |
2014 | 6 | 19 |
2015 | 30 | 58 |
2016 | 185 | 258 |
2017 | 361 | 864 |
2018 | 430 | 1255 |
2019 | 656 | 1497 |
2020 | 697 | 1312 |
2021 | 29 | 88 |
Our groupBy aggregation adds new columns for "TV Show" and "Movie". This is exactly what pivot
does: it generates new columns for every unique value in type
.
df.groupBy { date_added.map { it.year } }
.pivot { type }
date_added | type | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Movie | TV Show | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2008 | DataFrame [1 x 12]
| DataFrame [1 x 12]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2009 | DataFrame [2 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2010 | DataFrame [1 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2011 | DataFrame [13 x 12]
... showing only top 5 of 13 rows | DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2012 | DataFrame [3 x 12]
| DataFrame [0 x 0] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2013 | DataFrame [6 x 12]
... showing only top 5 of 6 rows | DataFrame [5 x 12]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2014 | DataFrame [19 x 12]
... showing only top 5 of 19 rows | DataFrame [6 x 12]
... showing only top 5 of 6 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2015 | DataFrame [58 x 12]
... showing only top 5 of 58 rows | DataFrame [30 x 12]
... showing only top 5 of 30 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2016 | DataFrame [258 x 12]
... showing only top 5 of 258 rows | DataFrame [185 x 12]
... showing only top 5 of 185 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2017 | DataFrame [864 x 12]
... showing only top 5 of 864 rows | DataFrame [361 x 12]
... showing only top 5 of 361 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2018 | DataFrame [1255 x 12]
... showing only top 5 of 1255 rows | DataFrame [430 x 12]
... showing only top 5 of 430 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2019 | DataFrame [1497 x 12]
... showing only top 5 of 1497 rows | DataFrame [656 x 12]
... showing only top 5 of 656 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2020 | DataFrame [1312 x 12]
... showing only top 5 of 1312 rows | DataFrame [697 x 12]
... showing only top 5 of 697 rows | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2021 | DataFrame [88 x 12]
... showing only top 5 of 88 rows | DataFrame [29 x 12]
... showing only top 5 of 29 rows |
After the type
column is pivoted, we call aggregate
to specify the metrics to be calculated for every data group.
df.groupBy { date_added.map { it.year } }
.pivot { type }.aggregate { count() }
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | null |
2010 | 1 | null |
2011 | 13 | null |
2012 | 3 | null |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
Simple statistics can also be done without aggregate
:
df.groupBy { date_added.map { it.year } }
.pivot { type }.count()
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | 0 |
2010 | 1 | 0 |
2011 | 13 | 0 |
2012 | 3 | 0 |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
For the count
statistic, there is an even shorter API: pivotCounts
.
Here is the final version:
val df_date_count = df
.groupBy { date_added.map { it.year } }.pivotCounts { type }
df_date_count
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 2 | 0 |
2010 | 1 | 0 |
2011 | 13 | 0 |
2012 | 3 | 0 |
2013 | 6 | 5 |
2014 | 19 | 6 |
2015 | 58 | 30 |
2016 | 258 | 185 |
2017 | 864 | 361 |
2018 | 1255 | 430 |
2019 | 1497 | 656 |
2020 | 1312 | 697 |
2021 | 88 | 29 |
Let's plot the results.
df_date_count.plot {
x(date_added) { axis.name = "year" }
area {
y(type.`TV Show`) { axis.name = "count" }
fillColor = Color.hex("#BF360C")
borderLine.color = Color.hex("#BF360C")
alpha = .5
}
area {
y(type.Movie)
fillColor = Color.hex("#01579B")
borderLine.color = Color.hex("#01579B")
alpha = .5
}
layout {
title = "Number of titles by year"
size = 800 to 500
style {
panel {
background {
fillColor = Color.hex("#ECEFF1")
borderLineColor = Color.hex("#ECEFF1")
}
grid.lineGlobal { blank = true }
}
}
}
}
You can see that, compared to shows, more films were added every year.
The cumulative sum of the movies would clearly then also always be larger than the TV Shows, but let's plot it anyway.
val df_cumsum_titles = df_date_count
.sortBy { date_added } // sorting by date_added
.cumSum { type.colsOf<Int>() } // count cumulative sum for columns `TV Show` and `Movie` that are nested under column `type`
df_cumsum_titles
date_added | type | |
---|---|---|
Movie | TV Show | |
2008 | 1 | 1 |
2009 | 3 | 1 |
2010 | 4 | 1 |
2011 | 17 | 1 |
2012 | 20 | 1 |
2013 | 26 | 6 |
2014 | 45 | 12 |
2015 | 103 | 42 |
2016 | 361 | 227 |
2017 | 1225 | 588 |
2018 | 2480 | 1018 |
2019 | 3977 | 1674 |
2020 | 5289 | 2371 |
2021 | 5377 | 2400 |
df_cumsum_titles.plot {
x(date_added) { axis.name = "year" }
area {
y(type.`TV Show`) { axis.name = "cumulative count" }
fillColor = Color.hex("#BF360C")
borderLine.color = Color.hex("#BF360C")
alpha = .5
}
area {
y(type.Movie)
fillColor = Color.hex("#01579B")
borderLine.color = Color.hex("#01579B")
alpha = .5
}
layout {
title = "Cumulative count of titles by year"
size = 800 to 500
style {
panel {
background {
fillColor = Color.hex("#ECEFF1")
borderLineColor = Color.hex("#ECEFF1")
}
grid.lineGlobal { blank = true }
}
}
}
}
Let's take a look at the distribution of the lifetime of titles on the platform. To do this, we find the most recently uploaded title and calculate the difference between the date it was added and the most recent date found.
import kotlinx.datetime.*
val maxDate = df.date_added.max()
val df_days = df.add {
"days_on_platform" from { date_added.daysUntil(maxDate) } // adding column for number of days on the platform
"months_on_platform" from { date_added.monthsUntil(maxDate) } // adding column for number of months on the platform
"years_on_platform" from { date_added.yearsUntil(maxDate) } // adding column for number of years on the platform
}
val p1 = df_days.select { type and days_on_platform }.plot {
histogram(days_on_platform, binsOption = BinsOption.byNumber(30)) {
y(Stat.density)
fillColor = Color.hex("#ef0b0b")
borderLine.color = Color.hex("#ECEFF1")
}
statBin(days_on_platform, binsOption = BinsOption.byNumber(30)) {
area {
x(Stat.x)
y(Stat.density)
alpha = .5
fillColor = Color.hex("#0befef")
}
}
layout {
xAxisLabel = "days"
title = "Age distribution (in days) on Netflix"
}
}
val p2 = df_days.select { type and days_on_platform }.plot {
boxplot(x = type, y = days_on_platform) {
boxes {
fillColor(Stat.x) {
scale = categorical(range = listOf(Color.hex("#792020"), Color.hex("#207979")))
}
}
}
layout {
yAxisLabel = "days"
title = "Boxplot for age (in days) by type"
}
}
plotBunch {
add(p1, 0, 0, 500, 450)
add(p2, 500, 0, 500, 450)
}
The age distribution of titles on the platform is similar for movies and TV shows. But you can see in the second graph that there are more old titles among the movies compared to the shows. Let's take a closer look at this. To do this, let's build a graph of the duration in years of being on the platform for films and shows.
df_days.valueCounts(sort = false) { type and years_on_platform }.plot {
bars {
x(years_on_platform) { axis.name = "years" }
y("count")
fillColor(type) {
scale = categorical(range = listOf(Color.hex("#bc3076"), Color.hex("#30bc76")))
}
position = Position.dodge()
}
layout {
title = "Years of Movies and TV Shows on Netflix"
size = 900 to 500
}
}
As you can see, movies are usually older than TV shows. After that, you might ask yourself: how quickly were titles added to Netflix after their release? Well, finding the answer to it will be quite simple.
val df_years = df
// adding a new column of the difference between the year of release and the year of addition
.add("years_off_platform") {
date_added.year - release_year
}
// dropping negative values and equal to zero
.filter { "years_off_platform"<Int>() > 0 }
df_years
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | years_off_platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... | 2 |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... | 1 |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... | 2 |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... | 23 |
s2042 | Movie | Even the Rain | Icíar Bollaín | Luis Tosar, Gael García Bernal, Juan ... | Spain, Mexico, France | 2011-05-17 | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... | 1 |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | Ben Affleck, Mark Hamill, Richard Her... | United States | 2011-09-27 | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... | 11 |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | Ken Olin, Jill Eikenberry, Maureen Mu... | United States | 2011-10-01 | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... | 23 |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | Daniel J. Travanti, JoBeth Williams, ... | United States | 2011-10-01 | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... | 25 |
s2623 | Movie | Hard Lessons | Eric Laneuville | Denzel Washington, Lynn Whitfield, Ri... | United States | 2011-10-01 | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... | 25 |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | Judith Light, Michael Ontkean, Jerry ... | United States | 2011-10-01 | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... | 21 |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | Pam Dawber, Michael Nouri, Bess Meyer... | United States | 2011-10-01 | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... | 23 |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | Nancy McKeon, Valerie Harper, Stephen... | United States | 2011-10-01 | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... | 24 |
s6846 | Movie | The Ryan White Story | John Herzfeld | Judith Light, Lukas Haas, Michael Bow... | United States | 2011-10-01 | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... | 22 |
s7150 | Movie | Too Young the Hero | Buzz Kulik | Ricky Schroder, Jon DeVries, Debra Mo... | United States | 2011-10-01 | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... | 23 |
s7231 | Movie | Triumph of the Heart | Richard Michaels | Mario Van Peebles, Susan Ruttan, Lane... | United States | 2011-10-01 | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... | 20 |
s7363 | Movie | Unspeakable Acts | Linda Otto | Jill Clayburgh, Brad Davis, Sam Behrens | United States | 2011-10-01 | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... | 21 |
s7415 | Movie | Victim of Beauty | Roger Young | William Devane, Jeri Ryan, Michele Ab... | United States | 2011-10-01 | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... | 20 |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | Kevin Clash, Whoopi Goldberg | United States | 2012-02-21 | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... | 1 |
s3467 | Movie | Kung Fu Panda: Holiday | Tim Johnson | Jack Black, Angelina Jolie, Dustin Ho... | United States | 2012-12-01 | 2010 | TV-PG | 26 min | Children & Family Movies, Comedies | As preparations for the Winter Feast ... | 2 |
s6057 | TV Show | The 4400 | null | Joel Gretsch, Jacqueline McKenzie, Pa... | United States, United Kingdom | 2013-09-01 | 2007 | TV-14 | 4 Seasons | TV Dramas, TV Mysteries, TV Sci-Fi & ... | 4400 people who vanished over the cou... | 6 |
We dropped negative values because it happens that titles are added to the platform while it is still in production. Also dropped the zero values as they are of no interest.
df_years.valueCounts(false) { years_off_platform }.plot {
x(years_off_platform) { axis.name = "years" }
points {
y("count")
size = 7.5
color(years_off_platform) {
scale = continuous(range = Color.hex("#97a6d9")..Color.hex("#00256e"))
}
}
layout {
title = "How long does it take for a title to be added to Netflix?"
size = 1000 to 500
}
}
Well, let's build the informal top charts for the oldest and newest movies and TV shows.
// Top 5 oldest movies
df_days
.filter { type == "Movie" } // filtering by type
.sortByDesc { days_on_platform } // sorting by number of days on Netflix
.select { cols(type, title, country, date_added, release_year, duration) } // selecting required columns
.head() // taking first five rows
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
Movie | To and From New York | United States | 2008-01-01 | 2006 | 81 min |
Movie | Just Another Love Story | Denmark | 2009-05-05 | 2007 | 104 min |
Movie | Splatter | United States | 2009-11-18 | 2009 | 29 min |
Movie | Mad Ron's Prevues from Hell | United States | 2010-11-01 | 1987 | 84 min |
Movie | Even the Rain | Spain, Mexico, France | 2011-05-17 | 2010 | 103 min |
// Top 5 newest movies
df_days
.filter { type == "Movie" }
.sortBy { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
Movie | A Monster Calls | United Kingdom, Spain, United States | 2021-01-16 | 2016 | 108 min |
Movie | Death of Me | United States, Thailand | 2021-01-16 | 2020 | 94 min |
Movie | Radium Girls | United States | 2021-01-16 | 2018 | 103 min |
Movie | Double Dad | Brazil | 2021-01-15 | 2020 | 105 min |
Movie | Hook | United States | 2021-01-15 | 1991 | 142 min |
// Top 5 oldest shows
df_days
.filter { type == "TV Show" }
.sortByDesc { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
TV Show | Dinner for Five | United States | 2008-02-04 | 2007 | 1 Season |
TV Show | Jack Taylor | United States, Ireland | 2013-03-31 | 2016 | 1 Season |
TV Show | Breaking Bad | United States | 2013-08-02 | 2013 | 5 Seasons |
TV Show | The 4400 | United States, United Kingdom | 2013-09-01 | 2007 | 4 Seasons |
TV Show | Gossip Girl | United States | 2013-10-08 | 2012 | 6 Seasons |
// Top 5 newest shows
df_days
.filter { type == "TV Show" }
.sortBy { days_on_platform }
.select { cols(type, title, country, date_added, release_year, duration) }
.head()
type | title | country | date_added | release_year | duration |
---|---|---|---|---|---|
TV Show | Bling Empire | null | 2021-01-15 | 2021 | 1 Season |
TV Show | Carmen Sandiego | United States | 2021-01-15 | 2021 | 4 Seasons |
TV Show | Disenchantment | United States | 2021-01-15 | 2021 | 3 Seasons |
TV Show | Henry Danger | United States | 2021-01-15 | 2016 | 3 Seasons |
TV Show | Kuroko's Basketball | Japan | 2021-01-15 | 2013 | 1 Season |
You might be interested in what months titles are added most often?
val df_split_date = df
// splitting dates into four columns
.split { date_added }.by { listOf(it, it.dayOfWeek, it.month, it.year) }
.into("date", "day", "month", "year")
.sortBy("month") // sorting by month
df_split_date
show_id | type | title | director | cast | country | date | day | month | year | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | TUESDAY | JANUARY | 2008 | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s6899 | Movie | The Square | Jehane Noujaim | Ahmed Hassan, Khalid Abdalla, Magdy A... | United Kingdom, Egypt, United States | 2014-01-17 | FRIDAY | JANUARY | 2014 | 2013 | TV-MA | 105 min | Documentaries, International Movies | This Emmy-winning, street-level view ... |
s4154 | Movie | Mitt | Greg Whiteley | Mitt Romney | United States | 2014-01-24 | FRIDAY | JANUARY | 2014 | 2014 | TV-PG | 93 min | Documentaries | The real Mitt Romney is revealed in t... |
s2947 | Movie | Iliza Shlesinger: Freezing Hot | null | Iliza Shlesinger | null | 2015-01-23 | FRIDAY | JANUARY | 2015 | 2015 | TV-MA | 72 min | Stand-Up Comedy | Smart and brazen comedian Iliza Shles... |
s900 | TV Show | Big Bad Beetleborgs | null | Wesley Barker, Herbie Baez, Elisabeth... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1997 | TV-G | 2 Seasons | Kids' TV, TV Comedies | When three teens free a spirit that o... |
s2843 | Movie | How to Change the World | Jerry Rothwell | null | Canada, United Kingdom, Netherlands | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2015 | NR | 110 min | Documentaries, International Movies | In the 1970s, a group of activists wh... |
s4088 | TV Show | Mighty Morphin Alien Rangers | null | Julia Jordan, Matthew Sakimoto, Sicil... | null | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1996 | TV-Y7 | 1 Season | Kids' TV | Visitors arrive from space to help Re... |
s4089 | TV Show | Mighty Morphin Power Rangers | null | Austin St. John, Thuy Trang, Walter J... | United States, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2010 | TV-Y7 | 4 Seasons | Kids' TV | Five average teens are chosen by an i... |
s4480 | TV Show | Ninja Turtles: The Next Mutation | null | Jarred Blancard, Mitchell A. Lee Yuen... | Canada, United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1997 | TV-G | 1 Season | Kids' TV, TV Comedies | Everyone's favorite teenage mutants a... |
s4931 | TV Show | Power Rangers Dino Thunder | null | James Napier, Kevin Duhaney, Emma Lah... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2004 | TV-Y7 | 1 Season | Kids' TV | Dr. Tommy Oliver returns when his stu... |
s4932 | TV Show | Power Rangers in Space | null | Tracy Lynn Cruz, Patricia Ja Lee, Chr... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1998 | TV-Y7 | 1 Season | Kids' TV | With the Power Chamber destroyed, for... |
s4933 | TV Show | Power Rangers Jungle Fury | null | Jason Smith, Aljin Abella, Anna Hutch... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2008 | TV-Y7 | 1 Season | Kids' TV | The Power Rangers travel to Californi... |
s4934 | TV Show | Power Rangers Lightspeed Rescue | null | Michael Chaturantabut, Sean CW Johnso... | France, Japan, United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2000 | TV-Y7 | 1 Season | Kids' TV | As demons rumble from their graves be... |
s4935 | TV Show | Power Rangers Lost Galaxy | null | Archie Kao, Reggie Rolle, Danny Slavi... | United States, France, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 1999 | TV-Y7 | 1 Season | Kids' TV | Five teenagers, transformed by the my... |
s4936 | TV Show | Power Rangers Mystic Force | null | Firass Dirani, Angie Diaz, Richard Br... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2006 | TV-Y7 | 1 Season | Kids' TV | When the wicked Undead Army is unleas... |
s4938 | TV Show | Power Rangers Ninja Storm | null | Pua Magasiva, Sally Martin, Glenn McM... | United States, New Zealand | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2003 | TV-Y7 | 1 Season | Kids' TV | When the elite warriors from the Wind... |
s4939 | TV Show | Power Rangers Operation Overdrive | null | James Maclurcan, Caitlin Murphy, Samu... | United States, New Zealand, Japan | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2007 | TV-Y7 | 1 Season | Kids' TV | To keep powerful jewels from falling ... |
s4940 | TV Show | Power Rangers RPM | null | Eka Darville, Ari Boyland, Rose McIve... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2009 | TV-Y7 | 1 Season | Kids' TV | The Power Rangers' new member, Dillon... |
s4941 | TV Show | Power Rangers S.P.D. | null | Brandon Jay McLaren, Chris Violette, ... | United States, New Zealand | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2005 | TV-Y7 | 1 Season | Kids' TV | When the Troobian Empire attacks Eart... |
s4942 | TV Show | Power Rangers Samurai | null | Alex Heartman, Erika Fong, Hector Dav... | United States | 2016-01-01 | FRIDAY | JANUARY | 2016 | 2011 | TV-Y7 | 1 Season | Kids' TV | A new generation of Power Rangers mus... |
df_split_date
.valueCounts(false) { year and month }
.plot {
tiles {
x(year)
y(month)
width = .9
height = .9
fillColor("count") {
scale = continuous(range = Color.hex("#FFF3E0")..Color.hex("#E65100"))
}
}
layout {
title = "Content additions by month and year"
size = 900 to 700
style {
panel {
background {
blank = true
}
grid.lineGlobal { blank = true }
}
}
}
}
In this section, let's take a look at the actors and directors who make the content. First, let's determine the average number of actors in titles.
// splitting cast and couting number of actors
val cast_df = df
.split { cast }.by(',').inplace()
.add("size_cast") { "cast"<List<String>>().size }
// Since we need the time in milliseconds since epoch for the plots, let's convert date_added to an Instant
.convert { date_added }.with { it.atStartOfDayIn(TimeZone.UTC) }
cast_df
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | size_cast |
---|---|---|---|---|---|---|---|---|---|---|---|---|
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | [Barbara King, Shaana Diya, John Kris... | United States | 2008-01-01T00:00:00Z | 2006 | TV-MA | 81 min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... | 6 |
s1766 | TV Show | Dinner for Five | null | [ ] | United States | 2008-02-04T00:00:00Z | 2007 | TV-MA | 1 Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... | 0 |
s3249 | Movie | Just Another Love Story | Ole Bornedal | [Anders W. Berthelsen, Rebecka Hemse,... | Denmark | 2009-05-05T00:00:00Z | 2007 | TV-MA | 104 min | Dramas, International Movies | When he causes a car accident that le... | 12 |
s5766 | Movie | Splatter | Joe Dante | [Corey Feldman, Tony Todd, Tara Leigh... | United States | 2009-11-18T00:00:00Z | 2009 | TV-MA | 29 min | Horror Movies | After committing suicide, a washed-up... | 6 |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | [Nick Pawlow, Jordu Schell, Jay Kushw... | United States | 2010-11-01T00:00:00Z | 1987 | NR | 84 min | Cult Movies, Horror Movies | This collection cherry-picks trailers... | 10 |
s2042 | Movie | Even the Rain | Icíar Bollaín | [Luis Tosar, Gael García Bernal, Juan... | Spain, Mexico, France | 2011-05-17T00:00:00Z | 2010 | TV-MA | 103 min | Dramas, International Movies | While making a film about the incursi... | 12 |
s3222 | Movie | Joseph: King of Dreams | Rob LaDuca, Robert C. Ramirez | [Ben Affleck, Mark Hamill, Richard He... | United States | 2011-09-27T00:00:00Z | 2000 | TV-PG | 75 min | Children & Family Movies, Dramas, Fai... | With his gift of dream interpretation... | 15 |
s233 | Movie | A Stoning in Fulham County | Larry Elikann | [Ken Olin, Jill Eikenberry, Maureen M... | United States | 2011-10-01T00:00:00Z | 1988 | TV-14 | 95 min | Dramas | After reckless teens kill an Amish ch... | 12 |
s309 | Movie | Adam: His Song Continues | Robert Markowitz | [Daniel J. Travanti, JoBeth Williams,... | United States | 2011-10-01T00:00:00Z | 1986 | TV-MA | 96 min | Dramas | After their child was abducted and mu... | 5 |
s2623 | Movie | Hard Lessons | Eric Laneuville | [Denzel Washington, Lynn Whitfield, R... | United States | 2011-10-01T00:00:00Z | 1986 | TV-14 | 94 min | Dramas | This drama based on real-life events ... | 4 |
s2963 | Movie | In Defense of a Married Man | Joel Oliansky | [Judith Light, Michael Ontkean, Jerry... | United States | 2011-10-01T00:00:00Z | 1990 | TV-14 | 94 min | Dramas | A lawyer's husband is having an affai... | 8 |
s5042 | Movie | Quiet Victory: The Charlie Wedemeyer ... | Roy Campanella II | [Pam Dawber, Michael Nouri, Bess Meye... | United States | 2011-10-01T00:00:00Z | 1988 | TV-PG | 93 min | Dramas, Sports Movies | When high school football coach Charl... | 6 |
s5833 | Movie | Strange Voices | Arthur Allan Seidelman | [Nancy McKeon, Valerie Harper, Stephe... | United States | 2011-10-01T00:00:00Z | 1987 | TV-PG | 96 min | Dramas | When their college-age daughter sudde... | 5 |
s6846 | Movie | The Ryan White Story | John Herzfeld | [Judith Light, Lukas Haas, Michael Bo... | United States | 2011-10-01T00:00:00Z | 1989 | TV-PG | 94 min | Dramas | After contracting HIV from a tainted ... | 8 |
s7150 | Movie | Too Young the Hero | Buzz Kulik | [Ricky Schroder, Jon DeVries, Debra M... | United States | 2011-10-01T00:00:00Z | 1988 | TV-MA | 94 min | Dramas | Twelve-year-old Calvin manages to joi... | 7 |
s7231 | Movie | Triumph of the Heart | Richard Michaels | [Mario Van Peebles, Susan Ruttan, Lan... | United States | 2011-10-01T00:00:00Z | 1991 | TV-PG | 93 min | Dramas, Sports Movies | This drama tells the tale of Ricky Be... | 8 |
s7363 | Movie | Unspeakable Acts | Linda Otto | [Jill Clayburgh, Brad Davis, Sam Behr... | United States | 2011-10-01T00:00:00Z | 1990 | TV-14 | 95 min | Dramas | Laurie and Joseph are doctors who int... | 3 |
s7415 | Movie | Victim of Beauty | Roger Young | [William Devane, Jeri Ryan, Michele A... | United States | 2011-10-01T00:00:00Z | 1991 | TV-14 | 93 min | Dramas, Thrillers | A beauty pageant winner is stalked by... | 8 |
s819 | Movie | Being Elmo: A Puppeteer's Journey | Constance Marks | [Kevin Clash, Whoopi Goldberg] | United States | 2012-02-21T00:00:00Z | 2011 | PG | 76 min | Documentaries | Whoopi Goldberg narrates Elmo creator... | 2 |
s1230 | Movie | Casa de mi Padre | Matt Piedmont | [Will Ferrell, Gael García Bernal, Di... | United States, Mexico | 2012-11-14T00:00:00Z | 2012 | R | 84 min | Comedies | Will Ferrell stars as a Spanish-speak... | 8 |
cast_df.plot {
histogram(size_cast, binsOption = BinsOption.byNumber(50)) {
fillColor(Stat.count) {
scale = continuous(range = Color.hex("#E0F7FA")..Color.hex("#006064"))
legend {
type = LegendType.None
}
}
}
layout {
xAxisLabel = "actors"
title = "Number of people on cast"
size = 950 to 650
}
}
You can see that usually 8–9 people are included in the cast.
But what about who exactly is involved in creating the content? Let's take a look at these actors and how many times they took part in movies and shows.
// counting the participation of each actor
val actors_df = cast_df.cast.explode().valueCounts()
actors_df
cast | count |
---|---|
Anupam Kher | 42 |
Shah Rukh Khan | 35 |
Om Puri | 30 |
Naseeruddin Shah | 30 |
Takahiro Sakurai | 29 |
Akshay Kumar | 29 |
Yuki Kaji | 27 |
Amitabh Bachchan | 27 |
Boman Irani | 27 |
Paresh Rawal | 27 |
Kareena Kapoor | 25 |
Andrea Libman | 24 |
John Cleese | 24 |
Vincent Tong | 24 |
Tara Strong | 22 |
Ashleigh Ball | 22 |
Nawazuddin Siddiqui | 21 |
Ajay Devgn | 21 |
Daisuke Ono | 20 |
Salman Khan | 20 |
actors_df.take(30).plot {
barsH {
y(cast) { scale = categorical() }
x(count)
fillColor(cast) {
scale = categoricalColorHue()
legend {
type = LegendType.None
}
}
}
layout.title = "Top 30 actors"
layout.size = 950 to 900
}
Anupam Kher is definitely in the lead with 42 titles. Now we will split the castes for participation in movies or shows.
val actors = cast_df.pivot { type }.aggregate {
cast.explode().valueCounts()
}
actors
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [23049 x 2]
... showing only top 5 of 23049 rows | DataFrame [13538 x 2]
... showing only top 5 of 13538 rows |
val p1 = actors.`TV Show`.take(30).plot {
barsH {
x(count)
y(cast)
fillColor(cast) {
scale = continuous(Color.hex("#263238")..Color.hex("#ECEFF1"))
legend { type = LegendType.None }
}
}
layout.title = "Top 30 actors in Shows"
}
val p2 = actors.Movie.take(30).plot {
barsH {
x(count)
y(cast)
fillColor(cast) {
scale = continuousColorGradientN(listOf(Color.hex("#006064"), Color.hex("#E0F7FA")))
legend { type = LegendType.None }
}
}
layout.title = "Top 30 actors in Movies"
}
plotBunch {
add(p1, 0, 0, 500, 700)
add(p2, 500, 0, 500, 700)
}
How about directors? Let's see the top 10 directors with more appearance in the Netflix catalog.
val directors_df = df.valueCounts { director }
directors_df.take(10).plot {
barsH {
x(count)
y(director) { axis.name = "Name" }
fillColor(director) {
scale = categoricalColorHue()
legend { type = LegendType.None }
}
}
layout.title = "Top 10 directors"
layout.size = 850 to 500
}
These people work very productively.
This section focuses on analyzing how content is distributed across various countries.
To do so, we will need to import libraries that work with geospatial data and maps and then perform the necessary manipulations to render the maps. However, let's prepare the data first.
Let's add another dataframe with country labels.
val countriesCodes = DataFrame.readCsv("country_codes.csv")
countriesCodes.head()
country | code | iso | lat | lon |
---|---|---|---|---|
United States | US | USA | 39.783730 | -100.445883 |
India | IN | IND | 22.351115 | 78.667743 |
United Kingdom | GB | GBR | 54.702355 | -3.276575 |
null | Unknown | Unknown | 51.146139 | 12.233285 |
Canada | CA | CAN | 61.066692 | -107.991707 |
// counting the number of titles by county and joining them with country codes in the dataframe
val dfCountry = df.valueCounts { country }.join(countriesCodes)
dfCountry
country | count | code | iso | lat | lon |
---|---|---|---|---|---|
United States | 2549 | US | USA | 39.783730 | -100.445883 |
India | 923 | IN | IND | 22.351115 | 78.667743 |
United Kingdom | 396 | GB | GBR | 54.702355 | -3.276575 |
Japan | 225 | JP | JPN | 36.574844 | 139.239418 |
South Korea | 183 | KR | KOR | 36.638392 | 127.696119 |
Canada | 177 | CA | CAN | 61.066692 | -107.991707 |
Spain | 134 | ES | ESP | 39.326069 | -4.837979 |
France | 115 | FR | FRA | 46.603354 | 1.888334 |
Egypt | 101 | EG | EGY | 26.254049 | 29.267547 |
Mexico | 100 | MX | MEX | 22.500049 | -100.000038 |
Turkey | 100 | TR | TUR | 38.959759 | 34.924965 |
Australia | 82 | AU | AUS | -24.776109 | 134.755000 |
Taiwan | 78 | TW | TWN | 23.973937 | 120.982018 |
Brazil | 72 | BR | BRA | -10.333333 | -53.200000 |
Philippines | 71 | PH | PHL | 12.750349 | 122.731210 |
Nigeria | 70 | NG | NGA | 9.600036 | 7.999972 |
Indonesia | 70 | ID | IDN | -2.483383 | 117.890285 |
Germany | 61 | DE | DEU | 51.083420 | 10.423447 |
China | 57 | CN | CHN | 35.000074 | 104.999927 |
Thailand | 57 | TH | THA | 14.897192 | 100.832730 |
The result clearly shows where the content is mainly produced and gets to Netflix. Let's take a closer look at the top of such countries.
dfCountry[0..9].sortByDesc { count }.plot {
bars {
x(country)
y(count)
fillColor = Color.hex("#00796B")
}
layout.title = "Top 10 Countries"
layout.size = 900 to 450
}
How long is the content? How much time would you need to spend to watch it all?
Let's get the number of minutes for films and the number of seasons for shows.
val df_dur = df
.split { duration }.by(" ").inward("duration_num", "duration_scale") // splitting duration by time and scale inward
.convert { "duration"["duration_num"] }.toInt() // converting by column path
.update { "duration"["duration_scale"] }.with { if (it == "Seasons") "Season" else it }
df_dur.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
duration_num | duration_scale | |||||||||||
s7114 | Movie | To and From New York | Sorin Dan Mihalcescu | Barbara King, Shaana Diya, John Krisi... | United States | 2008-01-01 | 2006 | TV-MA | 81 | min | Dramas, Independent Movies, Thrillers | While covering a story in New York Ci... |
s1766 | TV Show | Dinner for Five | null | null | United States | 2008-02-04 | 2007 | TV-MA | 1 | Season | Stand-Up Comedy & Talk Shows | In each episode, four celebrities joi... |
s3249 | Movie | Just Another Love Story | Ole Bornedal | Anders W. Berthelsen, Rebecka Hemse, ... | Denmark | 2009-05-05 | 2007 | TV-MA | 104 | min | Dramas, International Movies | When he causes a car accident that le... |
s5766 | Movie | Splatter | Joe Dante | Corey Feldman, Tony Todd, Tara Leigh,... | United States | 2009-11-18 | 2009 | TV-MA | 29 | min | Horror Movies | After committing suicide, a washed-up... |
s3841 | Movie | Mad Ron's Prevues from Hell | Jim Monaco | Nick Pawlow, Jordu Schell, Jay Kushwa... | United States | 2010-11-01 | 1987 | NR | 84 | min | Cult Movies, Horror Movies | This collection cherry-picks trailers... |
val durations = df_dur.pivot { type }.values { duration }
durations
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [5377 x 2]
... showing only top 5 of 5377 rows | DataFrame [2400 x 2]
... showing only top 5 of 2400 rows |
val p1 = durations.Movie.plot {
histogram(duration_num, binsOption = BinsOption.byNumber(100)) {
y(Stat.density)
fillColor = Color.hex("#00BCD4")
}
statBin(duration_num, binsOption = BinsOption.byNumber(25)) {
line {
x(Stat.x) { axis.name = "minutes" }
y(Stat.density) { axis.name = "density" }
alpha = 1.0
width = 1.0
color = Color.hex("#d41900")
}
}
layout.title = "Distribution of movies duration in minutes"
}
val p2 = durations.`TV Show`.plot {
statBin(duration_num, binsOption = BinsOption.byNumber(15)) {
bars {
x(Stat.x)
y(Stat.count)
fillColor = Color.hex("#00BCD4")
}
}
}
plotBunch {
add(p1, 0, 0, 1000, 500)
add(p2, 0, 500, 1000, 500)
}
And by tradition, the longest movies and TV shows.
df_dur.xs("Movie") { type }
.sortByDesc { duration.duration_num }.head()
.select { title and country and date_added and release_year and duration.cols() }
title | country | date_added | release_year | duration_num | duration_scale |
---|---|---|---|---|---|
Black Mirror: Bandersnatch | United States | 2018-12-28 | 2018 | 312 | min |
The School of Mischief | Egypt | 2020-05-21 | 1973 | 253 | min |
No Longer kids | Egypt | 2020-05-21 | 1979 | 237 | min |
Lock Your Girls In | null | 2020-05-21 | 1982 | 233 | min |
Raya and Sakina | null | 2020-05-21 | 1984 | 230 | min |
df_dur.xs("TV Show") { type }
.sortByDesc { duration.duration_num }.head()
.select { title and country and date_added and release_year and duration.cols() }
title | country | date_added | release_year | duration_num | duration_scale |
---|---|---|---|---|---|
Grey's Anatomy | United States | 2020-05-09 | 2019 | 16 | Season |
NCIS | United States | 2018-07-01 | 2017 | 15 | Season |
Supernatural | United States, Canada | 2020-06-05 | 2019 | 15 | Season |
COMEDIANS of the world | United States | 2019-01-01 | 2019 | 13 | Season |
Criminal Minds | United States, Canada | 2017-06-30 | 2017 | 12 | Season |
And in the top content-producing countries, how long are movies and TV shows?
val list_top_countries = dfCountry.country.take(10).toSet()
val df_cntr = df_dur
.filter { country in list_top_countries }
.pivot { type }.aggregate {
groupBy { country }.mean { duration.duration_num }
}
df_cntr
Movie | TV Show | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DataFrame [10 x 2]
... showing only top 5 of 10 rows | DataFrame [10 x 2]
... showing only top 5 of 10 rows |
val p1 = df_cntr.Movie.sortBy { duration_num }.plot {
bars {
x(country) { axis.name = "Name" }
y(duration_num) { axis.name = "Minute" }
fillColor(duration_num) {
scale = continuous(Color.hex("#ECEFF1")..Color.hex("#263238"))
legend.type = LegendType.None
}
}
layout.title = "Top 10 cast on Movies by country"
}
val p2 = df_cntr.`TV Show`.sortBy { duration_num }.plot {
bars {
x(country) { axis.name = "Name" }
y(duration_num) { axis.name = "Season" }
fillColor(duration_num) {
scale = continuous(Color.hex("#E0F7FA")..Color.hex("#006064"))
legend.type = LegendType.None
}
}
layout.title = "Top 10 cast on TV Shows by country"
}
plotBunch {
add(p1, 0, 0, 900, 550)
add(p2, 0, 550, 900, 550)
}
Finally, let's take a look at the rating column. Here we will find out what is the most commonly assigned rating for films and shows.
val dfInstants = df.convert { date_added }.with { it.atStartOfDayIn(TimeZone.UTC) }
dfInstants.valueCounts(false) { rating }.sortBy("count").plot {
bars {
x(rating)
y("count")
fillColor(rating) {
scale = categoricalColorHue()
legend.type = LegendType.None
}
}
layout.title = "Rating of Titles"
layout.size = 950 to 500
}
dfInstants.valueCounts(sort = false) { rating and type }.plot {
bars {
x(rating)
y("count")
fillColor(type) { scale = categorical(listOf(Color.hex("#607D8B"), Color.hex("#00BCD4"))) }
position = Position.dodge()
}
layout.title = "Rating of Titles"
layout.size = 950 to 500
}
Finally, use the experimental Kandy-Geo integration to plot some results on a map!
// Dataframe & Kandy geo extensions
%use kandy-geo@kc25
val worldGeo = with("naturalearth_lowres") {
val url = "https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/shp/${this}/${this}.shp"
GeoDataFrame.readShapefile(url)
}
2025-05-27T15:23:28.575890Z Execution of code 'val worldGeo = with(...' ERROR Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
val geoCountriesCount = worldGeo.modify {
join(dfCountry) {
iso_a3 match right.iso
}
}
geoCountriesCount.df
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | country | count | code | lat | lon |
---|---|---|---|---|---|---|---|---|---|---|
35623680 | North America | Canada | CAN | 1674000.000000 | MULTIPOLYGON (((-122.84000000000003 4... | Canada | 177 | CA | 61.066692 | -107.991707 |
326625791 | North America | United States of America | USA | 18560000.000000 | MULTIPOLYGON (((-122.84000000000003 4... | United States | 2549 | US | 39.783730 | -100.445883 |
260580739 | Asia | Indonesia | IDN | 3028000.000000 | MULTIPOLYGON (((141.00021040259185 -2... | Indonesia | 70 | ID | -2.483383 | 117.890285 |
44293293 | South America | Argentina | ARG | 879400.000000 | MULTIPOLYGON (((-68.63401022758323 -5... | Argentina | 50 | AR | -34.996496 | -64.967282 |
17789267 | South America | Chile | CHL | 436100.000000 | MULTIPOLYGON (((-68.63401022758323 -5... | Chile | 14 | CL | -31.761337 | -71.318770 |
47615739 | Africa | Kenya | KEN | 152700.000000 | MULTIPOLYGON (((39.20222 -4.67677, 37... | Kenya | 2 | KE | 1.441968 | 38.431398 |
142257519 | Europe | Russia | RUS | 3745000.000000 | MULTIPOLYGON (((178.7253 71.0988, 180... | Russia | 16 | RU | 64.686314 | 97.745306 |
54841552 | Africa | South Africa | ZAF | 739100.000000 | MULTIPOLYGON (((16.344976840895242 -2... | South Africa | 25 | ZA | -28.816624 | 24.991639 |
124574795 | North America | Mexico | MEX | 2307000.000000 | MULTIPOLYGON (((-117.12775999999985 3... | Mexico | 100 | MX | 22.500049 | -100.000038 |
3360148 | South America | Uruguay | URY | 73250.000000 | MULTIPOLYGON (((-57.62513342958296 -3... | Uruguay | 3 | UY | -32.875555 | -56.020153 |
207353391 | South America | Brazil | BRA | 3081000.000000 | MULTIPOLYGON (((-53.373661668498244 -... | Brazil | 72 | BR | -10.333333 | -53.200000 |
31036656 | South America | Peru | PER | 410400.000000 | MULTIPOLYGON (((-69.89363521999663 -4... | Peru | 4 | PE | -6.869970 | -75.045852 |
47698524 | South America | Colombia | COL | 688000.000000 | MULTIPOLYGON (((-66.87632585312258 1.... | Colombia | 31 | CO | 2.889443 | -73.783892 |
15460732 | North America | Guatemala | GTM | 131800.000000 | MULTIPOLYGON (((-92.22775000686983 14... | Guatemala | 1 | GT | 15.635609 | -89.898809 |
31304016 | South America | Venezuela | VEN | 468600.000000 | MULTIPOLYGON (((-60.73357418480372 5.... | Venezuela | 1 | VE | 8.001871 | -66.110932 |
13805084 | Africa | Zimbabwe | ZWE | 28330.000000 | MULTIPOLYGON (((31.19140913262129 -22... | Zimbabwe | 1 | ZW | -19.016880 | 29.353650 |
2484780 | Africa | Namibia | NAM | 25990.000000 | MULTIPOLYGON (((19.895767856534434 -2... | Namibia | 1 | null | -23.233550 | 17.323111 |
14668522 | Africa | Senegal | SEN | 39720.000000 | MULTIPOLYGON (((-16.71372880702347 13... | Senegal | 1 | SN | 14.465177 | -14.765341 |
190632261 | Africa | Nigeria | NGA | 1089000.000000 | MULTIPOLYGON (((2.6917016943562544 6.... | Nigeria | 70 | NG | 9.600036 | 7.999972 |
27499924 | Africa | Ghana | GHA | 120800.000000 | MULTIPOLYGON (((0.0238025244237008 11... | Ghana | 3 | GH | 7.857371 | -1.084098 |
geoCountriesCount.plot {
geoMap() {
fillColor(count) {
scale = continuous(Color.hex("#FFF3E0")..Color.hex("#E65100"), transform = Transformation.SQRT)
legend.name = "Number of Titles"
}
}
coordinatesTransformation = CoordinatesTransformation.mercator()
y.axis.limits = -55 .. 85
layout {
size = 1000 to 800
style(Style.Void)
}
}