This is an experiment in using the Kotlin kernel for Jupyter

Note: this notebook was updated July 2021 to point to newer versions of its dependencies, which had become deprecated and were not allowing the notebook to complete successfully. It was also used for a presentation of Kotlin's Jupyter kernel in March 2021, so the 2020 season data, which didn't exist at the time the initial article was written, was added.

In [1]:
// two "supported" packages, we can skip the full dependency & import boilerplate
%use lets-plot, krangl
In [2]:
// csv is courtesy of pro-football-reference: https://www.pro-football-reference.com/years/NFL/scoring.htm
val dfScoring = DataFrame.readCSV("nfl_scoring.csv")
dfScoring
Out[2]:
RkYearTmsRshTDRecTDPR TDKR TDFblTDIntTDOthTDAllTD2PM2PAXPMXPAFGMFGASftyPtsPts/G
1202032532871872131314736313112441338812960241269224.8
2201932447797773435513325411311361210802983171167622.8
3201832439847752445413716612911641235802947101194823.3
42017323807411074142412253782106611348661027151111821.7
520163244378610722344130651105111911958501009201164722.8
6201532365842137335351318459411461217834987161167822.8
72014323808071362847121293285812221230829987211156522.6
8201332410804137306591338346912621267863998201198723.4
9201232401757181326711112972956122912358521016131165122.8
102011324007452093149512592450120012078381011211135822.2
112010323997511323225751270265012031214794964131128322.0
122009324297101018254871247245911651185756930141099121.5
13200832476646161333521012462864117011768451000211127922.0
142007323867201725375261243305711651177795960181110421.7
15200632424648159334931181213511241135767942121057720.7
16200532431644912234761172274710991114783967111055620.6
172004324167321117345351268377311791189703870151100021.5
182003324276541813245841198296011101128756954211066620.8
192002324606942217264651270478111481165737951121109721.7
202001313656351210335961120408510081027732959101002420.2

... only showing top 20 rows

look how filter is a normal, native Kotlin command! Only difference is lt or gt instead of > or <

compare to non-native Python required by Pandas for simple filtering: df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values

In [3]:
// DataFrames are cool but so are native Kotlin data structures, like Maps
val mapScoring = dfScoring.filter { (it["Year"] lt 2021) AND (it["Year"] gt 1990) }.toMap()
mapScoring.keys
Out[3]:
[Rk, Year, Tms, RshTD, RecTD, PR TD, KR TD, FblTD, IntTD, OthTD, AllTD, 2PM, 2PA, XPM, XPA, FGM, FGA, Sfty, Pts, Pts/G]
In [4]:
// the map's keys are strings (column titles), the values are lists, the individual lists contain the column data
mapScoring["Year"]?.map { it }
Out[4]:
[2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991]
In [5]:
// boom... we can easily plot a key column, Total Points
val p = letsPlot(mapScoring) { x = "Year"; y = "Pts" } + ggsize(640, 240)
p + geomBar(stat=Stat.identity) +
    ggtitle("Total Points per NFL regular season")
Out[5]:
In [6]:
// and another, Receiving TDs
val p = letsPlot(mapScoring) { x = "Year"; y = "RecTD" } + ggsize(640, 240)
p + geomBar(stat=Stat.identity) +
    ggtitle("Total Receiving Touchdowns per NFL regular season")
Out[6]:

The graphs will be a bit more dramatic if we group the years by 5-year buckets

In [27]:
// to add new columns (for bucketing), we return to the original DataFrame and create new columns based on existing values
// krangl's `addColumn` is not a native Kotlin method, but its syntax is just like `filter` or `map`, it accesses `it`, etc.
val dfScoringRanges = dfScoring
    .filter { (it["Year"] lt 2021) AND (it["Year"] gt 1990) }
    .addColumn("YearRange") { it["Year"].map<Double>{ floor(it.minus(1).div(5.0)).times(5).plus(1).toInt() }}
    .addColumn("Years") { it["YearRange"].map<Int>{ "$it - ${it + 4}" }}
 
// we're creating another Map, but now we are grouping by year bucket and averaging the values within each bucket
val mapScoringRanges = dfScoringRanges
    .select({ listOf("Year", "Pts", "RecTD", "YearRange", "Years") })
    .groupBy("YearRange", "Years")
    .summarize(
        "mean_Pts" to { it["Pts"].mean(removeNA = true) },
        "mean_RecTD" to { it["RecTD"].mean(removeNA = true) }
    ).toMap()

// these xlimits are the discrete values used on the x-axis (and the labels)
// only annoying thing is all the null handling of a data source we know is non-null
val xlimits = mapScoringRanges["Years"]?.toSet()?.reversed()?.filterNotNull()
In [28]:
// same plot as before, but bucketed -- unlike above graph, every value is higher than previous, no ups & downs
val p = letsPlot(mapScoringRanges) { x = "Years"; y = "mean_Pts" } + ggsize(780, 240)
    p + geomBar(stat=Stat.identity) + scaleXDiscrete(limits = xlimits) +
    ggtitle("Average total points per NFL regular season")
Out[28]:
In [29]:
// ggsave(p + geom_bar(stat=Stat.identity) + scale_x_discrete(limits = xlimits) +
//     ggtitle("Average total points per NFL regular season"), "avg_points_binned.png")
In [30]:
// again, same plot, bucketed
val p2 = letsPlot(mapScoringRanges) { x = "Years"; y = "mean_RecTD" } + ggsize(780, 240)
p2 + geomBar(stat=Stat.identity) + scaleXDiscrete(limits = xlimits) +
    ggtitle("Average Receiving Touchdowns per NFL regular season")
Out[30]:
In [31]:
// ggsave(p2 + geom_bar(stat=Stat.identity) + scale_x_discrete(limits = xlimits) +
//     ggtitle("Average Receiving Touchdowns per NFL regular season"), "avg_rectd_binned.png")
In [ ]:

In [ ]: