using CSV, DataFramesMeta, Statistics
# --- Plotting Functions - Makie ----
# using CairoMakie;
# set_theme!(theme_ggplot2())
# --- Plotting Functions - GadFly ----
using Gadfly
Gadfly.push_theme(:dark)
set_default_plot_size(15cm, 15cm)
Today we'll demonstrate a quick exploration and model that you might perform with TidyTuesday data for practice in Julia.
tidytuesday
repo¶Users of Julia 1.6+ can use Base.download
to get the data from the subfolders in the official tidytuesday
repo and then CSV.read
to read it into memory as a DataFrame
.
filepath = "https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-03-29/sports.csv?raw=true"
# Declare `missingstring` to convert 'NA' and 'NAN' values to the `missing` type
df = CSV.read(download(filepath), DataFrame; missingstring=["NA", "NAN"])
df |> describe
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | Type | |
1 | year | 2017.68 | 2015 | 2018.0 | 2019 | 0 | Int64 |
2 | unitid | 1.8436e5 | 100654 | 181738.0 | 800001 | 0 | Int64 |
3 | institution_name | ASA College | Yuba College | 0 | String | ||
4 | city_txt | Aberdeen | Yuma | 45 | Union{Missing, String31} | ||
5 | state_cd | AK | WY | 45 | Union{Missing, String3} | ||
6 | zip_text | 4.20049e7 | 604 | 49104.0 | 997757500 | 45 | Union{Missing, Int64} |
7 | classification_code | 7.77933 | 1 | 6.0 | 20 | 0 | Int64 |
8 | classification_name | CCCAA | USCAA | 0 | String | ||
9 | classification_other | ACCA | uscaa | 130642 | Union{Missing, String} | ||
10 | ef_male_count | 2126.25 | 0 | 986.0 | 35954 | 0 | Int64 |
11 | ef_female_count | 2496.21 | 0 | 1248.0 | 30325 | 0 | Int64 |
12 | ef_total_count | 4622.46 | 0 | 2259.0 | 66279 | 0 | Int64 |
13 | sector_cd | 2.2103 | 1 | 2.0 | 99 | 0 | Int64 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
17 | partic_women | 20.7083 | 1 | 16.0 | 327 | 63442 | Union{Missing, Int64} |
18 | partic_coed_men | 11.0469 | 1 | 8.0 | 130 | 131560 | Union{Missing, Int64} |
19 | partic_coed_women | 14.1669 | 1 | 10.0 | 91 | 131560 | Union{Missing, Int64} |
20 | sum_partic_men | 14.4924 | 0 | 0.0 | 331 | 0 | Int64 |
21 | sum_partic_women | 10.8622 | 0 | 6.0 | 327 | 0 | Int64 |
22 | rev_men | 809011.0 | 65 | 158126.0 | 156147208 | 70462 | Union{Missing, Int64} |
23 | rev_women | 2.79346e5 | 0 | 138318.0 | 21440365 | 63444 | Union{Missing, Int64} |
24 | total_rev_menwomen | 7.95231e5 | 130 | 2.28776e5 | 156147208 | 45193 | Union{Missing, Int64} |
25 | exp_men | 6.62386e5 | 65 | 159666.0 | 69718059 | 70462 | Union{Missing, Int64} |
26 | exp_women | 3.31594e5 | 65 | 141800.0 | 9485162 | 63442 | Union{Missing, Int64} |
27 | total_exp_menwomen | 7.32422e5 | 130 | 234559.0 | 69718059 | 45191 | Union{Missing, Int64} |
28 | sports | All Track Combined | Wrestling | 0 | String31 |
first(df, 5)
Row | year | unitid | institution_name | city_txt | state_cd | zip_text | classification_code | classification_name | classification_other | ef_male_count | ef_female_count | ef_total_count | sector_cd | sector_name | sportscode | partic_men | partic_women | partic_coed_men | partic_coed_women | sum_partic_men | sum_partic_women | rev_men | rev_women | total_rev_menwomen | exp_men | exp_women | total_exp_menwomen | sports |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | String | String31? | String3? | Int64? | Int64 | String | String? | Int64 | Int64 | Int64 | Int64 | String? | Int64 | Int64? | Int64? | Int64? | Int64? | Int64 | Int64 | Int64? | Int64? | Int64? | Int64? | Int64? | Int64? | String31 | |
1 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762 | 2 | NCAA Division I-FCS | missing | 1923 | 2300 | 4223 | 1 | Public, 4-year or above | 1 | 31 | missing | missing | missing | 31 | 0 | 345592 | missing | 345592 | 397818 | missing | 397818 | Baseball |
2 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762 | 2 | NCAA Division I-FCS | missing | 1923 | 2300 | 4223 | 1 | Public, 4-year or above | 2 | 19 | 16 | missing | missing | 19 | 16 | 1211095 | 748833 | 1959928 | 817868 | 742460 | 1560328 | Basketball |
3 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762 | 2 | NCAA Division I-FCS | missing | 1923 | 2300 | 4223 | 1 | Public, 4-year or above | 3 | 61 | 46 | missing | missing | 61 | 46 | 183333 | 315574 | 498907 | 246949 | 251184 | 498133 | All Track Combined |
4 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762 | 2 | NCAA Division I-FCS | missing | 1923 | 2300 | 4223 | 1 | Public, 4-year or above | 7 | 99 | missing | missing | missing | 99 | 0 | 2808949 | missing | 2808949 | 3059353 | missing | 3059353 | Football |
5 | 2015 | 100654 | Alabama A & M University | Normal | AL | 35762 | 2 | NCAA Division I-FCS | missing | 1923 | 2300 | 4223 | 1 | Public, 4-year or above | 8 | 9 | missing | missing | missing | 9 | 0 | 78270 | missing | 78270 | 83913 | missing | 83913 | Golf |
RCall
and the tidytuesdayR
library¶We can use the tidytuesdayR
R library and your local R installation with the RCall
library to call R from Julia.
Example:
YEAR=2022;
WEEK=13;
# TidytuesdayR
tt_data = R"tt_data <- tidytuesdayR::tt_load($YEAR, week=$WEEK)";
# R --> Julia
df= rcopy(tt_data["sports"])
For plotting there's lots of options. I initially used GadFly but switched to Makie recently.
Here's an example with GadFly using the data from 2022 Week 13.
This dataset comes from the 'Equity in Athletics Data Analysis', from Data is Plural So we'll want to make comparisons of sports, colleges, and genders.
Let's look at the kinds of sports there are:
@chain df begin
plot(x=:sports, Geom.histogram,
Guide.title("Types of Sports"))
end
What about gender differences in different sports, by total participation?
@chain df begin
@rsubset(:sum_partic_women > 0, :sum_partic_men > 0)
@by(:sports,
:partic_difference=(sum(:sum_partic_women) - sum(:sum_partic_men)))
@orderby(-:partic_difference)
plot(x=:partic_difference, y=:sports,
Geom.bar(orientation=:horizontal),
Guide.title("Gender Participation Differences by Sport\n(Left: Skews Male, Right: Skews Female)"))
end
What about revenues?
@chain df begin
@rsubset(:sum_partic_women > 0, :sum_partic_men > 0)
@by([:sports],
:difference=(sum(skipmissing(:rev_women)) - sum(skipmissing(:rev_men))),
:abs_difference=abs(sum(skipmissing(:rev_women)) - sum(skipmissing(:rev_men))))
@orderby(:abs_difference)
@rsubset(abs(:abs_difference)>1e6)
last(5)
plot(x=:difference, y=:sports,
Geom.bar(orientation=:horizontal),
Guide.title("Which sports have the highest absolute revenue imbalance?"))
end