Julia is getting more and more interesting as a way to translate math into beautiful-looking code. After some halted attempts earlier at Julia 1.0 and 1.4, I'm giving it a try again at 1.6 after it seems like Julia is looking more and more stable. So here I'll write down a simple task that would be trivial to do in Python, except I'll do it in Julia instead.

I want to:

  1. Read a CSV file from the web (i.e. in a github repository) as a dataframe
  2. Do some data cleaning - text replacement, fill na's, etc

Load a CSV File

In [1]:
using HTTP, CSV, DataFrames, PyCall, DataFramesMeta
In [2]:
# Use a GET command to get some data, in this case it's a String
# r = HTTP.get("https://api.github.com/users/banditkings").body |> String;
In [3]:
url = "https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-analytics-edge-spring-2017/linear-regression/assignment-2/climate_change.csv";

Docstrings must be placed before a function call. On the plus side, it seems to render the docstring really nicely:

In [4]:
"""
Basic function call example. In this case we're using it to fetch a CSV file from
github here.

You can even use latex in here! \$f(x) = Ax + b\$

<span style="color:red">and HTML!!</span>

Parameters
----------
url : String
* url address for the CSV file that you want to read in as a dataframe

Returns
-------
dataframe
"""    
function get_data(url)
    body = HTTP.get(url).body
    df = CSV.read(body, DataFrame);
end;
In [5]:
?get_data() # display the docstring we just generated
Out[5]:

Basic function call example. In this case we're using it to fetch a CSV file from github here.

You can even use latex in here! $f(x) = Ax + b$

and HTML!!

Parameters

url : String

  • url address for the CSV file that you want to read in as a dataframe

Returns

dataframe


Control Flows and try:catch

Our little get_data function is fine but I don't want to constantly be downloading this file if I don't have to...how about something like python's try:except and learn a little about how Julia handles control flows?

  • Goal: I want to try to load from CSV file if it's available, and load from a url if it doesn't exist.

  • Reference: Control Flows and try:catch here!

In [6]:
try
    df = CSV.read("climate_change.csv", DataFrame);
catch
    println("No CSV Found, Downloading from source")
    df = get_data(url);
    # We can also save the csv file locally
    CSV.write("climate_change.csv", df)
end;

df = CSV.read("climate_change.csv", DataFrame);

Python vs Julia: appending an item to a list (and PyCall in Jupyter)

Another key difference from python is appending lists together. In python, you'll create a list like:

>>> foo = [1]
>>> foo = foo + ['a', 'b']
>>> foo
[1, 'a', 'b']

Actually why don't we use the PyCall module to demonstrate the above? With PyCall, you can call Python and get functions and variables out. I tried this a while back and it seemed to be really bad at supporting pandas dataframes but things like numpy arrays are fine.

In [7]:
# Example of using Python within a jupyter notebook cell with a Julia kernel

py"""
foo = [1]
foo = foo + ['a', 'b']
"""

# Reference an existing variable or function
py"foo"
Out[7]:
3-element Vector{Any}:
 1
  "a"
  "b"

Now here's the same thing in Julia:

In [8]:
foo = Any[1]; # need to set to Any or it'll assume it's an Integer and it'll fail on the two strings there
append!(foo, ["a"])
Out[8]:
2-element Vector{Any}:
 1
  "a"

What is that ! symbol?

Okay notice the exclamation mark ! after append - this is special Julia notation for functions that modify something inplace. In this case, append! adds an item to the list in place. You could use push! if adding a single item. And there's no append function, only append!. How do you do something where you don't modify it in place? You can't, but you can make a copy:

In [9]:
bar = foo |> copy
append!(bar, ["b"])
Out[9]:
3-element Vector{Any}:
 1
  "a"
  "b"
In [10]:
foo == bar
Out[10]:
false

DataFrame Manipulation

With Base Julia: Documentation

head and tail become first and last with Julia.

In [11]:
first(df, 5)
Out[11]:

5 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
2198362.167345.521633.71303.746192.057351.8481366.120.0794
3198371.741344.151633.22303.795192.818353.7251366.290.0731
4198381.13342.251631.35303.839193.602355.6331366.420.0673
5198390.428340.171648.4303.901194.392357.4651366.230.0619

You can also pipe commands as well with Julia's base pipe |>

In [12]:
df |> last
Out[12]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
308200812-0.666385.561812.88322.182244.204535.0051365.690.0046

How about the last 5 by pipe |>? Well it's not as good as R's dplyr unfortunately...the base pipe function in Julia can't handle multiple parameters. There are other libraries like Pipe and Chain that make use of macros, but compared to dplyr or even method chaining in Python, they both are too clunky.

Libraries like the Query from the Queryverse attempt to address this but one of the many issues of trying to do work in a new language that's decades behind Python and R.

Indexing and Slicing

Indexing is just like pandas like:

df[rows, columns]
In [13]:
# get Rows 5-10 (INCLUSIVE), and all columns
df[5:10, :]
Out[13]:

6 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198390.428340.171648.4303.901194.392357.4651366.230.0619
21983100.002340.31663.79303.97195.171359.1741366.060.0569
3198311-0.176341.531658.23304.032195.921360.7581366.110.0524
4198312-0.176343.071654.31304.082196.609362.1741366.060.0486
519841-0.339344.051658.98304.13197.219363.3591365.430.0451
619842-0.565344.771656.48304.194197.759364.2961365.660.0416

You can exclude columns with Not(), and you can use regex, too. What a dream!!

In [14]:
# Exclude all columns that have a capital C in it
df[1, Not(r"C.*")]
Out[14]:

DataFrameRow (7 columns)

YearMonthMEIN2OTSIAerosolsTemp
Int64Int64Float64Float64Float64Float64Float64
1198352.556303.6771366.10.08630.109

Checking out the data:

  • size(df) : To get dataframe dimensions you need to use size instead of length or len or shape:
  • describe(df) : Gives you summary info
In [15]:
size(df)
Out[15]:
(308, 11)
In [16]:
describe(df)
Out[16]:

11 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolFloat64RealFloat64RealInt64DataType
1Year1995.6619831996.020080Int64
2Month6.5519517.0120Int64
3MEI0.275555-1.6350.23753.0010Float64
4CO2363.227340.17361.735388.50Float64
5CH41749.821629.891764.041814.180Float64
6N2O312.392303.677311.507322.1820Float64
7CFC-11251.973191.324258.344271.4940Float64
8CFC-12497.525350.113528.356543.8130Float64
9TSI1366.071365.431365.981367.320Float64
10Aerosols0.01665710.00160.005750.14940Float64
11Temp0.256776-0.2820.2480.7390Float64

Filtering DataFrames Based on Conditions

How do you conditionally filter a dataframe? One way to do it is through indexing, similar to how you would do it in Pandas.

Note that you can reference a dataframe column (like a pandas series) both with and without quotes, so df.Year is the same as df."Year". Handy for dealing with spaces!

In [17]:
df[(df.Year.==1985) .& (df."Month".<=4), :]
Out[17]:

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164
219852-0.602346.061666.21305.243209.543374.6811365.710.016
319853-0.737347.661678.34305.165210.368376.0041365.660.0141
419854-0.484348.21675.24305.093211.111377.6351365.510.0138

What is that . doing in front of the <=, == and & sign?

The . sign there is doing some heavy lifting. It's an 'elementwise' operator, meaning that it does the comparison across every element in the vector df.Year and df.Month. If you don't have them this is what it would return for the df.Year comparison:

In [18]:
df.Year==1985 # Returns true if ALL values of Year are equal to 1985
Out[18]:
false

When what we really want is something like a list (vector) of values that compares each element and returns either true or false for each element in each vector.

In [19]:
println((df.Year.==1985 )[1:5])
Bool[0, 0, 0, 0, 0]

Another example

In [20]:
println(([1,1,0].==[1,0,1]))
Bool[1, 0, 0]
In [21]:
println((df.Year.==1985) .& (df."Month".<=4))
Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

DataFrames also has a filter and subset option to index based on row values, but they're a little messy

The filter(f, df) function lets you filter by passing in a function, like:

In [22]:
filter(:Year=> ==(1985), df) |> first
Out[22]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164

What is the colon : doing in front of Year?

Another nuance with Julia is this little colon in front of a column name, and that's because Year is a Symbol type. This is a special thing you can do for performance, but you can still index columns by string too - handy if your column name has spaces in it.

In [23]:
filter(:"Year"=> ==(1985), df) |> first
Out[23]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119851-0.561345.251663.42305.301208.537373.6231365.610.0164

What the heck is the => doing after the column?

More nuance, and this is usually the part where I go back to Pandas or Dplyr. What's actually happening here is that it's assigning a Pair like key=>value.

Filtering Rows

The row logic makes sense to me more, but it's not as performant

In [24]:
@time begin
    filter(row -> (row.Year in [1983, 1984]), df) |> first
end
  0.209174 seconds (834.07 k allocations: 43.821 MiB, 99.62% compilation time)
Out[24]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

Remember that -> is an anonymous function and filter is expecting a function as its first argument.

This row filtering is equivalent to:

In [25]:
@time begin
    df[(df.Year .==1983) .| (df.Year .==1984), :] |> first
end
  0.143445 seconds (370.22 k allocations: 18.119 MiB, 22.80% gc time, 99.72% compilation time)
Out[25]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

Or also:

In [26]:
# using a list comprehension works for the `in` operator here
@time begin
    df[[ year in [1983,1984] for year in df.Year], :] |> first
end
  0.082639 seconds (159.67 k allocations: 8.747 MiB, 99.63% compilation time)
Out[26]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

And you can combine indexing like this:

In [27]:
@time begin 
    filter(row -> (row.Year in [1983, 1984]) & (row.Month <= 4), df)
end
  0.095871 seconds (455.86 k allocations: 23.344 MiB, 99.63% compilation time)
Out[27]:

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352

Or equivalently with plain indexing:

In [28]:
# using a list comprehension works because there's no elementwise `.in` or `.∈` operator here
@time begin
    condition1 = [year in [1983,1984] for year in df.Year]
    condition2 = [month <= 5 for month in df.Month]
    df[condition1 .& condition2, :] |> first
end
  0.167747 seconds (533.72 k allocations: 27.179 MiB, 99.48% compilation time)
Out[28]:

DataFrameRow (11 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

So the nice thing about filter is that you can apply functions and stuff to it, like you could do calculations and then do a boolean off of the calculation, etc etc. Either way it's not too slow.

Subset(df, args)

Returns a copy of the dataframe where args are true

In [29]:
subset(df, :Year=> ByRow(==(1983)))
Out[29]:

8 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863
2198362.167345.521633.71303.746192.057351.8481366.120.0794
3198371.741344.151633.22303.795192.818353.7251366.290.0731
4198381.13342.251631.35303.839193.602355.6331366.420.0673
5198390.428340.171648.4303.901194.392357.4651366.230.0619
61983100.002340.31663.79303.97195.171359.1741366.060.0569
7198311-0.176341.531658.23304.032195.921360.7581366.110.0524
8198312-0.176343.071654.31304.082196.609362.1741366.060.0486

Subset with multiple conditions

To do multiple conditions (including an OR operator) we need to use a function. Both filter and subset are powerful because of the ability to pass functions as arguments.

In [30]:
function my_filter(year, month)
    year_match = (year in [1983, 1984])
    month_match = (month<=4)
    return year_match & month_match
end;

# apply this filter for each row
subset(df, [:Year, :Month]=> ByRow(my_filter))
Out[30]:

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352

subset is nice because it can handle missing values, whereas filter won't. So in short, for most of your use cases you should stick with indexing or using subset.

In [31]:
@time begin
    subset(df, [:Year, :Month]=> ByRow(my_filter))
end
  0.000166 seconds (507 allocations: 39.328 KiB)
Out[31]:

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352

@subset

Instead of the filter and subset stuff in base dataframes, we can use DataFramesMeta:

In [32]:
@subset(df, (:Year .<1984), (:Month .==5))
Out[32]:

1 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

You can do @rsubset to filter across rows so you don't need to do the elementwise . every time:

In [33]:
@rsubset(df, :Year<1984, :Month==5)
Out[33]:

1 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
1198352.556345.961638.59303.677191.324350.1131366.10.0863

So in the above example we are using an AND statement, it'll keep doing the ANDs across each item there. So if you want to do something fancier like an OR statement you'll need to have that in a function and pass the function in to @rsubset:

In [34]:
@time begin
    @rsubset(df, my_filter(:Year, :Month))
end
  0.106380 seconds (275.40 k allocations: 13.698 MiB, 25.59% gc time, 99.50% compilation time)
Out[34]:

4 rows × 11 columns (omitted printing of 1 columns)

YearMonthMEICO2CH4N2OCFC-11CFC-12TSIAerosols
Int64Int64Float64Float64Float64Float64Float64Float64Float64Float64
119841-0.339344.051658.98304.13197.219363.3591365.430.0451
219842-0.565344.771656.48304.194197.759364.2961365.660.0416
3198430.131345.461655.77304.285198.249365.0441366.170.0383
4198440.331346.771657.68304.389198.723365.6921365.570.0352

Nice, it's super fast too! I think this is the way.

@chain from DataFramesMeta

Instead of the clunky filtering and indexing, you can try out some of the querying libraries that Julia has to offer. They are all going to be different than pandas and dplyr, but that might not be a bad thing. The downside is that there isn't one obvious/clear standard yet in Julia, so for the practical business user or accidental coder it's still tough sledding.

It's November 2021 and we've had stuff like Query and Queryverse for a while now, with DataFramesMeta in the mix. I'll show off @chain macro stuff here from the DataFramesMeta package, and we've already been familiar with @rsubset from earlier.

In [35]:
df2 = rename(df,:CO2 => "Carbon Dioxide");

But there's more! Here's a guide for people coming from dplyr from the DataFramesMeta folks.

In [36]:
@chain df2 begin
    @rsubset(:Year==1985, :Month<=4)
    @select(:MEI, :Aerosols, :"Carbon Dioxide")
end
Out[36]:

4 rows × 3 columns

MEIAerosolsCarbon Dioxide
Float64Float64Float64
1-0.5610.0164345.25
2-0.6020.016346.06
3-0.7370.0141347.66
4-0.4840.0138348.2

Using our previous OR filter:

In [37]:
@chain df2 begin
    @rsubset(my_filter(:Year, :Month))
    @select(:MEI, :Aerosols, :"Carbon Dioxide")
end
Out[37]:

4 rows × 3 columns

MEIAerosolsCarbon Dioxide
Float64Float64Float64
1-0.3390.0451344.05
2-0.5650.0416344.77
30.1310.0383345.46
40.3310.0352346.77
In [ ]: