Julia is getting more and more interesting as a way to translate math into beautiful-looking code. After some halted attempts earlier at Julia 1.0 and 1.4, I'm giving it a try again at 1.6 after it seems like Julia is looking more and more stable. So here I'll write down a simple task that would be trivial to do in Python, except I'll do it in Julia instead.
I want to:
using HTTP, CSV, DataFrames, PyCall, DataFramesMeta
# Use a GET command to get some data, in this case it's a String
# r = HTTP.get("https://api.github.com/users/banditkings").body |> String;
url = "https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-analytics-edge-spring-2017/linear-regression/assignment-2/climate_change.csv";
Docstrings must be placed before a function call. On the plus side, it seems to render the docstring really nicely:
"""
Basic function call example. In this case we're using it to fetch a CSV file from
github here.
You can even use latex in here! \$f(x) = Ax + b\$
<span style="color:red">and HTML!!</span>
Parameters
----------
url : String
* url address for the CSV file that you want to read in as a dataframe
Returns
-------
dataframe
"""
function get_data(url)
body = HTTP.get(url).body
df = CSV.read(body, DataFrame);
end;
?get_data() # display the docstring we just generated
try
:catch
¶Our little get_data
function is fine but I don't want to constantly be downloading this file if I don't have to...how about something like python's try
:except
and learn a little about how Julia handles control flows?
Goal: I want to try to load from CSV file if it's available, and load from a url if it doesn't exist.
Reference: Control Flows and try
:catch
here!
try
df = CSV.read("climate_change.csv", DataFrame);
catch
println("No CSV Found, Downloading from source")
df = get_data(url);
# We can also save the csv file locally
CSV.write("climate_change.csv", df)
end;
df = CSV.read("climate_change.csv", DataFrame);
Another key difference from python is appending lists together. In python, you'll create a list like:
>>> foo = [1]
>>> foo = foo + ['a', 'b']
>>> foo
[1, 'a', 'b']
Actually why don't we use the PyCall
module to demonstrate the above? With PyCall
, you can call Python and get functions and variables out. I tried this a while back and it seemed to be really bad at supporting pandas dataframes but things like numpy arrays are fine.
# Example of using Python within a jupyter notebook cell with a Julia kernel
py"""
foo = [1]
foo = foo + ['a', 'b']
"""
# Reference an existing variable or function
py"foo"
3-element Vector{Any}: 1 "a" "b"
Now here's the same thing in Julia:
foo = Any[1]; # need to set to Any or it'll assume it's an Integer and it'll fail on the two strings there
append!(foo, ["a"])
2-element Vector{Any}: 1 "a"
!
symbol?¶Okay notice the exclamation mark !
after append
- this is special Julia notation for functions that modify something inplace. In this case, append!
adds an item to the list in place. You could use push!
if adding a single item. And there's no append
function, only append!
. How do you do something where you don't modify it in place? You can't, but you can make a copy:
bar = foo |> copy
append!(bar, ["b"])
3-element Vector{Any}: 1 "a" "b"
foo == bar
false
With Base Julia: Documentation
head
and tail
become first
and last
with Julia.
first(df, 5)
5 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1983 | 6 | 2.167 | 345.52 | 1633.71 | 303.746 | 192.057 | 351.848 | 1366.12 | 0.0794 |
3 | 1983 | 7 | 1.741 | 344.15 | 1633.22 | 303.795 | 192.818 | 353.725 | 1366.29 | 0.0731 |
4 | 1983 | 8 | 1.13 | 342.25 | 1631.35 | 303.839 | 193.602 | 355.633 | 1366.42 | 0.0673 |
5 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
You can also pipe commands as well with Julia's base pipe |>
df |> last
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
308 | 2008 | 12 | -0.666 | 385.56 | 1812.88 | 322.182 | 244.204 | 535.005 | 1365.69 | 0.0046 |
How about the last 5 by pipe |>
? Well it's not as good as R's dplyr
unfortunately...the base pipe function in Julia can't handle multiple parameters. There are other libraries like Pipe
and Chain
that make use of macros, but compared to dplyr
or even method chaining in Python, they both are too clunky.
Libraries like the Query from the Queryverse attempt to address this but one of the many issues of trying to do work in a new language that's decades behind Python and R.
Indexing is just like pandas like:
df[rows, columns]
# get Rows 5-10 (INCLUSIVE), and all columns
df[5:10, :]
6 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
2 | 1983 | 10 | 0.002 | 340.3 | 1663.79 | 303.97 | 195.171 | 359.174 | 1366.06 | 0.0569 |
3 | 1983 | 11 | -0.176 | 341.53 | 1658.23 | 304.032 | 195.921 | 360.758 | 1366.11 | 0.0524 |
4 | 1983 | 12 | -0.176 | 343.07 | 1654.31 | 304.082 | 196.609 | 362.174 | 1366.06 | 0.0486 |
5 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
6 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
You can exclude columns with Not()
, and you can use regex, too. What a dream!!
# Exclude all columns that have a capital C in it
df[1, Not(r"C.*")]
DataFrameRow (7 columns)
Year | Month | MEI | N2O | TSI | Aerosols | Temp | |
---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 303.677 | 1366.1 | 0.0863 | 0.109 |
Checking out the data:
size(df)
: To get dataframe dimensions you need to use size
instead of length
or len
or shape
:describe(df)
: Gives you summary infosize(df)
(308, 11)
describe(df)
11 rows × 7 columns
variable | mean | min | median | max | nmissing | eltype | |
---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Int64 | DataType | |
1 | Year | 1995.66 | 1983 | 1996.0 | 2008 | 0 | Int64 |
2 | Month | 6.55195 | 1 | 7.0 | 12 | 0 | Int64 |
3 | MEI | 0.275555 | -1.635 | 0.2375 | 3.001 | 0 | Float64 |
4 | CO2 | 363.227 | 340.17 | 361.735 | 388.5 | 0 | Float64 |
5 | CH4 | 1749.82 | 1629.89 | 1764.04 | 1814.18 | 0 | Float64 |
6 | N2O | 312.392 | 303.677 | 311.507 | 322.182 | 0 | Float64 |
7 | CFC-11 | 251.973 | 191.324 | 258.344 | 271.494 | 0 | Float64 |
8 | CFC-12 | 497.525 | 350.113 | 528.356 | 543.813 | 0 | Float64 |
9 | TSI | 1366.07 | 1365.43 | 1365.98 | 1367.32 | 0 | Float64 |
10 | Aerosols | 0.0166571 | 0.0016 | 0.00575 | 0.1494 | 0 | Float64 |
11 | Temp | 0.256776 | -0.282 | 0.248 | 0.739 | 0 | Float64 |
How do you conditionally filter a dataframe? One way to do it is through indexing, similar to how you would do it in Pandas.
Note that you can reference a dataframe column (like a pandas series) both with and without quotes, so df.Year
is the same as df."Year"
. Handy for dealing with spaces!
df[(df.Year.==1985) .& (df."Month".<=4), :]
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
2 | 1985 | 2 | -0.602 | 346.06 | 1666.21 | 305.243 | 209.543 | 374.681 | 1365.71 | 0.016 |
3 | 1985 | 3 | -0.737 | 347.66 | 1678.34 | 305.165 | 210.368 | 376.004 | 1365.66 | 0.0141 |
4 | 1985 | 4 | -0.484 | 348.2 | 1675.24 | 305.093 | 211.111 | 377.635 | 1365.51 | 0.0138 |
.
doing in front of the <=
, ==
and &
sign?¶The .
sign there is doing some heavy lifting. It's an 'elementwise' operator, meaning that it does the comparison across every element in the vector df.Year
and df.Month
. If you don't have them this is what it would return for the df.Year
comparison:
df.Year==1985 # Returns true if ALL values of Year are equal to 1985
false
When what we really want is something like a list (vector) of values that compares each element and returns either true
or false
for each element in each vector.
println((df.Year.==1985 )[1:5])
Bool[0, 0, 0, 0, 0]
Another example
println(([1,1,0].==[1,0,1]))
Bool[1, 0, 0]
println((df.Year.==1985) .& (df."Month".<=4))
Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
DataFrames
also has a filter
and subset
option to index based on row values, but they're a little messy
The filter(f, df)
function lets you filter by passing in a function, like:
filter(:Year=> ==(1985), df) |> first
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
:
doing in front of Year
?¶Another nuance with Julia is this little colon in front of a column name, and that's because Year
is a Symbol
type. This is a special thing you can do for performance, but you can still index columns by string too - handy if your column name has spaces in it.
filter(:"Year"=> ==(1985), df) |> first
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1985 | 1 | -0.561 | 345.25 | 1663.42 | 305.301 | 208.537 | 373.623 | 1365.61 | 0.0164 |
=>
doing after the column?¶More nuance, and this is usually the part where I go back to Pandas or Dplyr. What's actually happening here is that it's assigning a Pair
like key=>value
.
The row logic makes sense to me more, but it's not as performant
@time begin
filter(row -> (row.Year in [1983, 1984]), df) |> first
end
0.190525 seconds (834.12 k allocations: 43.820 MiB, 99.68% compilation time)
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
Remember that ->
is an anonymous function and filter
is expecting a function as its first argument.
filter
and subset
: julia for data scienceThis row filtering is equivalent to:
@time begin
df[(df.Year .==1983) .| (df.Year .==1984), :] |> first
end
0.098306 seconds (370.27 k allocations: 18.121 MiB, 99.74% compilation time)
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
Or also:
# using a list comprehension works for the `in` operator here
@time begin
df[[ year in [1983,1984] for year in df.Year], :] |> first
end
0.071469 seconds (159.67 k allocations: 8.747 MiB, 99.50% compilation time)
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
And you can combine indexing like this:
@time begin
filter(row -> (row.Year in [1983, 1984]) & (row.Month <= 4), df)
end
0.084675 seconds (455.85 k allocations: 23.343 MiB, 99.34% compilation time)
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
Or equivalently with plain indexing:
# using a list comprehension works because there's no elementwise `.in` or `.∈` operator here
@time begin
condition1 = [year in [1983,1984] for year in df.Year]
condition2 = [month <= 5 for month in df.Month]
df[condition1 .& condition2, :] |> first
end
0.206541 seconds (533.77 k allocations: 27.184 MiB, 13.27% gc time, 99.64% compilation time)
DataFrameRow (11 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
So the nice thing about filter
is that you can apply functions and stuff to it, like you could do calculations and then do a boolean off of the calculation, etc etc. Either way it's not too slow.
Returns a copy of the dataframe where args are true
subset(df, :Year=> ByRow(==(1983)))
8 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1983 | 6 | 2.167 | 345.52 | 1633.71 | 303.746 | 192.057 | 351.848 | 1366.12 | 0.0794 |
3 | 1983 | 7 | 1.741 | 344.15 | 1633.22 | 303.795 | 192.818 | 353.725 | 1366.29 | 0.0731 |
4 | 1983 | 8 | 1.13 | 342.25 | 1631.35 | 303.839 | 193.602 | 355.633 | 1366.42 | 0.0673 |
5 | 1983 | 9 | 0.428 | 340.17 | 1648.4 | 303.901 | 194.392 | 357.465 | 1366.23 | 0.0619 |
6 | 1983 | 10 | 0.002 | 340.3 | 1663.79 | 303.97 | 195.171 | 359.174 | 1366.06 | 0.0569 |
7 | 1983 | 11 | -0.176 | 341.53 | 1658.23 | 304.032 | 195.921 | 360.758 | 1366.11 | 0.0524 |
8 | 1983 | 12 | -0.176 | 343.07 | 1654.31 | 304.082 | 196.609 | 362.174 | 1366.06 | 0.0486 |
To do multiple conditions (including an OR operator) we need to use a function. Both filter
and subset
are powerful because of the ability to pass functions as arguments.
function my_filter(year, month)
year_match = (year in [1983, 1984])
month_match = (month<=4)
return year_match & month_match
end;
# apply this filter for each row
subset(df, [:Year, :Month]=> ByRow(my_filter))
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
subset
is nice because it can handle missing values, whereas filter
won't. So in short, for most of your use cases you should stick with indexing or using subset
.
@time begin
subset(df, [:Year, :Month]=> ByRow(my_filter))
end
0.000168 seconds (507 allocations: 39.328 KiB)
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
@subset
¶Instead of the filter and subset stuff in base dataframes, we can use DataFramesMeta:
@subset(df, (:Year .<1984), (:Month .==5))
1 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
You can do @rsubset
to filter across rows so you don't need to do the elementwise .
every time:
@rsubset(df, :Year<1984, :Month==5)
1 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
So in the above example we are using an AND statement, it'll keep doing the ANDs across each item there. So if you want to do something fancier like an OR statement you'll need to wrap it in parenthesis:
@time begin
@rsubset(df, (:Year<1984 && :Month==5) || (:Year==1985 && :Month==3))
end
0.134172 seconds (490.13 k allocations: 26.135 MiB, 99.18% compilation time)
2 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1983 | 5 | 2.556 | 345.96 | 1638.59 | 303.677 | 191.324 | 350.113 | 1366.1 | 0.0863 |
2 | 1985 | 3 | -0.737 | 347.66 | 1678.34 | 305.165 | 210.368 | 376.004 | 1365.66 | 0.0141 |
Or you can always pass a custom function too, just like before (this is fast!)
@time begin
@rsubset(df, my_filter(:Year, :Month))
end
0.000125 seconds (508 allocations: 39.375 KiB)
4 rows × 11 columns (omitted printing of 1 columns)
Year | Month | MEI | CO2 | CH4 | N2O | CFC-11 | CFC-12 | TSI | Aerosols | |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1984 | 1 | -0.339 | 344.05 | 1658.98 | 304.13 | 197.219 | 363.359 | 1365.43 | 0.0451 |
2 | 1984 | 2 | -0.565 | 344.77 | 1656.48 | 304.194 | 197.759 | 364.296 | 1365.66 | 0.0416 |
3 | 1984 | 3 | 0.131 | 345.46 | 1655.77 | 304.285 | 198.249 | 365.044 | 1366.17 | 0.0383 |
4 | 1984 | 4 | 0.331 | 346.77 | 1657.68 | 304.389 | 198.723 | 365.692 | 1365.57 | 0.0352 |
Nice, it's super fast too! I think this is the way.
Instead of the clunky filtering and indexing, you can try out some of the querying libraries that Julia has to offer. They are all going to be different than pandas and dplyr, but that might not be a bad thing. The downside is that there isn't one obvious/clear standard yet in Julia, so for the practical business user or accidental coder it's still tough sledding.
It's November 2021 and we've had stuff like Query and Queryverse for a while now, with DataFramesMeta in the mix. I'll show off @chain
macro stuff here from the DataFramesMeta package, and we've already been familiar with @rsubset
from earlier.
df2 = rename(df,:CO2 => "Carbon Dioxide");
But there's more! Here's a guide for people coming from dplyr from the DataFramesMeta folks.
@chain df2 begin
@rsubset(:Year==1985, :Month<=4)
@select(:MEI, :Aerosols, :"Carbon Dioxide")
end
4 rows × 3 columns
MEI | Aerosols | Carbon Dioxide | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | -0.561 | 0.0164 | 345.25 |
2 | -0.602 | 0.016 | 346.06 |
3 | -0.737 | 0.0141 | 347.66 |
4 | -0.484 | 0.0138 | 348.2 |
Using our previous OR filter:
@chain df2 begin
@rsubset(my_filter(:Year, :Month))
@select(:MEI, :Aerosols, :"Carbon Dioxide")
end
4 rows × 3 columns
MEI | Aerosols | Carbon Dioxide | |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | -0.339 | 0.0451 | 344.05 |
2 | -0.565 | 0.0416 | 344.77 |
3 | 0.131 | 0.0383 | 345.46 |
4 | 0.331 | 0.0352 | 346.77 |