Start with a table of data, we'll make up an employee database and a sales database
using CSV, DataFramesMeta, Statistics, Dates
# DataFrame(column=data)
df = DataFrame(id=1:8,
first_name=["Michael", "Dwight", "Angela", "Jim", "Pam", "Oscar", "Meredith", "Creed"],
last_name=["Scott", "Schrute", "Martin", "Halpert", "Beesly", "Nunez", "Palmer", "Bratton"],
department=["Management & Admin", "Sales", "Accounting", "Sales", "Management & Admin", "Accounting",
"Purchasing", "Purchasing"],
salary=[5100, 4200, 3750, 4300, 2200, 3400, 3300, 3200])
8 rows × 5 columns
id | first_name | last_name | department | salary | |
---|---|---|---|---|---|
Int64 | String | String | String | Int64 | |
1 | 1 | Michael | Scott | Management & Admin | 5100 |
2 | 2 | Dwight | Schrute | Sales | 4200 |
3 | 3 | Angela | Martin | Accounting | 3750 |
4 | 4 | Jim | Halpert | Sales | 4300 |
5 | 5 | Pam | Beesly | Management & Admin | 2200 |
6 | 6 | Oscar | Nunez | Accounting | 3400 |
7 | 7 | Meredith | Palmer | Purchasing | 3300 |
8 | 8 | Creed | Bratton | Purchasing | 3200 |
Let's create a sales database called db_sales
with client information (thanks to this site)
# Parse dates as Date objects
dates = ["1-2-2006", "1-29-2006", "2-1-2006", "2-14-2006", "3-1-2006", "3-20-2006"]
dates = parse.(Date, dates, dateformat"m-d-y")
db_sales = DataFrame(id=1:6,
transaction_date=dates,
employee_id=[4, 2, 4, 2, 4, 2],
quantity=[100, 500, 600, 200, 400, 250],
customer=["Dunmore High School", "Harper Collins", "Blue Cross of Pennsylvania",
"Apex Technology", "Blue Cross of Pennsylvania",
"Stone, Cooper, and Grandy: Attorneys at Law"])
db_sales |> print
6×5 DataFrame Row │ id transaction_date employee_id quantity customer │ Int64 Date Int64 Int64 String ─────┼─────────────────────────────────────────────────────────────────────────────────── 1 │ 1 2006-01-02 4 100 Dunmore High School 2 │ 2 2006-01-29 2 500 Harper Collins 3 │ 3 2006-02-01 4 600 Blue Cross of Pennsylvania 4 │ 4 2006-02-14 2 200 Apex Technology 5 │ 5 2006-03-01 4 400 Blue Cross of Pennsylvania 6 │ 6 2006-03-20 2 250 Stone, Cooper, and Grandy: Attor…
#DataFrames
subset(df,
:id => ByRow(x -> x >= 1),
:first_name => ByRow(x -> x =="Michael")
)
1 rows × 5 columns
id | first_name | last_name | department | salary | |
---|---|---|---|---|---|
Int64 | String | String | String | Int64 | |
1 | 1 | Michael | Scott | Management & Admin | 5100 |
#With DataFramesMeta it's easier to layer on complexity like do 'or' statements a little easier, plus it's more terse
@rsubset(df,
(:id == 2) | (:first_name =="Michael")
)
2 rows × 5 columns
id | first_name | last_name | department | salary | |
---|---|---|---|---|---|
Int64 | String | String | String | Int64 | |
1 | 1 | Michael | Scott | Management & Admin | 5100 |
2 | 2 | Dwight | Schrute | Sales | 4200 |
From Regex String Matching Docs: regex literals
julia> occursin(r"^\s*(?:#|$)", "# a comment")
true
Similar to pandas
pd.Series.str.contains()
, we use the base occursin()
string function here to the powerful subset
or rsubset
functions.
#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))
#DataFramesMeta
@rsubset(df, occursin("Admin", :department))
2 rows × 5 columns
id | first_name | last_name | department | salary | |
---|---|---|---|---|---|
Int64 | String | String | String | Int64 | |
1 | 1 | Michael | Scott | Management & Admin | 5100 |
2 | 5 | Pam | Beesly | Management & Admin | 2200 |
Grouping and aggregating with groupby
, combine
, @combine
, and @by
gd = groupby(df, :department)
#DataFrames
combine(gd, :salary => mean => :"Average Salary",
:department => length => :count)
#DataFramesMeta
@combine(gd, :"Average Salary" = mean(:salary),
:count = length(:department))
4 rows × 3 columns
department | Average Salary | count | |
---|---|---|---|
String | Float64 | Int64 | |
1 | Management & Admin | 3650.0 | 2 |
2 | Sales | 4250.0 | 2 |
3 | Accounting | 3575.0 | 2 |
4 | Purchasing | 3250.0 | 2 |
bonus: counts
Alternatively, instead of doing groupby()
and then combine()
, you can do @by
as a shorthand:
@by(df, :department,
:"Average Salary" = mean(:salary),
:count=length(:salary))
4 rows × 3 columns
department | Average Salary | count | |
---|---|---|---|
String | Float64 | Int64 | |
1 | Management & Admin | 3650.0 | 2 |
2 | Sales | 4250.0 | 2 |
3 | Accounting | 3575.0 | 2 |
4 | Purchasing | 3250.0 | 2 |
#DataFrames
transform(gd, :salary => mean => :"Average Salary")
#DataFramesMeta
@transform(gd, :"Average Salary"=mean(:salary))
8 rows × 6 columns
id | first_name | last_name | department | salary | Average Salary | |
---|---|---|---|---|---|---|
Int64 | String | String | String | Int64 | Float64 | |
1 | 1 | Michael | Scott | Management & Admin | 5100 | 3650.0 |
2 | 2 | Dwight | Schrute | Sales | 4200 | 4250.0 |
3 | 3 | Angela | Martin | Accounting | 3750 | 3575.0 |
4 | 4 | Jim | Halpert | Sales | 4300 | 4250.0 |
5 | 5 | Pam | Beesly | Management & Admin | 2200 | 3650.0 |
6 | 6 | Oscar | Nunez | Accounting | 3400 | 3575.0 |
7 | 7 | Meredith | Palmer | Purchasing | 3300 | 3250.0 |
8 | 8 | Creed | Bratton | Purchasing | 3200 | 3250.0 |
Joins are fairly self explanatory - the on
argument is nice and flexible where you can pass it a column or pairs of columns (i.e. [:left=>:right]
) for the mapping. In the case of duplicate column names, you can set makeunique=true
to append a string when there are conflicts:
leftjoin(df, db_sales, on=[:id=>:employee_id], makeunique=true)
12 rows × 9 columns (omitted printing of 2 columns)
id | first_name | last_name | department | salary | id_1 | transaction_date | |
---|---|---|---|---|---|---|---|
Int64 | String | String | String | Int64 | Int64? | Date? | |
1 | 2 | Dwight | Schrute | Sales | 4200 | 2 | 2006-01-29 |
2 | 2 | Dwight | Schrute | Sales | 4200 | 4 | 2006-02-14 |
3 | 2 | Dwight | Schrute | Sales | 4200 | 6 | 2006-03-20 |
4 | 4 | Jim | Halpert | Sales | 4300 | 1 | 2006-01-02 |
5 | 4 | Jim | Halpert | Sales | 4300 | 3 | 2006-02-01 |
6 | 4 | Jim | Halpert | Sales | 4300 | 5 | 2006-03-01 |
7 | 1 | Michael | Scott | Management & Admin | 5100 | missing | missing |
8 | 3 | Angela | Martin | Accounting | 3750 | missing | missing |
9 | 5 | Pam | Beesly | Management & Admin | 2200 | missing | missing |
10 | 6 | Oscar | Nunez | Accounting | 3400 | missing | missing |
11 | 7 | Meredith | Palmer | Purchasing | 3300 | missing | missing |
12 | 8 | Creed | Bratton | Purchasing | 3200 | missing | missing |
sales = @chain db_sales begin
groupby(:employee_id)
@combine(:total_quantity=sum(:quantity),
:number_of_customers=length(:customer))
end
result = @chain df begin
@select(:id, :first_name, :last_name, :department)
@rsubset(:department=="Sales")
leftjoin(sales, on=[:id=>:employee_id])
@orderby(sort(:total_quantity, rev=true)) # order by descending
end
2 rows × 6 columns
id | first_name | last_name | department | total_quantity | number_of_customers | |
---|---|---|---|---|---|---|
Int64 | String | String | String | Int64? | Int64? | |
1 | 4 | Jim | Halpert | Sales | 1100 | 3 |
2 | 2 | Dwight | Schrute | Sales | 950 | 3 |