Start with a table of data, we'll make up an employee database and a sales database

In [1]:
using CSV, DataFramesMeta, Statistics, Dates
In [2]:
# DataFrame(column=data)
df = DataFrame(id=1:8,
               first_name=["Michael", "Dwight", "Angela", "Jim", "Pam", "Oscar", "Meredith", "Creed"],
               last_name=["Scott", "Schrute", "Martin", "Halpert", "Beesly", "Nunez", "Palmer", "Bratton"],
               department=["Management & Admin", "Sales", "Accounting", "Sales", "Management & Admin", "Accounting", 
                           "Purchasing", "Purchasing"],
               salary=[5100, 4200, 3750, 4300, 2200, 3400, 3300, 3200])
Out[2]:

8 rows × 5 columns

idfirst_namelast_namedepartmentsalary
Int64StringStringStringInt64
11MichaelScottManagement & Admin5100
22DwightSchruteSales4200
33AngelaMartinAccounting3750
44JimHalpertSales4300
55PamBeeslyManagement & Admin2200
66OscarNunezAccounting3400
77MeredithPalmerPurchasing3300
88CreedBrattonPurchasing3200

Let's create a sales database called db_sales with client information (thanks to this site)

In [3]:
# Parse dates as Date objects
dates = ["1-2-2006", "1-29-2006", "2-1-2006", "2-14-2006", "3-1-2006", "3-20-2006"]
dates = parse.(Date, dates, dateformat"m-d-y")

db_sales = DataFrame(id=1:6,
                     transaction_date=dates,
                     employee_id=[4, 2, 4, 2, 4, 2],
                     quantity=[100, 500, 600, 200, 400, 250],
                     customer=["Dunmore High School", "Harper Collins", "Blue Cross of Pennsylvania", 
                                "Apex Technology", "Blue Cross of Pennsylvania", 
                                "Stone, Cooper, and Grandy: Attorneys at Law"])

db_sales |> print
6×5 DataFrame
 Row  id     transaction_date  employee_id  quantity  customer                          
      Int64  Date              Int64        Int64     String                            
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │     1  2006-01-02                  4       100  Dunmore High School
   2 │     2  2006-01-29                  2       500  Harper Collins
   3 │     3  2006-02-01                  4       600  Blue Cross of Pennsylvania
   4 │     4  2006-02-14                  2       200  Apex Technology
   5 │     5  2006-03-01                  4       400  Blue Cross of Pennsylvania
   6 │     6  2006-03-20                  2       250  Stone, Cooper, and Grandy: Attor…
In [4]:
#DataFrames
subset(df, 
       :id => ByRow(x -> x >= 1),
       :first_name => ByRow(x -> x =="Michael")
       )
Out[4]:

1 rows × 5 columns

idfirst_namelast_namedepartmentsalary
Int64StringStringStringInt64
11MichaelScottManagement & Admin5100
In [5]:
#With DataFramesMeta it's easier to layer on complexity like do 'or' statements a little easier, plus it's more terse
@rsubset(df,
        (:id == 2) | (:first_name =="Michael")
        )
Out[5]:

2 rows × 5 columns

idfirst_namelast_namedepartmentsalary
Int64StringStringStringInt64
11MichaelScottManagement & Admin5100
22DwightSchruteSales4200

Regex string matching

From Regex String Matching Docs: regex literals

julia> occursin(r"^\s*(?:#|$)", "# a comment")
true

Similar to pandas pd.Series.str.contains(), we use the base occursin() string function here to the powerful subset or rsubset functions.

In [6]:
#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))

#DataFramesMeta
@rsubset(df, occursin("Admin", :department))
Out[6]:

2 rows × 5 columns

idfirst_namelast_namedepartmentsalary
Int64StringStringStringInt64
11MichaelScottManagement & Admin5100
25PamBeeslyManagement & Admin2200

GROUP BY

Grouping and aggregating with groupby, combine, @combine, and @by

In [7]:
gd = groupby(df, :department)
#DataFrames
combine(gd, :salary => mean => :"Average Salary", 
            :department => length => :count)

#DataFramesMeta
@combine(gd, :"Average Salary" = mean(:salary),
             :count = length(:department))
Out[7]:

4 rows × 3 columns

departmentAverage Salarycount
StringFloat64Int64
1Management & Admin3650.02
2Sales4250.02
3Accounting3575.02
4Purchasing3250.02

bonus: counts

Alternatively, instead of doing groupby() and then combine(), you can do @by as a shorthand:

In [8]:
@by(df, :department, 
        :"Average Salary" = mean(:salary),          
        :count=length(:salary))
Out[8]:

4 rows × 3 columns

departmentAverage Salarycount
StringFloat64Int64
1Management & Admin3650.02
2Sales4250.02
3Accounting3575.02
4Purchasing3250.02

PARTITION BY

In [9]:
#DataFrames
transform(gd, :salary => mean => :"Average Salary")

#DataFramesMeta
@transform(gd, :"Average Salary"=mean(:salary))
Out[9]:

8 rows × 6 columns

idfirst_namelast_namedepartmentsalaryAverage Salary
Int64StringStringStringInt64Float64
11MichaelScottManagement & Admin51003650.0
22DwightSchruteSales42004250.0
33AngelaMartinAccounting37503575.0
44JimHalpertSales43004250.0
55PamBeeslyManagement & Admin22003650.0
66OscarNunezAccounting34003575.0
77MeredithPalmerPurchasing33003250.0
88CreedBrattonPurchasing32003250.0

JOIN

Joins are fairly self explanatory - the on argument is nice and flexible where you can pass it a column or pairs of columns (i.e. [:left=>:right]) for the mapping. In the case of duplicate column names, you can set makeunique=true to append a string when there are conflicts:

In [10]:
leftjoin(df, db_sales, on=[:id=>:employee_id], makeunique=true)
Out[10]:

12 rows × 9 columns (omitted printing of 2 columns)

idfirst_namelast_namedepartmentsalaryid_1transaction_date
Int64StringStringStringInt64Int64?Date?
12DwightSchruteSales420022006-01-29
22DwightSchruteSales420042006-02-14
32DwightSchruteSales420062006-03-20
44JimHalpertSales430012006-01-02
54JimHalpertSales430032006-02-01
64JimHalpertSales430052006-03-01
71MichaelScottManagement & Admin5100missingmissing
83AngelaMartinAccounting3750missingmissing
95PamBeeslyManagement & Admin2200missingmissing
106OscarNunezAccounting3400missingmissing
117MeredithPalmerPurchasing3300missingmissing
128CreedBrattonPurchasing3200missingmissing

Putting it all together

@chain

The @chain macro allows you to chain things together so you don't need to keep specifying the first argument in the function calls. This is similar to method chaining in python or the pipe in dplyr in R.

In [11]:
sales = @chain db_sales begin
    groupby(:employee_id)
    @combine(:total_quantity=sum(:quantity), 
             :number_of_customers=length(:customer))
end

result = @chain df begin
                @select(:id, :first_name, :last_name, :department)
                @rsubset(:department=="Sales")
                leftjoin(sales, on=[:id=>:employee_id])
                @orderby(sort(:total_quantity, rev=true))  # order by descending
          end
Out[11]:

2 rows × 6 columns

idfirst_namelast_namedepartmenttotal_quantitynumber_of_customers
Int64StringStringStringInt64?Int64?
14JimHalpertSales11003
22DwightSchruteSales9503
In [ ]: