Notebook

Start with a table of data, we'll make up an employee database and a sales database

In [1]:

using CSV, DataFramesMeta, Statistics, Dates

In [2]:

# DataFrame(column=data)
df = DataFrame(id=1:8,
               first_name=["Michael", "Dwight", "Angela", "Jim", "Pam", "Oscar", "Meredith", "Creed"],
               last_name=["Scott", "Schrute", "Martin", "Halpert", "Beesly", "Nunez", "Palmer", "Bratton"],
               department=["Management & Admin", "Sales", "Accounting", "Sales", "Management & Admin", "Accounting", 
                           "Purchasing", "Purchasing"],
               salary=[5100, 4200, 3750, 4300, 2200, 3400, 3300, 3200])

Out[2]:

8 rows × 5 columns

	id	first_name	last_name	department	salary
	Int64	String	String	String	Int64
1	1	Michael	Scott	Management & Admin	5100
2	2	Dwight	Schrute	Sales	4200
3	3	Angela	Martin	Accounting	3750
4	4	Jim	Halpert	Sales	4300
5	5	Pam	Beesly	Management & Admin	2200
6	6	Oscar	Nunez	Accounting	3400
7	7	Meredith	Palmer	Purchasing	3300
8	8	Creed	Bratton	Purchasing	3200

Let's create a sales database called db_sales with client information (thanks to this site)

In [3]:

# Parse dates as Date objects
dates = ["1-2-2006", "1-29-2006", "2-1-2006", "2-14-2006", "3-1-2006", "3-20-2006"]
dates = parse.(Date, dates, dateformat"m-d-y")

db_sales = DataFrame(id=1:6,
                     transaction_date=dates,
                     employee_id=[4, 2, 4, 2, 4, 2],
                     quantity=[100, 500, 600, 200, 400, 250],
                     customer=["Dunmore High School", "Harper Collins", "Blue Cross of Pennsylvania", 
                                "Apex Technology", "Blue Cross of Pennsylvania", 
                                "Stone, Cooper, and Grandy: Attorneys at Law"])

db_sales |> print

6×5 DataFrame
 Row │ id     transaction_date  employee_id  quantity  customer                          
     │ Int64  Date              Int64        Int64     String                            
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │     1  2006-01-02                  4       100  Dunmore High School
   2 │     2  2006-01-29                  2       500  Harper Collins
   3 │     3  2006-02-01                  4       600  Blue Cross of Pennsylvania
   4 │     4  2006-02-14                  2       200  Apex Technology
   5 │     5  2006-03-01                  4       400  Blue Cross of Pennsylvania
   6 │     6  2006-03-20                  2       250  Stone, Cooper, and Grandy: Attor…

In [4]:

#DataFrames
subset(df, 
       :id => ByRow(x -> x >= 1),
       :first_name => ByRow(x -> x =="Michael")
       )

Out[4]:

1 rows × 5 columns

	id	first_name	last_name	department	salary
	Int64	String	String	String	Int64
1	1	Michael	Scott	Management & Admin	5100

In [5]:

#With DataFramesMeta it's easier to layer on complexity like do 'or' statements a little easier, plus it's more terse
@rsubset(df,
        (:id == 2) | (:first_name =="Michael")
        )

Out[5]:

2 rows × 5 columns

	id	first_name	last_name	department	salary
	Int64	String	String	String	Int64
1	1	Michael	Scott	Management & Admin	5100
2	2	Dwight	Schrute	Sales	4200

Regex string matching¶

From Regex String Matching Docs: regex literals

julia> occursin(r"^\s*(?:#|$)", "# a comment")
true

Similar to pandas pd.Series.str.contains(), we use the base occursin() string function here to the powerful subset or rsubset functions.

In [6]:

#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))

#DataFramesMeta
@rsubset(df, occursin("Admin", :department))

Out[6]:

2 rows × 5 columns

	id	first_name	last_name	department	salary
	Int64	String	String	String	Int64
1	1	Michael	Scott	Management & Admin	5100
2	5	Pam	Beesly	Management & Admin	2200

GROUP BY¶

Grouping and aggregating with groupby, combine, @combine, and @by

In [7]:

gd = groupby(df, :department)
#DataFrames
combine(gd, :salary => mean => :"Average Salary", 
            :department => length => :count)

#DataFramesMeta
@combine(gd, :"Average Salary" = mean(:salary),
             :count = length(:department))

Out[7]:

4 rows × 3 columns

	department	Average Salary	count
	String	Float64	Int64
1	Management & Admin	3650.0	2
2	Sales	4250.0	2
3	Accounting	3575.0	2
4	Purchasing	3250.0	2

bonus: counts

Alternatively, instead of doing groupby() and then combine(), you can do @by as a shorthand:

In [8]:

@by(df, :department, 
        :"Average Salary" = mean(:salary),          
        :count=length(:salary))

Out[8]:

4 rows × 3 columns

	department	Average Salary	count
	String	Float64	Int64
1	Management & Admin	3650.0	2
2	Sales	4250.0	2
3	Accounting	3575.0	2
4	Purchasing	3250.0	2

PARTITION BY¶

In [9]:

#DataFrames
transform(gd, :salary => mean => :"Average Salary")

#DataFramesMeta
@transform(gd, :"Average Salary"=mean(:salary))

Out[9]:

8 rows × 6 columns

	id	first_name	last_name	department	salary	Average Salary
	Int64	String	String	String	Int64	Float64
1	1	Michael	Scott	Management & Admin	5100	3650.0
2	2	Dwight	Schrute	Sales	4200	4250.0
3	3	Angela	Martin	Accounting	3750	3575.0
4	4	Jim	Halpert	Sales	4300	4250.0
5	5	Pam	Beesly	Management & Admin	2200	3650.0
6	6	Oscar	Nunez	Accounting	3400	3575.0
7	7	Meredith	Palmer	Purchasing	3300	3250.0
8	8	Creed	Bratton	Purchasing	3200	3250.0

JOIN¶

Joins are fairly self explanatory - the on argument is nice and flexible where you can pass it a column or pairs of columns (i.e. [:left=>:right]) for the mapping. In the case of duplicate column names, you can set makeunique=true to append a string when there are conflicts:

In [10]:

leftjoin(df, db_sales, on=[:id=>:employee_id], makeunique=true)

Out[10]:

12 rows × 9 columns (omitted printing of 2 columns)

	id	first_name	last_name	department	salary	id_1	transaction_date
	Int64	String	String	String	Int64	Int64?	Date?
1	2	Dwight	Schrute	Sales	4200	2	2006-01-29
2	2	Dwight	Schrute	Sales	4200	4	2006-02-14
3	2	Dwight	Schrute	Sales	4200	6	2006-03-20
4	4	Jim	Halpert	Sales	4300	1	2006-01-02
5	4	Jim	Halpert	Sales	4300	3	2006-02-01
6	4	Jim	Halpert	Sales	4300	5	2006-03-01
7	1	Michael	Scott	Management & Admin	5100	missing	missing
8	3	Angela	Martin	Accounting	3750	missing	missing
9	5	Pam	Beesly	Management & Admin	2200	missing	missing
10	6	Oscar	Nunez	Accounting	3400	missing	missing
11	7	Meredith	Palmer	Purchasing	3300	missing	missing
12	8	Creed	Bratton	Purchasing	3200	missing	missing

Putting it all together¶

`@chain`¶

The @chain macro allows you to chain things together so you don't need to keep specifying the first argument in the function calls. This is similar to method chaining in python or the pipe in dplyr in R.

In [11]:

sales = @chain db_sales begin
    groupby(:employee_id)
    @combine(:total_quantity=sum(:quantity), 
             :number_of_customers=length(:customer))
end

result = @chain df begin
                @select(:id, :first_name, :last_name, :department)
                @rsubset(:department=="Sales")
                leftjoin(sales, on=[:id=>:employee_id])
                @orderby(sort(:total_quantity, rev=true))  # order by descending
          end

Out[11]:

2 rows × 6 columns

	id	first_name	last_name	department	total_quantity	number_of_customers
	Int64	String	String	String	Int64?	Int64?
1	4	Jim	Halpert	Sales	1100	3
2	2	Dwight	Schrute	Sales	950	3

In [ ]: