In [1]:

using Plots
using Statistics
using StatsBase
using PyCall
using DataFrames
using GLM
using Tables
using XLSX
using MLBase
using RDatasets
using LsqFit

In [2]:

xvals = repeat(1:0.5:10, inner=2)
yvals = 3 .+ xvals .+ 2 .* rand(length(xvals)) .-1
scatter(xvals, yvals, color=:black, leg=false)

Out[2]:

In [3]:

function find_best_fit(xvals,yvals)
    meanx = mean(xvals)
    meany = mean(yvals)
    stdx = std(xvals)
    stdy = std(yvals)
    r = cor(xvals,yvals)
    a = r*stdy/stdx
    b = meany - a*meanx
    return a,b
end

Out[3]:

find_best_fit (generic function with 1 method)

In [4]:

a,b = find_best_fit(xvals,yvals)
ynew = a .* xvals .+ b

Out[4]:

38-element Vector{Float64}:
  3.9579559470764725
  3.9579559470764725
  4.4608594351932025
  4.4608594351932025
  4.9637629233099325
  4.9637629233099325
  5.466666411426663
  5.466666411426663
  5.9695698995433935
  5.9695698995433935
  6.4724733876601235
  6.4724733876601235
  6.975376875776854
  ⋮
 10.495701292593967
 10.495701292593967
 10.998604780710696
 10.998604780710696
 11.501508268827429
 11.501508268827429
 12.004411756944158
 12.004411756944158
 12.507315245060887
 12.507315245060887
 13.01021873317762
 13.01021873317762

In [5]:

np = pyimport("numpy");

In [6]:

xdata = xvals
ydata = yvals
@time myfit = np.polyfit(xdata, ydata, 1);
ynew2 = collect(xdata) .* myfit[1] .+ myfit[2];
scatter(xvals,yvals)
plot!(xvals,ynew)
plot!(xvals,ynew2)

  0.261024 seconds (1.26 M allocations: 75.000 MiB, 5.50% gc time, 0.44% compilation time)

Out[6]:

In [7]:

data = DataFrame(X=xdata, Y=ydata)
ols = lm(@formula(Y ~ X), data)
plot!(xdata,predict(ols))

Out[7]:

Now let's get some real data. We will use housing information from zillow, check out the file zillow_data_download_april2020.xlsx for a quick look of what the data looks like. Our goal will be to build a linear regression model between the number of houses listed vs the number of houses sold in a few states. Fitting these models can serve as a key real estate indicator.

In [8]:

# play around with data for a bit
R = XLSX.readxlsx("data/zillow_data_download_april2020.xlsx")

Out[8]:

XLSXFile("zillow_data_download_april2020.xlsx") containing 4 Worksheets
            sheetname size          range        
-------------------------------------------------
 MonthlyListings_City 8348x91       A1:CM8348    
     Sale_counts_city 28760x148     A1:ER28760   
Sales_median_price_c… 3767x148      A1:ER3767    
                 meta 1x1           A1:A1

In [9]:

sale_counts = R["Sale_counts_city"][:]
df_sale_counts = DataFrame(sale_counts[2:end,:],Symbol.(sale_counts[1,:]))

monthly_listings = R["MonthlyListings_City"][:]
df_monthly_listings = DataFrame(monthly_listings[2:end,:],Symbol.(monthly_listings[1,:]))

Out[9]:

8,347 rows × 91 columns (omitted printing of 83 columns)

	SizeRank	RegionID	RegionName	RegionType	StateName	2013-01	2013-02	2013-03
	Any	Any	Any	Any	Any	Any	Any	Any
1	1	6181	New York	City	NY	28904	28155	30596
2	2	12447	Los Angeles	City	CA	6431	6613	6467
3	3	39051	Houston	City	TX	11696	11737	12291
4	4	17426	Chicago	City	IL	9523	9609	9772
5	5	6915	San Antonio	City	TX	7223	7170	7438
6	6	13271	Philadelphia	City	PA	7505	7266	7578
7	7	40326	Phoenix	City	AZ	6232	5720	5872
8	8	18959	Las Vegas	City	NV	7027	6218	6199
9	9	54296	San Diego	City	CA	4121	3844	3932
10	10	38128	Dallas	City	TX	4332	4395	4568
11	11	10221	Austin	City	TX	2345	2230	2301
12	12	33839	San Jose	City	CA	missing	missing	missing
13	13	25290	Jacksonville	City	FL	4517	4495	4481
14	14	24043	Charlotte	City	NC	4550	4451	4812
15	15	18172	Fort Worth	City	TX	3709	3738	3703
16	16	7481	Tucson	City	AZ	4422	4355	4376
17	17	10920	Columbus	City	OH	3873	3849	4099
18	18	12455	Louisville	City	KY	3517	3355	3567
19	19	13121	Orlando	City	FL	2821	2781	2711
20	20	17933	El Paso	City	TX	3255	3287	3405
21	21	17762	Detroit	City	MI	3550	3413	3358
22	22	11093	Denver	City	CO	2486	2404	2588
23	23	16037	Seattle	City	WA	1514	1569	1709
24	24	32811	Memphis	City	TN	3024	2867	2986
25	25	44269	Boston	City	MA	1193	1287	1570
26	26	41568	Washington	City	DC	1362	1339	1418
27	27	13373	Portland	City	OR	2133	2158	2220
28	28	6118	Nashville	City	TN	4285	4280	4448
29	29	20288	Sacramento	City	CA	905	938	990
30	30	3523	Baltimore	City	MD	2907	2894	3026
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

In [10]:

monthly_listings_2020_02 = df_monthly_listings[!,[1,2,3,4,5,end]]
rename!(monthly_listings_2020_02, Symbol("2020-02") .=> Symbol("listings"))

sale_counts_2020_02 = df_sale_counts[!,[1,end]]
rename!(sale_counts_2020_02, Symbol("2020-02") .=> Symbol("sales"))

Out[10]:

28,759 rows × 2 columns

	RegionID	sales
	Any	Any
1	6181	4054
2	12447	1522
3	39051	2682
4	17426	2100
5	6915	1626
6	13271	1620
7	40326	2325
8	18959	2547
9	54296	missing
10	38128	1292
11	10221	1023
12	33839	439
13	25290	missing
14	32149	1443
15	20330	291
16	24043	1439
17	18172	1059
18	7481	870
19	10920	895
20	12455	972
21	13121	726
22	17933	465
23	17762	missing
24	11093	1198
25	16037	482
26	32811	898
27	44269	147
28	41568	845
29	13373	642
30	6118	1107
⋮	⋮	⋮

In [11]:

Feb2020data

UndefVarError: Feb2020data not defined

Stacktrace:
 [1] top-level scope
   @ :0
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

In [12]:

Feb2020data = innerjoin(monthly_listings_2020_02,sale_counts_2020_02,on=:RegionID) #, type="outer")
dropmissing!(Feb2020data)
sales = Feb2020data[!,:sales]
# prices = Feb2020data[!,:price]
counts = Feb2020data[!,:listings]
using DataStructures
states = Feb2020data[!,:StateName]
C = counter(states)
C.map
countvals = values(C.map)
topstates = sortperm(collect(countvals),rev=true)[1:10]
states_of_interest = collect(keys(C.map))[topstates]
all_plots = Array{Plots.Plot}(undef,10)

Out[12]:

10-element Vector{Plots.Plot}:
 #undef
 #undef
 #undef
 #undef
 #undef
 #undef
 #undef
 #undef
 #undef
 #undef

In [13]:

all_plots = Array{Plots.Plot}(undef,10)
for (i,si) in enumerate(states_of_interest)
    curids = findall(Feb2020data[!,:StateName].==si)
    data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
    ols = GLM.lm(@formula(Y ~ 0 + X), data)    
    all_plots[i] = scatter(counts[curids],sales[curids],markersize=2,
        xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
        legend=false,title=si)
    @show si,coef(ols)
    plot!(counts[curids],predict(ols),color=:black)
end
plot(all_plots...,layout=(2,5),size=(900,300))

(si, coef(ols)) = ("CA", [0.29595375979515304])
(si, coef(ols)) = ("FL", [0.16693529057344902])
(si, coef(ols)) = ("IL", [0.22291703276169347])
(si, coef(ols)) = ("NJ", [0.25556391864492345])
(si, coef(ols)) = ("MI", [0.2925436820147375])
(si, coef(ols)) = ("TX", [0.24732835944836878])
(si, coef(ols)) = ("PA", [0.310066636803275])
(si, coef(ols)) = ("OH", [0.4569949754409864])
(si, coef(ols)) = ("NC", [0.4800747758857476])
(si, coef(ols)) = ("NY", [0.19151984869233082])

Out[13]:

In [14]:

all_plots = Array{Plots.Plot}(undef,10)
for (i,si) in enumerate(states_of_interest)
    curids = findall(Feb2020data[!,:StateName].==si)
    data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
    ols = GLM.lm(@formula(Y ~ X), data)    
    all_plots[i] = scatter(counts[curids],sales[curids],markersize=2,
        xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
        legend=false,title=si)
    @show si,coef(ols)
    plot!(counts[curids],predict(ols),color=:black)
end
plot(all_plots...,layout=(2,5),size=(900,300))

(si, coef(ols)) = ("CA", [6.803086464628309, 0.28798374289950357])
(si, coef(ols)) = ("FL", [19.126255819233386, 0.15510798334559428])
(si, coef(ols)) = ("IL", [0.6626439876883514, 0.2226325765807212])
(si, coef(ols)) = ("NJ", [-1.9635911642225732, 0.2650303535602978])
(si, coef(ols)) = ("MI", [-0.9181942041703473, 0.29768705032459525])
(si, coef(ols)) = ("TX", [5.241220767319292, 0.24566620049636262])
(si, coef(ols)) = ("PA", [-0.21362448579952723, 0.3101955410309725])
(si, coef(ols)) = ("OH", [-6.026958739166818, 0.47909280478486926])
(si, coef(ols)) = ("NC", [-24.60003985327959, 0.5302793903364814])
(si, coef(ols)) = ("NY", [-0.20048638474403055, 0.191535092167981])

Out[14]:

In [15]:

plot()
for (i,si) in enumerate(states_of_interest)
    curids = findall(Feb2020data[!,:StateName].==si)
    data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
    ols = GLM.lm(@formula(Y ~ 0 + X), data)    
    scatter!(counts[curids],sales[curids],markersize=2,
        xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
        legend=false,marker=(3,3,stroke(0)),alpha=0.2)
        if si == "NC" || si == "CA" || si == "FL"
            annotate!([(500-20,10+coef(ols)[1]*500,text(si,10))])
        end
    @show si,coef(ols)
    plot!(counts[curids],predict(ols),color=i,linewidth=2)
end
# plot(all_plots...,layout=(2,5),size=(900,300))
xlabel!("listings")
ylabel!("sales")

(si, coef(ols)) = ("CA", [0.29595375979515304])
(si, coef(ols)) = ("FL", [0.16693529057344902])
(si, coef(ols)) = ("IL", [0.22291703276169347])
(si, coef(ols)) = ("NJ", [0.25556391864492345])
(si, coef(ols)) = ("MI", [0.2925436820147375])
(si, coef(ols)) = ("TX", [0.24732835944836878])
(si, coef(ols)) = ("PA", [0.310066636803275])
(si, coef(ols)) = ("OH", [0.4569949754409864])
(si, coef(ols)) = ("NC", [0.4800747758857476])
(si, coef(ols)) = ("NY", [0.19151984869233082])

Out[15]:

🟠 Logistic regression¶

So far, we have shown several ways to solve the linear regression problem in Julia. Here, we will first start with a motivating example of when you would want to use logistic regression. Let's assume that our predictor vector is binary (0 or 1), let's fit a linear regression model.

In [19]:

data = DataFrame(X=[1,2,3,4,5,6,7], Y=[1,0,1,1,1,1,1])
linear_reg = lm(@formula(Y ~ X), data)
scatter(data[!,:X],data[!,:Y],legend=false,size=(300,200))
plot!(1:7,predict(linear_reg))

Out[19]:

What this plot quickly shows is that linear regression may end up predicting values outside the [0,1] interval. For an example like this, we will use logistic regression. Interestingly, a generalized linear model (https://en.wikipedia.org/wiki/Generalized_linear_model) unifies concepts like linear regression and logistic regression, and the GLM package allows you to apply either of these regressions easily by specifying the distribution family and the link function.

To apply logistic regression via the GLM package, you can readily use the Binomial() family and the LogitLink() link function.

Let's load some data and take a look at one example.

In [20]:

# we will load this data from RDatasets
cats = dataset("MASS", "cats")

Out[20]:

144 rows × 3 columns

	Sex	BWt	HWt
	Cat…	Float64	Float64
1	F	2.0	7.0
2	F	2.0	7.4
3	F	2.0	9.5
4	F	2.1	7.2
5	F	2.1	7.3
6	F	2.1	7.6
7	F	2.1	8.1
8	F	2.1	8.2
9	F	2.1	8.3
10	F	2.1	8.5
11	F	2.1	8.7
12	F	2.1	9.8
13	F	2.2	7.1
14	F	2.2	8.7
15	F	2.2	9.1
16	F	2.2	9.7
17	F	2.2	10.9
18	F	2.2	11.0
19	F	2.3	7.3
20	F	2.3	7.9
21	F	2.3	8.4
22	F	2.3	9.0
23	F	2.3	9.0
24	F	2.3	9.5
25	F	2.3	9.6
26	F	2.3	9.7
27	F	2.3	10.1
28	F	2.3	10.1
29	F	2.3	10.6
30	F	2.3	11.2
⋮	⋮	⋮	⋮

We will map the sex of each cat to a binary 0/1 value.

In [21]:

lmap = labelmap(cats[!,:Sex])
ci = labelencode(lmap, cats[!,:Sex])
scatter(cats[!,:BWt],cats[!,:HWt],color=ci,legend=false)

Out[21]:

In [22]:

lmap

Out[22]:

LabelMap (with 2 labels):
[1] F
[2] M

Females (color 1) seem to be more present in the lower left corner and Males (color 2) seem to be present in the top right corner. Let's run a logistic regression model on this data.

In [23]:

data = DataFrame(X=cats[!,:HWt], Y=ci.-1)
probit = glm(@formula(Y ~ X), data, Binomial(), LogitLink())
scatter(data[!,:X],data[!,:Y],label="ground truth gender",color=6)
scatter!(data[!,:X],predict(probit),label="predicted gender",color=7)

Out[23]:

As you can see, contrary to the linear regression case, the predicted values do not go beyond 1.

🟠 Non linear regression¶

Finally, sometimes you may have a set of points and the goal is to fit a non-linear function (maybe a quadratic function, a cubic function, an exponential function...). The way we would solve such a problem is by minimizing the least square error between the fitted function and the observations we have. We will use the package LsqFit for this task. Note that this problem is usually modeled as a numerical optimizaiton problem.

We will first set up our data.

In [16]:

xvals = 0:0.05:10
yvals = 1*exp.(-xvals*2) + 2*sin.(0.8*pi*xvals) + 0.15 * randn(length(xvals));
scatter(xvals,yvals,legend=false)

Out[16]:

Then, we set up the model with model(x,p). The vector p is what to be estimated given a set of values x.

In [17]:

@. model(x, p) = p[1]*exp(-x*p[2]) + p[3]*sin(0.8*pi*x)
p0 = [0.5, 0.5, 0.5]
myfit = curve_fit(model, xvals, yvals, p0)

Out[17]:

LsqFit.LsqFitResult{Vector{Float64}, Vector{Float64}, Matrix{Float64}, Vector{Float64}}([1.0415140237037126, 2.1693486431737714, 2.0181247909313385], [0.02752895291159496, -0.07389276157679525, -0.011588736447885939, 0.07005887221187113, 0.13545559451970735, 0.20377674694992653, -0.3816295524295559, 0.055818537181232086, -0.24105254601566406, 0.03344957136175175  …  0.17882293221686463, 0.27231195003464026, 0.42863922028744716, -0.1488432806781279, 0.11540485560759861, 0.09837732624408746, -0.24814190503014721, 0.08351719361346754, -0.0945998664033223, -0.028474437883129682], [1.0000000000063065 0.0 0.0; 0.8972081144921082 -0.046722741661822303 0.12533323356031575; … ; 4.2248335930741175e-10 -4.377880403761785e-9 -0.12533323356258697; 3.7902478458698786e-10 -3.947381787803067e-9 0.0], true, Float64[])

⚠️ A note about curve_fit: this function can take multiple other inputs, for instance the Jacobian of what you are trying to fit. We don't dive into these details here, but be sure to check out the LsqFit package to see what other things you can can pass to create a better fit.

Also note that julia has multiple packages that allow you to create Jacobians so you don't have to write them yourself. Two such packages are FiniteDifferences or ForwardDiff.

↪️ Back to our example. We are ready now to plot the curve we have generated

In [18]:

p = myfit.param
findyvals = p[1]*exp.(-xvals*p[2]) + p[3]*sin.(0.8*pi*xvals)
scatter(xvals,yvals,legend=false)
plot!(xvals,findyvals)

Out[18]:

Just for fun... Let's try to fit a linear function

In [19]:

@. model(x, p) = p[1]*x
myfit = curve_fit(model, xvals, yvals, [0.5])
p = myfit.param
findyvals = p[1]*xvals
scatter(xvals,yvals,legend=false)
plot!(xvals,findyvals)

Out[19]:

Finally...¶

After finishing this notebook, you should be able to:

run a linear regression model
use the GLM package to pass functions and probability distributions to solve your special regression problem
use GLM to solve a logistic regression problem
fit a nonlinear regression to your data using the LsqFit package
use the LsqFit package to fit a linear function too

🥳 One cool finding¶

One metric in real estate is to find the number of houses being sold out of the number of houses on the market. We collect multiple data points from multiple states and fit a linear model to these states. It turns out that North Carolina has the highest sold/listed ratios. Florida is one of the least, and California is somewhere in between.