using Plots
using Statistics
using StatsBase
using PyCall
using DataFrames
using GLM
using Tables
using XLSX
using MLBase
using RDatasets
using LsqFit
xvals = repeat(1:0.5:10, inner=2)
yvals = 3 .+ xvals .+ 2 .* rand(length(xvals)) .-1
scatter(xvals, yvals, color=:black, leg=false)
function find_best_fit(xvals,yvals)
meanx = mean(xvals)
meany = mean(yvals)
stdx = std(xvals)
stdy = std(yvals)
r = cor(xvals,yvals)
a = r*stdy/stdx
b = meany - a*meanx
return a,b
end
find_best_fit (generic function with 1 method)
a,b = find_best_fit(xvals,yvals)
ynew = a .* xvals .+ b
38-element Vector{Float64}: 3.9579559470764725 3.9579559470764725 4.4608594351932025 4.4608594351932025 4.9637629233099325 4.9637629233099325 5.466666411426663 5.466666411426663 5.9695698995433935 5.9695698995433935 6.4724733876601235 6.4724733876601235 6.975376875776854 ⋮ 10.495701292593967 10.495701292593967 10.998604780710696 10.998604780710696 11.501508268827429 11.501508268827429 12.004411756944158 12.004411756944158 12.507315245060887 12.507315245060887 13.01021873317762 13.01021873317762
np = pyimport("numpy");
xdata = xvals
ydata = yvals
@time myfit = np.polyfit(xdata, ydata, 1);
ynew2 = collect(xdata) .* myfit[1] .+ myfit[2];
scatter(xvals,yvals)
plot!(xvals,ynew)
plot!(xvals,ynew2)
0.261024 seconds (1.26 M allocations: 75.000 MiB, 5.50% gc time, 0.44% compilation time)
data = DataFrame(X=xdata, Y=ydata)
ols = lm(@formula(Y ~ X), data)
plot!(xdata,predict(ols))
Now let's get some real data. We will use housing information from zillow, check out the file zillow_data_download_april2020.xlsx
for a quick look of what the data looks like. Our goal will be to build a linear regression model between the number of houses listed vs the number of houses sold in a few states. Fitting these models can serve as a key real estate indicator.
# play around with data for a bit
R = XLSX.readxlsx("data/zillow_data_download_april2020.xlsx")
XLSXFile("zillow_data_download_april2020.xlsx") containing 4 Worksheets sheetname size range ------------------------------------------------- MonthlyListings_City 8348x91 A1:CM8348 Sale_counts_city 28760x148 A1:ER28760 Sales_median_price_c… 3767x148 A1:ER3767 meta 1x1 A1:A1
sale_counts = R["Sale_counts_city"][:]
df_sale_counts = DataFrame(sale_counts[2:end,:],Symbol.(sale_counts[1,:]))
monthly_listings = R["MonthlyListings_City"][:]
df_monthly_listings = DataFrame(monthly_listings[2:end,:],Symbol.(monthly_listings[1,:]))
SizeRank | RegionID | RegionName | RegionType | StateName | 2013-01 | 2013-02 | 2013-03 | |
---|---|---|---|---|---|---|---|---|
Any | Any | Any | Any | Any | Any | Any | Any | |
1 | 1 | 6181 | New York | City | NY | 28904 | 28155 | 30596 |
2 | 2 | 12447 | Los Angeles | City | CA | 6431 | 6613 | 6467 |
3 | 3 | 39051 | Houston | City | TX | 11696 | 11737 | 12291 |
4 | 4 | 17426 | Chicago | City | IL | 9523 | 9609 | 9772 |
5 | 5 | 6915 | San Antonio | City | TX | 7223 | 7170 | 7438 |
6 | 6 | 13271 | Philadelphia | City | PA | 7505 | 7266 | 7578 |
7 | 7 | 40326 | Phoenix | City | AZ | 6232 | 5720 | 5872 |
8 | 8 | 18959 | Las Vegas | City | NV | 7027 | 6218 | 6199 |
9 | 9 | 54296 | San Diego | City | CA | 4121 | 3844 | 3932 |
10 | 10 | 38128 | Dallas | City | TX | 4332 | 4395 | 4568 |
11 | 11 | 10221 | Austin | City | TX | 2345 | 2230 | 2301 |
12 | 12 | 33839 | San Jose | City | CA | missing | missing | missing |
13 | 13 | 25290 | Jacksonville | City | FL | 4517 | 4495 | 4481 |
14 | 14 | 24043 | Charlotte | City | NC | 4550 | 4451 | 4812 |
15 | 15 | 18172 | Fort Worth | City | TX | 3709 | 3738 | 3703 |
16 | 16 | 7481 | Tucson | City | AZ | 4422 | 4355 | 4376 |
17 | 17 | 10920 | Columbus | City | OH | 3873 | 3849 | 4099 |
18 | 18 | 12455 | Louisville | City | KY | 3517 | 3355 | 3567 |
19 | 19 | 13121 | Orlando | City | FL | 2821 | 2781 | 2711 |
20 | 20 | 17933 | El Paso | City | TX | 3255 | 3287 | 3405 |
21 | 21 | 17762 | Detroit | City | MI | 3550 | 3413 | 3358 |
22 | 22 | 11093 | Denver | City | CO | 2486 | 2404 | 2588 |
23 | 23 | 16037 | Seattle | City | WA | 1514 | 1569 | 1709 |
24 | 24 | 32811 | Memphis | City | TN | 3024 | 2867 | 2986 |
25 | 25 | 44269 | Boston | City | MA | 1193 | 1287 | 1570 |
26 | 26 | 41568 | Washington | City | DC | 1362 | 1339 | 1418 |
27 | 27 | 13373 | Portland | City | OR | 2133 | 2158 | 2220 |
28 | 28 | 6118 | Nashville | City | TN | 4285 | 4280 | 4448 |
29 | 29 | 20288 | Sacramento | City | CA | 905 | 938 | 990 |
30 | 30 | 3523 | Baltimore | City | MD | 2907 | 2894 | 3026 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
monthly_listings_2020_02 = df_monthly_listings[!,[1,2,3,4,5,end]]
rename!(monthly_listings_2020_02, Symbol("2020-02") .=> Symbol("listings"))
sale_counts_2020_02 = df_sale_counts[!,[1,end]]
rename!(sale_counts_2020_02, Symbol("2020-02") .=> Symbol("sales"))
RegionID | sales | |
---|---|---|
Any | Any | |
1 | 6181 | 4054 |
2 | 12447 | 1522 |
3 | 39051 | 2682 |
4 | 17426 | 2100 |
5 | 6915 | 1626 |
6 | 13271 | 1620 |
7 | 40326 | 2325 |
8 | 18959 | 2547 |
9 | 54296 | missing |
10 | 38128 | 1292 |
11 | 10221 | 1023 |
12 | 33839 | 439 |
13 | 25290 | missing |
14 | 32149 | 1443 |
15 | 20330 | 291 |
16 | 24043 | 1439 |
17 | 18172 | 1059 |
18 | 7481 | 870 |
19 | 10920 | 895 |
20 | 12455 | 972 |
21 | 13121 | 726 |
22 | 17933 | 465 |
23 | 17762 | missing |
24 | 11093 | 1198 |
25 | 16037 | 482 |
26 | 32811 | 898 |
27 | 44269 | 147 |
28 | 41568 | 845 |
29 | 13373 | 642 |
30 | 6118 | 1107 |
⋮ | ⋮ | ⋮ |
Feb2020data
UndefVarError: Feb2020data not defined Stacktrace: [1] top-level scope @ :0 [2] eval @ ./boot.jl:360 [inlined] [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String) @ Base ./loading.jl:1094
Feb2020data = innerjoin(monthly_listings_2020_02,sale_counts_2020_02,on=:RegionID) #, type="outer")
dropmissing!(Feb2020data)
sales = Feb2020data[!,:sales]
# prices = Feb2020data[!,:price]
counts = Feb2020data[!,:listings]
using DataStructures
states = Feb2020data[!,:StateName]
C = counter(states)
C.map
countvals = values(C.map)
topstates = sortperm(collect(countvals),rev=true)[1:10]
states_of_interest = collect(keys(C.map))[topstates]
all_plots = Array{Plots.Plot}(undef,10)
10-element Vector{Plots.Plot}: #undef #undef #undef #undef #undef #undef #undef #undef #undef #undef
all_plots = Array{Plots.Plot}(undef,10)
for (i,si) in enumerate(states_of_interest)
curids = findall(Feb2020data[!,:StateName].==si)
data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
ols = GLM.lm(@formula(Y ~ 0 + X), data)
all_plots[i] = scatter(counts[curids],sales[curids],markersize=2,
xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
legend=false,title=si)
@show si,coef(ols)
plot!(counts[curids],predict(ols),color=:black)
end
plot(all_plots...,layout=(2,5),size=(900,300))
(si, coef(ols)) = ("CA", [0.29595375979515304]) (si, coef(ols)) = ("FL", [0.16693529057344902]) (si, coef(ols)) = ("IL", [0.22291703276169347]) (si, coef(ols)) = ("NJ", [0.25556391864492345]) (si, coef(ols)) = ("MI", [0.2925436820147375]) (si, coef(ols)) = ("TX", [0.24732835944836878]) (si, coef(ols)) = ("PA", [0.310066636803275]) (si, coef(ols)) = ("OH", [0.4569949754409864]) (si, coef(ols)) = ("NC", [0.4800747758857476]) (si, coef(ols)) = ("NY", [0.19151984869233082])
all_plots = Array{Plots.Plot}(undef,10)
for (i,si) in enumerate(states_of_interest)
curids = findall(Feb2020data[!,:StateName].==si)
data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
ols = GLM.lm(@formula(Y ~ X), data)
all_plots[i] = scatter(counts[curids],sales[curids],markersize=2,
xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
legend=false,title=si)
@show si,coef(ols)
plot!(counts[curids],predict(ols),color=:black)
end
plot(all_plots...,layout=(2,5),size=(900,300))
(si, coef(ols)) = ("CA", [6.803086464628309, 0.28798374289950357]) (si, coef(ols)) = ("FL", [19.126255819233386, 0.15510798334559428]) (si, coef(ols)) = ("IL", [0.6626439876883514, 0.2226325765807212]) (si, coef(ols)) = ("NJ", [-1.9635911642225732, 0.2650303535602978]) (si, coef(ols)) = ("MI", [-0.9181942041703473, 0.29768705032459525]) (si, coef(ols)) = ("TX", [5.241220767319292, 0.24566620049636262]) (si, coef(ols)) = ("PA", [-0.21362448579952723, 0.3101955410309725]) (si, coef(ols)) = ("OH", [-6.026958739166818, 0.47909280478486926]) (si, coef(ols)) = ("NC", [-24.60003985327959, 0.5302793903364814]) (si, coef(ols)) = ("NY", [-0.20048638474403055, 0.191535092167981])
plot()
for (i,si) in enumerate(states_of_interest)
curids = findall(Feb2020data[!,:StateName].==si)
data = DataFrame(X=float.(counts[curids]), Y=float.(sales[curids]))
ols = GLM.lm(@formula(Y ~ 0 + X), data)
scatter!(counts[curids],sales[curids],markersize=2,
xlim=(0,500),ylim=(0,500),color=i,aspect_ratio=:equal,
legend=false,marker=(3,3,stroke(0)),alpha=0.2)
if si == "NC" || si == "CA" || si == "FL"
annotate!([(500-20,10+coef(ols)[1]*500,text(si,10))])
end
@show si,coef(ols)
plot!(counts[curids],predict(ols),color=i,linewidth=2)
end
# plot(all_plots...,layout=(2,5),size=(900,300))
xlabel!("listings")
ylabel!("sales")
(si, coef(ols)) = ("CA", [0.29595375979515304]) (si, coef(ols)) = ("FL", [0.16693529057344902]) (si, coef(ols)) = ("IL", [0.22291703276169347]) (si, coef(ols)) = ("NJ", [0.25556391864492345]) (si, coef(ols)) = ("MI", [0.2925436820147375]) (si, coef(ols)) = ("TX", [0.24732835944836878]) (si, coef(ols)) = ("PA", [0.310066636803275]) (si, coef(ols)) = ("OH", [0.4569949754409864]) (si, coef(ols)) = ("NC", [0.4800747758857476]) (si, coef(ols)) = ("NY", [0.19151984869233082])
So far, we have shown several ways to solve the linear regression problem in Julia. Here, we will first start with a motivating example of when you would want to use logistic regression. Let's assume that our predictor vector is binary (0
or 1
), let's fit a linear regression model.
data = DataFrame(X=[1,2,3,4,5,6,7], Y=[1,0,1,1,1,1,1])
linear_reg = lm(@formula(Y ~ X), data)
scatter(data[!,:X],data[!,:Y],legend=false,size=(300,200))
plot!(1:7,predict(linear_reg))
What this plot quickly shows is that linear regression may end up predicting values outside the [0,1]
interval. For an example like this, we will use logistic regression. Interestingly, a generalized linear model (https://en.wikipedia.org/wiki/Generalized_linear_model) unifies concepts like linear regression and logistic regression, and the GLM
package allows you to apply either of these regressions easily by specifying the distribution family
and the link
function.
To apply logistic regression via the GLM
package, you can readily use the Binomial()
family and the LogitLink()
link function.
Let's load some data and take a look at one example.
# we will load this data from RDatasets
cats = dataset("MASS", "cats")
Sex | BWt | HWt | |
---|---|---|---|
Cat… | Float64 | Float64 | |
1 | F | 2.0 | 7.0 |
2 | F | 2.0 | 7.4 |
3 | F | 2.0 | 9.5 |
4 | F | 2.1 | 7.2 |
5 | F | 2.1 | 7.3 |
6 | F | 2.1 | 7.6 |
7 | F | 2.1 | 8.1 |
8 | F | 2.1 | 8.2 |
9 | F | 2.1 | 8.3 |
10 | F | 2.1 | 8.5 |
11 | F | 2.1 | 8.7 |
12 | F | 2.1 | 9.8 |
13 | F | 2.2 | 7.1 |
14 | F | 2.2 | 8.7 |
15 | F | 2.2 | 9.1 |
16 | F | 2.2 | 9.7 |
17 | F | 2.2 | 10.9 |
18 | F | 2.2 | 11.0 |
19 | F | 2.3 | 7.3 |
20 | F | 2.3 | 7.9 |
21 | F | 2.3 | 8.4 |
22 | F | 2.3 | 9.0 |
23 | F | 2.3 | 9.0 |
24 | F | 2.3 | 9.5 |
25 | F | 2.3 | 9.6 |
26 | F | 2.3 | 9.7 |
27 | F | 2.3 | 10.1 |
28 | F | 2.3 | 10.1 |
29 | F | 2.3 | 10.6 |
30 | F | 2.3 | 11.2 |
⋮ | ⋮ | ⋮ | ⋮ |
We will map the sex of each cat to a binary 0/1 value.
lmap = labelmap(cats[!,:Sex])
ci = labelencode(lmap, cats[!,:Sex])
scatter(cats[!,:BWt],cats[!,:HWt],color=ci,legend=false)
lmap
LabelMap (with 2 labels): [1] F [2] M
Females (color 1) seem to be more present in the lower left corner and Males (color 2) seem to be present in the top right corner. Let's run a logistic regression model on this data.
data = DataFrame(X=cats[!,:HWt], Y=ci.-1)
probit = glm(@formula(Y ~ X), data, Binomial(), LogitLink())
scatter(data[!,:X],data[!,:Y],label="ground truth gender",color=6)
scatter!(data[!,:X],predict(probit),label="predicted gender",color=7)
As you can see, contrary to the linear regression case, the predicted values do not go beyond 1.
Finally, sometimes you may have a set of points and the goal is to fit a non-linear function (maybe a quadratic function, a cubic function, an exponential function...). The way we would solve such a problem is by minimizing the least square error between the fitted function and the observations we have. We will use the package LsqFit
for this task. Note that this problem is usually modeled as a numerical optimizaiton problem.
We will first set up our data.
xvals = 0:0.05:10
yvals = 1*exp.(-xvals*2) + 2*sin.(0.8*pi*xvals) + 0.15 * randn(length(xvals));
scatter(xvals,yvals,legend=false)
Then, we set up the model with model(x,p)
. The vector p
is what to be estimated given a set of values x
.
@. model(x, p) = p[1]*exp(-x*p[2]) + p[3]*sin(0.8*pi*x)
p0 = [0.5, 0.5, 0.5]
myfit = curve_fit(model, xvals, yvals, p0)
LsqFit.LsqFitResult{Vector{Float64}, Vector{Float64}, Matrix{Float64}, Vector{Float64}}([1.0415140237037126, 2.1693486431737714, 2.0181247909313385], [0.02752895291159496, -0.07389276157679525, -0.011588736447885939, 0.07005887221187113, 0.13545559451970735, 0.20377674694992653, -0.3816295524295559, 0.055818537181232086, -0.24105254601566406, 0.03344957136175175 … 0.17882293221686463, 0.27231195003464026, 0.42863922028744716, -0.1488432806781279, 0.11540485560759861, 0.09837732624408746, -0.24814190503014721, 0.08351719361346754, -0.0945998664033223, -0.028474437883129682], [1.0000000000063065 0.0 0.0; 0.8972081144921082 -0.046722741661822303 0.12533323356031575; … ; 4.2248335930741175e-10 -4.377880403761785e-9 -0.12533323356258697; 3.7902478458698786e-10 -3.947381787803067e-9 0.0], true, Float64[])
⚠️ A note about curve_fit
: this function can take multiple other inputs, for instance the Jacobian of what you are trying to fit. We don't dive into these details here, but be sure to check out the LsqFit
package to see what other things you can can pass to create a better fit.
Also note that julia has multiple packages that allow you to create Jacobians so you don't have to write them yourself. Two such packages are FiniteDifferences
or ForwardDiff
.
↪️ Back to our example. We are ready now to plot the curve we have generated
p = myfit.param
findyvals = p[1]*exp.(-xvals*p[2]) + p[3]*sin.(0.8*pi*xvals)
scatter(xvals,yvals,legend=false)
plot!(xvals,findyvals)
Just for fun... Let's try to fit a linear function
@. model(x, p) = p[1]*x
myfit = curve_fit(model, xvals, yvals, [0.5])
p = myfit.param
findyvals = p[1]*xvals
scatter(xvals,yvals,legend=false)
plot!(xvals,findyvals)
After finishing this notebook, you should be able to:
One metric in real estate is to find the number of houses being sold out of the number of houses on the market. We collect multiple data points from multiple states and fit a linear model to these states. It turns out that North Carolina has the highest sold/listed ratios. Florida is one of the least, and California is somewhere in between.