By Tyler Ransom, Duke University Social Science Research Institute
tyler.ransom@duke.edu
With our knowledge of the DataFrames
package, we can use Julia to estimate common regression models which are useful in summarizing and describing data.
First, let's call the package we'll need for this demonstration. We'll be using Julia version 0.4.1 with GLM
version 0.5.0 and DataFrames
version 0.6.10.
using GLM
using DataFrames
Let's start by loading in the same CSV file from our DataFrames
example: (available at https://github.com/jmxpearson/duke-julia-ssri-2016/auto.csv)
auto = readtable("auto.csv");
showcols(auto)
74x12 DataFrames.DataFrame | Col # | Name | Eltype | Missing | |-------|--------------|------------|---------| | 1 | make | UTF8String | 0 | | 2 | price | Int64 | 0 | | 3 | mpg | Int64 | 0 | | 4 | rep78 | Int64 | 5 | | 5 | headroom | Float64 | 0 | | 6 | trunk | Int64 | 0 | | 7 | weight | Int64 | 0 | | 8 | length | Int64 | 0 | | 9 | turn | Int64 | 0 | | 10 | displacement | Int64 | 0 | | 11 | gear_ratio | Float64 | 0 | | 12 | foreign | Int64 | 0 |
Now let's use the GLM
package to estimate a simple ordinary least squares regression. Let's use :price
as the dependent variable and :mpg
and :weight
as the indepenent variables.
The syntax for estimating an OLS model with the GLM package is:
object = glm(Y~X,data,Normal(),IdentityLink())
where
object
is the Julia object that stores all of the resultsY
is the dependent variableX
is a linear combination of independent variablesdata
is the name of the data frame that holds the dataUsers with experience in R
will recognize the tight correspondence between Julia's GLM estimation syntax and R
's.
Let's estimate our model now:
olsest = glm(price~mpg+weight,auto,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}: Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) 1946.07 3597.05 0.541018 0.5885 mpg -49.5122 86.156 -0.574681 0.5655 weight 1.74656 0.641354 2.72324 0.0065
We see the reported coefficients, standard errors, z-values, and corresponding p-values for each covariate included in the model. Note that the intercept was automatically included.
Additionally, we can extract these objects, and others, by the following syntax:
coef(olsest)
3-element Array{Float64,1}: 1946.07 -49.5122 1.74656
stderr(olsest)
3-element Array{Float64,1}: 3597.05 86.156 0.641354
[coef(olsest) stderr(olsest)]
3x2 Array{Float64,2}: 1946.07 3597.05 -49.5122 86.156 1.74656 0.641354
Additional objects that we can extract from the stored estimates object include:
deviance()
, which is the weighted sum of squares of the residualsvcov()
, which is the estimated variance-covariance matrix of the parameterspredict()
, which obtains predicted values of the outcome for each observationdeviance(olsest)
4.487441163821706e8
pricehat = predict(olsest)
74-element Array{Float64,1}: 5974.22 6955.33 5467.72 6632.14 8329.35 7464.72 4553.58 6684.54 7930.52 6943.64 8815.5 8064.48 8399.05 ⋮ 3918.89 7226.13 3854.95 3793.59 5264.06 4253.62 5718.16 4579.86 3479.05 4079.12 4183.92 6640.95
[convert(Array,auto[:,:price]) pricehat]
74x2 Array{Float64,2}: 4099.0 5974.22 4749.0 6955.33 3799.0 5467.72 4816.0 6632.14 7827.0 8329.35 5788.0 7464.72 4453.0 4553.58 5189.0 6684.54 10372.0 7930.52 4082.0 6943.64 11385.0 8815.5 14500.0 8064.48 15906.0 8399.05 ⋮ 3995.0 3918.89 12990.0 7226.13 3895.0 3854.95 3798.0 3793.59 5899.0 5264.06 3748.0 4253.62 5719.0 5718.16 7140.0 4579.86 5397.0 3479.05 4697.0 4079.12 6850.0 4183.92 11995.0 6640.95
vcov(olsest)
3x3 Array{Float64,2}: 1.29388e7 -2.9276e5 -2191.9 -2.9276e5 7422.86 44.6017 -2191.9 44.6017 0.411335
[sqrt(diag(vcov(olsest))) stderr(olsest)]
3x2 Array{Float64,2}: 3597.05 3597.05 86.156 86.156 0.641354 0.641354
Here we will estimate parameters associated with a binary dependent variable. The two canonical models that we will discuss here are the logistic regression model (logit) and the probit model. The two models usually give similar results, and discussing their differences is beyond the current scope.
The logit model is called using the Binomial()
and LogitLink()
options in the glm()
function:
logitest = glm(foreign~mpg+weight,auto,Binomial(),LogitLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}: Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) 13.7084 4.51742 3.03456 0.0024 mpg -0.168587 0.0919032 -1.8344 0.0666 weight -0.0039067 0.00101117 -3.86356 0.0001
Here, we can predict the probability that a given vehicle is from a foreign maker and compare that with what was observed.
[convert(Array,auto[:,:foreign]) predict(logitest)]
74x2 Array{Float64,2}: 0.0 0.190436 0.0 0.0957767 0.0 0.422082 0.0 0.0862626 0.0 0.00849477 0.0 0.0249945 0.0 0.648662 0.0 0.0774615 0.0 0.0155653 0.0 0.0585486 0.0 0.00380411 0.0 0.0200754 0.0 0.00136983 ⋮ 1.0 0.714123 1.0 0.117869 1.0 0.898059 1.0 0.449941 1.0 0.778794 1.0 0.471888 1.0 0.560431 1.0 0.800974 1.0 0.236247 1.0 0.875856 1.0 0.848046 1.0 0.176266
Similar to the logit case, the probit model is called using the Binomial()
and ProbitLink()
options in the glm()
function:
probitest = glm(foreign~mpg+weight,auto,Binomial(),ProbitLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}: Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) 8.27532 2.57855 3.20929 0.0013 mpg -0.103948 0.0542063 -1.91763 0.0552 weight -0.00233551 0.000556961 -4.19331 <1e-4
[convert(Array,auto[:,:foreign]) predict(probitest)]
74x2 Array{Float64,2}: 0.0 0.196393 0.0 0.0941283 0.0 0.429645 0.0 0.0816519 0.0 0.00245574 0.0 0.0151149 0.0 0.642254 0.0 0.0715818 0.0 0.0071502 0.0 0.0504584 0.0 0.000496129 0.0 0.0110559 0.0 4.30191e-5 ⋮ 1.0 0.702837 1.0 0.121525 1.0 0.902976 1.0 0.440127 1.0 0.781031 1.0 0.466058 1.0 0.566884 1.0 0.799495 1.0 0.226333 1.0 0.878817 1.0 0.848251 1.0 0.185297
Suppose you have a categorical variable (e.g. ethnicity, or gender) that you want to include as an independent variable in your model.
Several existing software packages have this capability. In Stata, use i.gender
; in R
use gender <- as.factor(gender)
.
In Julia, factor levels are created using what is called a "pooled data array."
Let's see how they work by estimating an OLS model on the auto
data frame, using the :rep78
variable, which takes on categorical values 1 through 5.
olsest2 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}: Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) -1939.87 3622.35 -0.535528 0.5923 mpg -52.2172 83.7396 -0.623567 0.5329 weight 2.11149 0.619008 3.41108 0.0006 rep78 820.812 320.899 2.55786 0.0105
Here, the glm()
function thinks that :rep78
is a continuous variable. In order to "pool" this variable, we use the following syntax:
pool!(auto,:rep78);
Once the pooling is done, we can look at the levels of the pooled vector using the levels()
function:
levels(auto[:rep78])
5-element Array{Int64,1}: 1 2 3 4 5
Now when we re-run this regression, we should expect to obtain four different coefficient estimates representing the four uniquely identifiable categories of :rep78
.
olsest3 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())
DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}: Coefficients: Estimate Std.Error z value Pr(>|z|) (Intercept) -598.967 3960.9 -0.15122 0.8798 mpg -63.0971 87.4528 -0.721499 0.4706 weight 2.09307 0.636901 3.28633 0.0010 rep78 - 2 753.702 1919.76 0.392602 0.6946 rep78 - 3 1349.36 1772.71 0.761187 0.4465 rep78 - 4 2030.47 1810.09 1.12175 0.2620 rep78 - 5 3376.91 1900.17 1.77716 0.0755
This is indeed the case.
It is possible to automatically pool the data frame upon import using the makefactors
option of readtable
:
auto2 = readtable("auto.csv",makefactors=true);
However, this functionality only works with string variables, so any numerical categorical variables must be pooled manually.