Foundations of Computational Economics #17¶

by Fedor Iskhakov, ANU

Linear regression using Pandas and Numpy¶

Description: Using Numpy and Pandas to estimate simple regression.

Linear regression¶

Recall the classic linear regression model with data in columns of $ (X,y) $, where $ X $ are independent variables and $ y $ is the dependent variable. Parameter vector to be estimated is $ \beta $, and we assume that errors follow $ \varepsilon \sim N(0, \sigma) $

$$ y = X \beta + \varepsilon \quad \quad \varepsilon \sim N(0, \sigma) $$

Let $ \hat{\beta} $ denote the estimate of the parameters $ \beta $. To find it, we minimize the sum of squares of the residuals $ e = y - X \hat{\beta} $, i.e. $ e'e \longrightarrow_{\hat{\beta}} \min $, which leads to the well known OLS formula

$$ \hat{\beta} = (X'X)^{-1} X' y $$

The mean standard error (MSE) of the regression is calculated as $ s = \sqrt{\frac{1}{n-k} e'e} $, where $ n $ is the number of observations and $ k $ is the number of parameters (elements in $ \beta $).

The variance-covariance matrix of the estimates is given by $ \hat{\Sigma} = s^2 (X'X)^{-1} $. The square root of the diagonal elements of this matrix are them standard deviations of the estimates, and give us the measure of the accuracy of the estimated parameters.

William Greene “Econometric Analysis”

In [1]:

import numpy as np
def ols(X,y,addConstant=True,verbose=True):
    '''Return the OLS estimates and their variance-covariance matrix for the given data X,y
    When addConstant is True, constant is added to X
    When verbose is True, a report is printed
    '''
    pass

In [2]:

# test on small dataset
X = np.array([[5, 3],
              [2, 3],
              [3, 1],
              [2, 8],
              [4.5, 2.5],
              [2.5, 1.5],
              [4.3, 4.2],
              [0.5, 3.5],
              [1, 5],
              [3, 8]])
truebeta = np.array([1.234,-0.345])[:,np.newaxis]  # column vector
y = X @ truebeta + 2.5 + np.random.normal(size=(X.shape[0],1),scale=0.2)

beta,S=ols(X,y)
beta,S=ols(X,y,addConstant=False)

Number of observations: 10
Number of parameters: 3
Parameter estimates (std in brackets)
   2.18817 (   0.18051)
   1.28565 (   0.04215)
  -0.34394 (   0.02592)
MSE = 0.18440

Number of observations: 10
Number of parameters: 2
Parameter estimates (std in brackets)
   1.67610 (   0.11924)
  -0.12370 (   0.08107)
MSE = 0.80889

In [3]:

# test with one dimensional arrays
X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([9.4,8.1,7.7,6.3,5.7,4.4,3.0,2.1,1.1,0.8])

beta,S=ols(X,y)
beta,S=ols(X,y,addConstant=False)

Number of observations: 10
Number of parameters: 2
Parameter estimates (std in brackets)
  10.38000 (   0.21096)
  -1.00364 (   0.03400)
MSE = 0.30881

Number of observations: 10
Number of parameters: 1
Parameter estimates (std in brackets)
   0.47922 (   0.25856)
MSE = 5.07328

In [1]:

import numpy as np
def ols(X,y,addConstant=True,verbose=True):
    '''Return the OLS estimates and their variance-covariance matrix for the given data X,y
    When addConstant is True, constant is added to X
    When verbose is True, a report is printed
    '''
    y = y.squeeze()  # we are better off if y is one-dimensional
    if addConstant and X.ndim==1:
        X = np.hstack((np.ones(X.shape[0])[:,np.newaxis],X[:,np.newaxis]))
        k = 2
    elif addConstant and X.ndim>1:
        X = np.hstack((np.ones(X.shape[0])[:,np.newaxis],X))
        k = X.shape[1]+1
    elif X.ndim==1:
        X = X[:,np.newaxis]
    xxinv = np.linalg.inv(X.T@X)  # inv(X'X)
    beta = xxinv @ X.T@y  # OLS estimates
    e = y - X@beta  # residuals
    n,k = X.shape  # number of observations and parameters
    s2 = e.T@e / (n-k)
    Sigma = s2*xxinv
    if verbose:
        # report the estimates
        print('Number of observations: {:d}\nNumber of parameters: {:d}'.format(n,k))
        print('Parameter estimates (std in brackets)')
        for b,s in zip(beta,np.sqrt(np.diag(Sigma))):
            print('{:10.5f} ({:10.5f})'.format(b,s))
        print('MSE = {:1.5f}\n'.format(np.sqrt(s2)))
    return beta,Sigma

Data on median wages¶

The Economic Guide To Picking A College Major

Data dictionary available at

https://github.com/fivethirtyeight/data/tree/master/college-majors

In [5]:

import pandas as pd
# same data as in video 15
data = pd.read_csv('./_static/data/recent-grads.csv')

In [6]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Rank                  173 non-null    int64  
 1   Major_code            173 non-null    int64  
 2   Major                 173 non-null    object 
 3   Total                 172 non-null    float64
 4   Men                   172 non-null    float64
 5   Women                 172 non-null    float64
 6   Major_category        173 non-null    object 
 7   ShareWomen            172 non-null    float64
 8   Sample_size           173 non-null    int64  
 9   Employed              173 non-null    int64  
 10  Full_time             173 non-null    int64  
 11  Part_time             173 non-null    int64  
 12  Full_time_year_round  173 non-null    int64  
 13  Unemployed            173 non-null    int64  
 14  Unemployment_rate     173 non-null    float64
 15  Median                173 non-null    int64  
 16  P25th                 173 non-null    int64  
 17  P75th                 173 non-null    int64  
 18  College_jobs          173 non-null    int64  
 19  Non_college_jobs      173 non-null    int64  
 20  Low_wage_jobs         173 non-null    int64  
dtypes: float64(5), int64(14), object(2)
memory usage: 28.5+ KB

In [7]:

data.head(n=15)

Out[7]:

	Rank	Major_code	Major	Total	Men	Women	Major_category	ShareWomen	Sample_size	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	2339.0	2057.0	282.0	Engineering	0.120564	36	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	756.0	679.0	77.0	Engineering	0.101852	7	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	856.0	725.0	131.0	Engineering	0.153037	3	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0
3	4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	1258.0	1123.0	135.0	Engineering	0.107313	16	758	...	150	692	40	0.050125	70000	43000	80000	529	102	0
4	5	2405	CHEMICAL ENGINEERING	32260.0	21239.0	11021.0	Engineering	0.341631	289	25694	...	5180	16697	1672	0.061098	65000	50000	75000	18314	4440	972
5	6	2418	NUCLEAR ENGINEERING	2573.0	2200.0	373.0	Engineering	0.144967	17	1857	...	264	1449	400	0.177226	65000	50000	102000	1142	657	244
6	7	6202	ACTUARIAL SCIENCE	3777.0	2110.0	1667.0	Business	0.441356	51	2912	...	296	2482	308	0.095652	62000	53000	72000	1768	314	259
7	8	5001	ASTRONOMY AND ASTROPHYSICS	1792.0	832.0	960.0	Physical Sciences	0.535714	10	1526	...	553	827	33	0.021167	62000	31500	109000	972	500	220
8	9	2414	MECHANICAL ENGINEERING	91227.0	80320.0	10907.0	Engineering	0.119559	1029	76442	...	13101	54639	4650	0.057342	60000	48000	70000	52844	16384	3253
9	10	2408	ELECTRICAL ENGINEERING	81527.0	65511.0	16016.0	Engineering	0.196450	631	61928	...	12695	41413	3895	0.059174	60000	45000	72000	45829	10874	3170
10	11	2407	COMPUTER ENGINEERING	41542.0	33258.0	8284.0	Engineering	0.199413	399	32506	...	5146	23621	2275	0.065409	60000	45000	75000	23694	5721	980
11	12	2401	AEROSPACE ENGINEERING	15058.0	12953.0	2105.0	Engineering	0.139793	147	11391	...	2724	8790	794	0.065162	60000	42000	70000	8184	2425	372
12	13	2404	BIOMEDICAL ENGINEERING	14955.0	8407.0	6548.0	Engineering	0.437847	79	10047	...	2694	5986	1019	0.092084	60000	36000	70000	6439	2471	789
13	14	5008	MATERIALS SCIENCE	4279.0	2949.0	1330.0	Engineering	0.310820	22	3307	...	878	1967	78	0.023043	60000	39000	65000	2626	391	81
14	15	2409	ENGINEERING MECHANICS PHYSICS AND SCIENCE	4321.0	3526.0	795.0	Engineering	0.183985	30	3608	...	811	2004	23	0.006334	58000	25000	74000	2439	947	263

15 rows × 21 columns

In [8]:

import matplotlib.pyplot as plt
%matplotlib inline
data.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
# add a linear regression line to the plot

Out[8]:

Text(0, 0.5, 'Median salary')

In [9]:

print(data[['Median','ShareWomen']].isnull().sum()) # check if there are NaNs in the data!
data1 = data[['Median','ShareWomen']].dropna()  # drop NaNs
data1.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
# add a linear regression line to the plot
b,_ = ols(X=data['ShareWomen'],y=data['Median'],verbose=False)
fn = lambda x: b[0]+b[1]*x
xx = np.linspace(0,1,100)
plt.plot(xx,fn(xx),color='navy',linewidth=3)
plt.show()

Median        0
ShareWomen    1
dtype: int64

In [10]:

# create fraction variables
data.drop(index=data[data['Total']==0].index,inplace=True)  # drop zero Totals
data.drop(index=data[data['Employed']==0].index,inplace=True)  # drop zero Employed
data['Employment rate'] = data['Employed'] / data['Total']
data['Fulltime rate'] = data['Full_time'] / data['Employed']
data2 = data[['Median','ShareWomen','Employment rate','Fulltime rate']].dropna()  # drop NaNs
y = data2['Median']/1000  # rescale salary

In [11]:

# run the full model
ols(data2[['ShareWomen','Employment rate','Fulltime rate']],y);

Number of observations: 171
Number of parameters: 4
Parameter estimates (std in brackets)
  47.78925 (  10.54264)
 -27.86576 (   3.38424)
 -10.81362 (   9.80911)
  18.51797 (   7.52103)
MSE = 8.83857

Further learning resources¶

Regression analysis using sklearn library https://datascience.quantecon.org/applications/regression.html