Canada's indigenous population is currently facing a water crisis. More than 100 First Nation communities have long term water boil advisories in place, meaning they are without access to clean drinking water.
Many Canadian indigneous populations have faced these conditions for years, and some, for even decades. The average duration of a water boil advisory lasts eight years. In a water-rich country, indigenous nations are facing conditions similar to those in impoverished and developing countries. As a result of increasing pressure from indigenous and social justice organizations, in 2016, the Government of Canada committed to ending all long term drinking water advisories by 2021.
In this project, I will be investigating how long term water boil advisories affect migration, education, and unemployment levels within indigenous nations in Canada. Using data from Statistics Canada and Indigenous Services Canada, I will be looking at long term water boil advisories for the period of 1996-2016. This project is based upon my existing thesis in Econ 494.
I take a data set in which I have merged individual census data for each affected Indigenous nation (which was previously merged within Python) and clean it further.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import qeds
qeds.themes.mpl_style();
plotly_template = qeds.themes.plotly_template()
colors = qeds.themes.COLOR_CYCLE
from sklearn import (
linear_model, metrics, neural_network, pipeline, model_selection
)
from sklearn import linear_model
Demo_data = pd.read_excel(r'master_data_2016_copy.xlsx')
demo_df = pd.DataFrame(Demo_data)
demo_df.head()
#dropping columns
demo_df.drop([" Married spouses and common-law partners", " Lone parents"], axis = 1, inplace = True)
demo_df.drop(["Number of Years Advisory was On "], axis=1, inplace = True)
#cleaning column names
demo_df.rename(columns={"Year":"year", "First Nation":"first_nation", "Population": "population"}, inplace = True)
demo_df.rename(columns={"Total - Marital Status 15 years and over":"total_married"}, inplace = True)
demo_df.rename(columns={"% of the Aboriginal identity population with an Aboriginal language as mother tongue":"aboriginal_mother_tongue"}, inplace = True)
demo_df.rename(columns={"% of the Aboriginal identity population who speak an Aboriginal language most often at home":"aboriginal_home_language"}, inplace = True)
demo_df.rename(columns={" Children in census families":"children"}, inplace = True)
demo_df.rename(columns={" Married or living common law":"married_commonlaw"," Not married and not living common law":"notmarried_commonlaw"}, inplace = True)
demo_df.rename(columns={"Total - Family Characteristics":"total_family"," Persons not in census families":"not_in_family"}, inplace = True)
demo_df.rename(columns={"Total - Mobility status 1 year ago ":"total_mobility_1yr", "Total - Mobility status 5 years ago":"total_mobility_5yr"}, inplace = True)
demo_df.rename(columns={" Non-movers":"non-movers_1yr"," Movers":"movers_1yr"," Non-movers.1":"non-movers_5yr"," Movers.1":"movers_5yr"}, inplace = True)
demo_df.rename(columns={"Total - Highest certificate, diploma or degree ":"total_hs_higher_edu"," No certificate, diploma or degree":"no_hs_higheredu"," Secondary (high) school diploma or equivalency certificate":"hs_edu"}, inplace = True)
demo_df.rename(columns={" Persons with a trades; college or university certificate or diploma (below bachelor's degree)":"higher_edu_below"}, inplace = True)
demo_df.rename(columns={" University certificate, diploma or degree at bachelor level or above":"higher_edu_bach"}, inplace = True)
demo_df.rename(columns={"Total - Labour force status ": "total_lbr"," In the labour force":"in_lbr_force"," Employed":"employed"," Unemployed":"unemployed"," Not in the labour force":"not_in_lbr_force"},inplace = True)
demo_df.rename(columns={"Participation rate":"participation_rate", "Employment rate":"employment_rate","Unemployment rate":"unemployment_rate","Multiple On":"multiple_on","Advisory On":"advisory_on"}, inplace = True)
#converting all relevant columns into percentages represented in decimal format
demo_df["employment_rate"] = demo_df["employment_rate"].div(100)
demo_df["unemployment_rate"] = demo_df["unemployment_rate"].div(100)
demo_df["participation_rate"] = demo_df["participation_rate"].div(100)
demo_df["married"] = demo_df["married_commonlaw"]/demo_df["total_married"]
demo_df["child_perc"] = demo_df["children"]/demo_df["total_family"]
demo_df["single_perc"] = demo_df["not_in_family"]/demo_df["total_family"]
demo_df["movers_1yr_perc"] = demo_df["movers_1yr"]/demo_df["total_mobility_1yr"]
demo_df["movers_5yr_perc"] = demo_df["movers_5yr"]/demo_df["total_mobility_5yr"]
#generating new column for hs and uni education
demo_df["high_school_more"] = demo_df["hs_edu"] + demo_df["higher_edu_bach"]+ demo_df["higher_edu_below"]
demo_df["high_school_more"] = demo_df["high_school_more"]/demo_df["total_hs_higher_edu"]
#filling Nan with 0's
demo_df = demo_df.fillna(0)
#preview of cleaned data
demo_df.head()
year | first_nation | population | total_married | married_commonlaw | notmarried_commonlaw | total_family | children | not_in_family | aboriginal_mother_tongue | ... | employment_rate | unemployment_rate | multiple_on | advisory_on | married | child_perc | single_perc | movers_1yr_perc | movers_5yr_perc | high_school_more | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1996 | Shoal Lake No.40 | 164 | 100.0 | 35.0 | 70.0 | 40.0 | 30.0 | 0.0 | 0.666667 | ... | 0.450 | 0.167 | 0 | 0 | 0.350000 | 0.750000 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1996 | North Spirit Lake | 157 | 110.0 | 35.0 | 75.0 | 35.0 | 25.0 | 0.0 | 0.750000 | ... | 0.478 | 0.231 | 0 | 0 | 0.318182 | 0.714286 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1996 | Sandy Lake | 1611 | 950.0 | 450.0 | 500.0 | 355.0 | 290.0 | 0.0 | 0.618012 | ... | 0.416 | 0.186 | 0 | 0 | 0.473684 | 0.816901 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1996 | Toosey | 75 | 55.0 | 0.0 | 50.0 | 15.0 | 10.0 | 0.0 | 0.133333 | ... | 0.500 | 0.250 | 0 | 0 | 0.000000 | 0.666667 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1996 | Indian Island | 52 | 30.0 | 5.0 | 25.0 | 15.0 | 0.0 | 0.0 | 0.272727 | ... | 0.333 | 0.500 | 0 | 0 | 0.166667 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 38 columns
Here I prepare some visualizations to better understand the data I have cleaned. These graphs are meant to give some preliminary observations on the composition of the data before regressions are run.
For an initial visualization of the data, here I am going to look at the number of advisories over my time frame, 1996-2016. Since demographic data is only available for each census year in this time frame, I coded the occurance of an advisory for a census year if an advisory was on within the last five years of the relevant census year.
#number of advisories by year
year = demo_df.groupby("year")
year.head()
df_1996 = year.get_group(1996)
#print (pd.value_counts(df_1996['advisory_on'].values, sort=False))
df_2001 = year.get_group(2001)
#print (pd.value_counts(df_2001['advisory_on'].values, sort=False))
df_2006 = year.get_group(2006)
#print(pd.value_counts(df_2006["advisory_on"].values, sort=False))
df_2011 = year.get_group(2011)
#print(pd.value_counts(df_2011["advisory_on"].values, sort=False))
df_2016 = year.get_group(2016)
#print(pd.value_counts(df_2016["advisory_on"].values, sort=False))
info = {"year":[1996, 2001, 2006, 2011, 2016], "Advisory On":[0,0,12,26,66],"Advisory Off":[46,58,80,47,27]}
year_df = pd.DataFrame(data=info)
year_df.columns
Index(['year', 'Advisory On', 'Advisory Off'], dtype='object')
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.pyplot as plt
y = year_df["year"]
x = year_df['Advisory On']
plt.bar(y,x, width = 3, color = ["white","white","lightblue","skyblue","steelblue"])
plt.xticks(y)
plt.grid(None)
plt.title("Indigenous Water Boil Advisories per Census Year", fontsize = 15)
plt.xlabel("Census Year", fontsize = 10)
plt.ylabel("Count of Advisories", fontsize = 10)
plt.tight_layout()
plt.figure(figsize=(20,10))
plt.savefig("boil_advisories_4.png")
#plt.savefig('myimage.svg', format='svg', dpi=1200)
plt.show()
<Figure size 1440x720 with 0 Axes>
As we can see from the graph above, the number of advisories is increasing per year, with no advisories occuring in 1996 and 2001. Due to limitations of data, within my dataset, the number of advisories in 1996 and 2001 is zero, however in reality this is not the case. The reason there could be "missing advisories" is due to the fact that demographic data for the affected First Nation was not available for the relevant census year, therefore, the First Nation could not be marked as having an advisory. Note that the count of advisories means the number of First Nation communities that have water boil advisories within 5 years of the relevant census year.
#boxpot of HS/Higher Education
import seaborn as sns
edu_boxplot = sns.boxplot(x="advisory_on", y="high_school_more", data=demo_df, palette = "Blues")
edu_boxplot.set_xlabel("Advisory On", fontsize = 13)
edu_boxplot.set_ylabel("HS/Higher Education", fontsize = 13)
edu_boxplot.set_title("Proportion of Indigneous community with HS/Higher Education", fontsize= 13)
edu_boxplot.grid(False)
#boxplot of unemployment rate
unemp_boxplot = sns.boxplot(x="advisory_on", y="unemployment_rate", data=demo_df, palette = "Blues")
unemp_boxplot.set_xlabel("Advisory On", fontsize = 13)
unemp_boxplot.set_ylabel("Unemployment Rate", fontsize = 13)
unemp_boxplot.set_title("Unemployment Rate Dynamics", fontsize=16)
unemp_boxplot.grid(False)
#unemp_boxplot.legend(handles = unemp_boxplot, labels = ["Advisory Off", "Advisory On"])
#boxplot of migration (population)
plt.figure(figsize=(5,9))
pop_boxplot = sns.boxplot(x="advisory_on", y="population", data=demo_df, palette = "Blues")
pop_boxplot.set_xlabel("Advisory On", fontsize = "13")
pop_boxplot.set_ylabel("Population", fontsize = "13")
pop_boxplot.set_title("Population Dynamics", fontsize = 15)
pop_boxplot.grid(False)
#zooming in
pop_boxplot = sns.boxplot(x="advisory_on", y="population", data=demo_df, palette = "Blues")
pop_boxplot.set_xlabel("Advisory On", fontsize = 13)
pop_boxplot.set_ylabel("Population", fontsize = 13)
pop_boxplot.set_title("Zoomed In Box Plot of Population Dynamics", fontsize =15 )
plt.axis([-1,2,0,1500])
pop_boxplot.grid(False)
From the graphs, we see that when an advisory is turned on, the mean of the unemployment rate and attainment of a highschool/higher level education is higher than when an adivsory is off or not in place. For population (migration change) we see that it is slightly lower when an advisory is in effect.
To get a better idea of how long term water boil advisories are distributed across Canada, I'm going to make a map of the individual First Nation communities that are affected. First, I'll make my dataset of coordinates so I will be ready to put the point on the map. I created this dataset by searching up affected First Nation community names, and obtained coordinates from the GeoHacks website from Wikipedia, or from Google Maps.
Using folium, I create a map of Canada with interactive points with the names of affected First Nation communities. Click to take a look at the names!
import folium
map_1 = folium.Map(location=[55.585901, -105.750596], zoom_start = 5)
map_data = pd.read_excel(r'map data 2.0.xlsx')
map_df_1 = pd.DataFrame(map_data)
map_df_1["Coordinates"] = list(zip(map_df_1["Longitude"], map_df_1["Latitude"]))
map_df_1.head()
map_df_1['Latitude'] = map_df_1['Latitude'].astype(float)
map_df_1["Longitude"] = map_df_1["Longitude"].astype(float)
for i in range(0,len(map_df_1)):
folium.Marker([map_df_1.iloc[i]['Latitude'], map_df_1.iloc[i]['Longitude']], popup=map_df_1.iloc[i]['First Nation']).add_to(map_1)
map_1
As we can see from the graph above, the provinces that are most affected by long term water boil advisories are Manitoba and Ontario. However, water boil advisories seem to plague the majority of Canada, with exception to the North West Territories, Nunavat, and the Yukon.
Here, I will be looking at a simple fixed effects panel OLS regression for the outcomes of long term water boil advisories on migration, educational attainment, and unemployment rates.
For the regression on migration, I will run one (First Nation community and year) fixed effects regression without additional controls, and another with additional controls. These controls will control for noise in the estimate.
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
demo_df.columns
#regression on population
lm_pop = list()
#no controls, fixed effects
lm_pop.append(smf.ols(formula="population ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#controls, fixed effects
lm_pop.append(smf.ols(formula="population ~ advisory_on + C(first_nation) + C(year) + child_perc + movers_1yr_perc + movers_5yr_perc + high_school_more + aboriginal_mother_tongue + aboriginal_home_language", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors = ["Intercept","advisory_on", "child_perc", "movers_1yr_perc","movers_5yr_perc","high_school_more","aboriginal_mother_tongue","aboriginal_home_language"]
tble = summary_col(lm_pop,regressor_order=regressors, stars = True)
tble.tables[0] = tble.tables[0].loc[regressors]
print (tble)
=================================================== population I population II --------------------------------------------------- Intercept 139.6264*** 196.9917*** advisory_on -20.4577 -15.7457 child_perc -76.8143** movers_1yr_perc -224.3292 movers_5yr_perc 41.3787 high_school_more -46.3176 aboriginal_mother_tongue 3.2227*** aboriginal_home_language -2.5327** =================================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
With the results observed above, our estimates with and without controls are not significant. The effects of water boil advisory on population change are quite small, decreasing the population by around 20 people without controls, and around 15 in the estimate with controls.
Like the regression on migration, I will run two fixed effect regressions controlling for variation that occurs First Nation community and years. One regression will run without additional controls, another will incorporate additional controls.
#regression on attainment of high school education and/or higher level education
lm_edu = list()
#no controls, fixed effects
lm_edu.append(smf.ols(formula="high_school_more ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#controls, fixed effects
lm_edu.append(smf.ols(formula="high_school_more ~ advisory_on + C(first_nation) + C(year) + child_perc + married", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors_1 = ["Intercept","advisory_on", "child_perc", "married"]
tble_1 = summary_col(lm_edu,regressor_order=regressors_1, stars = True)
tble_1.tables[0] = tble_1.tables[0].loc[regressors_1]
print (tble_1)
================================================== high_school_more I high_school_more II -------------------------------------------------- Intercept 0.0179 -0.2316** advisory_on 0.0218 -0.0019 child_perc 0.1647*** married 0.4152*** ================================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
Similar to our regression looking at migration through population change, our effects on high school and/or higher level attainment are insignificant. Without controls, it seems that the proportion of a indigneous community that would have achieved a high school diploma or higher would have actually increased by ~2%, and when I add controls, it would have decreased by less than 1%.
#regression on unemployment rate
lm_unemp = list()
#no controls, fixed effects
lm_unemp.append(smf.ols(formula="unemployment_rate ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#controls, fixed effects
lm_unemp.append(smf.ols(formula="unemployment_rate ~ advisory_on + C(first_nation) + C(year) + high_school_more + child_perc + married", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors_2 = ["Intercept","advisory_on", "child_perc", "married", "high_school_more"]
tble_2= summary_col(lm_unemp,regressor_order=regressors_2, stars = True)
tble_2.tables[0] = tble_2.tables[0].loc[regressors_2]
print (tble_2)
========================================================= unemployment_rate I unemployment_rate II --------------------------------------------------------- Intercept 0.2145*** 0.1384*** advisory_on 0.0021 -0.0171 child_perc 0.1971*** married -0.1118 high_school_more 0.2742*** ========================================================= Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
The effects on unemployment rate are also found to be insignificant. For the regression without additional controls, the unemployment rate increases by less than 1%. For the regression with additional controls, it seems that the unemployment rate for an average indigneous community would actually decrease by around 2%.
Throughout these three regressions, what I find is that there are no significant effects on migration, high school and/or higher level educational attainment, or unemployment levels. This means that when a long term water boil advisory is in effect, indigenous populations do not move out of their communities, their levels of educational attainment are unaffected, and unemployment rates are unaffected.
I will implement a lasso regression as my regularization tool to look at which coefficients from my fixed effects regression could be eliminated due to overfitting. For each of my regressions looking at different outcomes, I will test out the different controls that are particular to each outcome variable and see how they perform under lasso regression. Instead of the fixed effects regression above, I will instead be implementing a simple linear regression, omitting time and First Nation community fixed effects.
#lasso regression for controls pertaining to population
demo_df.columns
X_pop = demo_df.drop(["year", "higher_edu_bach", "total_lbr","in_lbr_force", "first_nation", "population", "total_married", "not_in_family", "married_commonlaw", "notmarried_commonlaw", "employed", "employment_rate", "single_perc", "total_family", "children","total_mobility_1yr","non-movers_1yr","movers_1yr","total_mobility_5yr","non-movers_5yr","movers_5yr","total_hs_higher_edu","no_hs_higheredu","hs_edu","higher_edu_below","unemployed","not_in_lbr_force","participation_rate","unemployment_rate","multiple_on","married"], axis=1).copy()
X_pop.head()
y_pop = demo_df["population"]
lasso_model_pop = linear_model.Lasso()
lasso_model_pop.fit(X_pop, y_pop)
lr_model_pop = linear_model.LinearRegression()
lr_model_pop.fit(X_pop, y_pop)
lasso_coefs_pop = pd.Series(dict(zip(list(X_pop), lasso_model_pop.coef_)))
lr_coefs_pop = pd.Series(dict(zip(list(X_pop), lr_model_pop.coef_)))
coefs_pop = pd.DataFrame(dict(lasso=lasso_coefs_pop, linreg=lr_coefs_pop))
coefs_pop
lasso | linreg | |
---|---|---|
aboriginal_mother_tongue | 6.541572 | 6.144288 |
aboriginal_home_language | -1.897842 | -1.187683 |
advisory_on | -9.126670 | -22.827865 |
child_perc | 346.293725 | 368.148840 |
movers_1yr_perc | -4.705196 | -704.278714 |
movers_5yr_perc | -0.000000 | 105.654375 |
high_school_more | 363.173397 | 449.845937 |
Here we see that most of the control variables used do not contribute to overfitting within the regression, this is very good news! The only variable that could cause an overfitting issue and should be eliminated from my regression is movers_5yr_perc.
from itertools import cycle
#alphas = np.exp(np.linspace(10, -2, 50))
alphas, coefs_lasso_pop, _ = linear_model.lasso_path(X_pop, y_pop, alphas=None, fit_intercept=True, max_iter=10000)
# plotting
fig, ax = plt.subplots(figsize=(12, 8))
colors = cycle(qeds.themes.COLOR_CYCLE)
log_alphas = -np.log10(alphas)
for coef_l, c, name in zip(coefs_lasso_pop, colors, list(X_pop)):
ax.plot(log_alphas, coef_l, c=c)
ax.set_xlabel('-Log(alpha)')
ax.set_ylabel('lasso coefficients')
ax.set_title('Lasso Path for Coefficients of Controls in Migration Regression')
ax.axis('tight')
maxabs = np.max(np.abs(coef_l))
i = [idx for idx in range(len(coef_l)) if abs(coef_l[idx]) >= (0.9*maxabs)][0]
xnote = log_alphas[i]
ynote = coef_l[i]
ax.annotate(name, (xnote, ynote), color=c)
#for regression on completion of high school and/or some or all higher level education
demo_df.columns
X_edu = demo_df.drop(["year","higher_edu_bach","total_lbr","in_lbr_force", "aboriginal_home_language","aboriginal_mother_tongue", "high_school_more","movers_1yr_perc","movers_5yr_perc", "first_nation", "population", "total_married", "not_in_family", "married_commonlaw", "notmarried_commonlaw", "employed", "employment_rate", "single_perc", "total_family", "children","total_mobility_1yr","non-movers_1yr","movers_1yr","total_mobility_5yr","non-movers_5yr","movers_5yr","total_hs_higher_edu","no_hs_higheredu","hs_edu","higher_edu_below","unemployed","not_in_lbr_force","participation_rate","unemployment_rate","multiple_on"], axis=1).copy()
X_edu.head()
y_edu = demo_df["high_school_more"]
lasso_model_edu = linear_model.Lasso()
lasso_model_edu.fit(X_edu, y_edu)
lr_model_edu = linear_model.LinearRegression()
lr_model_edu.fit(X_edu, y_edu)
lasso_coefs_edu = pd.Series(dict(zip(list(X_edu), lasso_model_pop.coef_)))
lr_coefs_edu = pd.Series(dict(zip(list(X_edu), lr_model_pop.coef_)))
coefs_edu = pd.DataFrame(dict(lasso=lasso_coefs_edu, linreg=lr_coefs_edu))
coefs_edu
lasso | linreg | |
---|---|---|
advisory_on | 6.541572 | 6.144288 |
married | -1.897842 | -1.187683 |
child_perc | -9.126670 | -22.827865 |
It looks like from above, we don't have to eliminate any of our control variables! Let's also see a visualization below.
from itertools import cycle
#alphas = np.exp(np.linspace(10, -2, 50))
alphas, coefs_lasso_edu, _ = linear_model.lasso_path(X_edu, y_edu, alphas = None, fit_intercept=True, max_iter=10000)
# plotting
fig, ax = plt.subplots(figsize=(12, 8))
colors = cycle(qeds.themes.COLOR_CYCLE)
log_alphas = -np.log10(alphas)
for coef_l_e, c, name in zip(coefs_lasso_edu, colors, list(X_edu)):
ax.plot(log_alphas, coef_l_e, c=c)
ax.set_xlabel('-Log(alpha)')
ax.set_ylabel('lasso coefficients')
ax.set_title('Lasso Path for Coefficients of Controls in Education Regression')
ax.axis('tight')
maxabs = np.max(np.abs(coef_l_e))
i = [idx for idx in range(len(coef_l_e)) if abs(coef_l_e[idx]) >= (0.9*maxabs)][0]
xnote = log_alphas[i]
ynote = coef_l_e[i]
ax.annotate(name, (xnote, ynote), color=c)
#for regression on unemployment rates
demo_df.columns
X_unemp = demo_df.drop(["year","higher_edu_bach","total_lbr","in_lbr_force", "aboriginal_home_language","aboriginal_mother_tongue","movers_1yr_perc","movers_5yr_perc", "first_nation", "population", "total_married", "not_in_family", "married_commonlaw", "notmarried_commonlaw", "employed", "employment_rate", "single_perc", "total_family", "children","total_mobility_1yr","non-movers_1yr","movers_1yr","total_mobility_5yr","non-movers_5yr","movers_5yr","total_hs_higher_edu","no_hs_higheredu","hs_edu","higher_edu_below","unemployed","not_in_lbr_force","participation_rate","unemployment_rate","multiple_on"], axis=1).copy()
X_unemp.head()
y_unemp = demo_df["unemployment_rate"]
lasso_model_unemp = linear_model.Lasso()
lasso_model_unemp.fit(X_unemp, y_unemp)
lr_model_unemp = linear_model.LinearRegression()
lr_model_unemp.fit(X_unemp, y_unemp)
lasso_coefs_unemp = pd.Series(dict(zip(list(X_unemp), lasso_model_unemp.coef_)))
lr_coefs_unemp = pd.Series(dict(zip(list(X_unemp), lr_model_unemp.coef_)))
coefs_unemp = pd.DataFrame(dict(lasso=lasso_coefs_unemp, linreg=lr_coefs_unemp))
coefs_unemp
lasso | linreg | |
---|---|---|
advisory_on | 0.0 | 0.001543 |
married | 0.0 | -0.137155 |
child_perc | 0.0 | 0.348416 |
high_school_more | 0.0 | 0.232869 |
This is surprising - all of the controls, including the treatment variable, contribute to overfitting within the regression! However, I can't chuck out the treatment variable, or else, I would have no regression. But, I can disinclude the other controls and just use fixed effects.
from itertools import cycle
alphas = np.exp(np.linspace(10, -2, 50))
alphas, coefs_lasso_unemp, _ = linear_model.lasso_path(X_unemp, y_unemp, alphas = alphas, fit_intercept=True, max_iter=10000)
# plotting
fig, ax = plt.subplots(figsize=(12, 8))
colors = cycle(qeds.themes.COLOR_CYCLE)
log_alphas = -np.log10(alphas)
for coef_l, c, name in zip(coefs_lasso_unemp, colors, list(X_unemp)):
ax.plot(log_alphas, coef_l, c=c)
ax.set_xlabel('-Log(alpha)')
ax.set_ylabel('lasso coefficients')
ax.set_title('Lasso Path for Coefficients of Controls in Unemployment Regression')
ax.axis('tight')
maxabs = np.max(np.abs(coef_l))
i = [idx for idx in range(len(coef_l)) if abs(coef_l[idx]) >= (0.9*maxabs)][0]
xnote = log_alphas[i]
ynote = coef_l[i]
ax.annotate(name, (xnote, ynote), color=c)
#regression on population
lm_pop_1 = list()
#no controls, fixed effects
lm_pop_1.append(smf.ols(formula="population ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#old controls before lasso regression, fixed effects
lm_pop_1.append(smf.ols(formula="population ~ advisory_on + C(first_nation) + C(year) + child_perc + movers_1yr_perc + movers_5yr_perc + high_school_more + aboriginal_mother_tongue + aboriginal_home_language", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#new regression with considerations from the lasso regression, fixed effects
lm_pop_1.append(smf.ols(formula="population ~ advisory_on + C(first_nation) + C(year) + child_perc + movers_1yr_perc + high_school_more + aboriginal_mother_tongue + aboriginal_home_language", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors_pop_1 = ["Intercept","advisory_on", "child_perc", "movers_1yr_perc","movers_5yr_perc","high_school_more","aboriginal_mother_tongue","aboriginal_home_language"]
table_pop_1 = summary_col(lm_pop_1,regressor_order=regressors_pop_1, stars = True)
table_pop_1.tables[0] = table_pop_1.tables[0].loc[regressors_pop_1]
print (table_pop_1)
================================================================== population I population II population III ------------------------------------------------------------------ Intercept 139.6264*** 196.9917*** 197.1705*** advisory_on -20.4577 -15.7457 -16.5453 child_perc -76.8143** -77.1064** movers_1yr_perc -224.3292 -167.4374 movers_5yr_perc 41.3787 high_school_more -46.3176 -39.0160 aboriginal_mother_tongue 3.2227*** 3.2881*** aboriginal_home_language -2.5327** -2.5524** ================================================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
By eliminating the control that looked at the proportion of people who moved within the last five years, our estimate when the advisory is turned on slightly decreases to 16, as seen in the population III column. This estimate says that when an advisory is turned on, more people migrate out of their community than previously seen with the population II regression. It is important to note that all three estimates are still not signficant.
From the lasso regression, I found out that all of the control variables I introduced didn't contribute to overfitting! Here are the results of the original regression, with and without additional controls below.
#regression on attainment of high school education and/or higher level education
lm_edu = list()
#no controls, fixed effects
lm_edu.append(smf.ols(formula="high_school_more ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
#controls, fixed effects
lm_edu.append(smf.ols(formula="high_school_more ~ advisory_on + C(first_nation) + C(year) + child_perc + married", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors_1 = ["Intercept","advisory_on", "child_perc", "married"]
tble_1 = summary_col(lm_edu,regressor_order=regressors_1, stars = True)
tble_1.tables[0] = tble_1.tables[0].loc[regressors_1]
print (tble_1)
================================================== high_school_more I high_school_more II -------------------------------------------------- Intercept 0.0179 -0.2316** advisory_on 0.0218 -0.0019 child_perc 0.1647*** married 0.4152*** ================================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
Since the additional control variables don't contribute to overfitting, they might give a more accurate estimate than high_school_more I, which only controls for fixed effects. In high_school_more II I see find that when there is a long term water boil advisory in place for an indigenous community, the proportion of people within the community who attain a high school education and or higher education decreases by ~ 0.2%, although this effect is not significant.
#regression on unemployment rate
lm_unemp = list()
#no controls, fixed effects
lm_unemp.append(smf.ols(formula="unemployment_rate ~ advisory_on + C(first_nation) + C(year)", data=demo_df,
missing="drop").fit(cov_type='HC0'))
regressors_2 = ["Intercept","advisory_on"]
tble_2= summary_col(lm_unemp,regressor_order=regressors_2, stars = True)
tble_2.tables[0] = tble_2.tables[0].loc[regressors_2]
print (tble_2)
============================= unemployment_rate ----------------------------- Intercept 0.2145*** advisory_on 0.0021 ============================= Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
Since the lasso regression indicated that I was overfitting my model by incorporating my additonal controls, I only run one regression - one with fixed effects and no other controls. I see that the effect of a long term water boil advisory on unemployment rates is still not signficant, however, the impact is quite small at 0.2% increase.