import warnings
warnings.filterwarnings("ignore")
Regression (and prediction more generally) provides us a perfect case to examine how spatial structure can help us understand and analyze our data. In this chapter, we discuss how spatial structure can be used to both validate and improve prediction algorithms, focusing on linear regression specifically.
Usually, spatial structure helps regression models in one of two ways. The first (and most clear) way space can have an impact on our data is when the process generating the data is itself explicitly spatial. Here, think of something like the prices for single family homes. It's often the case that individuals pay a premium on their house price in order to live in a better school district for the same quality house. Alternatively, homes closer to noise or chemical polluters like waste water treatment plants, recycling facilities, or wide highways, may actually be cheaper than we would otherwise anticipate. In cases like asthma incidence, the locations individuals tend to travel to throughout the day, such as their places of work or recreation, may have more impact on their health than their residential addresses. In this case, it may be necessary to use data from other sites to predict the asthma incidence at a given site. Regardless of the specific case at play, here, geography is a feature: it directly helps us make predictions about outcomes because those outcomes are obtained from geographical processes.
An alternative (and more skeptical understanding) reluctantly acknowledges geography's instrumental value. Often, in the analysis of predictive methods and classifiers, we are interested in analyzing what we get wrong. This is common in econometrics; an analyst may be concerned that the model systematically mis-predicts some types of observations. If we know our model routinely performs poorly on a known set of observations or type of input, we might make a better model if we can account for this. Among other kinds of error diagnostics, geography provides us with an exceptionally useful embedding to assess structure in our errors. Mapping classification/prediction error can help show whether or not there are clusters of error in our data. If we know that errors tend to be larger in some areas than in other areas (or if error is "contagious" between observations), then we might be able to exploit this structure to make better predictions.
Spatial structure in our errors might arise from when geography should be an attribute somehow, but we are not sure exactly how to include it in our model. They might also arise because there is some other feature whose omission causes the spatial patterns in the error we see; if this additional feature were included, the structure would disappear. Or, it might arise from the complex interactions and interdependencies between the features that we have chosen to use as predictors, resulting in intrinsic structure in mis-prediction. Most of the predictors we use in models of social processes contain embodied spatial information: patterning intrinsic to the feature that we get for free in the model. If we intend to or not, using a spatially patterned predictor in a model can result in spatially patterned errors; using more than one can amplify this effect. Thus, regardless of whether or not the true process is explicitly geographic, additional information about the spatial relationships between our observations or more information about nearby sites can make our predictions better.
In this chapter, we build space into the traditional regression framework. We begin with a standard linear regression model, devoid of any geographical reference. From there, we formalize space and spatial relationships in three main ways: first, encoding it in exogenous variables; second, through spatial heterogeneity, or as systematic variation of outcomes across space; third, as dependence, or through the effect associated to the characteristics of spatial neighbors. Throughout, we focus on the conceptual differences each approach entails rather than on the technical details.
from pysal.lib import weights
from pysal.explore import esda
import numpy
import pandas
import geopandas
import matplotlib.pyplot as plt
import seaborn
import contextily
To learn a little more about how regression works, we'll examine information about Airbnb properties in San Diego, CA.
This dataset contains house intrinsic characteristics, both continuous (number of beds as in beds
) and categorical (type of renting or, in Airbnb jargon, property group as in the series of pg_X
binary variables), but also variables that explicitly refer to the location and spatial configuration of the dataset (e.g., distance to Balboa Park, d2balboa
or neighborhood id, neighborhood_cleansed
).
db = geopandas.read_file("../data/airbnb/regression_db.geojson")
These are the explanatory variables we will use throughout the chapter.
variable_names = [
"accommodates", # Number of people it accommodates
"bathrooms", # Number of bathrooms
"bedrooms", # Number of bedrooms
"beds", # Number of beds
# Below are binary variables, 1 True, 0 False
"rt_Private_room", # Room type: private room
"rt_Shared_room", # Room type: shared room
"pg_Condominium", # Property group: condo
"pg_House", # Property group: house
"pg_Other", # Property group: other
"pg_Townhouse", # Property group: townhouse
]
Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in Python, and how one can begin to interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, we recommend {cite}Gelman_2006
, in particular chapters 3 and 4, which provide a fantastic, non-spatial introduction.
The core idea of linear regression is to explain the variation in a given (dependent) variable as a linear function of a collection of other (explanatory) variables. For example, in our case, we may want to express the price of a house as a function of the number of bedrooms it has and whether it is a condominium or not. At the individual level, we can express this as:
$$ P_i = \alpha + \sum_k \mathbf{X}_{ik}\beta_k + \epsilon_i $$where $P_i$ is the Airbnb price of house $i$, and $X$ is a set of covariates that we use to explain such price (e.g., No. of bedrooms and condominium binary variable). $\beta$ is a vector of parameters that give us information about in which way and to what extent each variable is related to the price, and $\alpha$, the constant term, is the average house price when all the other variables are zero. The term $\epsilon_i$ is usually referred to as "error" and captures elements that influence the price of a house but are not included in $X$. We can also express this relation in matrix form, excluding sub-indices for $i$, which yields:
$$ P = \alpha + \mathbf{X}\beta + \epsilon $$A regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $\beta_k$ coefficients in the equation above is as the degree of correlation between the explanatory variable $k$ and the dependent variable, keeping all the other explanatory variables constant. When one calculates bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables -- also called confounding factors. Regression allows us to isolate the distinct effect that a single variable has on the dependent one, once we control for those other variables.
Practically speaking, linear regressions in Python are rather streamlined and easy to work with. There are also several packages which will run them (e.g., statsmodels
, scikit-learn
, pysal
). We will import the spreg
module in Pysal:
from pysal.model import spreg
In the context of this chapter, it makes sense to start with spreg
, as that is the only library that will allow us to move into explicitly spatial econometric models. To fit the model specified in the equation above with $X$ as the list defined, using ordinary least squares (OLS), we only need the following line of code:
# Fit OLS model
m1 = spreg.OLS(
# Dependent variable
db[["log_price"]].values,
# Independent variables
db[variable_names].values,
# Dependent variable name
name_y="log_price",
# Independent variable name
name_x=variable_names,
)
We use the command OLS
, part of the spreg
sub-package, and specify the dependent variable (the log of the price, so we can interpret results in terms of percentage change) and the explanatory ones. Note that both objects need to be arrays, so we extract them from the pandas.DataFrame
object using .values
.
In order to inspect the results of the model, we can print the summary
attribute:
print(m1.summary)
REGRESSION ---------- SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ----------------------------------------- Data set : unknown Weights matrix : None Dependent Variable : log_price Number of Observations: 6110 Mean dependent var : 4.9958 Number of Variables : 11 S.D. dependent var : 0.8072 Degrees of Freedom : 6099 R-squared : 0.6683 Adjusted R-squared : 0.6678 Sum squared residual: 1320.148 F-statistic : 1229.0564 Sigma-square : 0.216 Prob(F-statistic) : 0 S.E. of regression : 0.465 Log likelihood : -3988.895 Sigma-square ML : 0.216 Akaike info criterion : 7999.790 S.E of regression ML: 0.4648 Schwarz criterion : 8073.685 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error t-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 4.3883830 0.0161147 272.3217773 0.0000000 accommodates 0.0834523 0.0050781 16.4336318 0.0000000 bathrooms 0.1923790 0.0109668 17.5419773 0.0000000 bedrooms 0.1525221 0.0111323 13.7009195 0.0000000 beds -0.0417231 0.0069383 -6.0134430 0.0000000 rt_Private_room -0.5506868 0.0159046 -34.6244758 0.0000000 rt_Shared_room -1.2383055 0.0384329 -32.2198992 0.0000000 pg_Condominium 0.1436347 0.0221499 6.4846529 0.0000000 pg_House -0.0104894 0.0145315 -0.7218393 0.4704209 pg_Other 0.1411546 0.0228016 6.1905633 0.0000000 pg_Townhouse -0.0416702 0.0342758 -1.2157316 0.2241342 ------------------------------------------------------------------------------------ REGRESSION DIAGNOSTICS MULTICOLLINEARITY CONDITION NUMBER 11.964 TEST ON NORMALITY OF ERRORS TEST DF VALUE PROB Jarque-Bera 2 2671.611 0.0000 DIAGNOSTICS FOR HETEROSKEDASTICITY RANDOM COEFFICIENTS TEST DF VALUE PROB Breusch-Pagan test 10 322.532 0.0000 Koenker-Bassett test 10 135.581 0.0000 ================================ END OF REPORT =====================================
A full detailed explanation of the output is beyond the scope of this chapter, so we will focus on the relevant bits for our main purpose. This is concentrated on the Coefficients
section, which gives us the estimates for $\beta_k$ in our model. In other words, these numbers express the relationship between each explanatory variable and the dependent one, once the effect of confounding factors has been accounted for. Keep in mind however that regression is no magic; we are only discounting the effect of confounding factors that we include in the model, not of all potentially confounding factors.
Results are largely as expected: houses tend to be significantly more expensive if they accommodate more people (accommodates
), if they have more bathrooms and bedrooms, and if they are a condominium or part of the "other" category of house type. Conversely, given a number of rooms, houses with more beds (i.e., listings that are more "crowded") tend to go for cheaper, as it is the case for properties where one does not rent the entire house but only a room (rt_Private_room
) or even shares it (rt_Shared_room
). Of course, you might conceptually doubt the assumption that it is possible to arbitrarily change the number of beds within an Airbnb without eventually changing the number of people it accommodates, but methods to address these concerns using interaction effects won't be discussed here.
In general, our model performs well, being able to predict slightly about two-thirds ($R^2=0.67$) of the variation in the mean nightly price using the covariates we've discussed above. But, our model might display some clustering in the errors, which may be a problem as that violates the i.i.d. assumption linear models usually come built-in with. To interrogate this, we can do a few things. One simple concept might be to look at the correlation between the error in predicting an Airbnb and the error in predicting its nearest neighbor. To examine this, we first might want to split our data up by regions and see if we've got some spatial structure in our residuals. One reasonable theory might be that our model does not include any information about beaches, a critical aspect of why people live and vacation in San Diego. Therefore, we might want to see whether or not our errors are higher or lower depending on whether or not an Airbnb is in a "beach" neighborhood, a neighborhood near the ocean. We use the code below to generate Figure XXX1XXX, which looks at prices between the two groups of houses, "beach" and "no beach".
# Create a Boolean (True/False) with whether a
# property is coastal or not
is_coastal = db.coastal.astype(bool)
# Split residuals (m1.u) between coastal and not
coastal = m1.u[is_coastal]
not_coastal = m1.u[~is_coastal]
# Create histogram of the distribution of coastal residuals
plt.hist(coastal, density=True, label="Coastal")
# Create histogram of the distribution of non-coastal residuals
plt.hist(
not_coastal,
histtype="step",
density=True,
linewidth=4,
label="Not Coastal",
)
# Add Line on 0
plt.vlines(0, 0, 1, linestyle=":", color="k", linewidth=4)
# Add legend
plt.legend()
# Display
plt.show()
While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic $t$-test:
from scipy.stats import ttest_ind
ttest_ind(coastal, not_coastal)
Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))
There are more sophisticated (and harder to fool) tests that may be applicable for this data, however. We cover them in the Challenge section.
Additionally, it might be the case that some neighborhoods are more desirable than other neighborhoods due to unmodeled latent preferences or marketing. For instance, despite its presence close to the sea, living near Camp Pendleton -a Marine base in the North of the city- may incur some significant penalties on area desirability due to noise and pollution. These are questions that domain knowledge provides and data analysis can help us answer. For us to determine whether this is the case, we might be interested in the full distribution of model residuals within each neighborhood.
To make this more clear, we'll first sort the data by the median residual in that neighborhood, and then make a boxplot (Fig. XXX2XXX), which shows the distribution of residuals in each neighborhood:
# Create column with residual values from m1
db["residual"] = m1.u
# Obtain the median value of residuals in each neighborhood
medians = (
db.groupby("neighborhood")
.residual.median()
.to_frame("hood_residual")
)
# Increase fontsize
seaborn.set(font_scale=1.25)
# Set up figure
f = plt.figure(figsize=(15, 3))
# Grab figure's axis
ax = plt.gca()
# Generate bloxplot of values by neighborhood
# Note the data includes the median values merged on-the-fly
seaborn.boxplot(
"neighborhood",
"residual",
ax=ax,
data=db.merge(
medians, how="left", left_on="neighborhood", right_index=True
).sort_values("hood_residual"),
palette="bwr",
)
# Auto-format of the X labels
f.autofmt_xdate()
# Display
plt.show()
No neighborhood is disjoint from one another, but some do appear to be higher than others, such as the well-known downtown tourist neighborhoods areas of the Gaslamp Quarter, Little Italy, or The Core. Thus, there may be a distinctive effect of intangible neighborhood fashionableness that matters in this model.
Noting that many of the most over- and under-predicted neighborhoods are near one another in the city, it may also be the case that there is some sort of contagion or spatial spillovers in the nightly rent price. This often is apparent when individuals seek to price their Airbnb listings to compete with similar nearby listings. Since our model is not aware of this behavior, its errors may tend to cluster. One exceptionally simple way we can look into this structure is by examining the relationship between an observation's residuals and its surrounding residuals.
To do this, we will use spatial weights to represent the geographic relationships between observations. We cover spatial weights in detail in Chapter 4, so we will not repeat ourselves here. For this example, we'll start off with a $KNN$ matrix where $k=1$, meaning we're focusing only on the linkages of each Airbnb to their closest other listing.
knn = weights.KNN.from_dataframe(db, k=1)
This means that, when we compute the spatial lag of that $KNN$ weight and the residual, we get the residual of the Airbnb listing closest to each observation.
lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u)
ax = seaborn.regplot(
m1.u.flatten(),
lag_residual.flatten(),
line_kws=dict(color="orangered"),
ci=None,
)
ax.set_xlabel("Model Residuals - $u$")
ax.set_ylabel("Spatial Lag of Model Residuals - $W u$");
In Figure XXX3XXX, we see that our prediction errors tend to cluster! Above, we show the relationship between our prediction error at each site and the prediction error at the site nearest to it. Here, we're using this nearest site to stand in for the surroundings of that Airbnb. This means that, when the model tends to over-predict a given Airbnb's nightly log price, sites around that Airbnb are more likely to also be over-predicted. An interesting property of this relationship is that it tends to stabilize as the number of nearest neighbors used to construct each Airbnb's surroundings increases. Consult the Challenge section for more on this property.
Given this behavior, let's look at the stable $k=20$ number of neighbors. Examining the relationship between this stable surrounding average and the focal Airbnb, we can even find clusters in our model error. Recalling the local Moran statistics in Chapter 7, Figure XXX4XXX is generated from the code below to identify certain areas where our predictions of the nightly (log) Airbnb price tend to be significantly off:
# Re-weight W to 20 nearest neighbors
knn.reweight(k=20, inplace=True)
# Row standardize weights
knn.transform = "R"
# Run LISA on residuals
outliers = esda.moran.Moran_Local(m1.u, knn, permutations=9999)
# Select only LISA cluster cores
error_clusters = outliers.q % 2 == 1
# Filter out non-significant clusters
error_clusters &= outliers.p_sim <= 0.001
# Add `error_clusters` and `local_I` columns
ax = (
db.assign(
error_clusters=error_clusters,
local_I=outliers.Is
# Retain error clusters only
)
.query(
"error_clusters"
# Sort by I value to largest plot on top
)
.sort_values(
"local_I"
# Plot I values
)
.plot("local_I", cmap="bwr", marker=".")
)
# Add basemap
contextily.add_basemap(ax, crs=db.crs)
# Remove axes
ax.set_axis_off();
Thus, these areas tend to be locations where our model significantly under-predicts the nightly Airbnb price both for that specific observation and observations in its immediate surroundings. This is critical since, if we can identify how these areas are structured — if they have a consistent geography that we can model — then we might make our predictions even better, or at least not systematically mis-predict prices in some areas while correctly predicting prices in other areas. Since significant under- and over-predictions do appear to cluster in a highly structured way, we might be able to use a better model to fix the geography of our model errors.
There are many different ways that spatial structure shows up in our models, predictions, and our data, even if we do not explicitly intend to study it. Fortunately, there are nearly as many techniques, called spatial regression methods, that are designed to handle these sorts of structures. Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression. Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors that we cannot but should include in our model. As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views. To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify "character" per se, although it might be easier to get at its spatial variation, hence a case of space as a proxy.
Spatial regression is a large field of development in the econometrics and statistics literatures.
In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. Before diving into them, we begin with another approach that introduces space in a regression model without modifying the model itself but rather creates spatially explicit independent variables.
For more rigorous treatments of the topics introduced here, we refer you to {cite}Anselin_2003,Anselin_2014,Gelman_2006
.
Using geographic information to "construct" new data is a common approach to bring in spatial information into data analysis. Often, this reflects the fact that processes are not the same everywhere in the map of analysis, or that geographical information may be useful to predict our outcome of interest. In this section, we will briefly present how to insert spatial features, or $X$ variables that are constructed from geographical relationships, in a standard linear model. We discuss spatial feature engineering extensively in Chapter 12, though, and the depth and extent of spatial feature engineering is difficult to over-state. Rather than detail, this section will show how spatially explicit variables you engineer can be "plugged" into a model to improve its performance or help you explain the underlying process of interest with more accuracy.
One relevant proximity-driven variable that could influence our San Diego model is based on the listings proximity to Balboa Park. A common tourist destination, Balboa Park is a central recreation hub for the city of San Diego, containing many museums and the San Diego Zoo. Thus, it could be the case that people searching for Airbnbs in San Diego are willing to pay a premium to live closer to the park. If this were true and we omitted this from our model, we may indeed see a significant spatial pattern caused by this distance decay effect.
Therefore, this is sometimes called a spatially patterned omitted covariate: geographic information our model needs to make good predictions which we have left out of our model. Therefore, let's build a new model containing this distance to Balboa Park covariate. First, though, it helps to visualize (Fig. XXX5XXX) the structure of this distance covariate itself:
ax = db.plot("d2balboa", marker=".", s=5)
contextily.add_basemap(ax, crs=db.crs)
ax.set_axis_off();
To run a linear model that includes the additional variable of distance to the park, we add the name to the list of variables we included originally:
balboa_names = variable_names + ["d2balboa"]
And then fit the model using the OLS class in Pysal's spreg
:
m2 = spreg.OLS(
db[["log_price"]].values,
db[balboa_names].values,
name_y="log_price",
name_x=balboa_names,
)
When you inspect the regression diagnostics and output, you see that this covariate is not quite as helpful as we might anticipate:
pandas.DataFrame(
[[m1.r2, m1.ar2], [m2.r2, m2.ar2]],
index=["M1", "M2"],
columns=["R2", "Adj. R2"],
)
R2 | Adj. R2 | |
---|---|---|
M1 | 0.668345 | 0.667801 |
M2 | 0.668502 | 0.667904 |
It is not statistically significant at conventional significance levels, the model fit does not substantially change:
# Set up table of regression coefficients
pandas.DataFrame(
{
# Pull out regression coefficients and
# flatten as they are returned as Nx1 array
"Coeff.": m2.betas.flatten(),
# Pull out and flatten standard errors
"Std. Error": m2.std_err.flatten(),
# Pull out P-values from t-stat object
"P-Value": [i[1] for i in m2.t_stat],
},
index=m2.name_x,
)
Coeff. | Std. Error | P-Value | |
---|---|---|---|
CONSTANT | 4.379624 | 0.016915 | 0.000000e+00 |
accommodates | 0.083644 | 0.005079 | 1.156896e-59 |
bathrooms | 0.190791 | 0.011005 | 9.120139e-66 |
bedrooms | 0.150746 | 0.011179 | 7.418035e-41 |
beds | -0.041476 | 0.006939 | 2.394322e-09 |
rt_Private_room | -0.552996 | 0.015960 | 2.680270e-240 |
rt_Shared_room | -1.235521 | 0.038462 | 2.586867e-209 |
pg_Condominium | 0.140459 | 0.022225 | 2.803765e-10 |
pg_House | -0.013302 | 0.014623 | 3.630396e-01 |
pg_Other | 0.141176 | 0.022798 | 6.309880e-10 |
pg_Townhouse | -0.045784 | 0.034356 | 1.826992e-01 |
d2balboa | 0.001645 | 0.000967 | 8.902052e-02 |
And, there still appears to be spatial structure in our model's errors, as we can see in Figure XXX6XXX, generated by the code below:
lag_residual = weights.spatial_lag.lag_spatial(knn, m2.u)
ax = seaborn.regplot(
m2.u.flatten(),
lag_residual.flatten(),
line_kws=dict(color="orangered"),
ci=None,
)
ax.set_xlabel("Residuals ($u$)")
ax.set_ylabel("Spatial lag of residuals ($Wu$)");