Using U.S. Permanent Residents Data to Explain Housing Prices

by Yucheng Liu

A common claim made by many is that immigration pushes up housing prices. Moreover, sometimes these claims are targeted to immigrants with a specific origin. An example is that Chinese immigrants causes housing prices to rise.

In this project, I use annual country-specific Lawful Permanent Residents (PR) data to the United States and housing data to find evidence for/against these claims. I target to answer 2 questions:

  1. Do newly admitted Permanent Residents push up housing prices?
  2. Specifically, do Chinese immigrants affect housing prices differently than an average immigrant?

I use linear regression models to estimate correlations between appropriate variables. My findings are:

  1. Immigration is positively correlated with housing prices.
  2. Chinese immigrants does not affect prices differently than an average immigrant.

As a bonus thing that I got from the data, I estimated correlations of PR admission with prior years of housing prices. Interstingly, PR admission is positively correlated with housing prices 6 years ago and negatively correlated with housing prices 7 years ago.


  1. Zillow Home Value Index (ZHVI), 1996-2020
  2. U.S. Lawful Permanent Residents data, 2004-2018

Data Cleaning

Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import qeds
import geopandas as gpd
from pathlib import Path
folder = Path(r"C:\Users\YuchengLiu\Documents\323")
In [2]:
import json
from bokeh.plotting import figure
from import output_notebook, show, output_file
from bokeh.models import GeoJSONDataSource, LinearColorMapper, LogColorMapper
from bokeh.models import ColorBar, HoverTool, LogTicker
from bokeh.palettes import brewer
Loading BokehJS ...
In [3]:
import seaborn as sns
import sklearn
from sklearn import linear_model
import statsmodels.api as sm
from import add_constant

Import and Clean Data

The housing data is monthly. I turned it to yearly data by averaging.

In [4]:
years = range(2004,2019)

# import housing file
housing_file = folder / 'housing_by_metro.csv'
housing = pd.read_csv(housing_file)
metro_housing = housing['RegionName']

# take year average
housing=housing.T.reset_index().groupby(pd.Grouper(key='index',freq='Y')).mean().T = 'year'

log_housing = housing.apply(np.log)
In [5]:
def get_metro_map(dfa,metro_housing):
    '''search the housing file to match metro names'''
    metro_pr = dfa.index
    d = {}
    for metro in metro_pr:
        if metro == 'Total':
            d[metro]='United States'
        if (metro_housing==metro).sum() != 0:
            x = set()
            end = metro.find(',')
            cities = metro[:end].split('-')
            states = metro[end+1:].strip().split('-')
            for city in cities:
                for state in states:
                    if city != '':
                        a = metro_housing.str.contains(city)
                        b = metro_housing.str.contains(state)
                        s = a&b
                        values = metro_housing[s[s].index].values
                        x = x.union(metro_housing[s[s].index].values)
            if x:
                #print(len(x), end=' ')
                #if len(x)!=1:
                    #print(metro, x)
                d[metro] = x.pop() # empirically, unique 
    d['Louisville/Jefferson County, KY-IN']='Louisville-Jefferson County, KY'
    return d
In [6]:
def load_code_file():
    df = pd.read_csv(folder / 'cbsa_to_county.csv')
    df = df[['cbsacode','cbsatitle']][1:].drop_duplicates()
    df = df.set_index('cbsatitle')
    mapping = get_metro_map(df,metro_housing)
    df.index =
    return df

The PR data has a file for each year. I cleaned them individually and append them together.

In [7]:
# import PR file
def get_year(year, matched_names, common_names, country_level=False):
    '''load and process PR file for a single year'''
    pr_folder_name = folder / 'pr_tables'
    for file in pr_folder_name.iterdir():
    # extract the data
    header = 3 if year > 2012 else 4
    df = pd.read_excel(file, header=header)
    dfa= df.loc[10:] if country_level else df.loc[1:8]

    # remove footnotes
    dfa=dfa.set_index('Region and country of birth')
    # match metro names with housing data
    mapping = get_metro_map(dfa,metro_housing)
    matched_names = matched_names.union(mapping.values())
    common_names  = common_names.intersection(mapping.values())
    dfa.index =
    # merge with cbsa code
    dfa = (dfa.reset_index()
    dfa.loc['United States','cbsacode']=0 # put 0 for U.S.
    # add year
    # fix some names
    dfa=dfa.rename(columns={'Korea':'Korea, South', 'Antigua-Barbuda':'Antigua and Barbuda'})
    return dfa, matched_names, common_names

# append all the years together
pr = pd.DataFrame()
matched_names = set()
common_names  = set(metro_housing)
for year in range(2004,2019):
    dta, matched_names, common_names = get_year(year, matched_names, common_names, True)
    pr = pr.append(dta)
pr = pr.dropna(axis=1)
print('Done loading PR files!')
---------------Done loading PR files!

These states and cbsa's dataframs are for mapping.

In [8]:
def load_state_df():
    df = gpd.read_file('')
    df['cbsacode'] = 0.0
    return df[['NAME','cbsacode','geometry']]

In [9]:
def load_cbsa_df():
    df = gpd.read_file("")
    df['cbsacode'] = df['CBSAFP'].astype(float)
    return df[['cbsacode','geometry']]

Visualizing the PR data

First, I give a visualization of the Permanent Residence data.

In [10]:
def graph_pr_bokeh(df1,df2, year, paint_according_to='Total'):
    if paint_according_to in {'Total','China','Mexico','Russia'}:
        color_mapper = LogColorMapper(palette = brewer['YlOrRd'][9][::-1])
        color_mapper = LinearColorMapper(palette = brewer['YlOrRd'][9][::-1])
    color_bar = ColorBar(color_mapper=color_mapper, ticker=LogTicker(),
                         label_standoff=12, height = 20,
                         border_line_color=None, location = (0,0), orientation = 'horizontal')
    title=f"Persons Obtaining PR Status by Region and Country of Birth in {year}"
    if paint_according_to != 'Total':
        title += f', colored by persons from {paint_according_to}'
    p = figure(title=title, plot_width=750, plot_height=500, tools="pan,box_zoom,reset,previewsave")
    hover_off=p.patches("xs","ys",source=geojson1, fill_color='white', line_color='grey')
    hover_on =p.patches("xs","ys",source=geojson2, line_color='black',
                       fill_color = {'field' :paint_according_to, 'transform' : color_mapper})
    tooltips = [('Region','@index'), ('Total' ,'@Total'), ('Canada','@Canada'), 
                ('China' ,'@China'), ('Mexico','@Mexico'),]
    temp = (paint_according_to, '@'+paint_according_to)
    if temp not in tooltips:
    hover = HoverTool(tooltips = tooltips, renderers=[hover_on])
    #p.add_layout(color_bar, 'below')
def graph_pr(year=2018, paint_according_to='Total', bokeh=True):
    year2 = f"{year}-12-31"
    data = (pr[pr['year']==year2]
            .drop(index=['Urban Honolulu, HI'])
            .rename(columns={"China, People's Republic":'China'}))
    # First draw US, then draw the metros
     39:US Virgin Islands, 
     48:Commonwealth of the Northern Mariana Islands,
     49:American Samoa, 
     51:Puerto Rico, 
    df1 = state_df.drop(index=[27,39,47,48,49,51,55]).merge(data, on='cbsacode', how='left')
    df2 = cbsa_df.merge(data,on='cbsacode', how='inner')
    if bokeh:
        graph_pr_bokeh(df1, df2, year, paint_according_to)
        fig, gax = plt.subplots(figsize=(15, 10))
        df1.plot(ax=gax, alpha=0.3)
        df2.plot(ax=gax, edgecolor="black", color="white")
    #return df1,df2
graph_pr(2018, 'China')