Contents¶

Introduction
Block Using the Sorted Neighborhood Blocker
Sorted Neighborhood Blocker Limitations

Introduction¶

WARNING: The sorted neighborhood blocker is still experimental and has not been fully tested yet. Use this blocker at your own risk.

Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The py_entitymatching package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is the sorted neighborhood blocker. This IPython notebook illustrates how to perform blocking using the sorted neighborhood blocker.

Note, often the sorted neighborhood blocking technique is used on a single table. In this case we have implemented sorted neighborhood blocking between two tables. We first enrich the tables with whether the table is the left table, or right table. Then we merge the tables. At this point we perform sorted neighborhood blocking, which is to pass a sliding window of window_size (default 2) across the merged dataset. Within the sliding window all tuple pairs that have one tuple from the left table and one tuple from the right table are returned.

First, we need to import py_entitymatching package and other libraries as follows:

In [2]:

# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the input tablse from the datasets directory

In [3]:

# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'

In [4]:

# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')

In [5]:

A.head()

Out[5]:

	ID	name	birth_year	hourly_wage	address	zipcode
0	a1	Kevin Smith	1989	30.0	607 From St, San Francisco	94107
1	a2	Michael Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
2	a3	William Bridge	1986	32.0	3131 Webster St, San Francisco	94107
3	a4	Binto George	1987	32.5	423 Powell St, San Francisco	94122
4	a5	Alphonse Kemper	1984	35.0	1702 Post Street, San Francisco	94122

In [6]:

B.head()

Out[6]:

	ID	name	birth_year	hourly_wage	address	zipcode
0	b1	Mark Levene	1987	29.5	108 Clement St, San Francisco	94107
1	b2	Bill Bridge	1986	32.0	3131 Webster St, San Francisco	94107
2	b3	Mike Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
3	b4	Joseph Kuan	1982	26.0	108 South Park, San Francisco	94122
4	b5	Alfons Kemper	1984	35.0	170 Post St, Apt 4, San Francisco	94122

Block Using the Sorted Neighborhood Blocker¶

Once the tables are read, we can do blocking using sorted neighborhood blocker.

With the sorted neighborhood blocker, you can only block between two tables to produce a candidate set of tuple pairs.

Block Tables to Produce a Candidate Set of Tuple Pairs¶

In [7]:

# Instantiate attribute equivalence blocker object
sn = em.SortedNeighborhoodBlocker()

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

For the given two tables, we will assume that two persons with different zipcode values do not refer to the same real world person. So, we apply attribute equivalence blocking on zipcode. That is, we block all the tuple pairs that have different zipcodes.

In [8]:

# Use block_tables to apply blocking over two input tables.
C1 = sn.block_tables(A, B, 
                    l_block_attr='birth_year', r_block_attr='birth_year', 
                    l_output_attrs=['name', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_', window_size=3)

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

In [9]:

# Display the candidate set of tuple pairs
C1.head()

Out[9]:

	_id	l_ID	r_ID	l_name	l_birth_year	l_zipcode	r_name	r_birth_year	r_zipcode
0	0	a5	b4	Alphonse Kemper	1984	94122	Joseph Kuan	1982	94122
1	1	a5	b5	Alphonse Kemper	1984	94122	Alfons Kemper	1984	94122
2	2	a3	b5	William Bridge	1986	94107	Alfons Kemper	1984	94122
3	3	a3	b2	William Bridge	1986	94107	Bill Bridge	1986	94107
4	4	a4	b2	Binto George	1987	94122	Bill Bridge	1986	94107

Note that the tuple pairs in the candidate set have the same zipcode.

The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command.

In [10]:

# Show the metadata of C1
em.show_properties(C1)

id: 139837734495736
key: _id
fk_ltable: l_ID
fk_rtable: r_ID
ltable(obj.id): 139837734692656
rtable(obj.id): 139837734692520

In [11]:

id(A), id(B)

Out[11]:

(139837734759952, 139837734846816)

Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables.

Handling Missing Values¶

If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set allow_missing paramater to be True.

In [12]:

# Introduce some missing values
A1 = em.read_csv_metadata(path_A, key='ID')
A1.ix[0, 'zipcode'] = pd.np.NaN
A1.ix[0, 'birth_year'] = pd.np.NaN

/u/p/m/pmartinkus/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until

In [13]:

A1

Out[13]:

	ID	name	birth_year	hourly_wage	address	zipcode
0	a1	Kevin Smith	NaN	30.0	607 From St, San Francisco	NaN
1	a2	Michael Franklin	1988.0	27.5	1652 Stockton St, San Francisco	94122.0
2	a3	William Bridge	1986.0	32.0	3131 Webster St, San Francisco	94107.0
3	a4	Binto George	1987.0	32.5	423 Powell St, San Francisco	94122.0
4	a5	Alphonse Kemper	1984.0	35.0	1702 Post Street, San Francisco	94122.0

In [14]:

# Use block_tables to apply blocking over two input tables.
C2 = sn.block_tables(A1, B, 
                    l_block_attr='zipcode', r_block_attr='zipcode', 
                    l_output_attrs=['name', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_', 
                    allow_missing=True) # setting allow_missing parameter to True

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

In [15]:

len(C1), len(C2)

Out[15]:

(11, 9)

In [16]:

C2

Out[16]:

	_id	l_ID	r_ID	l_name	l_birth_year	l_zipcode	r_name	r_birth_year	r_zipcode
0	0	a3	b1	William Bridge	1986.0	94107.0	Mark Levene	1987.0	94107.0
1	1	a1	b1	Kevin Smith	NaN	NaN	Mark Levene	1987.0	94107.0
2	2	a1	b2	Kevin Smith	NaN	NaN	Bill Bridge	1986.0	94107.0
3	3	a1	b6	Kevin Smith	NaN	NaN	Michael Brodie	1987.0	94107.0
4	4	a2	b6	Michael Franklin	1988.0	94122.0	Michael Brodie	1987.0	94107.0
5	5	a5	b3	Alphonse Kemper	1984.0	94122.0	Mike Franklin	1988.0	94122.0
6	6	a1	b3	Kevin Smith	NaN	NaN	Mike Franklin	1988.0	94122.0
7	7	a1	b4	Kevin Smith	NaN	NaN	Joseph Kuan	1982.0	94122.0
8	8	a1	b5	Kevin Smith	NaN	NaN	Alfons Kemper	1984.0	94122.0

The candidate set C2 includes all possible tuple pairs with missing values.

Window Size¶

A tunable parameter to the Sorted Neighborhood Blocker is the Window size. To perform the same result as above with a larger window size is via the window_size argument. Note that it has more results than C1.

In [17]:

C3 = sn.block_tables(A, B, 
                    l_block_attr='birth_year', r_block_attr='birth_year', 
                    l_output_attrs=['name', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_', window_size=5)

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

In [18]:

len(C1)

Out[18]:

In [19]:

len(C3)

Out[19]:

Stable Sort Order¶

One final challenge for the Sorted Neighborhood Blocker is making the sort order stable. If the column being sorted on has multiple identical keys, and those keys are longer than the window size, then different results may occur between runs. To always guarantee the same results for every run, make sure to make the sorting column unique. One method to do so is to append the id of the tuple onto the end of the sorting column. Here is an example.

In [20]:

A["birth_year_plus_id"]=A["birth_year"].map(str)+'-'+A["ID"].map(str)
B["birth_year_plus_id"]=B["birth_year"].map(str)+'-'+A["ID"].map(str)
C3 = sn.block_tables(A, B, 
                    l_block_attr='birth_year_plus_id', r_block_attr='birth_year_plus_id', 
                    l_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],
                    r_output_attrs=['name', 'birth_year_plus_id', 'birth_year', 'zipcode'],
                    l_output_prefix='l_', r_output_prefix='r_', window_size=5)

WARNING: THIS IS AN EXPERIMENTAL COMMAND. THIS COMMAND IS NOT TESTED. USE AT YOUR OWN RISK.

In [21]:

C3.head()

Out[21]:

	_id	l_ID	r_ID	l_name	l_birth_year_plus_id	l_birth_year	l_zipcode	r_name	r_birth_year_plus_id	r_birth_year	r_zipcode
0	0	a5	b4	Alphonse Kemper	1984-a5	1984	94122	Joseph Kuan	1982-a4	1982	94122
1	1	a5	b5	Alphonse Kemper	1984-a5	1984	94122	Alfons Kemper	1984-a5	1984	94122
2	2	a5	b2	Alphonse Kemper	1984-a5	1984	94122	Bill Bridge	1986-a2	1986	94107
3	3	a3	b4	William Bridge	1986-a3	1986	94107	Joseph Kuan	1982-a4	1982	94122
4	4	a3	b5	William Bridge	1986-a3	1986	94107	Alfons Kemper	1984-a5	1984	94122

Sorted Neighborhood Blocker limitations¶

Since the sorted neighborhood blocker requires position in sorted order, unlike other blockers, blocking on a candidate set or checking two tuples is not applicable. Attempts to call block_candset or block_tuples will raise an assertion.