This IPython notebook illustrates how to perform blocking using rule-based blocker.
First, we need to import py_entitymatching package and other libraries as follows:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
Then, read the (sample) input tables for blocking purposes.
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')
There are three different ways to do overlap blocking:
candidate set
of tuple pairs.candidate set
of tuple pairs to typically produce a reduced candidate set of tuple pairs.First, define a blackbox function
def address_address_function(x, y):
# x, y will be of type pandas series
# get name attribute
x_address = x['address']
y_address = y['address']
# get the city
x_split, y_split = x_address.split(','), y_address.split(',')
x_city = x_split[len(x_split) - 1]
y_city = y_split[len(y_split) - 1]
# check if the cities match
if x_city != y_city:
return True
else:
return False
# Instantiate blackbox blocker
bb = em.BlackBoxBlocker()
# Set the black box function
bb.set_black_box_function(address_address_function)
C = bb.block_tables(A, B, l_output_attrs=['name', 'address'], r_output_attrs=['name', 'address'])
0% 100% [##############################] | ETA: 00:00:00 Total time elapsed: 00:00:00
C
_id | ltable_ID | rtable_ID | ltable_name | ltable_address | rtable_name | rtable_address | |
---|---|---|---|---|---|---|---|
0 | 0 | a1 | b1 | Kevin Smith | 607 From St, San Francisco | Mark Levene | 108 Clement St, San Francisco |
1 | 1 | a1 | b2 | Kevin Smith | 607 From St, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
2 | 2 | a1 | b3 | Kevin Smith | 607 From St, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
3 | 3 | a1 | b4 | Kevin Smith | 607 From St, San Francisco | Joseph Kuan | 108 South Park, San Francisco |
4 | 4 | a1 | b6 | Kevin Smith | 607 From St, San Francisco | Michael Brodie | 133 Clement Street, San Francisco |
5 | 5 | a2 | b1 | Michael Franklin | 1652 Stockton St, San Francisco | Mark Levene | 108 Clement St, San Francisco |
6 | 6 | a2 | b2 | Michael Franklin | 1652 Stockton St, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
7 | 7 | a2 | b3 | Michael Franklin | 1652 Stockton St, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
8 | 8 | a2 | b4 | Michael Franklin | 1652 Stockton St, San Francisco | Joseph Kuan | 108 South Park, San Francisco |
9 | 9 | a2 | b6 | Michael Franklin | 1652 Stockton St, San Francisco | Michael Brodie | 133 Clement Street, San Francisco |
10 | 10 | a3 | b1 | William Bridge | 3131 Webster St, San Francisco | Mark Levene | 108 Clement St, San Francisco |
11 | 11 | a3 | b2 | William Bridge | 3131 Webster St, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
12 | 12 | a3 | b3 | William Bridge | 3131 Webster St, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
13 | 13 | a3 | b4 | William Bridge | 3131 Webster St, San Francisco | Joseph Kuan | 108 South Park, San Francisco |
14 | 14 | a3 | b6 | William Bridge | 3131 Webster St, San Francisco | Michael Brodie | 133 Clement Street, San Francisco |
15 | 15 | a4 | b1 | Binto George | 423 Powell St, San Francisco | Mark Levene | 108 Clement St, San Francisco |
16 | 16 | a4 | b2 | Binto George | 423 Powell St, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
17 | 17 | a4 | b3 | Binto George | 423 Powell St, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
18 | 18 | a4 | b4 | Binto George | 423 Powell St, San Francisco | Joseph Kuan | 108 South Park, San Francisco |
19 | 19 | a4 | b6 | Binto George | 423 Powell St, San Francisco | Michael Brodie | 133 Clement Street, San Francisco |
20 | 20 | a5 | b1 | Alphonse Kemper | 1702 Post Street, San Francisco | Mark Levene | 108 Clement St, San Francisco |
21 | 21 | a5 | b2 | Alphonse Kemper | 1702 Post Street, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
22 | 22 | a5 | b3 | Alphonse Kemper | 1702 Post Street, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
23 | 23 | a5 | b4 | Alphonse Kemper | 1702 Post Street, San Francisco | Joseph Kuan | 108 South Park, San Francisco |
24 | 24 | a5 | b6 | Alphonse Kemper | 1702 Post Street, San Francisco | Michael Brodie | 133 Clement Street, San Francisco |
First, define a blackbox function
def name_name_function(x, y):
# x, y will be of type pandas series
# get name attribute
x_name = x['name']
y_name = y['name']
# get last names
x_name = x_name.split(' ')[1]
y_name = y_name.split(' ')[1]
# check if last names match
if x_name != y_name:
return True
else:
return False
# Instantiate blackbox blocker
bb = em.BlackBoxBlocker()
# Set the black box function
bb.set_black_box_function(name_name_function)
D = bb.block_candset(C)
0% 100% [#########################] | ETA: 00:00:00 Total time elapsed: 00:00:00
D
_id | ltable_ID | rtable_ID | ltable_name | ltable_address | rtable_name | rtable_address | |
---|---|---|---|---|---|---|---|
7 | 7 | a2 | b3 | Michael Franklin | 1652 Stockton St, San Francisco | Mike Franklin | 1652 Stockton St, San Francisco |
11 | 11 | a3 | b2 | William Bridge | 3131 Webster St, San Francisco | Bill Bridge | 3131 Webster St, San Francisco |
First, define the black box function first
def address_address_function(x, y):
# x, y will be of type pandas series
# get name attribute
x_address = x['address']
y_address = y['address']
# get the city
x_split, y_split = x_address.split(','), y_address.split(',')
x_city = x_split[len(x_split) - 1]
y_city = y_split[len(y_split) - 1]
# check if the cities match
if x_city != y_city:
return True
else:
return False
# Instantiate blackabox blocker
bb = em.BlackBoxBlocker()
# Set the blackbox function
bb.set_black_box_function(address_address_function)
A.ix[[0]]
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | a1 | Kevin Smith | 1989 | 30.0 | 607 From St, San Francisco | 94107 |
B.ix[[0]]
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | b1 | Mark Levene | 1987 | 29.5 | 108 Clement St, San Francisco | 94107 |
status = bb.block_tuples(A.ix[0], B.ix[0])
print(status)
False