Blocking is typically done to reduce the number of tuple pairs considered for matching. There are several blocking methods proposed. The py_entitymatching package supports a subset of such blocking methods (#ref to what is supported). One such supported blocker is attribute equivalence blocker. This IPython notebook illustrates how to perform blocking using attribute equivalence blocker.
First, we need to import py_entitymatching package and other libraries as follows:
%load_ext autotime
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
Then, read the input tablse from the datasets directory
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')
A.head()
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | a1 | Kevin Smith | 1989 | 30.0 | 607 From St, San Francisco | 94107 |
1 | a2 | Michael Franklin | 1988 | 27.5 | 1652 Stockton St, San Francisco | 94122 |
2 | a3 | William Bridge | 1986 | 32.0 | 3131 Webster St, San Francisco | 94107 |
3 | a4 | Binto George | 1987 | 32.5 | 423 Powell St, San Francisco | 94122 |
4 | a5 | Alphonse Kemper | 1984 | 35.0 | 1702 Post Street, San Francisco | 94122 |
B.head()
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | b1 | Mark Levene | 1987 | 29.5 | 108 Clement St, San Francisco | 94107 |
1 | b2 | Bill Bridge | 1986 | 32.0 | 3131 Webster St, San Francisco | 94107 |
2 | b3 | Mike Franklin | 1988 | 27.5 | 1652 Stockton St, San Francisco | 94122 |
3 | b4 | Joseph Kuan | 1982 | 26.0 | 108 South Park, San Francisco | 94122 |
4 | b5 | Alfons Kemper | 1984 | 35.0 | 170 Post St, Apt 4, San Francisco | 94122 |
Once the tables are read, we can do blocking using attribute equivalence blocker.
There are three different ways to do attribute equivalence blocking:
# Instantiate attribute equivalence blocker object
ab = em.AttrEquivalenceBlocker()
For the given two tables, we will assume that two persons with different zipcode
values do not refer to the same real world person. So, we apply attribute equivalence blocking on zipcode
. That is, we block all the tuple pairs that have different zipcodes.
# Use block_tables to apply blocking over two input tables.
C1 = ab.block_tables(A, B,
l_block_attr='zipcode', r_block_attr='zipcode',
l_output_attrs=['name', 'birth_year', 'zipcode'],
r_output_attrs=['name', 'birth_year', 'zipcode'],
l_output_prefix='l_', r_output_prefix='r_')
# Display the candidate set of tuple pairs
C1.head()
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | a1 | b1 | Kevin Smith | 1989 | 94107 | Mark Levene | 1987 | 94107 |
1 | 1 | a1 | b2 | Kevin Smith | 1989 | 94107 | Bill Bridge | 1986 | 94107 |
2 | 2 | a1 | b6 | Kevin Smith | 1989 | 94107 | Michael Brodie | 1987 | 94107 |
3 | 3 | a3 | b1 | William Bridge | 1986 | 94107 | Mark Levene | 1987 | 94107 |
4 | 4 | a3 | b2 | William Bridge | 1986 | 94107 | Bill Bridge | 1986 | 94107 |
Note that the tuple pairs in the candidate set have the same zipcode.
The attributes included in the candidate set are based on l_output_attrs and r_output_attrs mentioned in block_tables command (the key columns are included by default). Specifically, the list of attributes mentioned in l_output_attrs are picked from table A and the list of attributes mentioned in r_output_attrs are picked from table B. The attributes in the candidate set are prefixed based on l_output_prefix and r_ouptut_prefix parameter values mentioned in block_tables command.
# Show the metadata of C1
em.show_properties(C1)
id: 4565680872 fk_rtable: r_ID rtable(obj.id): 4565204272 fk_ltable: l_ID key: _id ltable(obj.id): 4565203432
id(A), id(B)
(4565203432, 4565204272)
Note that the metadata of C1 includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables.
If the input tuples have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set allow_missing
paramater to be True.
# Introduce some missing values
A1 = em.read_csv_metadata(path_A, key='ID')
A1.ix[0, 'zipcode'] = pd.np.NaN
A1.ix[0, 'birth_year'] = pd.np.NaN
A1
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | a1 | Kevin Smith | NaN | 30.0 | 607 From St, San Francisco | NaN |
1 | a2 | Michael Franklin | 1988.0 | 27.5 | 1652 Stockton St, San Francisco | 94122.0 |
2 | a3 | William Bridge | 1986.0 | 32.0 | 3131 Webster St, San Francisco | 94107.0 |
3 | a4 | Binto George | 1987.0 | 32.5 | 423 Powell St, San Francisco | 94122.0 |
4 | a5 | Alphonse Kemper | 1984.0 | 35.0 | 1702 Post Street, San Francisco | 94122.0 |
# Use block_tables to apply blocking over two input tables.
C2 = ab.block_tables(A1, B,
l_block_attr='zipcode', r_block_attr='zipcode',
l_output_attrs=['name', 'birth_year', 'zipcode'],
r_output_attrs=['name', 'birth_year', 'zipcode'],
l_output_prefix='l_', r_output_prefix='r_',
allow_missing=True) # setting allow_missing parameter to True
len(C1), len(C2)
(15, 18)
C2
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | a2 | b3 | Michael Franklin | 1988.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
1 | 1 | a2 | b4 | Michael Franklin | 1988.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
2 | 2 | a2 | b5 | Michael Franklin | 1988.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
3 | 3 | a4 | b3 | Binto George | 1987.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
4 | 4 | a4 | b4 | Binto George | 1987.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
5 | 5 | a4 | b5 | Binto George | 1987.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
6 | 6 | a5 | b3 | Alphonse Kemper | 1984.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
7 | 7 | a5 | b4 | Alphonse Kemper | 1984.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
8 | 8 | a5 | b5 | Alphonse Kemper | 1984.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
9 | 9 | a3 | b1 | William Bridge | 1986.0 | 94107.0 | Mark Levene | 1987 | 94107.0 |
10 | 10 | a3 | b2 | William Bridge | 1986.0 | 94107.0 | Bill Bridge | 1986 | 94107.0 |
11 | 11 | a3 | b6 | William Bridge | 1986.0 | 94107.0 | Michael Brodie | 1987 | 94107.0 |
12 | 12 | a1 | b1 | Kevin Smith | NaN | NaN | Mark Levene | 1987 | 94107.0 |
13 | 13 | a1 | b2 | Kevin Smith | NaN | NaN | Bill Bridge | 1986 | 94107.0 |
14 | 14 | a1 | b3 | Kevin Smith | NaN | NaN | Mike Franklin | 1988 | 94122.0 |
15 | 15 | a1 | b4 | Kevin Smith | NaN | NaN | Joseph Kuan | 1982 | 94122.0 |
16 | 16 | a1 | b5 | Kevin Smith | NaN | NaN | Alfons Kemper | 1984 | 94122.0 |
17 | 17 | a1 | b6 | Kevin Smith | NaN | NaN | Michael Brodie | 1987 | 94107.0 |
The candidate set C2 includes all possible tuple pairs with missing values.
In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have different birth years. We will assume that two persons with different birth years cannot refer to the same person. So, we block the candidate set of tuple pairs on birth_year
. That is, we block all the tuple pairs that have different birth years.
# Instantiate Attr. Equivalence Blocker
ab = em.AttrEquivalenceBlocker()
# Use block_tables to apply blocking over two input tables.
C3 = ab.block_candset(C1, l_block_attr='birth_year', r_block_attr='birth_year')
0% 100% [###############] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00
time: 45.3 ms
Total time elapsed: 00:00:00
C3.head()
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
4 | 4 | a3 | b2 | William Bridge | 1986 | 94107 | Bill Bridge | 1986 | 94107 |
6 | 6 | a2 | b3 | Michael Franklin | 1988 | 94122 | Mike Franklin | 1988 | 94122 |
14 | 14 | a5 | b5 | Alphonse Kemper | 1984 | 94122 | Alfons Kemper | 1984 | 94122 |
Note that, the tuple pairs in the resulting candidate set have the same birth year.
The attributes included in the resulting candidate set are based on the input candidate set (i.e the same attributes are retained).
# Show the metadata of C1
em.show_properties(C3)
id: 4565765144 fk_rtable: r_ID rtable(obj.id): 4565204272 fk_ltable: l_ID key: _id ltable(obj.id): 4565203432
id(A), id(B)
(4565203432, 4565204272)
As we saw earlier the metadata of C3 includes the same metadata as C1. That is, it includes key, foreign key to the left and right tables (i.e A and B) and pointers to left and right tables.
If the tuple pairs included in the candidate set have missing values in the blocking attribute, then they are ignored by default. This is because, including all possible tuple pairs with missing values can significantly increase the size of the candidate set. But if you want to include them, then you can set allow_missing
paramater to be True.
# Display C2 (got by blocking over A1 and B)
C2
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | a2 | b3 | Michael Franklin | 1988.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
1 | 1 | a2 | b4 | Michael Franklin | 1988.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
2 | 2 | a2 | b5 | Michael Franklin | 1988.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
3 | 3 | a4 | b3 | Binto George | 1987.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
4 | 4 | a4 | b4 | Binto George | 1987.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
5 | 5 | a4 | b5 | Binto George | 1987.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
6 | 6 | a5 | b3 | Alphonse Kemper | 1984.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
7 | 7 | a5 | b4 | Alphonse Kemper | 1984.0 | 94122.0 | Joseph Kuan | 1982 | 94122.0 |
8 | 8 | a5 | b5 | Alphonse Kemper | 1984.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
9 | 9 | a3 | b1 | William Bridge | 1986.0 | 94107.0 | Mark Levene | 1987 | 94107.0 |
10 | 10 | a3 | b2 | William Bridge | 1986.0 | 94107.0 | Bill Bridge | 1986 | 94107.0 |
11 | 11 | a3 | b6 | William Bridge | 1986.0 | 94107.0 | Michael Brodie | 1987 | 94107.0 |
12 | 12 | a1 | b1 | Kevin Smith | NaN | NaN | Mark Levene | 1987 | 94107.0 |
13 | 13 | a1 | b2 | Kevin Smith | NaN | NaN | Bill Bridge | 1986 | 94107.0 |
14 | 14 | a1 | b3 | Kevin Smith | NaN | NaN | Mike Franklin | 1988 | 94122.0 |
15 | 15 | a1 | b4 | Kevin Smith | NaN | NaN | Joseph Kuan | 1982 | 94122.0 |
16 | 16 | a1 | b5 | Kevin Smith | NaN | NaN | Alfons Kemper | 1984 | 94122.0 |
17 | 17 | a1 | b6 | Kevin Smith | NaN | NaN | Michael Brodie | 1987 | 94107.0 |
em.show_properties(C2)
id: 4565567248 fk_rtable: r_ID rtable(obj.id): 4565204272 fk_ltable: l_ID key: _id ltable(obj.id): 4565680928
em.show_properties(A1)
id: 4565680928 key: ID
We see that A1
is the left table to C2
.
A1.head()
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | a1 | Kevin Smith | NaN | 30.0 | 607 From St, San Francisco | NaN |
1 | a2 | Michael Franklin | 1988.0 | 27.5 | 1652 Stockton St, San Francisco | 94122.0 |
2 | a3 | William Bridge | 1986.0 | 32.0 | 3131 Webster St, San Francisco | 94107.0 |
3 | a4 | Binto George | 1987.0 | 32.5 | 423 Powell St, San Francisco | 94122.0 |
4 | a5 | Alphonse Kemper | 1984.0 | 35.0 | 1702 Post Street, San Francisco | 94122.0 |
C4 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=False)
0% 100% [##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 Total time elapsed: 00:00:00
C4
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | a2 | b3 | Michael Franklin | 1988.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
8 | 8 | a5 | b5 | Alphonse Kemper | 1984.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
10 | 10 | a3 | b2 | William Bridge | 1986.0 | 94107.0 | Bill Bridge | 1986 | 94107.0 |
# Set allow_missing to True
C5 = ab.block_candset(C2, l_block_attr='birth_year', r_block_attr='birth_year', allow_missing=True)
0% 100% [##################] | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 | ETA: 00:00:00 Total time elapsed: 00:00:00
len(C4), len(C5)
(3, 9)
C5
_id | l_ID | r_ID | l_name | l_birth_year | l_zipcode | r_name | r_birth_year | r_zipcode | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | a2 | b3 | Michael Franklin | 1988.0 | 94122.0 | Mike Franklin | 1988 | 94122.0 |
8 | 8 | a5 | b5 | Alphonse Kemper | 1984.0 | 94122.0 | Alfons Kemper | 1984 | 94122.0 |
10 | 10 | a3 | b2 | William Bridge | 1986.0 | 94107.0 | Bill Bridge | 1986 | 94107.0 |
12 | 12 | a1 | b1 | Kevin Smith | NaN | NaN | Mark Levene | 1987 | 94107.0 |
13 | 13 | a1 | b2 | Kevin Smith | NaN | NaN | Bill Bridge | 1986 | 94107.0 |
14 | 14 | a1 | b3 | Kevin Smith | NaN | NaN | Mike Franklin | 1988 | 94122.0 |
15 | 15 | a1 | b4 | Kevin Smith | NaN | NaN | Joseph Kuan | 1982 | 94122.0 |
16 | 16 | a1 | b5 | Kevin Smith | NaN | NaN | Alfons Kemper | 1984 | 94122.0 |
17 | 17 | a1 | b6 | Kevin Smith | NaN | NaN | Michael Brodie | 1987 | 94107.0 |
We can apply attribute equivalence blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on zipcode
.
# Display the first tuple from table A
A.ix[[0]]
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | a1 | Kevin Smith | 1989 | 30.0 | 607 From St, San Francisco | 94107 |
# Display the first tuple from table B
B.ix[[0]]
ID | name | birth_year | hourly_wage | address | zipcode | |
---|---|---|---|---|---|---|
0 | b1 | Mark Levene | 1987 | 29.5 | 108 Clement St, San Francisco | 94107 |
# Instantiate Attr. Equivalence Blocker
ab = em.AttrEquivalenceBlocker()
# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status
status = ab.block_tuples(A.ix[0], B.ix[0], l_block_attr='zipcode', r_block_attr='zipcode')
# Print the blocking status
print(status)
False
The above result says that the tuple pair will not be blocked, i.e. this tuple pair will be included in the candidate set.
Currently, block_tuples command does not handle missing values