Introduction¶

This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:

In [1]:

# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

# Set the seed value 
seed = 0

In [2]:

!ls $datasets_dir

Adding Features to Feature Table.ipynb
Combining Multiple Blockers.ipynb
Debugging Blocker Output.ipynb
Down Sampling.ipynb
Editing and Generating Features Manually.ipynb
Evaluating the Selected Matcher.ipynb
Generating Features Manually.ipynb
Performing Blocking Using Blackbox Blocker.ipynb
Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb
Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb
Performing Blocking Using Rule-Based Blocking.ipynb
Reading CSV Files from Disk.ipynb
Removing Features From Feature Table.ipynb
Sampling and Labeling.ipynb
Selecting the Best Learning Matcher.ipynb

In [3]:

# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [5]:

A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

No handlers could be found for logger "py_entitymatching.io.parsers"

Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher

In [6]:

# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

Selecting the Best learning-based matcher¶

This, typically involves the following steps:

Creating a set of learning-based matchers
Creating features
Extracting feature vectors
Selecting the best learning-based matcher using k-fold cross validation
Debugging the matcher (and possibly repeat the above steps)

Creating a set of learning-based matchers¶

First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.

In [7]:

# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

Creating features¶

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.

In [8]:

# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.

In [9]:

F.feature_name

Out[9]:

0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16                          year_year_exm
17                          year_year_anm
18                     year_year_lev_dist
19                      year_year_lev_sim
Name: feature_name, dtype: object

Extracting feature vectors¶

In this step, we extract feature vectors using the development set and the created features.

In [10]:

# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

In [11]:

# Display first few rows
H.head()

Out[11]:

	_id	ltable_id	rtable_id	id_id_lev_dist	id_id_lev_sim	id_id_jar	id_id_jwn	id_id_jac_qgm_3_qgm_3	title_title_jac_qgm_3_qgm_3	...	authors_authors_jac_qgm_3_qgm_3	authors_authors_cos_dlm_dc0_dlm_dc0	authors_authors_mel	authors_authors_lev_dist	authors_authors_lev_sim	year_year_exm	year_year_anm	year_year_lev_sim	label
430	430	l1494	r1257	4	0.20	0.466667	0.466667	0.000000	0.000000	...	0.000000	0.000000	0.445707	44.0	0.083333	1	1.0	1.0	0
35	35	l1385	r1160	4	0.20	0.466667	0.466667	0.000000	0.025641	...	0.000000	0.000000	0.589417	43.0	0.271186	1	1.0	1.0	0
394	394	l1345	r85	4	0.20	0.000000	0.000000	0.090909	1.000000	...	0.951111	0.945946	0.822080	172.0	0.338462	1	1.0	1.0	1
29	29	l611	r141	3	0.25	0.666667	0.666667	0.090909	0.049383	...	0.000000	0.000000	0.531543	26.0	0.277778	1	1.0	1.0	0
181	181	l1164	r1161	2	0.60	0.733333	0.733333	0.076923	1.000000	...	0.592593	0.668153	0.684700	34.0	0.244444	1	1.0	1.0	1

5 rows × 24 columns

In [12]:

# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(H))

Out[12]:

True

We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.

In [13]:

# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
                strategy='mean')

Selecting the best matcher using cross-validation¶

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher.

In [14]:

# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

Out[14]:

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	0.915322	0.950714	0.930980
1	RF	1.000000	0.950714	0.974131
2	SVM	0.977778	0.810632	0.883248
3	LinReg	1.000000	0.935330	0.966131
4	LogReg	0.985714	0.935330	0.958724

In [15]:

result['drill_down_cv_stats']['precision']

Out[15]:

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	0.95	1.000000	0.764706	0.933333	0.928571	0.915322
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	1.00	1.000000	1.000000	1.000000	1.000000	1.000000
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	1.00	1.000000	0.888889	1.000000	1.000000	0.977778
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	1.00	1.000000	1.000000	1.000000	1.000000	1.000000
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	1.00	0.928571	1.000000	1.000000	1.000000	0.985714

In [16]:

result['drill_down_cv_stats']['recall']

Out[16]:

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	0.95	1.000000	0.928571	0.8750	1.000000	0.950714
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	0.95	1.000000	0.928571	0.8750	1.000000	0.950714
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	0.90	0.923077	0.571429	0.8125	0.846154	0.810632
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	0.95	1.000000	0.928571	0.8750	0.923077	0.935330
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	0.95	1.000000	0.928571	0.8750	0.923077	0.935330

In [17]:

result['drill_down_cv_stats']['f1']

Out[17]:

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	0.950000	1.000000	0.838710	0.903226	0.962963	0.930980
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	0.974359	1.000000	0.962963	0.933333	1.000000	0.974131
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	0.947368	0.960000	0.695652	0.896552	0.916667	0.883248
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	0.974359	1.000000	0.962963	0.933333	0.960000	0.966131
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	0.974359	0.962963	0.962963	0.933333	0.960000	0.958724

Debug X (Random Forest)¶

In [18]:

# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']

In [19]:

# Debug RF matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        target_attr='label')

In [20]:

# Add a feature to do Jaccard on title + authors and add it to F

# Create a feature declaratively
sim = em.get_sim_funs_for_matching()
tok = em.get_tokenizers_for_matching()
feature_string = """jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()), 
                            wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))"""
feature = em.get_feature_fn(feature_string, sim, tok)

# Add feature to F
em.add_feature(F, 'jac_ws_title_authors', feature)

Out[20]:

True

In [21]:

# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

In [22]:

# Check whether the updated F improves X (Random Forest)
result = em.select_matcher([rf], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['drill_down_cv_stats']['f1']

Out[22]:

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	0.974359	1.0	0.962963	0.933333	1.0	0.974131

In [23]:

# Select the best matcher again using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

Out[23]:

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	1.000000	1.000000	1.000000
1	RF	1.000000	0.950714	0.974131
2	SVM	1.000000	0.837418	0.907995
3	LinReg	1.000000	0.970330	0.984593
4	LogReg	0.985714	0.935330	0.958724

In [24]:

result['drill_down_cv_stats']['f1']

Out[24]:

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	0.974359	1.000000	0.962963	0.933333	1.000000	0.974131
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	0.947368	0.960000	0.782609	0.933333	0.916667	0.907995
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	1.000000	1.000000	0.962963	1.000000	0.960000	0.984593
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	0.974359	0.962963	0.962963	0.933333	0.960000	0.958724