This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
# Set the seed value
seed = 0
!ls $datasets_dir
Adding Features to Feature Table.ipynb Combining Multiple Blockers.ipynb Debugging Blocker Output.ipynb Down Sampling.ipynb Editing and Generating Features Manually.ipynb Evaluating the Selected Matcher.ipynb Generating Features Manually.ipynb Performing Blocking Using Blackbox Blocker.ipynb Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb Performing Blocking Using Rule-Based Blocking.ipynb Reading CSV Files from Disk.ipynb Removing Features From Feature Table.ipynb Sampling and Labeling.ipynb Selecting the Best Learning Matcher.ipynb
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
No handlers could be found for logger "py_entitymatching.io.parsers"
Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']
This, typically involves the following steps:
First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.
F.feature_name
0 id_id_lev_dist 1 id_id_lev_sim 2 id_id_jar 3 id_id_jwn 4 id_id_exm 5 id_id_jac_qgm_3_qgm_3 6 title_title_jac_qgm_3_qgm_3 7 title_title_cos_dlm_dc0_dlm_dc0 8 title_title_mel 9 title_title_lev_dist 10 title_title_lev_sim 11 authors_authors_jac_qgm_3_qgm_3 12 authors_authors_cos_dlm_dc0_dlm_dc0 13 authors_authors_mel 14 authors_authors_lev_dist 15 authors_authors_lev_sim 16 year_year_exm 17 year_year_anm 18 year_year_lev_dist 19 year_year_lev_sim Name: feature_name, dtype: object
In this step, we extract feature vectors using the development set and the created features.
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I,
feature_table=F,
attrs_after='label',
show_progress=False)
# Display first few rows
H.head()
_id | ltable_id | rtable_id | id_id_lev_dist | id_id_lev_sim | id_id_jar | id_id_jwn | id_id_exm | id_id_jac_qgm_3_qgm_3 | title_title_jac_qgm_3_qgm_3 | ... | authors_authors_jac_qgm_3_qgm_3 | authors_authors_cos_dlm_dc0_dlm_dc0 | authors_authors_mel | authors_authors_lev_dist | authors_authors_lev_sim | year_year_exm | year_year_anm | year_year_lev_dist | year_year_lev_sim | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
430 | 430 | l1494 | r1257 | 4 | 0.20 | 0.466667 | 0.466667 | 0 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.445707 | 44.0 | 0.083333 | 1 | 1.0 | 0.0 | 1.0 | 0 |
35 | 35 | l1385 | r1160 | 4 | 0.20 | 0.466667 | 0.466667 | 0 | 0.000000 | 0.025641 | ... | 0.000000 | 0.000000 | 0.589417 | 43.0 | 0.271186 | 1 | 1.0 | 0.0 | 1.0 | 0 |
394 | 394 | l1345 | r85 | 4 | 0.20 | 0.000000 | 0.000000 | 0 | 0.090909 | 1.000000 | ... | 0.951111 | 0.945946 | 0.822080 | 172.0 | 0.338462 | 1 | 1.0 | 0.0 | 1.0 | 1 |
29 | 29 | l611 | r141 | 3 | 0.25 | 0.666667 | 0.666667 | 0 | 0.090909 | 0.049383 | ... | 0.000000 | 0.000000 | 0.531543 | 26.0 | 0.277778 | 1 | 1.0 | 0.0 | 1.0 | 0 |
181 | 181 | l1164 | r1161 | 2 | 0.60 | 0.733333 | 0.733333 | 0 | 0.076923 | 1.000000 | ... | 0.592593 | 0.668153 | 0.684700 | 34.0 | 0.244444 | 1 | 1.0 | 0.0 | 1.0 | 1 |
5 rows × 24 columns
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(H))
True
We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
strategy='mean')
Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher.
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
k=5,
target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']
Matcher | Average precision | Average recall | Average f1 | |
---|---|---|---|---|
0 | DecisionTree | 0.915322 | 0.950714 | 0.930980 |
1 | RF | 1.000000 | 0.950714 | 0.974131 |
2 | SVM | 0.977778 | 0.810632 | 0.883248 |
3 | LinReg | 1.000000 | 0.935330 | 0.966131 |
4 | LogReg | 0.985714 | 0.935330 | 0.958724 |
result['drill_down_cv_stats']['precision']
Name | Matcher | Num folds | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean score | |
---|---|---|---|---|---|---|---|---|---|
0 | DecisionTree | <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> | 5 | 0.95 | 1.000000 | 0.764706 | 0.933333 | 0.928571 | 0.915322 |
1 | RF | <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> | 5 | 1.00 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2 | SVM | <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> | 5 | 1.00 | 1.000000 | 0.888889 | 1.000000 | 1.000000 | 0.977778 |
3 | LinReg | <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> | 5 | 1.00 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
4 | LogReg | <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> | 5 | 1.00 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 0.985714 |
result['drill_down_cv_stats']['recall']
Name | Matcher | Num folds | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean score | |
---|---|---|---|---|---|---|---|---|---|
0 | DecisionTree | <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> | 5 | 0.95 | 1.000000 | 0.928571 | 0.8750 | 1.000000 | 0.950714 |
1 | RF | <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> | 5 | 0.95 | 1.000000 | 0.928571 | 0.8750 | 1.000000 | 0.950714 |
2 | SVM | <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> | 5 | 0.90 | 0.923077 | 0.571429 | 0.8125 | 0.846154 | 0.810632 |
3 | LinReg | <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> | 5 | 0.95 | 1.000000 | 0.928571 | 0.8750 | 0.923077 | 0.935330 |
4 | LogReg | <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> | 5 | 0.95 | 1.000000 | 0.928571 | 0.8750 | 0.923077 | 0.935330 |
result['drill_down_cv_stats']['f1']
Name | Matcher | Num folds | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean score | |
---|---|---|---|---|---|---|---|---|---|
0 | DecisionTree | <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> | 5 | 0.950000 | 1.000000 | 0.838710 | 0.903226 | 0.962963 | 0.930980 |
1 | RF | <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> | 5 | 0.974359 | 1.000000 | 0.962963 | 0.933333 | 1.000000 | 0.974131 |
2 | SVM | <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> | 5 | 0.947368 | 0.960000 | 0.695652 | 0.896552 | 0.916667 | 0.883248 |
3 | LinReg | <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> | 5 | 0.974359 | 1.000000 | 0.962963 | 0.933333 | 0.960000 | 0.966131 |
4 | LogReg | <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> | 5 | 0.974359 | 0.962963 | 0.962963 | 0.933333 | 0.960000 | 0.958724 |
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']
# Debug RF matcher using GUI
em.vis_debug_rf(rf, P, Q,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
target_attr='label')
# Add a feature to do Jaccard on title + authors and add it to F
# Create a feature declaratively
sim = em.get_sim_funs_for_matching()
tok = em.get_tokenizers_for_matching()
feature_string = """jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()),
wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))"""
feature = em.get_feature_fn(feature_string, sim, tok)
# Add feature to F
em.add_feature(F, 'jac_ws_title_authors', feature)
True
# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I,
feature_table=F,
attrs_after='label',
show_progress=False)
# Check whether the updated F improves X (Random Forest)
result = em.select_matcher([rf], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
k=5,
target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['drill_down_cv_stats']['f1']
Name | Matcher | Num folds | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean score | |
---|---|---|---|---|---|---|---|---|---|
0 | RF | <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> | 5 | 0.974359 | 1.0 | 0.962963 | 0.933333 | 1.0 | 0.974131 |
# Select the best matcher again using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H,
exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
k=5,
target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']
Matcher | Average precision | Average recall | Average f1 | |
---|---|---|---|---|
0 | DecisionTree | 1.000000 | 1.000000 | 1.000000 |
1 | RF | 1.000000 | 0.950714 | 0.974131 |
2 | SVM | 1.000000 | 0.837418 | 0.907995 |
3 | LinReg | 1.000000 | 0.970330 | 0.984593 |
4 | LogReg | 0.985714 | 0.935330 | 0.958724 |
result['drill_down_cv_stats']['f1']
Name | Matcher | Num folds | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean score | |
---|---|---|---|---|---|---|---|---|---|
0 | DecisionTree | <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990> | 5 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
1 | RF | <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310> | 5 | 0.974359 | 1.000000 | 0.962963 | 0.933333 | 1.000000 | 0.974131 |
2 | SVM | <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390> | 5 | 0.947368 | 0.960000 | 0.782609 | 0.933333 | 0.916667 | 0.907995 |
3 | LinReg | <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0> | 5 | 1.000000 | 1.000000 | 0.962963 | 1.000000 | 0.960000 | 0.984593 |
4 | LogReg | <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210> | 5 | 0.974359 | 0.962963 | 0.962963 | 0.933333 | 0.960000 | 0.958724 |