Introduction¶

This IPython notebook illustrates how to performing matching with a ML matcher. In particular we show examples with a decision tree matcher, but the same principles apply to all of the other ML matchers.

In [2]:

# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Read in the orignal tables and a set of labeled data into py_entitymatching.

In [3]:

# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'

In [4]:

A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.

Training the ML Matcher¶

Now, we can train our ML matcher. In this notebook we will demonstrate this process with a decision tree matcher. First, we need to split our labeled data into a training set and a test set. Then we will exract feature vectors from the training set and train our decision tree with the fit command.

In [5]:

# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

In [6]:

# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

In [7]:

# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)

In [8]:

# Instantiate the matcher to evaluate.
dt = em.DTMatcher(name='DecisionTree', random_state=0)

In [9]:

# Train using feature vectors from I 
dt.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
       target_attr='label')

Getting Predictions with the ML Matcher¶

Since we now have a trained decision tree, we can use our matcher to get predictions on the test set. Below, we will show four different ways to get the predictions with the predict command that will be useful in various contexts.

Getting a List of Predictions¶

First up, we will demonstrate how to get just a list of predictions using the predict command. This is the default method of getting predictions. As shown below, the resulting variable, predictions, is just an array containing the predictions for each of the feature vectors in the test set.

In [10]:

# Convert J into a set of feature vectors using F
L1 = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='label', show_progress=False)

# Predict on L 
predictions = dt.predict(table=L1, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'])

# Show the predictions
predictions[0:10]

Out[10]:

array([0, 0, 0, 1, 1, 1, 0, 1, 0, 0])

Getting a List of Predictions and a List of Probabilities¶

Next we will demonstrate how to get both a list of prediction for the test set, as well as a list of the associated probabilities for the predictions. This is done by setting the 'return_probs' argument to true. Note that the probabilities shown are the probability for a match.

In [11]:

# Convert J into a set of feature vectors using F
L2 = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='label', show_progress=False)

# Predict on L 
predictions, probs = dt.predict(table=L2, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], return_probs=True)

# Show the predictions and probabilities
print('Predictions for first ten entries: {0}'.format(predictions[0:10]))
print('Probabilities of a match for first ten entries: {0}'.format(probs[0:10]))

Predictions for first ten entries: [0 0 0 1 1 1 0 1 0 0]
Probabilities of a match for first ten entries: [0. 0. 0. 1. 1. 1. 0. 1. 0. 0.]

Appending the Predictions to the Feature Vectors Table¶

Often, we want to include the predictions with the feature vector table. We can return predictions appended to a copy of the feature vector table if we use the 'append' argument to true. We can choose the name of the new predictions column using the 'target_attr' argument. We can also append the probabilites by setting 'return_probs' to true and setting the new probabilities column name with the 'probs_attr'.

In [12]:

# Convert J into a set of feature vectors using F
L3 = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='label', show_progress=False)

# Predict on L 
predictions = dt.predict(table=L3, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
                         target_attr='prediction', append=True,
                         return_probs=True, probs_attr='probability')

# Show the predictions and probabilities
predictions[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probability']].head()

Out[12]:

	_id	ltable_id	rtable_id	label	prediction	probability
124	124	l1647	r366	0	0	0.0
54	54	l332	r1463	0	0	0.0
268	268	l1499	r1725	0	0	0.0
293	293	l759	r1749	1	1	1.0
230	230	l1580	r1711	1	1	1.0

Appending the Prediction to the Original Feature Vectors Table In-place¶

Lastly, we will show how to append the predictions to the original feature vector dataframe. We can accomplish this by setting the 'append' argument to true, setting the name of the new column with the 'target_attr' argument and then setting the 'inplace' argument to true. Again, we can include the probabilites with the 'return_probs' and 'probs_attr' arguments. This will append the predictions and probabilities to the original feature vector dataframe as opposed to the mthod used above which will create a copy of the feature vectors and append the predictions to that copy.

In [13]:

# Convert J into a set of feature vectors using F
L4 = em.extract_feature_vecs(J, feature_table=F,
                            attrs_after='label', show_progress=False)

# Predict on L 
dt.predict(table=L4, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'], 
           target_attr='prediction', append=True,
           return_probs=True, probs_attr='probabilities',
           inplace=True)

# Show the predictions and probabilities
L4[['_id', 'ltable_id', 'rtable_id', 'label', 'prediction', 'probabilities']].head()

Out[13]:

	_id	ltable_id	rtable_id	label	prediction	probabilities
124	124	l1647	r366	0	0	0.0
54	54	l332	r1463	0	0	0.0
268	268	l1499	r1725	0	0	0.0
293	293	l759	r1749	1	1	1.0
230	230	l1580	r1711	1	1	1.0