Notebook

Clean-Clean Entity Resolution Tutorial
using Tf-Idf transformation.

In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.

How to install?¶

pyJedAI is an open-source library that can be installed from PyPI.

In [ ]:

!pip install pyjedai -U

In [2]:

!pip show pyjedai

Name: pyjedai
Version: 0.1.0
Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.
Home-page: 
Author: 
Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr>
License: Apache Software License 2.0
Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages
Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine
Required-by:

Imports

In [3]:

import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph

from pyjedai.utils import (
    text_cleaning_method,
    print_clusters,
    print_blocks,
    print_candidate_pairs
)
from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data
d1 = pd.read_csv("../data/ccer/D2/abt.csv", sep='|', engine='python')
d2 = pd.read_csv("../data/ccer/D2/buy.csv", sep='|', engine='python')
gt = pd.read_csv("../data/ccer/D2/gt.csv", sep='|', engine='python')

data = Data(
    dataset_1=d1,
    id_column_name_1='id',
    dataset_2=d2,
    id_column_name_2='id',
    ground_truth=gt
)

[nltk_data] Downloading package stopwords to /home/jm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Simple workflow¶

In [4]:

from pyjedai.block_building import StandardBlocking
from pyjedai.block_cleaning import BlockFiltering
from pyjedai.block_cleaning import BlockPurging
from pyjedai.comparison_cleaning import WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning, CardinalityNodePruning
from pyjedai.matching import EntityMatching

sb = StandardBlocking()
blocks = sb.build_blocks(data)
sb.evaluate(blocks, with_classification_report=True)

cbbp = BlockPurging()
blocks = cbbp.process(blocks, data, tqdm_disable=False)
cbbp.evaluate(blocks, with_classification_report=True)

bf = BlockFiltering(ratio=0.8)
blocks = bf.process(blocks, data, tqdm_disable=False)

wep = CardinalityEdgePruning(weighting_scheme='X2')
candidate_pairs_blocks = wep.process(blocks, data)
wep.evaluate(candidate_pairs_blocks, with_classification_report=True)

Standard Blocking:   0%|          | 0/2152 [00:00<?, ?it/s]

***************************************************************************************************************************
                                         Μethod:  Standard Blocking
***************************************************************************************************************************
Method name: Standard Blocking
Parameters: 
Runtime: 0.6493 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      0.11% 
	Recall:        99.81%
	F1-score:       0.23%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 1074
	False positives: 948081
	True negatives: 1156698
	False negatives: 2
	Total comparisons: 949155
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Block Purging:   0%|          | 0/4264 [00:00<?, ?it/s]

***************************************************************************************************************************
                                         Μethod:  Block Purging
***************************************************************************************************************************
Method name: Block Purging
Parameters: 
	Smoothing factor: 1.025
	Max Comparisons per Block: 13920.0
Runtime: 0.0899 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      0.25% 
	Recall:        99.81%
	F1-score:       0.49%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 1074
	False positives: 436832
	True negatives: 1156698
	False negatives: 2
	Total comparisons: 437906
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Block Filtering:   0%|          | 0/3 [00:00<?, ?it/s]

Cardinality Edge Pruning:   0%|          | 0/1076 [00:00<?, ?it/s]

***************************************************************************************************************************
                                         Μethod:  Cardinality Edge Pruning
***************************************************************************************************************************
Method name: Cardinality Edge Pruning
Parameters: 
	Node centric: False
	Weighting scheme: X2
Runtime: 13.3849 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      7.32% 
	Recall:        93.77%
	F1-score:      13.58%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 1009
	False positives: 12773
	True negatives: 1156633
	False negatives: 67
	Total comparisons: 13782
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Out[4]:

{'Precision %': 7.32114352053403,
 'Recall %': 93.77323420074349,
 'F1 %': 13.581908736034462,
 'True Positives': 1009,
 'False Positives': 12773,
 'True Negatives': 1156633,
 'False Negatives': 67}

Entity Matching with Tf-Idf configuration¶

Available options with `metric='tf-idf'``:

tokenizer = {   'tfidf_char_1gram', ..., 'tfidf_char_6gram',
                'tfidf_word_1gram', ..., 'tfidf_word_6gram',
                'tf_char_1gram', ... , 'tf_char_6gram',
                'tf_word_1gram', ... , 'tf_word_6gram',
                'boolean_char_1gram', ... , 'boolean_char_6gram'
                'boolean_word_1gram', ... , 'boolean_word_6gram'

where:

tdidf literal calculates tf-idf in the whole dataset using TdidfVectorizer
tf literal calculates tf in the whole dataset using CountVectorizer
boolean literal transforms tf-idf matrix to boolean

and available metrics for vector similarity metric:

metric = ['dice', 'jaccard', 'cosine']

In [9]:

em = EntityMatching(metric='cosine', 
                    tokenizer='char_tokenizer',
                    vectorizer='tfidf',
                    qgram=3,
                    similarity_threshold=0.0)
pairs_graph = em.predict(candidate_pairs_blocks, data)
em.evaluate(pairs_graph)

Entity Matching (cosine, char_tokenizer):   0%|          | 0/1073 [00:00<?, ?it/s]

***************************************************************************************************************************
                                         Μethod:  Entity Matching
***************************************************************************************************************************
Method name: Entity Matching
Parameters: 
	Metric: cosine
	Attributes: None
	Similarity threshold: 0.0
	Tokenizer: char_tokenizer
	Vectorizer: tfidf
	Qgrams: 3
Runtime: 2.3742 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      7.32% 
	Recall:        93.77%
	F1-score:      13.58%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Out[9]:

{'Precision %': 7.32114352053403,
 'Recall %': 93.77323420074349,
 'F1 %': 13.581908736034462,
 'True Positives': 1009,
 'False Positives': 12773,
 'True Negatives': 1156633,
 'False Negatives': 67}

In [10]:

em.plot_distribution_of_all_weights()

In [11]:

em.plot_distribution_of_scores()

Distribution-% of predicted scores:  [8.997242780438253, 34.20403424756929, 31.359744594398492, 14.576984472500362, 6.827746335800319, 2.7354520388913075, 0.9650268466115223, 0.3192570018865186, 0.014511681903932667, 0.0]

Entity Clustering¶

It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

In [14]:

from pyjedai.clustering import UniqueMappingClustering

In [15]:

umc = UniqueMappingClustering()
clusters = umc.process(pairs_graph, data)
umc.evaluate(clusters, with_classification_report=True)

***************************************************************************************************************************
                                         Μethod:  Unique Mapping Clustering
***************************************************************************************************************************
Method name: Unique Mapping Clustering
Parameters: 
Runtime: 0.2460 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:     92.56% 
	Recall:        87.92%
	F1-score:      90.18%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 946
	False positives: 76
	True negatives: 1156570
	False negatives: 130
	Total comparisons: 1022
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Out[15]:

{'Precision %': 92.56360078277886,
 'Recall %': 87.9182156133829,
 'F1 %': 90.1811248808389,
 'True Positives': 946,
 'False Positives': 76,
 'True Negatives': 1156570,
 'False Negatives': 130}

K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis

Apache License 2.0