In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.
pyJedAI is an open-source library that can be installed from PyPI.
%pip install pyjedai -U
%pip show pyjedai
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.utils import (
text_cleaning_method,
print_clusters,
print_blocks,
print_candidate_pairs
)
from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data
d1 = pd.read_csv("../data/ccer/D2/abt.csv", sep='|', engine='python')
d2 = pd.read_csv("../data/ccer/D2/buy.csv", sep='|', engine='python')
gt = pd.read_csv("../data/ccer/D2/gt.csv", sep='|', engine='python')
data = Data(
dataset_1=d1,
id_column_name_1='id',
dataset_2=d2,
id_column_name_2='id',
ground_truth=gt
)
from pyjedai.block_building import StandardBlocking
from pyjedai.block_cleaning import BlockFiltering
from pyjedai.block_cleaning import BlockPurging
from pyjedai.comparison_cleaning import WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning, CardinalityNodePruning
from pyjedai.matching import EntityMatching
sb = StandardBlocking()
blocks = sb.build_blocks(data)
sb.evaluate(blocks, with_classification_report=True)
cbbp = BlockPurging()
blocks = cbbp.process(blocks, data, tqdm_disable=False)
cbbp.evaluate(blocks, with_classification_report=True)
bf = BlockFiltering(ratio=0.8)
blocks = bf.process(blocks, data, tqdm_disable=False)
wep = CardinalityEdgePruning(weighting_scheme='X2')
candidate_pairs_blocks = wep.process(blocks, data)
wep.evaluate(candidate_pairs_blocks, with_classification_report=True)
/home/conda/miniconda3/envs/pypi_dependencies/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Standard Blocking: 0%| | 0/2152 [00:00<?, ?it/s]Standard Blocking: 100%|██████████| 2152/2152 [00:00<00:00, 20775.30it/s]
*************************************************************************************************************************** Method: Standard Blocking *************************************************************************************************************************** Method name: Standard Blocking Parameters: Runtime: 0.1049 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.08% Recall: 99.81% F1-score: 0.15% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1074 False positives: 1406781 True negatives: 1156698 False negatives: 2 Total comparisons: 1407855 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Block Purging: 100%|██████████| 4266/4266 [00:00<00:00, 462839.21it/s]
*************************************************************************************************************************** Method: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 13920.0 Runtime: 0.0110 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.25% Recall: 99.81% F1-score: 0.49% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1074 False positives: 436832 True negatives: 1156698 False negatives: 2 Total comparisons: 437906 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Block Filtering: 100%|██████████| 3/3 [00:00<00:00, 26.37it/s] Cardinality Edge Pruning: 100%|██████████| 1076/1076 [00:02<00:00, 532.97it/s]
*************************************************************************************************************************** Method: Cardinality Edge Pruning *************************************************************************************************************************** Method name: Cardinality Edge Pruning Parameters: Node centric: False Weighting scheme: X2 Runtime: 2.0199 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 7.32% Recall: 93.77% F1-score: 13.58% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1009 False positives: 12773 True negatives: 1156633 False negatives: 67 Total comparisons: 13782 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 7.32114352053403, 'Recall %': 93.77323420074349, 'F1 %': 13.581908736034462, 'True Positives': 1009, 'False Positives': 12773, 'True Negatives': 1156633, 'False Negatives': 67}
Available options with `metric='tf-idf'``:
tokenizer = { 'tfidf_char_1gram', ..., 'tfidf_char_6gram',
'tfidf_word_1gram', ..., 'tfidf_word_6gram',
'tf_char_1gram', ... , 'tf_char_6gram',
'tf_word_1gram', ... , 'tf_word_6gram',
'boolean_char_1gram', ... , 'boolean_char_6gram'
'boolean_word_1gram', ... , 'boolean_word_6gram'
where:
tdidf
literal calculates tf-idf in the whole dataset using TdidfVectorizertf
literal calculates tf in the whole dataset using CountVectorizerboolean
literal transforms tf-idf matrix to booleanand available metrics for vector similarity metric:
metric = ['dice', 'jaccard', 'cosine']
em = EntityMatching(metric='cosine',
tokenizer='char_tokenizer',
vectorizer='tfidf',
qgram=3,
similarity_threshold=0.0)
pairs_graph = em.predict(candidate_pairs_blocks, data)
em.evaluate(pairs_graph)
Entity Matching (cosine, char_tokenizer): 100%|██████████| 1073/1073 [00:00<00:00, 2605.80it/s]
*************************************************************************************************************************** Method: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: cosine Attributes: None Similarity threshold: 0.0 Tokenizer: char_tokenizer Vectorizer: tfidf Qgrams: 3 Runtime: 0.4132 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 7.32% Recall: 93.77% F1-score: 13.58% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 7.32114352053403, 'Recall %': 93.77323420074349, 'F1 %': 13.581908736034462, 'True Positives': 1009, 'False Positives': 12773, 'True Negatives': 1156633, 'False Negatives': 67}
em.plot_distribution_of_all_weights()
em.plot_distribution_of_scores()
Distribution-% of predicted scores: [8.960963575678422, 34.465244521840084, 31.410535481062258, 14.4826585401248, 6.733420403424757, 2.7064286750834423, 0.9432593237556233, 0.28297779712668697, 0.014511681903932667, 0.0]
It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.
from pyjedai.clustering import UniqueMappingClustering
umc = UniqueMappingClustering()
clusters = umc.process(pairs_graph, data)
umc.evaluate(clusters, with_classification_report=True)
*************************************************************************************************************************** Method: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Similarity Threshold: 0.1 Runtime: 0.0654 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 92.56% Recall: 87.92% F1-score: 90.18% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 946 False positives: 76 True negatives: 1156570 False negatives: 130 Total comparisons: 1022 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 92.56360078277886, 'Recall %': 87.9182156133829, 'F1 %': 90.1811248808389, 'True Positives': 946, 'False Positives': 76, 'True Negatives': 1156570, 'False Negatives': 130}