In this notebook we present the a user-friendly approach in the well-known ABT-BUY dataset. This is a simple approach, specially developed for novice users in ER.
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!pip install pyjedai -U
!pip show pyjedai
Name: pyjedai Version: 0.0.3 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com> License: Apache Software License 2.0 Location: c:\users\nikol\appdata\local\programs\python\python310\lib\site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, pandas, pandas-profiling, pandocfilters, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers Required-by:
Imports
import os
import sys
import pandas as pd
from pyjedai.datamodel import Data
data = Data(
dataset_1=pd.read_csv("./../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_1=['id','name','description'],
id_column_name_1='id',
dataset_2=pd.read_csv("./../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_2=['id','name','description'],
id_column_name_2='id',
ground_truth=pd.read_csv("./../data/ccer/D2/gt.csv", sep='|', engine='python'),
)
from pyjedai.workflow import WorkFlow, compare_workflows
from pyjedai.block_building import StandardBlocking, QGramsBlocking, ExtendedQGramsBlocking, SuffixArraysBlocking, ExtendedSuffixArraysBlocking
from pyjedai.block_cleaning import BlockFiltering, BlockPurging
from pyjedai.comparison_cleaning import WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning, CardinalityNodePruning, BLAST, ReciprocalCardinalityNodePruning, ReciprocalWeightedNodePruning, ComparisonPropagation
from pyjedai.matching import EntityMatching
from pyjedai.clustering import ConnectedComponentsClustering
w = WorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=3),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.8)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering),
name="Worflow-Test"
)
w.run(data, verbose=True)
*************************************************************************************************************************** Μethod: Q-Grams Blocking *************************************************************************************************************************** Method name: Q-Grams Blocking Parameters: Q-Gramms: 3 Runtime: 0.2410 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.08% Recall: 100.00% F1-score: 0.17% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.8 Runtime: 0.0990 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.12% Recall: 100.00% F1-score: 0.24% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 22500.0 Runtime: 0.0250 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.14% Recall: 99.91% F1-score: 0.28% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Cardinality Edge Pruning *************************************************************************************************************************** Method name: Cardinality Edge Pruning Parameters: Node centric: False Weighting scheme: JS Runtime: 3.0797 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 4.74% Recall: 97.30% F1-score: 9.04% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Tokenizer: white_space_tokenizer Metric: dice Similarity Threshold: 0.5 Runtime: 10.4458 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 77.78% Recall: 1.95% F1-score: 3.81% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.0000 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 90.48% Recall: 1.77% F1-score: 3.46% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
w.to_df()
Algorithm | F1 | Recall | Precision | Runtime (sec) | Params | |
---|---|---|---|---|---|---|
0 | Q-Grams Blocking | 0.167526 | 100.000000 | 0.083833 | 0.240996 | {'Q-Gramms': 3} |
1 | Block Filtering | 0.238795 | 100.000000 | 0.119540 | 0.098999 | {'Ratio': 0.8} |
2 | Block Purging | 0.277537 | 99.907063 | 0.138962 | 0.025035 | {'Smoothing factor': 1.025, 'Max Comparisons p... |
3 | Cardinality Edge Pruning | 9.037939 | 97.304833 | 4.739058 | 3.079692 | {'Node centric': False, 'Weighting scheme': 'JS'} |
4 | Entity Matching | 3.807797 | 1.951673 | 77.777778 | 10.445795 | {'Tokenizer': 'white_space_tokenizer', 'Metric... |
5 | Connected Components Clustering | 3.463993 | 1.765799 | 90.476190 | 0.000000 | {} |
w.visualize()
w.visualize(separate=True)
w1 = WorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=4),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.6)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering)
)
w1.run(data, verbose=False, workflow_tqdm_enable=True)
w2 = WorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=4),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.6)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.6,
attributes = ['description']
),
clustering = dict(method=ConnectedComponentsClustering)
)
w2.run(data, verbose=False, workflow_tqdm_enable=True)
Workflow-1: 0%| | 0/5 [00:00<?, ?it/s]
Workflow-2: 0%| | 0/5 [00:00<?, ?it/s]
compare_workflows([w, w1, w2], with_visualization=True)
Name | F1 | Recall | Precision | Runtime (sec) | |
---|---|---|---|---|---|
0 | Worflow-Test | 3.463993 | 1.765799 | 90.47619 | 14.490555 |
1 | Workflow-1 | 3.467153 | 1.765799 | 95.00000 | 7.563514 |
2 | Workflow-2 | 3.467153 | 1.765799 | 95.00000 | 7.482991 |