In this notebook we present the a user-friendly approach in the well-known ABT-BUY dataset. This is a simple approach, specially developed for novice users in ER.
pyJedAI is an open-source library that can be installed from PyPI.
%pip install pyjedai -U
%pip show pyjedai
Imports
import os
import sys
import pandas as pd
from pyjedai.datamodel import Data
data = Data(
dataset_1=pd.read_csv("./../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_1=['id','name','description'],
id_column_name_1='id',
dataset_2=pd.read_csv("./../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_2=['id','name','description'],
id_column_name_2='id',
ground_truth=pd.read_csv("./../data/ccer/D2/gt.csv", sep='|', engine='python'),
)
from pyjedai.workflow import BlockingBasedWorkFlow, EmbeddingsNNWorkFlow, compare_workflows
from pyjedai.block_building import StandardBlocking, QGramsBlocking, ExtendedQGramsBlocking, SuffixArraysBlocking, ExtendedSuffixArraysBlocking
from pyjedai.block_cleaning import BlockFiltering, BlockPurging
from pyjedai.comparison_cleaning import WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning, CardinalityNodePruning, BLAST, ReciprocalCardinalityNodePruning, ReciprocalWeightedNodePruning, ComparisonPropagation
from pyjedai.matching import EntityMatching
from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
/home/konstantinos/pyJedAI-Dev/src/pyjedai/workflow.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from tqdm.autonotebook import tqdm
w = BlockingBasedWorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=3),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
),
dict(
method=BlockFiltering,
params=dict(ratio=0.8)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering),
name="Worflow-Test"
)
w.run(data, verbose=True)
*************************************************************************************************************************** Method: Q-Grams Blocking *************************************************************************************************************************** Method name: Q-Grams Blocking Parameters: Q-Gramms: 3 Runtime: 0.1329 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.08% Recall: 100.00% F1-score: 0.17% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 34452.0 Runtime: 0.0135 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.10% Recall: 100.00% F1-score: 0.19% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.8 Runtime: 0.0966 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.25% Recall: 99.91% F1-score: 0.49% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Cardinality Edge Pruning *************************************************************************************************************************** Method name: Cardinality Edge Pruning Parameters: Node centric: False Weighting scheme: JS Runtime: 0.9665 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 4.71% Recall: 98.70% F1-score: 8.99% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: dice Attributes: None Similarity threshold: 0.5 Tokenizer: white_space_tokenizer Vectorizer: None Qgrams: 1 Runtime: 3.7806 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 77.78% Recall: 1.95% F1-score: 3.81% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Similarity Threshold: None Runtime: 0.0003 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 90.48% Recall: 1.77% F1-score: 3.46% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
w.to_df()
Algorithm | F1 | Recall | Precision | Runtime (sec) | Params | |
---|---|---|---|---|---|---|
0 | Q-Grams Blocking | 0.167526 | 100.000000 | 0.083833 | 0.132942 | {'Q-Gramms': 3} |
1 | Block Purging | 0.190665 | 100.000000 | 0.095423 | 0.013543 | {'Smoothing factor': 1.025, 'Max Comparisons p... |
2 | Block Filtering | 0.493133 | 99.907063 | 0.247176 | 0.096584 | {'Ratio': 0.8} |
3 | Cardinality Edge Pruning | 8.985532 | 98.698885 | 4.707030 | 0.966452 | {'Node centric': False, 'Weighting scheme': 'JS'} |
4 | Entity Matching | 3.807797 | 1.951673 | 77.777778 | 3.780590 | {'Metric': 'dice', 'Attributes': None, 'Simila... |
5 | Connected Components Clustering | 3.463993 | 1.765799 | 90.476190 | 0.000292 | {'Similarity Threshold': None} |
w.visualize()
w.visualize(separate=True)
w1 = BlockingBasedWorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=4),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.6)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering)
)
w1.run(data, verbose=False, workflow_tqdm_enable=True)
BlockingBasedWorkFlow-1: 100%|██████████| 5/5 [00:02<00:00, 1.26it/s]
block_building
:method
: EmbeddingsNNBlockBuildingparams
: Constructor parametersexec_params
: build_blocks
parametersclustering
:method
: UniqueMappingClusteringparams
: Constructor parameters (i.e similarity threshold)w2 = EmbeddingsNNWorkFlow(
block_building = dict(
method=EmbeddingsNNBlockBuilding,
params=dict(vectorizer='sminilm', similarity_search='faiss'),
exec_params=dict(top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False)
),
clustering = dict(method=UniqueMappingClustering),
name="EmbeddingsNNWorkFlow-Test"
)
w2.run(data, verbose=True)
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss, cuda]: 100%|██████████| 2152/2152 [00:21<00:00, 102.16it/s]
disable True *************************************************************************************************************************** Method: Embeddings-NN Block Building *************************************************************************************************************************** Method name: Embeddings-NN Block Building Parameters: Vectorizer: sminilm Similarity-Search: faiss Top-K: 5 Vector size: 384 Runtime: 21.7341 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 18.35% Recall: 91.73% F1-score: 30.58% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Similarity Threshold: 0.1 Runtime: 0.0474 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 79.44% Recall: 73.61% F1-score: 76.41% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
compare_workflows([w, w1, w2], with_visualization=True)
Name | F1 | Recall | Precision | Runtime (sec) | |
---|---|---|---|---|---|
0 | Worflow-Test | 3.463993 | 1.765799 | 90.476190 | 5.330923 |
1 | BlockingBasedWorkFlow-1 | 3.463993 | 1.765799 | 90.476190 | 2.949135 |
2 | EmbeddingsNNWorkFlow-Test | 76.410999 | 73.605948 | 79.438315 | 21.874120 |
w = BlockingBasedWorkFlow()
w.best_blocking_workflow_ccer()
w.run(data, verbose=True)
*************************************************************************************************************************** Method: Standard Blocking *************************************************************************************************************************** Method name: Standard Blocking Parameters: Runtime: 0.1045 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.12% Recall: 99.81% F1-score: 0.24% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.9 Runtime: 0.0792 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.24% Recall: 99.72% F1-score: 0.48% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Weighted Edge Pruning *************************************************************************************************************************** Method name: Weighted Edge Pruning Parameters: Node centric: False Weighting scheme: EJS Runtime: 1.5783 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.33% Recall: 99.54% F1-score: 2.63% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: cosine Attributes: None Similarity threshold: 0.0 Tokenizer: char_tokenizer Vectorizer: tfidf Qgrams: 3 Runtime: 0.5095 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.33% Recall: 99.54% F1-score: 2.63% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Method: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Similarity Threshold: 0.17 Runtime: 0.1052 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 94.52% Recall: 93.03% F1-score: 93.77% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────