In this notebook we present the a user-friendly approach in the well-known ABT-BUY dataset. This is a simple approach, specially developed for novice users in ER.
pyJedAI is an open-source library that can be installed from PyPI.
!pip install pyjedai -U
!pip show pyjedai
Name: pyjedai Version: 0.1.0 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr> License: Apache Software License 2.0 Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine Required-by:
Imports
import os
import sys
import pandas as pd
from pyjedai.datamodel import Data
data = Data(
dataset_1=pd.read_csv("./../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_1=['id','name','description'],
id_column_name_1='id',
dataset_2=pd.read_csv("./../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str),
attributes_2=['id','name','description'],
id_column_name_2='id',
ground_truth=pd.read_csv("./../data/ccer/D2/gt.csv", sep='|', engine='python'),
)
[nltk_data] Downloading package stopwords to /home/jm/nltk_data... [nltk_data] Package stopwords is already up-to-date!
from pyjedai.workflow import BlockingBasedWorkFlow, EmbeddingsNNWorkFlow, compare_workflows
from pyjedai.block_building import StandardBlocking, QGramsBlocking, ExtendedQGramsBlocking, SuffixArraysBlocking, ExtendedSuffixArraysBlocking
from pyjedai.block_cleaning import BlockFiltering, BlockPurging
from pyjedai.comparison_cleaning import WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning, CardinalityNodePruning, BLAST, ReciprocalCardinalityNodePruning, ReciprocalWeightedNodePruning, ComparisonPropagation
from pyjedai.matching import EntityMatching
from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
w = BlockingBasedWorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=3),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
),
dict(
method=BlockFiltering,
params=dict(ratio=0.8)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering),
name="Worflow-Test"
)
w.run(data, verbose=True)
*************************************************************************************************************************** Μethod: Q-Grams Blocking *************************************************************************************************************************** Method name: Q-Grams Blocking Parameters: Q-Gramms: 3 Runtime: 0.7395 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.08% Recall: 100.00% F1-score: 0.17% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 34452.0 Runtime: 0.0444 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.10% Recall: 100.00% F1-score: 0.19% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.8 Runtime: 0.4790 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.25% Recall: 99.91% F1-score: 0.49% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Cardinality Edge Pruning *************************************************************************************************************************** Method name: Cardinality Edge Pruning Parameters: Node centric: False Weighting scheme: JS Runtime: 2.7094 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 4.71% Recall: 98.70% F1-score: 8.99% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: dice Attributes: None Similarity threshold: 0.5 Tokenizer: white_space_tokenizer Vectorizer: None Qgrams: 1 Runtime: 11.1440 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 77.78% Recall: 1.95% F1-score: 3.81% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.0013 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 90.48% Recall: 1.77% F1-score: 3.46% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
w.to_df()
Algorithm | F1 | Recall | Precision | Runtime (sec) | Params | |
---|---|---|---|---|---|---|
0 | Q-Grams Blocking | 0.167526 | 100.000000 | 0.083833 | 0.739494 | {'Q-Gramms': 3} |
1 | Block Purging | 0.190665 | 100.000000 | 0.095423 | 0.044369 | {'Smoothing factor': 1.025, 'Max Comparisons p... |
2 | Block Filtering | 0.493133 | 99.907063 | 0.247176 | 0.479028 | {'Ratio': 0.8} |
3 | Cardinality Edge Pruning | 8.985532 | 98.698885 | 4.707030 | 2.709405 | {'Node centric': False, 'Weighting scheme': 'JS'} |
4 | Entity Matching | 3.807797 | 1.951673 | 77.777778 | 11.143954 | {'Metric': 'dice', 'Attributes': None, 'Simila... |
5 | Connected Components Clustering | 3.463993 | 1.765799 | 90.476190 | 0.001340 | {} |
w.visualize()
w.visualize(separate=True)
w1 = BlockingBasedWorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=4),
attributes_1=['name'],
attributes_2=['name']
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.6)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering)
)
w1.run(data, verbose=False, workflow_tqdm_enable=True)
BlockingBasedWorkFlow-1: 0%| | 0/5 [00:00<?, ?it/s]
block_building
:method
: EmbeddingsNNBlockBuildingparams
: Constructor parametersexec_params
: build_blocks
parametersclustering
:method
: UniqueMappingClusteringparams
: Constructor parameters (i.e similarity threshold)w2 = EmbeddingsNNWorkFlow(
block_building = dict(
method=EmbeddingsNNBlockBuilding,
params=dict(vectorizer='sminilm', similarity_search='faiss'),
exec_params=dict(top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False)
),
clustering = dict(method=UniqueMappingClustering),
name="EmbeddingsNNWorkFlow-Test"
)
w2.run(data, verbose=True)
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss]: 0%| | 0/2152 [00:00<?, ?it/s]
Device selected: cpu *************************************************************************************************************************** Μethod: Embeddings-NN Block Building *************************************************************************************************************************** Method name: Embeddings-NN Block Building Parameters: Vectorizer: sminilm Similarity-Search: faiss Top-K: 5 Vector size: 384 Runtime: 163.8474 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 9.17% Recall: 91.73% F1-score: 16.68% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Runtime: 0.1756 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 79.44% Recall: 73.61% F1-score: 76.41% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
compare_workflows([w, w1, w2], with_visualization=True)
Name | F1 | Recall | Precision | Runtime (sec) | |
---|---|---|---|---|---|
0 | Worflow-Test | 3.463993 | 1.765799 | 90.476190 | 16.132164 |
1 | BlockingBasedWorkFlow-1 | 3.463993 | 1.765799 | 90.476190 | 9.111916 |
2 | EmbeddingsNNWorkFlow-Test | 76.410999 | 73.605948 | 79.438315 | 164.297292 |
w = BlockingBasedWorkFlow()
w.best_blocking_workflow_ccer()
w.run(data, verbose=True)
*************************************************************************************************************************** Μethod: Standard Blocking *************************************************************************************************************************** Method name: Standard Blocking Parameters: Runtime: 0.6960 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.12% Recall: 99.81% F1-score: 0.24% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.9 Runtime: 0.4103 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 0.24% Recall: 99.72% F1-score: 0.48% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Weighted Edge Pruning *************************************************************************************************************************** Method name: Weighted Edge Pruning Parameters: Node centric: False Weighting scheme: EJS Runtime: 4.4405 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.33% Recall: 99.54% F1-score: 2.63% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: cosine Attributes: None Similarity threshold: 0.0 Tokenizer: char_tokenizer Vectorizer: tfidf Qgrams: 3 Runtime: 2.2846 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.33% Recall: 99.54% F1-score: 2.63% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── *************************************************************************************************************************** Μethod: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Runtime: 0.3374 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 94.52% Recall: 93.03% F1-score: 93.77% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────