In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.
Dataset: Abt-Buy dataset
The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!pip install pyjedai -U
Requirement already satisfied: pyjedai in c:\users\nikol\anaconda3\lib\site-packages (0.0.5) Requirement already satisfied: strsim>=0.0.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.0.3) Requirement already satisfied: regex>=2022.6.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2022.6.2) Requirement already satisfied: pandocfilters>=1.5 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.5.0) Requirement already satisfied: strsimpy>=0.2.1 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.2.1) Requirement already satisfied: seaborn>=0.11 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.11.2) Requirement already satisfied: gensim>=4.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.2.0) Requirement already satisfied: matplotlib>=3.1.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.5.3) Requirement already satisfied: matplotlib-inline>=0.1.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.1.6) Requirement already satisfied: numpy>=1.21 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.21.2) Requirement already satisfied: faiss-cpu>=1.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.7.2) Requirement already satisfied: transformers>=4.21 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.21.3) Requirement already satisfied: nltk>=3.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.7) Requirement already satisfied: scipy>=1.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.7.1) Requirement already satisfied: pandas>=0.25.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.3.4) Requirement already satisfied: tomli in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.0.1) Requirement already satisfied: tqdm>=4.64 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.64.0) Requirement already satisfied: optuna>=3.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.0.1) Requirement already satisfied: rdfpandas>=1.1.5 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.1.5) Requirement already satisfied: networkx>=2.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.6.3) Requirement already satisfied: PyYAML>=6.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (6.0) Requirement already satisfied: pandas-profiling>=3.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.2.0) Requirement already satisfied: rdflib>=6.1.1 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (6.1.1) Requirement already satisfied: sentence-transformers>=2.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.2.2) Requirement already satisfied: Cython==0.29.28 in c:\users\nikol\anaconda3\lib\site-packages (from gensim>=4.2.0->pyjedai) (0.29.28) Requirement already satisfied: smart-open>=1.8.1 in c:\users\nikol\anaconda3\lib\site-packages (from gensim>=4.2.0->pyjedai) (5.1.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.4) Requirement already satisfied: python-dateutil>=2.7 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (21.3) Requirement already satisfied: pillow>=6.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (8.4.0) Requirement already satisfied: cycler>=0.10 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (0.10.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (1.3.1) Requirement already satisfied: six in c:\users\nikol\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.1.3->pyjedai) (1.16.0) Requirement already satisfied: traitlets in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.1.1) Requirement already satisfied: click in c:\users\nikol\anaconda3\lib\site-packages (from nltk>=3.7->pyjedai) (8.0.3) Requirement already satisfied: joblib in c:\users\nikol\anaconda3\lib\site-packages (from nltk>=3.7->pyjedai) (1.1.0) Requirement already satisfied: typing-extensions>=3.10.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (4.3.0) Requirement already satisfied: colorlog in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (6.6.0) Requirement already satisfied: cmaes>=0.8.2 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (0.8.2) Requirement already satisfied: sqlalchemy>=1.1.0 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (1.4.22) Requirement already satisfied: cliff in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (3.10.0) Requirement already satisfied: alembic in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (1.7.5) Requirement already satisfied: pytz>=2017.3 in c:\users\nikol\anaconda3\lib\site-packages (from pandas>=0.25.3->pyjedai) (2021.3) Requirement already satisfied: pydantic>=1.8.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (1.10.2) Requirement already satisfied: htmlmin>=0.1.12 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.1.12) Requirement already satisfied: visions[type_image_path]==0.7.4 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.7.4) Requirement already satisfied: tangled-up-in-unicode==0.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.2.0) Requirement already satisfied: jinja2>=2.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (3.0.2) Requirement already satisfied: requests>=2.24.0 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (2.26.0) Requirement already satisfied: multimethod>=1.4 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (1.9) Requirement already satisfied: missingno>=0.4.2 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.5.1) Requirement already satisfied: phik>=0.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.12.2) Requirement already satisfied: markupsafe~=2.1.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (2.1.1) Requirement already satisfied: attrs>=19.3.0 in c:\users\nikol\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (21.2.0) Requirement already satisfied: imagehash in c:\users\nikol\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (4.3.1) Requirement already satisfied: importlib-metadata in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (4.8.1) Requirement already satisfied: setuptools in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (58.0.4) Requirement already satisfied: isodate in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1) Requirement already satisfied: certifi>=2017.4.17 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2022.9.14) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (1.26.7) Requirement already satisfied: idna<4,>=2.5 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2.0.4) Requirement already satisfied: torch>=1.6.0 in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (1.12.1) Requirement already satisfied: scikit-learn in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (1.0.1) Requirement already satisfied: sentencepiece in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.97) Requirement already satisfied: torchvision in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.13.1) Requirement already satisfied: huggingface-hub>=0.4.0 in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.10.0) Requirement already satisfied: filelock in c:\users\nikol\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (3.3.1) Requirement already satisfied: greenlet!=0.4.17 in c:\users\nikol\anaconda3\lib\site-packages (from sqlalchemy>=1.1.0->optuna>=3.0->pyjedai) (1.1.1) Requirement already satisfied: colorama in c:\users\nikol\anaconda3\lib\site-packages (from tqdm>=4.64->pyjedai) (0.4.4) Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from transformers>=4.21->pyjedai) (0.12.1) Requirement already satisfied: Mako in c:\users\nikol\anaconda3\lib\site-packages (from alembic->optuna>=3.0->pyjedai) (1.1.4) Requirement already satisfied: importlib-resources in c:\users\nikol\anaconda3\lib\site-packages (from alembic->optuna>=3.0->pyjedai) (5.4.0) Requirement already satisfied: stevedore>=2.0.1 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (3.5.0) Requirement already satisfied: autopage>=0.4.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (0.4.0) Requirement already satisfied: cmd2>=1.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (2.3.2) Requirement already satisfied: PrettyTable>=0.7.2 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (2.4.0) Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (5.6.0) Requirement already satisfied: wcwidth>=0.1.7 in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (0.2.5) Requirement already satisfied: pyreadline in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (2.1) Requirement already satisfied: pyperclip>=1.6 in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (1.8.2) Requirement already satisfied: zipp>=0.5 in c:\users\nikol\anaconda3\lib\site-packages (from importlib-metadata->rdflib>=6.1.1->pyjedai) (3.6.0) Requirement already satisfied: PyWavelets in c:\users\nikol\anaconda3\lib\site-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (2.2.0)
!pip show pyjedai
Name: pyjedai Version: 0.0.5 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com> License: Apache Software License 2.0 Location: c:\users\nikol\anaconda3\lib\site-packages Requires: PyYAML, optuna, scipy, gensim, pandocfilters, numpy, rdflib, pandas, transformers, regex, strsim, tqdm, networkx, seaborn, rdfpandas, strsimpy, matplotlib-inline, matplotlib, pandas-profiling, tomli, nltk, faiss-cpu, sentence-transformers Required-by:
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
import pyjedai
from pyjedai.utils import (
text_cleaning_method,
print_clusters,
print_blocks,
print_candidate_pairs
)
from pyjedai.evaluation import Evaluation, write
pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.
from pyjedai.datamodel import Data
from pyjedai.evaluation import Evaluation
d1 = pd.read_csv("./../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("./../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("./../data/ccer/D2/gt.csv", sep='|', engine='python').astype(str)
data = Data(dataset_1=d1,
id_column_name_1='id',
dataset_2=d2,
id_column_name_2='id',
ground_truth=gt)
pyJedAI offers also dataset analysis methods (more will be developed)
data.print_specs()
------------------------- Data ------------------------- Type of Entity Resolution: Clean-Clean Dataset-1: Number of entities: 1076 Number of NaN values: 0 Attributes: ['name', 'description', 'price'] Dataset-2: Number of entities: 1076 Number of NaN values: 0 Attributes: ['name', 'description', 'price'] Total number of entities: 2152 Number of matching pairs in ground-truth: 1076 --------------------------------------------------------
data.dataset_1.head(5)
id | name | description | price | |
---|---|---|---|---|
0 | 0 | Sony Turntable - PSLX350H | Sony Turntable - PSLX350H/ Belt Drive System/ ... | |
1 | 1 | Bose Acoustimass 5 Series III Speaker System -... | Bose Acoustimass 5 Series III Speaker System -... | 399 |
2 | 2 | Sony Switcher - SBV40S | Sony Switcher - SBV40S/ Eliminates Disconnecti... | 49 |
3 | 3 | Sony 5 Disc CD Player - CDPCE375 | Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change... | |
4 | 4 | Bose 27028 161 Bookshelf Pair Speakers In Whit... | Bose 161 Bookshelf Speakers In White - 161WH/ ... | 158 |
data.dataset_2.head(5)
id | name | description | price | |
---|---|---|---|---|
0 | 0 | Linksys EtherFast EZXS88W Ethernet Switch - EZ... | Linksys EtherFast 8-Port 10/100 Switch (New/Wo... | |
1 | 1 | Linksys EtherFast EZXS55W Ethernet Switch | 5 x 10/100Base-TX LAN | |
2 | 2 | Netgear ProSafe FS105 Ethernet Switch - FS105NA | NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw... | |
3 | 3 | Belkin Pro Series High Integrity VGA/SVGA Moni... | 1 x HD-15 - 1 x HD-15 - 10ft - Beige | |
4 | 4 | Netgear ProSafe JFS516 Ethernet Switch | Netgear ProSafe 16 Port 10/100 Rackmount Switc... |
data.ground_truth.head(3)
D1 | D2 | |
---|---|---|
0 | 206 | 216 |
1 | 60 | 46 |
2 | 182 | 160 |
It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.
The following methods are currently supported:
from pyjedai.block_building import (
StandardBlocking,
QGramsBlocking,
ExtendedQGramsBlocking,
SuffixArraysBlocking,
ExtendedSuffixArraysBlocking,
)
Created embeddings directory at: C:\Users\nikol\Desktop\test\tutorials\.embeddings
qgb = SuffixArraysBlocking()
blocks = qgb.build_blocks(data, attributes_1=['name'], attributes_2=['name'])
Suffix Arrays Blocking: 0%| | 0/2152 [00:00<?, ?it/s]
qgb.report()
Method name: Suffix Arrays Blocking Method info: Creates one block for every suffix that appears in the attribute value tokens of at least two entities. Parameters: Suffix length: 6 Maximum Block Size: 53 Attributes from D1: name Attributes from D2: name Runtime: 0.2220 seconds
_ = qgb.evaluate(blocks, with_classification_report=True)
*************************************************************************************************************************** Μethod: Suffix Arrays Blocking *************************************************************************************************************************** Method name: Suffix Arrays Blocking Parameters: Suffix length: 6 Maximum Block Size: 53 Runtime: 0.2220 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.41% Recall: 97.03% F1-score: 2.78% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1044 False positives: 73021 True negatives: 1084723 False negatives: 32 Total comparisons: 74065 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 1.4095726726523998, 'Recall %': 97.02602230483272, 'F1 %': 2.7787759013055453, 'True Positives': 1044, 'False Positives': 73021, 'True Negatives': 1084723, 'False Negatives': 32}
___Optional step___
Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.
from pyjedai.block_cleaning import BlockFiltering
bf = BlockFiltering(ratio=0.8)
filtered_blocks = bf.process(blocks, data, tqdm_disable=False)
Block Filtering: 0%| | 0/3 [00:00<?, ?it/s]
_ = bf.evaluate(filtered_blocks)
*************************************************************************************************************************** Μethod: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.8 Runtime: 0.0600 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.90% Recall: 94.42% F1-score: 3.72% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 1.8975048558195131, 'Recall %': 94.42379182156134, 'F1 %': 3.7202489930428415, 'True Positives': 1016, 'False Positives': 52528, 'True Negatives': 1105188, 'False Negatives': 60}
___Optional step___
Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.
The following methods are currently supported:
Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:
from pyjedai.block_cleaning import BlockPurging
cbbp = BlockPurging()
cleaned_blocks = cbbp.process(filtered_blocks, data, tqdm_disable=False)
Block Purging: 0%| | 0/4680 [00:00<?, ?it/s]
cbbp.report()
Method name: Block Purging Method info: Discards the blocks exceeding a certain number of comparisons. Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 600.0 Runtime: 0.0630 seconds
_ = cbbp.evaluate(cleaned_blocks)
*************************************************************************************************************************** Μethod: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 600.0 Runtime: 0.0630 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 1.90% Recall: 94.42% F1-score: 3.72% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
from pyjedai.comparison_cleaning import (
WeightedEdgePruning,
WeightedNodePruning,
CardinalityEdgePruning,
CardinalityNodePruning,
BLAST,
ReciprocalCardinalityNodePruning,
ReciprocalWeightedNodePruning,
ComparisonPropagation
)
mb = CardinalityEdgePruning(weighting_scheme='X2')
candidate_pairs_blocks = mb.process(filtered_blocks, data, tqdm_disable=True)
_ = mb.evaluate(candidate_pairs_blocks)
*************************************************************************************************************************** Μethod: Cardinality Edge Pruning *************************************************************************************************************************** Method name: Cardinality Edge Pruning Parameters: Node centric: False Weighting scheme: X2 Runtime: 1.4746 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 13.19% Recall: 85.97% F1-score: 22.86% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities.
from pyjedai.matching import EntityMatching
EM = EntityMatching(
metric='dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
)
pairs_graph = EM.predict(candidate_pairs_blocks, data, tqdm_disable=True)
draw(pairs_graph)
_ = EM.evaluate(pairs_graph)
*************************************************************************************************************************** Μethod: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Tokenizer: white_space_tokenizer Metric: dice Similarity Threshold: 0.5 Runtime: 2.2469 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 97.14% Recall: 3.16% F1-score: 6.12% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.
from pyjedai.clustering import ConnectedComponentsClustering
ccc = ConnectedComponentsClustering()
clusters = ccc.process(pairs_graph, data)
ccc.report()
Method name: Connected Components Clustering Method info: Gets equivalence clusters from the transitive closure of the similarity graph. Parameters: None Runtime: 0.0010 seconds
_ = ccc.evaluate(clusters)
*************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.0010 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 97.14% Recall: 3.16% F1-score: 6.12% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────