In this notebook we present the pyJedAI approach. pyJedAI is a an end-to-end and an upcoming python framework for Entity Resolution that will be a manual of the Entity Resolution. Its usages will outperform other state-of-the-art ER frameworks as it's easy-to-use and highly optimized as it is consisted from other established python libraries (i.e pandas, networkX, ..).
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!pip install pyjedai -U
Requirement already satisfied: pyjedai in c:\users\nikol\anaconda3\lib\site-packages (0.0.5) Requirement already satisfied: strsim>=0.0.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.0.3) Requirement already satisfied: numpy>=1.21 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.21.2) Requirement already satisfied: pandas>=0.25.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.3.4) Requirement already satisfied: PyYAML>=6.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (6.0) Requirement already satisfied: transformers>=4.21 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.21.3) Requirement already satisfied: scipy>=1.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.7.1) Requirement already satisfied: pandas-profiling>=3.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.2.0) Requirement already satisfied: sentence-transformers>=2.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.2.2) Requirement already satisfied: rdfpandas>=1.1.5 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.1.5) Requirement already satisfied: matplotlib-inline>=0.1.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.1.6) Requirement already satisfied: strsimpy>=0.2.1 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.2.1) Requirement already satisfied: tomli in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.0.1) Requirement already satisfied: seaborn>=0.11 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (0.11.2) Requirement already satisfied: gensim>=4.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.2.0) Requirement already satisfied: optuna>=3.0 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.0.1) Requirement already satisfied: networkx>=2.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2.6.3) Requirement already satisfied: faiss-cpu>=1.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.7.2) Requirement already satisfied: pandocfilters>=1.5 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (1.5.0) Requirement already satisfied: tqdm>=4.64 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (4.64.0) Requirement already satisfied: rdflib>=6.1.1 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (6.1.1) Requirement already satisfied: nltk>=3.7 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.7) Requirement already satisfied: regex>=2022.6.2 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (2022.6.2) Requirement already satisfied: matplotlib>=3.1.3 in c:\users\nikol\anaconda3\lib\site-packages (from pyjedai) (3.5.3) Requirement already satisfied: Cython==0.29.28 in c:\users\nikol\anaconda3\lib\site-packages (from gensim>=4.2.0->pyjedai) (0.29.28) Requirement already satisfied: smart-open>=1.8.1 in c:\users\nikol\anaconda3\lib\site-packages (from gensim>=4.2.0->pyjedai) (5.1.0) Requirement already satisfied: cycler>=0.10 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (0.10.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (8.4.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.4) Requirement already satisfied: fonttools>=4.22.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (4.25.0) Requirement already satisfied: packaging>=20.0 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (21.3) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (1.3.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2) Requirement already satisfied: six in c:\users\nikol\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.1.3->pyjedai) (1.16.0) Requirement already satisfied: traitlets in c:\users\nikol\anaconda3\lib\site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.1.1) Requirement already satisfied: joblib in c:\users\nikol\anaconda3\lib\site-packages (from nltk>=3.7->pyjedai) (1.1.0) Requirement already satisfied: click in c:\users\nikol\anaconda3\lib\site-packages (from nltk>=3.7->pyjedai) (8.0.3) Requirement already satisfied: sqlalchemy>=1.1.0 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (1.4.22) Requirement already satisfied: colorlog in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (6.6.0) Requirement already satisfied: cmaes>=0.8.2 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (0.8.2) Requirement already satisfied: cliff in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (3.10.0) Requirement already satisfied: alembic in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (1.7.5) Requirement already satisfied: typing-extensions>=3.10.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from optuna>=3.0->pyjedai) (4.3.0) Requirement already satisfied: pytz>=2017.3 in c:\users\nikol\anaconda3\lib\site-packages (from pandas>=0.25.3->pyjedai) (2021.3) Requirement already satisfied: jinja2>=2.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (3.0.2) Requirement already satisfied: missingno>=0.4.2 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.5.1) Requirement already satisfied: multimethod>=1.4 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (1.9) Requirement already satisfied: phik>=0.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.12.2) Requirement already satisfied: requests>=2.24.0 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (2.26.0) Requirement already satisfied: markupsafe~=2.1.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (2.1.1) Requirement already satisfied: pydantic>=1.8.1 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (1.10.2) Requirement already satisfied: tangled-up-in-unicode==0.2.0 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.2.0) Requirement already satisfied: visions[type_image_path]==0.7.4 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.7.4) Requirement already satisfied: htmlmin>=0.1.12 in c:\users\nikol\anaconda3\lib\site-packages (from pandas-profiling>=3.2->pyjedai) (0.1.12) Requirement already satisfied: attrs>=19.3.0 in c:\users\nikol\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (21.2.0) Requirement already satisfied: imagehash in c:\users\nikol\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (4.3.1) Requirement already satisfied: importlib-metadata in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (4.8.1) Requirement already satisfied: isodate in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1) Requirement already satisfied: setuptools in c:\users\nikol\anaconda3\lib\site-packages (from rdflib>=6.1.1->pyjedai) (58.0.4) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (1.26.7) Requirement already satisfied: idna<4,>=2.5 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (3.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\nikol\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2022.9.14) Requirement already satisfied: sentencepiece in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.97) Requirement already satisfied: huggingface-hub>=0.4.0 in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.10.0) Requirement already satisfied: torchvision in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (0.13.1) Requirement already satisfied: torch>=1.6.0 in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (1.12.1) Requirement already satisfied: scikit-learn in c:\users\nikol\anaconda3\lib\site-packages (from sentence-transformers>=2.2->pyjedai) (1.0.1) Requirement already satisfied: filelock in c:\users\nikol\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (3.3.1) Requirement already satisfied: greenlet!=0.4.17 in c:\users\nikol\anaconda3\lib\site-packages (from sqlalchemy>=1.1.0->optuna>=3.0->pyjedai) (1.1.1) Requirement already satisfied: colorama in c:\users\nikol\anaconda3\lib\site-packages (from tqdm>=4.64->pyjedai) (0.4.4) Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in c:\users\nikol\anaconda3\lib\site-packages (from transformers>=4.21->pyjedai) (0.12.1) Requirement already satisfied: importlib-resources in c:\users\nikol\anaconda3\lib\site-packages (from alembic->optuna>=3.0->pyjedai) (5.4.0) Requirement already satisfied: Mako in c:\users\nikol\anaconda3\lib\site-packages (from alembic->optuna>=3.0->pyjedai) (1.1.4) Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (5.6.0) Requirement already satisfied: stevedore>=2.0.1 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (3.5.0) Requirement already satisfied: PrettyTable>=0.7.2 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (2.4.0) Requirement already satisfied: cmd2>=1.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (2.3.2) Requirement already satisfied: autopage>=0.4.0 in c:\users\nikol\anaconda3\lib\site-packages (from cliff->optuna>=3.0->pyjedai) (0.4.0) Requirement already satisfied: pyreadline in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (2.1) Requirement already satisfied: pyperclip>=1.6 in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (1.8.2) Requirement already satisfied: wcwidth>=0.1.7 in c:\users\nikol\anaconda3\lib\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (0.2.5) Requirement already satisfied: zipp>=0.5 in c:\users\nikol\anaconda3\lib\site-packages (from importlib-metadata->rdflib>=6.1.1->pyjedai) (3.6.0) Requirement already satisfied: PyWavelets in c:\users\nikol\anaconda3\lib\site-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\nikol\anaconda3\lib\site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (2.2.0)
!pip show pyjedai
Name: pyjedai Version: 0.0.5 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com> License: Apache Software License 2.0 Location: c:\users\nikol\anaconda3\lib\site-packages Requires: numpy, rdfpandas, pandocfilters, pandas, seaborn, networkx, PyYAML, strsim, gensim, optuna, transformers, nltk, matplotlib-inline, tqdm, tomli, pandas-profiling, regex, matplotlib, sentence-transformers, scipy, strsimpy, faiss-cpu, rdflib Required-by:
pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.
Data module offers a numpber of options
import pandas as pd
from pyjedai.datamodel import Data
d1 = pd.read_csv("./../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("./../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("./../data/ccer/D2/gt.csv", sep='|', engine='python')
data = Data(
dataset_1=d1,
attributes_1=['id','name','description'],
id_column_name_1='id',
dataset_2=d2,
attributes_2=['id','name','description'],
id_column_name_2='id',
ground_truth=gt,
)
data.print_specs()
------------------------- Data ------------------------- Type of Entity Resolution: Clean-Clean Dataset-1: Number of entities: 1076 Number of NaN values: 0 Attributes: ['id', 'name', 'description'] Dataset-2: Number of entities: 1076 Number of NaN values: 0 Attributes: ['name', 'description', 'price'] Total number of entities: 2152 Number of matching pairs in ground-truth: 1076 --------------------------------------------------------
data.dataset_1.head(2)
id | name | description | price | |
---|---|---|---|---|
0 | 0 | Sony Turntable - PSLX350H | Sony Turntable - PSLX350H/ Belt Drive System/ ... | |
1 | 1 | Bose Acoustimass 5 Series III Speaker System -... | Bose Acoustimass 5 Series III Speaker System -... | 399 |
data.dataset_2.head(2)
id | name | description | price | |
---|---|---|---|---|
0 | 0 | Linksys EtherFast EZXS88W Ethernet Switch - EZ... | Linksys EtherFast 8-Port 10/100 Switch (New/Wo... | |
1 | 1 | Linksys EtherFast EZXS55W Ethernet Switch | 5 x 10/100Base-TX LAN |
data.ground_truth.head(2)
D1 | D2 | |
---|---|---|
0 | 206 | 216 |
1 | 60 | 46 |
Multiple algorithms, techniques and features have already been implemented. This way, we can import the method and proceed to the workflow architecture.
For example we demostrate a variety of algorithms in each step, as it is shown in the bellow cell.
from pyjedai.workflow import WorkFlow, compare_workflows
from pyjedai.block_building import (
StandardBlocking, QGramsBlocking, ExtendedQGramsBlocking,
SuffixArraysBlocking, ExtendedSuffixArraysBlocking
)
from pyjedai.block_cleaning import BlockFiltering, BlockPurging
from pyjedai.comparison_cleaning import (
WeightedEdgePruning, WeightedNodePruning, CardinalityEdgePruning,
CardinalityNodePruning, BLAST, ReciprocalCardinalityNodePruning,
ReciprocalWeightedNodePruning, ComparisonPropagation
)
from pyjedai.matching import EntityMatching
from pyjedai.clustering import ConnectedComponentsClustering
The main workflow that pyjedai supports, consists of 8 steps:
For this demo, we created a simple architecture as we see bellow:
w = WorkFlow(
block_building = dict(
method=QGramsBlocking,
params=dict(qgrams=3)
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.8)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering),
name="Worflow-QGramsBlocking"
)
w.run(data, workflow_tqdm_enable=True, verbose=False)
Worflow-QGramsBlocking: 0%| | 0/5 [00:00<?, ?it/s]
w.to_df()
Algorithm | F1 | Recall | Precision | Runtime (sec) | Params | |
---|---|---|---|---|---|---|
0 | Q-Grams Blocking | 0.041558 | 100.000000 | 0.020784 | 0.467047 | {'Q-Gramms': 3} |
1 | Block Filtering | 0.063367 | 100.000000 | 0.031694 | 0.299006 | {'Ratio': 0.8} |
2 | Block Purging | 0.072956 | 100.000000 | 0.036491 | 0.026966 | {'Smoothing factor': 1.025, 'Max Comparisons p... |
3 | Cardinality Edge Pruning | 3.554522 | 98.048327 | 1.810071 | 8.116037 | {'Node centric': False, 'Weighting scheme': 'JS'} |
4 | Entity Matching | 3.807797 | 1.951673 | 77.777778 | 26.423726 | {'Tokenizer': 'white_space_tokenizer', 'Metric... |
5 | Connected Components Clustering | 3.463993 | 1.765799 | 90.476190 | 0.000000 | {} |
w.visualize()
w.visualize(separate=True)
pyJedAI provides methods for comparing multiple workflows. For example, we can test the above example with all the Block Building methods provided.
block_building_methods = [StandardBlocking, QGramsBlocking, ExtendedQGramsBlocking, SuffixArraysBlocking, ExtendedSuffixArraysBlocking]
workflows = []
for bbm in block_building_methods:
workflows.append(WorkFlow(
block_building = dict(
method=bbm,
),
block_cleaning = [
dict(
method=BlockFiltering,
params=dict(ratio=0.8)
),
dict(
method=BlockPurging,
params=dict(smoothing_factor=1.025)
)
],
comparison_cleaning = dict(method=CardinalityEdgePruning),
entity_matching = dict(
method=EntityMatching,
metric='sorensen_dice',
similarity_threshold=0.5,
attributes = ['description', 'name']
),
clustering = dict(method=ConnectedComponentsClustering),
name="Workflow-"+str(bbm.__name__)
))
workflows[-1].run(data, workflow_tqdm_enable=True)
Workflow-StandardBlocking: 0%| | 0/5 [00:00<?, ?it/s]
Workflow-QGramsBlocking: 0%| | 0/5 [00:00<?, ?it/s]
Workflow-ExtendedQGramsBlocking: 0%| | 0/5 [00:00<?, ?it/s]
Workflow-SuffixArraysBlocking: 0%| | 0/5 [00:00<?, ?it/s]
Workflow-ExtendedSuffixArraysBlocking: 0%| | 0/5 [00:00<?, ?it/s]
compare_workflows(workflows, with_visualization=True)
Name | F1 | Recall | Precision | Runtime (sec) | |
---|---|---|---|---|---|
0 | Workflow-StandardBlocking | 3.463993 | 1.765799 | 90.47619 | 8.139925 |
1 | Workflow-QGramsBlocking | 3.463993 | 1.765799 | 90.47619 | 14.409635 |
2 | Workflow-ExtendedQGramsBlocking | 3.463993 | 1.765799 | 90.47619 | 15.853234 |
3 | Workflow-SuffixArraysBlocking | 3.463993 | 1.765799 | 90.47619 | 7.863116 |
4 | Workflow-ExtendedSuffixArraysBlocking | 3.467153 | 1.765799 | 95.00000 | 12.269665 |