In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Dirty ER, is the process of dedeplication of one set.
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!python --version
Python 3.8.17
!pip install pyjedai -U
Requirement already satisfied: pyjedai in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (0.1.0) Requirement already satisfied: gensim>=4.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.3.2) Requirement already satisfied: matplotlib>=3.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.7.2) Requirement already satisfied: matplotlib-inline>=0.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.6) Requirement already satisfied: networkx>=2.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.1) Requirement already satisfied: nltk>=3.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.8.1) Requirement already satisfied: numpy>=1.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.23.5) Requirement already satisfied: pandas>=0.25.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.3) Requirement already satisfied: pandas-profiling>=3.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.6.6) Requirement already satisfied: pandocfilters>=1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.5.0) Requirement already satisfied: PyYAML>=6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (6.0.1) Requirement already satisfied: rdflib>=6.1.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (7.0.0) Requirement already satisfied: rdfpandas>=1.1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.1.6) Requirement already satisfied: regex>=2022.6.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2023.8.8) Requirement already satisfied: scipy>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.10.1) Requirement already satisfied: seaborn>=0.11 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.12.2) Requirement already satisfied: strsim>=0.0.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.0.3) Requirement already satisfied: strsimpy>=0.2.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.2.1) Requirement already satisfied: tqdm>=4.64 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.66.1) Requirement already satisfied: transformers>=4.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.32.0) Requirement already satisfied: sentence-transformers>=2.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.2.2) Requirement already satisfied: faiss-cpu>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.7.4) Requirement already satisfied: optuna>=3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.3.0) Requirement already satisfied: py-stringmatching>=0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.4.3) Requirement already satisfied: ordered-set>=4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.1.0) Requirement already satisfied: plotly>=5.16.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (5.16.1) Requirement already satisfied: tomli in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.1) Requirement already satisfied: valentine>=0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.7) Requirement already satisfied: smart-open>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from gensim>=4.2.0->pyjedai) (6.3.0) Requirement already satisfied: contourpy>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.1.0) Requirement already satisfied: cycler>=0.10 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (4.42.1) Requirement already satisfied: kiwisolver>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.4.5) Requirement already satisfied: packaging>=20.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (23.0) Requirement already satisfied: pillow>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (10.0.0) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (5.2.0) Requirement already satisfied: traitlets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.7.1) Requirement already satisfied: click in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (8.1.7) Requirement already satisfied: joblib in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (1.3.2) Requirement already satisfied: alembic>=1.5.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (1.11.3) Requirement already satisfied: cmaes>=0.10.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (0.10.0) Requirement already satisfied: colorlog in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (6.7.0) Requirement already satisfied: sqlalchemy>=1.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (2.0.20) Requirement already satisfied: pytz>=2020.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2022.7) Requirement already satisfied: tzdata>=2022.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2023.3) Requirement already satisfied: ydata-profiling in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas-profiling>=3.2->pyjedai) (4.5.1) Requirement already satisfied: tenacity>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from plotly>=5.16.0->pyjedai) (8.2.3) Requirement already satisfied: six in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from py-stringmatching>=0.4->pyjedai) (1.16.0) Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1) Requirement already satisfied: torch>=1.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (2.0.1) Requirement already satisfied: torchvision in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.15.2) Requirement already satisfied: scikit-learn in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (1.3.0) Requirement already satisfied: sentencepiece in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.99) Requirement already satisfied: huggingface-hub>=0.4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.16.4) Requirement already satisfied: filelock in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (3.12.2) Requirement already satisfied: requests in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (2.31.0) Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.13.3) Requirement already satisfied: safetensors>=0.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.3.3) Requirement already satisfied: anytree<2.9,>=2.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.8.0) Requirement already satisfied: chardet<6.0.0,>=5.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (5.2.0) Requirement already satisfied: levenshtein<1.0,>=0.20.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.21.1) Requirement already satisfied: PuLP<3.0,>=2.5.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.7.0) Requirement already satisfied: pot<1.0,>=0.8.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.9.1) Requirement already satisfied: Mako in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (1.2.4) Requirement already satisfied: typing-extensions>=4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (4.7.1) Requirement already satisfied: importlib-metadata in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (6.0.0) Requirement already satisfied: fsspec in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (2023.6.0) Requirement already satisfied: zipp>=3.1.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib>=3.1.3->pyjedai) (3.11.0) Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from levenshtein<1.0,>=0.20.7->valentine>=0.1->pyjedai) (3.2.0) Requirement already satisfied: greenlet!=0.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sqlalchemy>=1.3.0->optuna>=3.0->pyjedai) (2.0.2) Requirement already satisfied: sympy in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.12) Requirement already satisfied: jinja2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.1.2) Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99) Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99) Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.101) Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (8.5.0.96) Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.10.3.66) Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.9.0.58) Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.2.10.91) Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.4.0.1) Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.4.91) Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.14.3) Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.91) Requirement already satisfied: triton==2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.0.0) Requirement already satisfied: setuptools in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (68.0.0) Requirement already satisfied: wheel in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (0.38.4) Requirement already satisfied: cmake in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.27.2) Requirement already satisfied: lit in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (16.0.6) Requirement already satisfied: charset-normalizer<4,>=2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2023.7.22) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (3.2.0) Requirement already satisfied: pydantic<2,>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.10.12) Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.7.5) Requirement already satisfied: htmlmin==0.1.12 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.1.12) Requirement already satisfied: phik<0.13,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.12.3) Requirement already satisfied: multimethod<2,>=1.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.1) Requirement already satisfied: statsmodels<1,>=0.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.14.0) Requirement already satisfied: typeguard<3,>=2.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (2.13.3) Requirement already satisfied: imagehash==4.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (4.3.1) Requirement already satisfied: wordcloud>=1.9.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.2) Requirement already satisfied: dacite>=1.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.8.1) Requirement already satisfied: PyWavelets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.4.1) Requirement already satisfied: attrs>=19.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (22.1.0) Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.2.0) Requirement already satisfied: MarkupSafe>=2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from jinja2->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.1.1) Requirement already satisfied: patsy>=0.5.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.5.3) Requirement already satisfied: mpmath>=0.19 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sympy->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.3.0)
!pip show pyjedai
Name: pyjedai Version: 0.1.0 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr> License: Apache Software License 2.0 Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine Required-by:
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs
from pyjedai.evaluation import Evaluation
[nltk_data] Downloading package stopwords to /home/jm/nltk_data... [nltk_data] Package stopwords is already up-to-date!
pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.
Data module offers a numpber of options
from pyjedai.datamodel import Data
d1 = pd.read_csv("./../data/der/cora/cora.csv", sep='|')
gt = pd.read_csv("./../data/der/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']
Data is the connecting module of all steps of the workflow
data = Data(
dataset_1=d1,
id_column_name_1='Entity Id',
ground_truth=gt,
attributes_1=attr
)
It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.
The following methods are currently supported:
from pyjedai.block_building import (
StandardBlocking,
QGramsBlocking,
SuffixArraysBlocking,
ExtendedSuffixArraysBlocking,
ExtendedQGramsBlocking
)
bb = SuffixArraysBlocking(suffix_length=2)
blocks = bb.build_blocks(data)
Suffix Arrays Blocking: 0%| | 0/1295 [00:00<?, ?it/s]
_ = bb.evaluate(blocks)
*************************************************************************************************************************** Μethod: Suffix Arrays Blocking *************************************************************************************************************************** Method name: Suffix Arrays Blocking Parameters: Suffix length: 2 Maximum Block Size: 53 Runtime: 1.3212 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 4.28% Recall: 75.77% F1-score: 8.10% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
from pyjedai.block_cleaning import BlockPurging
bp = BlockPurging()
cleaned_blocks = bp.process(blocks, data, tqdm_disable=False)
Block Purging: 0%| | 0/3420 [00:00<?, ?it/s]
bp.report()
Method name: Block Purging Method info: Discards the blocks exceeding a certain number of comparisons. Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 1378.0 Runtime: 0.0678 seconds
_ = bp.evaluate(cleaned_blocks)
*************************************************************************************************************************** Μethod: Block Purging *************************************************************************************************************************** Method name: Block Purging Parameters: Smoothing factor: 1.025 Max Comparisons per Block: 1378.0 Runtime: 0.0678 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 4.28% Recall: 75.77% F1-score: 8.10% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
___Optional step___
Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.
from pyjedai.block_cleaning import BlockFiltering
bc = BlockFiltering(ratio=0.9)
blocks = bc.process(blocks, data)
Block Filtering: 0%| | 0/3 [00:00<?, ?it/s]
_ = bc.evaluate(blocks)
*************************************************************************************************************************** Μethod: Block Filtering *************************************************************************************************************************** Method name: Block Filtering Parameters: Ratio: 0.9 Runtime: 0.3659 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 5.08% Recall: 73.73% F1-score: 9.51% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
___Optional step___
Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.
The following methods are currently supported:
Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:
from pyjedai.comparison_cleaning import (
WeightedEdgePruning,
WeightedNodePruning,
CardinalityEdgePruning,
CardinalityNodePruning,
BLAST,
ReciprocalCardinalityNodePruning,
ComparisonPropagation
)
mb = WeightedEdgePruning(weighting_scheme='CBS')
blocks = mb.process(blocks, data)
Weighted Edge Pruning: 0%| | 0/1295 [00:00<?, ?it/s]
_ = mb.evaluate(blocks)
*************************************************************************************************************************** Μethod: Weighted Edge Pruning *************************************************************************************************************************** Method name: Weighted Edge Pruning Parameters: Node centric: False Weighting scheme: CBS Runtime: 1.2461 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 74.46% Recall: 48.51% F1-score: 58.75% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities.
from pyjedai.matching import EntityMatching
em = EntityMatching(
metric='jaccard',
similarity_threshold=0.0
)
pairs_graph = em.predict(blocks, data)
Entity Matching (jaccard, white_space_tokenizer): 0%| | 0/795 [00:00<?, ?it/s]
draw(pairs_graph)
_ = em.evaluate(pairs_graph)
*************************************************************************************************************************** Μethod: Entity Matching *************************************************************************************************************************** Method name: Entity Matching Parameters: Metric: jaccard Attributes: None Similarity threshold: 0.0 Tokenizer: white_space_tokenizer Vectorizer: None Qgrams: 1 Runtime: 7.9103 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 74.46% Recall: 48.51% F1-score: 58.75% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Configure similariy threshold with a Grid-Search or with an Optuna search. Also pyJedAI provides some visualizations on the distributions of the scores.
For example with a classic histogram:
em.plot_distribution_of_all_weights()
Or with a range 0.1 from 0.0 to 1.0 grouping:
em.plot_distribution_of_scores()
Distribution-% of predicted scores: [3.0010718113612005, 7.725973561986424, 13.844230082172205, 23.97284744551626, 18.854948195784207, 13.71025366202215, 8.145766345123258, 4.948195784208646, 3.858520900321544, 1.9381922115041088]
It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.
from pyjedai.clustering import ConnectedComponentsClustering
ec = ConnectedComponentsClustering()
clusters = ec.process(pairs_graph, data, similarity_threshold=0.3)
_ = ec.evaluate(clusters)
*************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.0894 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 76.51% Recall: 51.54% F1-score: 61.59% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Data is the connecting module of all steps of the workflow
from pyjedai.datamodel import Data
d1 = pd.read_csv("./../data/der/cora/cora.csv", sep='|')
gt = pd.read_csv("./../data/der/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']
data = Data(
dataset_1=d1,
id_column_name_1='Entity Id',
ground_truth=gt,
attributes_1=attr
)
from pyjedai.joins import EJoin, TopKJoin
join = EJoin(similarity_threshold = 0.5,
metric = 'jaccard',
tokenization = 'qgrams_multiset',
qgrams = 2)
g = join.fit(data)
EJoin (jaccard): 0%| | 0/2590 [00:00<?, ?it/s]
_ = join.evaluate(g)
*************************************************************************************************************************** Μethod: EJoin *************************************************************************************************************************** Method name: EJoin Parameters: similarity_threshold: 0.5 metric: jaccard tokenization: qgrams_multiset qgrams: 2 Runtime: 49.7947 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 65.80% Recall: 93.03% F1-score: 77.08% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
topk_join = TopKJoin(K=20,
metric = 'jaccard',
tokenization = 'qgrams',
qgrams = 3)
g = topk_join.fit(data)
Top-K Join (jaccard): 0%| | 0/2590 [00:00<?, ?it/s]
draw(g)
topk_join.evaluate(g)
*************************************************************************************************************************** Μethod: Top-K Join *************************************************************************************************************************** Method name: Top-K Join Parameters: similarity_threshold: 0.25547445255474455 K: 20 metric: jaccard tokenization: qgrams qgrams: 3 Runtime: 33.9919 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 58.34% Recall: 63.75% F1-score: 60.92% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 58.340434597358325, 'Recall %': 63.74534450651769, 'F1 %': 60.923248053392655, 'True Positives': 10954, 'False Positives': 7822, 'True Negatives': 814451.0, 'False Negatives': 6230}
from pyjedai.clustering import ConnectedComponentsClustering
ccc = ConnectedComponentsClustering()
clusters = ccc.process(g, data)
_ = ccc.evaluate(clusters)
*************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.1218 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 2.05% Recall: 100.00% F1-score: 4.02% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────