In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset using a Similarity Join workflow.
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!python --version
Python 3.8.17
!pip install pyjedai -U
Requirement already satisfied: pyjedai in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (0.1.0) Requirement already satisfied: gensim>=4.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.3.2) Requirement already satisfied: matplotlib>=3.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.7.2) Requirement already satisfied: matplotlib-inline>=0.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.6) Requirement already satisfied: networkx>=2.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.1) Requirement already satisfied: nltk>=3.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.8.1) Requirement already satisfied: numpy>=1.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.23.5) Requirement already satisfied: pandas>=0.25.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.3) Requirement already satisfied: pandas-profiling>=3.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.6.6) Requirement already satisfied: pandocfilters>=1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.5.0) Requirement already satisfied: PyYAML>=6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (6.0.1) Requirement already satisfied: rdflib>=6.1.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (7.0.0) Requirement already satisfied: rdfpandas>=1.1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.1.6) Requirement already satisfied: regex>=2022.6.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2023.8.8) Requirement already satisfied: scipy>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.10.1) Requirement already satisfied: seaborn>=0.11 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.12.2) Requirement already satisfied: strsim>=0.0.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.0.3) Requirement already satisfied: strsimpy>=0.2.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.2.1) Requirement already satisfied: tqdm>=4.64 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.66.1) Requirement already satisfied: transformers>=4.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.32.0) Requirement already satisfied: sentence-transformers>=2.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.2.2) Requirement already satisfied: faiss-cpu>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.7.4) Requirement already satisfied: optuna>=3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.3.0) Requirement already satisfied: py-stringmatching>=0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.4.3) Requirement already satisfied: ordered-set>=4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.1.0) Requirement already satisfied: plotly>=5.16.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (5.16.1) Requirement already satisfied: tomli in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.1) Requirement already satisfied: valentine>=0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.7) Requirement already satisfied: smart-open>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from gensim>=4.2.0->pyjedai) (6.3.0) Requirement already satisfied: contourpy>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.1.0) Requirement already satisfied: cycler>=0.10 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (4.42.1) Requirement already satisfied: kiwisolver>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.4.5) Requirement already satisfied: packaging>=20.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (23.0) Requirement already satisfied: pillow>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (10.0.0) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (5.2.0) Requirement already satisfied: traitlets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.7.1) Requirement already satisfied: click in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (8.1.7) Requirement already satisfied: joblib in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (1.3.2) Requirement already satisfied: alembic>=1.5.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (1.11.3) Requirement already satisfied: cmaes>=0.10.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (0.10.0) Requirement already satisfied: colorlog in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (6.7.0) Requirement already satisfied: sqlalchemy>=1.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (2.0.20) Requirement already satisfied: pytz>=2020.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2022.7) Requirement already satisfied: tzdata>=2022.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2023.3) Requirement already satisfied: ydata-profiling in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas-profiling>=3.2->pyjedai) (4.5.1) Requirement already satisfied: tenacity>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from plotly>=5.16.0->pyjedai) (8.2.3) Requirement already satisfied: six in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from py-stringmatching>=0.4->pyjedai) (1.16.0) Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1) Requirement already satisfied: torch>=1.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (2.0.1) Requirement already satisfied: torchvision in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.15.2) Requirement already satisfied: scikit-learn in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (1.3.0) Requirement already satisfied: sentencepiece in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.99) Requirement already satisfied: huggingface-hub>=0.4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.16.4) Requirement already satisfied: filelock in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (3.12.2) Requirement already satisfied: requests in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (2.31.0) Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.13.3) Requirement already satisfied: safetensors>=0.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.3.3) Requirement already satisfied: anytree<2.9,>=2.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.8.0) Requirement already satisfied: chardet<6.0.0,>=5.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (5.2.0) Requirement already satisfied: levenshtein<1.0,>=0.20.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.21.1) Requirement already satisfied: PuLP<3.0,>=2.5.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.7.0) Requirement already satisfied: pot<1.0,>=0.8.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.9.1) Requirement already satisfied: Mako in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (1.2.4) Requirement already satisfied: typing-extensions>=4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (4.7.1) Requirement already satisfied: importlib-metadata in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (6.0.0) Requirement already satisfied: fsspec in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (2023.6.0) Requirement already satisfied: zipp>=3.1.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib>=3.1.3->pyjedai) (3.11.0) Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from levenshtein<1.0,>=0.20.7->valentine>=0.1->pyjedai) (3.2.0) Requirement already satisfied: greenlet!=0.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sqlalchemy>=1.3.0->optuna>=3.0->pyjedai) (2.0.2) Requirement already satisfied: sympy in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.12) Requirement already satisfied: jinja2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.1.2) Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99) Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99) Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.101) Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (8.5.0.96) Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.10.3.66) Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.9.0.58) Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.2.10.91) Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.4.0.1) Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.4.91) Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.14.3) Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.91) Requirement already satisfied: triton==2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.0.0) Requirement already satisfied: setuptools in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (68.0.0) Requirement already satisfied: wheel in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (0.38.4) Requirement already satisfied: cmake in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.27.2) Requirement already satisfied: lit in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (16.0.6) Requirement already satisfied: charset-normalizer<4,>=2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2023.7.22) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (3.2.0) Requirement already satisfied: pydantic<2,>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.10.12) Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.7.5) Requirement already satisfied: htmlmin==0.1.12 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.1.12) Requirement already satisfied: phik<0.13,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.12.3) Requirement already satisfied: multimethod<2,>=1.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.1) Requirement already satisfied: statsmodels<1,>=0.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.14.0) Requirement already satisfied: typeguard<3,>=2.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (2.13.3) Requirement already satisfied: imagehash==4.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (4.3.1) Requirement already satisfied: wordcloud>=1.9.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.2) Requirement already satisfied: dacite>=1.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.8.1) Requirement already satisfied: PyWavelets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.4.1) Requirement already satisfied: attrs>=19.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (22.1.0) Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.2.0) Requirement already satisfied: MarkupSafe>=2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from jinja2->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.1.1) Requirement already satisfied: patsy>=0.5.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.5.3) Requirement already satisfied: mpmath>=0.19 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sympy->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.3.0)
!pip show pyjedai
Name: pyjedai Version: 0.1.0 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr> License: Apache Software License 2.0 Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine Required-by:
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs
from pyjedai.evaluation import Evaluation
[nltk_data] Downloading package stopwords to /home/jm/nltk_data... [nltk_data] Package stopwords is already up-to-date!
pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.
Data module offers a numpber of options
from pyjedai.datamodel import Data
d1 = pd.read_csv("./../data/der/cora/cora.csv", sep='|')
gt = pd.read_csv("./../data/der/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']
Data is the connecting module of all steps of the workflow
data = Data(
dataset_1=d1,
id_column_name_1='Entity Id',
ground_truth=gt,
attributes_1=attr
)
from pyjedai.joins import EJoin, TopKJoin
/home/jm/public-pyJedAI/pyJedAI/src/pyjedai/joins.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) from tqdm.autonotebook import tqdm
join = EJoin(similarity_threshold = 0.5,
metric = 'jaccard',
tokenization = 'qgrams_multiset',
qgrams = 2)
g = join.fit(data)
EJoin (jaccard): 0%| | 0/2590 [00:00<?, ?it/s]
_ = join.evaluate(g)
*************************************************************************************************************************** Μethod: EJoin *************************************************************************************************************************** Method name: EJoin Parameters: similarity_threshold: 0.5 metric: jaccard tokenization: qgrams_multiset qgrams: 2 Runtime: 51.6994 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 65.80% Recall: 93.03% F1-score: 77.08% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
draw(g)
It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.
from pyjedai.clustering import ConnectedComponentsClustering
ec = ConnectedComponentsClustering()
clusters = ec.process(g, data, similarity_threshold=0.3)
_ = ec.evaluate(clusters)
*************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.3853 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 48.42% Recall: 93.19% F1-score: 63.73% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────