In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset using a Similarity Join workflow.
pyJedAI is an open-source library that can be installed from PyPI.
For more: pypi.org/project/pyjedai/
!python --version
Python 3.8.17
!pip install pyjedai -U
!pip show pyjedai
Name: pyjedai Version: 0.1.0 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr> License: Apache Software License 2.0 Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine Required-by:
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs
from pyjedai.evaluation import Evaluation
[nltk_data] Downloading package stopwords to /home/jm/nltk_data... [nltk_data] Package stopwords is already up-to-date!
pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files.
Data module offers a numpber of options
from pyjedai.datamodel import Data
d1 = pd.read_csv("./../data/der/cora/cora.csv", sep='|')
gt = pd.read_csv("./../data/der/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']
Data is the connecting module of all steps of the workflow
data = Data(
dataset_1=d1,
id_column_name_1='Entity Id',
ground_truth=gt,
attributes_1=attr
)
from pyjedai.joins import EJoin, TopKJoin
/home/jm/public-pyJedAI/pyJedAI/src/pyjedai/joins.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) from tqdm.autonotebook import tqdm
join = EJoin(similarity_threshold = 0.5,
metric = 'jaccard',
tokenization = 'qgrams_multiset',
qgrams = 2)
g = join.fit(data)
EJoin (jaccard): 0%| | 0/2590 [00:00<?, ?it/s]
_ = join.evaluate(g)
*************************************************************************************************************************** Μethod: EJoin *************************************************************************************************************************** Method name: EJoin Parameters: similarity_threshold: 0.5 metric: jaccard tokenization: qgrams_multiset qgrams: 2 Runtime: 51.6994 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 65.80% Recall: 93.03% F1-score: 77.08% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
draw(g)
It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.
from pyjedai.clustering import ConnectedComponentsClustering
ec = ConnectedComponentsClustering()
clusters = ec.process(g, data, similarity_threshold=0.3)
_ = ec.evaluate(clusters)
*************************************************************************************************************************** Μethod: Connected Components Clustering *************************************************************************************************************************** Method name: Connected Components Clustering Parameters: Runtime: 0.3853 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 48.42% Recall: 93.19% F1-score: 63.73% ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────