pyJedAI is an open-source library that can be installed from PyPI.
%pip install pyjedai -U
%pip show pyjedai
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data
d1 = pd.read_csv("../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("../data/ccer/D2/gt.csv", sep='|', engine='python')
attr1 = d1.columns[1:].to_list()
attr2 = d2.columns[1:].to_list()
data = Data(dataset_1=d1,
attributes_1=attr1,
id_column_name_1='id',
dataset_2=d2,
attributes_2=attr2,
id_column_name_2='id',
ground_truth=gt)
Available embeddings:
{ 'fasttext', 'glove', 'word2vec'}
{'smpnet','st5','sdistilroberta','sminilm','sent_glove'}
{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}
faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:
from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
/home/conda/miniconda3/envs/pypi_dependencies/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
similarity_search='faiss')
blocks, g = emb.build_blocks(data,
top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False,
with_entity_matching=True)
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss, cuda]: 100%|██████████| 2152/2152 [00:21<00:00, 101.10it/s]
disable True
emb.evaluate(blocks, with_classification_report=True, with_stats=True)
*************************************************************************************************************************** Method: Embeddings-NN Block Building *************************************************************************************************************************** Method name: Embeddings-NN Block Building Parameters: Vectorizer: sminilm Similarity-Search: faiss Top-K: 5 Vector size: 384 Runtime: 21.9882 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 18.75% Recall: 93.77% F1-score: 31.26% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1009 False positives: 4371 True negatives: 1156633 False negatives: 67 Total comparisons: 5380 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Statistics: FAISS: Indices shape returned after search: (1076, 5) ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 18.7546468401487, 'Recall %': 93.77323420074349, 'F1 %': 31.257744733581166, 'True Positives': 1009, 'False Positives': 4371, 'True Negatives': 1156633, 'False Negatives': 67}
from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.63)
ccc.evaluate(clusters, with_classification_report=True)
*************************************************************************************************************************** Method: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Similarity Threshold: 0.63 Runtime: 0.0330 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 83.41% Recall: 67.29% F1-score: 74.49% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 724 False positives: 144 True negatives: 1156348 False negatives: 352 Total comparisons: 868 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 83.41013824884793, 'Recall %': 67.28624535315984, 'F1 %': 74.48559670781893, 'True Positives': 724, 'False Positives': 144, 'True Negatives': 1156348, 'False Negatives': 352}