pyJedAI is an open-source library that can be installed from PyPI.
!pip install pyjedai -U
!pip show pyjedai
Name: pyjedai Version: 0.1.0 Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows. Home-page: Author: Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr> License: Apache Software License 2.0 Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine Required-by:
Imports
import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph
from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data
[nltk_data] Downloading package stopwords to /home/jm/nltk_data... [nltk_data] Package stopwords is already up-to-date!
d1 = pd.read_csv("../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("../data/ccer/D2/gt.csv", sep='|', engine='python')
attr1 = d1.columns[1:].to_list()
attr2 = d2.columns[1:].to_list()
data = Data(dataset_1=d1,
attributes_1=attr1,
id_column_name_1='id',
dataset_2=d2,
attributes_2=attr2,
id_column_name_2='id',
ground_truth=gt)
Available embeddings:
{ 'fasttext', 'glove', 'word2vec'}
{'smpnet','st5','sdistilroberta','sminilm','sent_glove'}
{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}
faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:
from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
similarity_search='faiss')
blocks, g = emb.build_blocks(data,
top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False,
with_entity_matching=True)
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss]: 0%| | 0/2152 [00:00<?, ?it/s]
Device selected: cpu
Downloading (…)5dded/.gitattributes: 0%| | 0.00/1.18k [00:00<?, ?B/s]
Downloading (…)_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Downloading (…)4d81d5dded/README.md: 0%| | 0.00/10.6k [00:00<?, ?B/s]
Downloading (…)81d5dded/config.json: 0%| | 0.00/573 [00:00<?, ?B/s]
Downloading (…)ce_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
Downloading (…)ded/data_config.json: 0%| | 0.00/39.3k [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 0.00/134M [00:00<?, ?B/s]
Downloading (…)nce_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/112 [00:00<?, ?B/s]
Downloading (…)5dded/tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 0%| | 0.00/352 [00:00<?, ?B/s]
Downloading (…)dded/train_script.py: 0%| | 0.00/13.2k [00:00<?, ?B/s]
Downloading (…)4d81d5dded/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Downloading (…)1d5dded/modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
emb.evaluate(blocks, with_classification_report=True, with_stats=True)
*************************************************************************************************************************** Μethod: Embeddings-NN Block Building *************************************************************************************************************************** Method name: Embeddings-NN Block Building Parameters: Vectorizer: sminilm Similarity-Search: faiss Top-K: 5 Vector size: 384 Runtime: 157.0182 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 9.38% Recall: 93.77% F1-score: 17.05% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 1009 False positives: 9751 True negatives: 1156633 False negatives: 67 Total comparisons: 10760 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Statistics: FAISS: Indices shape returned after search: (1076, 5) ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 9.37732342007435, 'Recall %': 93.77323420074349, 'F1 %': 17.049678945589726, 'True Positives': 1009, 'False Positives': 9751, 'True Negatives': 1156633, 'False Negatives': 67}
from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.63)
ccc.evaluate(clusters, with_classification_report=True)
*************************************************************************************************************************** Μethod: Unique Mapping Clustering *************************************************************************************************************************** Method name: Unique Mapping Clustering Parameters: Runtime: 0.1209 seconds ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Performance: Precision: 83.41% Recall: 67.29% F1-score: 74.49% ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Classification report: True positives: 724 False positives: 144 True negatives: 1156348 False negatives: 352 Total comparisons: 868 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 83.41013824884793, 'Recall %': 67.28624535315984, 'F1 %': 74.48559670781893, 'True Positives': 724, 'False Positives': 144, 'True Negatives': 1156348, 'False Negatives': 352}