Notebook

pyJedAI with pyTorch pre-trained embeddings and FAISS¶

How to install?¶

pyJedAI is an open-source library that can be installed from PyPI.

In [ ]:

!pip install pyjedai -U

In [2]:

!pip show pyjedai

Name: pyjedai
Version: 0.1.0
Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.
Home-page: 
Author: 
Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr>
License: Apache Software License 2.0
Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages
Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine
Required-by:

Imports

In [3]:

import os
import sys
import pandas as pd
import networkx
from networkx import draw, Graph

from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data

[nltk_data] Downloading package stopwords to /home/jm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [4]:

d1 = pd.read_csv("../data/ccer/D2/abt.csv", sep='|', engine='python', na_filter=False).astype(str)
d2 = pd.read_csv("../data/ccer/D2/buy.csv", sep='|', engine='python', na_filter=False).astype(str)
gt = pd.read_csv("../data/ccer/D2/gt.csv", sep='|', engine='python')

attr1 = d1.columns[1:].to_list()
attr2 = d2.columns[1:].to_list()

data = Data(dataset_1=d1,
            attributes_1=attr1,
            id_column_name_1='id',
            dataset_2=d2,
            attributes_2=attr2,
            id_column_name_2='id',
            ground_truth=gt)

Block Building¶

Pre-trained pyTorch & GENSIM embeddings¶

Available embeddings:

Gensim: { 'fasttext', 'glove', 'word2vec'}
pyTorch Sentence transformers : {'smpnet','st5','sdistilroberta','sminilm','sent_glove'}
pyTorch Word transformers :{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}

FAISS¶

faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:

In [5]:

from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding

In [6]:

emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
                                similarity_search='faiss')

blocks, g = emb.build_blocks(data,
                             top_k=5,
                             similarity_distance='euclidean',
                             load_embeddings_if_exist=False,
                             save_embeddings=False,
                             with_entity_matching=True)

Building blocks via Embeddings-NN Block Building [sminilm, faiss]

Embeddings-NN Block Building [sminilm, faiss]:   0%|          | 0/2152 [00:00<?, ?it/s]

Device selected:  cpu

Downloading (…)5dded/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4d81d5dded/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)81d5dded/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)ded/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5dded/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading (…)dded/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)4d81d5dded/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1d5dded/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [7]:

emb.evaluate(blocks, with_classification_report=True, with_stats=True)

***************************************************************************************************************************
                                         Μethod:  Embeddings-NN Block Building
***************************************************************************************************************************
Method name: Embeddings-NN Block Building
Parameters: 
	Vectorizer: sminilm
	Similarity-Search: faiss
	Top-K: 5
	Vector size: 384
Runtime: 157.0182 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:      9.38% 
	Recall:        93.77%
	F1-score:      17.05%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 1009
	False positives: 9751
	True negatives: 1156633
	False negatives: 67
	Total comparisons: 10760
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Statistics:
 FAISS:
	Indices shape returned after search: (1076, 5)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Out[7]:

{'Precision %': 9.37732342007435,
 'Recall %': 93.77323420074349,
 'F1 %': 17.049678945589726,
 'True Positives': 1009,
 'False Positives': 9751,
 'True Negatives': 1156633,
 'False Negatives': 67}

Entity Clustering¶

In [8]:

from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering

In [9]:

ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.63)

In [10]:

ccc.evaluate(clusters, with_classification_report=True)

***************************************************************************************************************************
                                         Μethod:  Unique Mapping Clustering
***************************************************************************************************************************
Method name: Unique Mapping Clustering
Parameters: 
Runtime: 0.1209 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
	Precision:     83.41% 
	Recall:        67.29%
	F1-score:      74.49%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
	True positives: 724
	False positives: 144
	True negatives: 1156348
	False negatives: 352
	Total comparisons: 868
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Out[10]:

{'Precision %': 83.41013824884793,
 'Recall %': 67.28624535315984,
 'F1 %': 74.48559670781893,
 'True Positives': 724,
 'False Positives': 144,
 'True Negatives': 1156348,
 'False Negatives': 352}

K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis

Apache License 2.0

In [ ]: