In this notebook we present the pyJedAI schema matching functionality. In general Schema Matching looks for semantic correspondences between structures or models, such as database schemas, XML message formats, and ontologies, identifying different attribute names that describe the same feature (e.g., “profession” and “job” are semantically equivalent)
Website: https://delftdata.github.io/valentine
Valentine is an extensible open-source product to execute and organize large-scale automated matching processes on tabular data either for experimentation or deployment in real world data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. To enable proper evaluation, Valentine offers a fabricator for creating evaluation dataset pairs that respect specific semantics.
!pip install pyjedai -U
!python --version
Python 3.9.16
import os
import sys
import pandas as pd
from pyjedai.utils import (
text_cleaning_method,
print_clusters,
print_blocks,
print_candidate_pairs
)
from pyjedai.evaluation import Evaluation
from pyjedai.datamodel import Data
d1 = pd.read_csv(r"C:\Users\nikol\Desktop\GitHub\pyJedAI-Dev\data\ccer\schema_matching\authors\authors1.csv")
d2 = pd.read_csv(r"C:\Users\nikol\Desktop\GitHub\pyJedAI-Dev\data\ccer\schema_matching\authors\authors2.csv")
gt = pd.read_csv(r"C:\Users\nikol\Desktop\GitHub\pyJedAI-Dev\data\ccer\schema_matching\authors\pairs.csv")
data = Data(
dataset_1=d1,
attributes_1=['EID','Authors','Cited by','Title','Year','Source tittle','DOI'],
id_column_name_1='EID',
dataset_2=d2,
attributes_2=['EID','Authors','Cited by','Country','Document Type','City','Access Type','aggregationType'],
id_column_name_2='EID',
ground_truth=gt,
)
from pyjedai.schema.matching import ValentineMethodBuilder, ValentineSchemaMatching
sm = ValentineSchemaMatching(ValentineMethodBuilder.cupid_matcher())
sm.process(data)
matches = sm.get_matches()
sm.print_matches()
sm.evaluate()