This IPython notebook illustrates how to sample and label a table (candidate set). First, we need to import py_entitymatching package and other libraries as follows:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
path_A = datasets_dir + os.sep + 'DBLP.csv'
path_B = datasets_dir + os.sep + 'ACM.csv'
path_C = datasets_dir + os.sep + 'tableC.csv'
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
C = em.read_csv_metadata(path_C, key='_id',
fk_ltable='ltable_id', fk_rtable='rtable_id',
ltable=A, rtable=B)
Metadata file is not present in the given path; proceeding to read the csv file. Metadata file is not present in the given path; proceeding to read the csv file. Metadata file is not present in the given path; proceeding to read the csv file.
C.head()
_id | ltable_id | rtable_id | ltable_authors | ltable_title | rtable_authors | rtable_title | |
---|---|---|---|---|---|---|---|
0 | 0 | conf/sigmod/AbadiC02 | 191915 | Daniel J. Abadi, Mitch Cherniack | Visual COKO: a debugger for query optimizer development | Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... | Shoring up persistent applications |
1 | 1 | conf/sigmod/AbadiC02 | 191931 | Daniel J. Abadi, Mitch Cherniack | Visual COKO: a debugger for query optimizer development | Daniel J. Dietterich | DEC data distributor: for data replication and data warehousing |
2 | 2 | conf/sigmod/AbadiC02 | 233356 | Daniel J. Abadi, Mitch Cherniack | Visual COKO: a debugger for query optimizer development | Mitch Cherniack, Stanley B. Zdonik | Rule languages and internal algebras for rule-based optimizers |
3 | 3 | conf/sigmod/AbadiC02 | 276311 | Daniel J. Abadi, Mitch Cherniack | Visual COKO: a debugger for query optimizer development | Mitch Cherniack, Stan Zdonik | Changing the rules: transformations for rule-based optimizers |
4 | 4 | conf/sigmod/AbadiC02 | 335432 | Daniel J. Abadi, Mitch Cherniack | Visual COKO: a debugger for query optimizer development | Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang | NiagaraCQ: a scalable continuous query system for Internet databases |
len(C)
14673
From the candidate set, a sample (for labeling purposes) can be obtained like this:
S = em.sample_table(C, 450)
# Label the sampled set
# Specify the name for the label column
G = em.label_table(S, 'gold_label')
Column name (gold_label) is not present in dataframe
The user must specify 0 for non-match and 1 for match. Typically, the sampling and the labeling step is done in iterations (till we get sufficient density of matches). Once labeled, the labeled data set will look like this:
# Assume that we have labeled the data and stored it in
# labeled_data_demo.csv
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
G = em.read_csv_metadata(path_labeled_data, key='_id',
fk_ltable='ltable_id', fk_rtable='rtable_id',
ltable=A, rtable=B)
Metadata file is not present in the given path; proceeding to read the csv file.
G.head()
_id | ltable_id | rtable_id | ltable_title | ltable_authors | ltable_year | rtable_title | rtable_authors | rtable_year | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | l1223 | r498 | Dynamic Information Visualization | Yannis E. Ioannidis | 1996 | Dynamic information visualization | Yannis E. Ioannidis | 1996 | 1 |
1 | 1 | l1563 | r1285 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | 1 |
2 | 2 | l1514 | r1348 | Query Processing and Optimization in Oracle Rdb | Gennady Antoshenkov, Mohamed Ziauddin | 1996 | prospector: a content-based multimedia server for massively parallel architectures | S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader | 1996 | 0 |
3 | 3 | l206 | r1641 | An Asymptotically Optimal Multiversion B-Tree | Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger | 1996 | A complete temporal relational algebra | Debabrata Dey, Terence M. Barron, Veda C. Storey | 1996 | 0 |
4 | 4 | l1589 | r495 | Evaluating Probabilistic Queries over Imprecise Data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | Evaluating probabilistic queries over imprecise data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | 1 |