This IPython notebook illustrates how to perform matching using the rule-based matcher.
First, we need to import py_entitymatching package and other libraries as follows:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd
Then, read the (sample) input tables for matching purposes.
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
S.head()
_id | ltable_id | rtable_id | ltable_title | ltable_authors | ltable_year | rtable_title | rtable_authors | rtable_year | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | l1223 | r498 | Dynamic Information Visualization | Yannis E. Ioannidis | 1996 | Dynamic information visualization | Yannis E. Ioannidis | 1996 | 1 |
1 | 1 | l1563 | r1285 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | 1 |
2 | 2 | l1514 | r1348 | Query Processing and Optimization in Oracle Rdb | Gennady Antoshenkov, Mohamed Ziauddin | 1996 | prospector: a content-based multimedia server for massively parallel architectures | S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader | 1996 | 0 |
3 | 3 | l206 | r1641 | An Asymptotically Optimal Multiversion B-Tree | Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger | 1996 | A complete temporal relational algebra | Debabrata Dey, Terence M. Barron, Veda C. Storey | 1996 | 0 |
4 | 4 | l1589 | r495 | Evaluating Probabilistic Queries over Imprecise Data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | Evaluating probabilistic queries over imprecise data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | 1 |
Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']
brm = em.BooleanRuleMatcher()
Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.
F.feature_name
0 id_id_lev_dist 1 id_id_lev_sim 2 id_id_jar 3 id_id_jwn 4 id_id_exm 5 id_id_jac_qgm_3_qgm_3 6 title_title_jac_qgm_3_qgm_3 7 title_title_cos_dlm_dc0_dlm_dc0 8 title_title_mel 9 title_title_lev_dist 10 title_title_lev_sim 11 authors_authors_jac_qgm_3_qgm_3 12 authors_authors_cos_dlm_dc0_dlm_dc0 13 authors_authors_mel 14 authors_authors_lev_dist 15 authors_authors_lev_sim 16 year_year_exm 17 year_year_anm 18 year_year_lev_dist 19 year_year_lev_sim Name: feature_name, dtype: object
Before we can use the rule-based matcher, we need to create rules to evaluate tuple pairs. Each rule is a list of strings. Each string specifies a conjunction of predicates. Each predicate has three parts: (1) an expression, (2) a comparison operator, and (3) a value. The expression is evaluated over a tuple pair, producing a numeric value.
# Add two rules to the rule-based matcher
# The first rule has two predicates, one comparing the titles and the other looking for an exact match of the years
brm.add_rule(['title_title_lev_sim(ltuple, rtuple) > 0.4', 'year_year_exm(ltuple, rtuple) == 1'], F)
# This second rule compares the authors
brm.add_rule(['authors_authors_lev_sim(ltuple, rtuple) > 0.4'], F)
brm.get_rule_names()
['_rule_0', '_rule_1']
# Rules can also be deleted from the rule-based matcher
brm.delete_rule('_rule_1')
True
Now that our rule-based matcher has some rules, we can use it to predict whether a tuple pair is actually a match. Each rule is is a conjunction of predicates and will return True only if all the predicates return True. The matcher is then a disjunction of rules and if any one of the rules return True, then the tuple pair will be a match.
brm.predict(S, target_attr='pred_label', append=True)
S
_id | ltable_id | rtable_id | ltable_title | ltable_authors | ltable_year | rtable_title | rtable_authors | rtable_year | label | pred_label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | l1223 | r498 | Dynamic Information Visualization | Yannis E. Ioannidis | 1996 | Dynamic information visualization | Yannis E. Ioannidis | 1996 | 1 | 1 |
1 | 1 | l1563 | r1285 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | Dynamic Load Balancing in Hierarchical Parallel Database Systems | Luc Bouganim, Daniela Florescu, Patrick Valduriez | 1996 | 1 | 1 |
2 | 2 | l1514 | r1348 | Query Processing and Optimization in Oracle Rdb | Gennady Antoshenkov, Mohamed Ziauddin | 1996 | prospector: a content-based multimedia server for massively parallel architectures | S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader | 1996 | 0 | 0 |
3 | 3 | l206 | r1641 | An Asymptotically Optimal Multiversion B-Tree | Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger | 1996 | A complete temporal relational algebra | Debabrata Dey, Terence M. Barron, Veda C. Storey | 1996 | 0 | 0 |
4 | 4 | l1589 | r495 | Evaluating Probabilistic Queries over Imprecise Data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | Evaluating probabilistic queries over imprecise data | Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar | 2003 | 1 | 1 |
5 | 5 | l43 | r1415 | Optimization of Run-time Management of Data Intensive Web-sites | Khaled Yagoub, Dan Suciu, Alon Y. Levy, Daniela Florescu | 1999 | On random sampling over joins | Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya | 1999 | 0 | 0 |
6 | 6 | l1466 | r1348 | Access Path Support for Referential Integrity in SQL2 | Joachim Reinert, Theo Hrder | 1996 | prospector: a content-based multimedia server for massively parallel architectures | S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader | 1996 | 0 | 0 |
7 | 7 | l1535 | r1800 | Mariposa: A Wide-Area Distributed Database System | Carl Staelin, Paul M. Aoki, Witold Litwin, Michael Stonebraker, Adam Sah, Jeff Sidell, Andrew Yu... | 1996 | Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases | Sin Yeung Lee, Tok Wang Ling | 1996 | 0 | 0 |
8 | 8 | l1317 | r1676 | QuickStore: A High Performance Mapped Object Store | David J. DeWitt, Seth J. White | 1994 | An Overview of Repository Technology | Philip A. Bernstein, Umeshwar Dayal | 1994 | 0 | 0 |
9 | 9 | l621 | r175 | Communication Efficient Distributed Mining of Association Rules | Ran Wolff, Assaf Schuster | 2001 | Editorial | Richard Snodgrass | 2001 | 0 | 0 |
10 | 10 | l668 | r1694 | Indexing Multimedia Databases (Tutorial) | Christos Faloutsos | 1995 | Information finding in a digital library: the Stanford perspective | Tak W. Yan, Héctor García-Molina | 1995 | 0 | 0 |
11 | 11 | l1189 | r1674 | Weimin Du, Xiangning Liu, Abdelsalam Helal | Multiview Access Protocols for Large-Scale Replication | 1998 | Multiview access protocols for large-scale replication | Xiangning Liu, Abdelsalam Helal, Weimin Du | 1998 | 1 | 0 |
12 | 12 | l1657 | r110 | Semantic B2B Integration | Christoph Bussler | 2001 | Monitoring business processes through event correlation based on dependency model | Asaf Adii, David Botzer, Opher Etzion, Tali Yatzkar-Haham | 2001 | 0 | 0 |
13 | 13 | l1490 | r599 | Extracting Large Data Sets using DB2 Parallel Edition | Sriram Padmanabhan | 1996 | Extracting Large Data Sets using DB2 Parallel Edition | Sriram Padmanabhan | 1996 | 1 | 1 |
14 | 14 | l595 | r87 | Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? (Panel) | Kyuseok Shim, Rajeev Rastogi, Minos N. Garofalakis, Sridhar Ramaswamy | 1999 | Of crawlers, portals, mice, and men: is there more to mining the Web? | Minos N. Garofalakis, Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim | 1999 | 1 | 1 |
15 | 15 | l380 | r1337 | Outerjoin Simplification and Reordering for Query Optimization | Csar A. Galindo-Legaria, Arnon Rosenthal | 1997 | Outerjoin simplification and reordering for query optimization | César Galindo-Legaria, Arnon Rosenthal | 1997 | 1 | 1 |
16 | 16 | l165 | r1118 | Cache-and-Query for Wide Area Sensor Databases | Phillip B. Gibbons, Srinivasan Seshan, Suman Kumar Nath, Amol Deshpande | 2003 | Cache-and-query for wide area sensor databases | Amol Deshpande, Suman Nath, Phillip B. Gibbons, Srinivasan Seshan | 2003 | 1 | 1 |
17 | 17 | l796 | r588 | Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl | Alexandros Labrinidis, Nick Roussopoulos | 2000 | Novel Approaches in Query Processing for Moving Object Trajectories | Dieter Pfoser, Christian S. Jensen, Yannis Theodoridis | 2000 | 0 | 0 |
18 | 18 | l1160 | r1733 | Khaled Alsabti, Vineet Singh, Sanjay Ranka | A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data | 1997 | A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data | Khaled Alsabti, Sanjay Ranka, Vineet Singh | 1997 | 1 | 0 |
19 | 19 | l1752 | r3 | SHORE: Combining the Best Features of OODBMS and File Systems | Shore Team | 1995 | The LyriC language: querying constraint objects | Alexander Brodsky, Yoram Kornatzky | 1995 | 0 | 0 |
20 | 20 | l1647 | r945 | Cost Based Query Scrambling for Initial Delays | Tolga Urhan, Michael J. Franklin, Laurent Amsaleg | 1998 | The Cubetree Storage Organization | Nick Roussopoulos, Yannis Kotidis | 1998 | 0 | 0 |
21 | 21 | l1135 | r1127 | Sampling-Based Estimation of the Number of Distinct Values of an Attribute | Peter J. Haas, Lynne Stokes, S. Seshadri, Jeffrey F. Naughton | 1995 | View maintenance in a warehousing environment | Yue Zhuge, Héctor García-Molina, Joachim Hammer, Jennifer Widom | 1995 | 0 | 0 |
22 | 22 | l1776 | r987 | Walking Through a Very Large Virtual Environment in Real-time | Yixin Ruan, Kian-Lee Tan, Jason Chionh, Lidan Shou, Zhiyong Huang | 2001 | Walking Through a Very Large Virtual Environment in Real-time | Lidan Shou, Jason Chionh, Zhiyong Huang, Yixin Ruan, Kian-Lee Tan | 2001 | 1 | 1 |
23 | 23 | l676 | r1395 | Datawarehousing Has More Colours Than Just Black & White | Thomas Zurek, Markus Sinnwell | 1999 | Datawarehousing Has More Colours Than Just Black &; White | Thomas Zurek, Markus Sinnwell | 1999 | 1 | 1 |
24 | 24 | l1087 | r648 | The Grid: An Application of the Semantic Web | Carole A. Goble, David De Roure | 2002 | An XML query engine for network-bound data | Zachary G. Ives, A. Y. Halevy, D. S. Weld | 2002 | 0 | 0 |
25 | 25 | l629 | r1478 | Engineering Federated Information Systems: Report of EFIS '99 Workshop | Flix Saltor, Uwe Hohenstein, Ralf-Detlef Kutsche, Wilhelm Hasselbring, Gunter Saake, Stefan Conr... | 1999 | Engineering federated information systems: report of EEFIS '99 workshop | S. Conrad, W. Hasselbring, U. Hohenstein, R.-D. Kutsche, M. Roantree, G. Saake, F. Saltor | 1999 | 1 | 1 |
26 | 26 | l649 | r1366 | Random Sampling for Histogram Construction: How much is enough? | Vivek R. Narasayya, Rajeev Motwani, Surajit Chaudhuri | 1998 | Random sampling for histogram construction: how much is enough? | Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya | 1998 | 1 | 1 |
27 | 27 | l211 | r1490 | BeSS: Storage Support for Interactive Visualization Systems | William O'Connell, Thomas A. Funkhouser, Alexandros Biliris, Euthimios Panagos | 1996 | BeSS: storage support for interactive visualization systems | A. Biliris, T. A. Funkhouser, W. O'Connell, E. Panagos | 1996 | 1 | 1 |
28 | 28 | l734 | r384 | Min-Max Compression Methods for Medical Image Databases | John M. Tyler, Kosmas Karadimitriou | 1997 | Min-max compression methods for medical image databases | Kosmas Karadimitriou, John M. Tyler | 1997 | 1 | 1 |
29 | 29 | l611 | r141 | Mining Generalized Association Rules | Ramakrishnan Srikant, Rakesh Agrawal | 1995 | Multi-table joins through bitmapped join indices | Patrick O'Neil, Goetz Graefe | 1995 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
420 | 420 | l834 | r883 | Estimating the Selectivity of XML Path Expressions for Internet Scale Applications | Ashraf Aboulnaga, Jeffrey F. Naughton, Alaa R. Alameldeen | 2001 | Estimating the Selectivity of XML Path Expressions for Internet Scale Applications | Ashraf Aboulnaga, Alaa R. Alameldeen, Jeffrey F. Naughton | 2001 | 1 | 1 |
421 | 421 | l746 | r301 | Providing Database Migration Tools - A Practicioner's Approach | Andreas Meier | 1995 | Providing Database Migration Tools - A Practicioner's Approach | Andreas Meier | 1995 | 1 | 1 |
422 | 422 | l1332 | r619 | Workshop on Workflow Management in Scientific and Engineering Applications - Report | Gottfried Vossen, Richard McClatchey | 1997 | Workshop on workflow management in scientific and engineering applications-report | R. McClatchey, G. Vossen | 1997 | 1 | 1 |
423 | 423 | l942 | r1473 | Research in Databases and Data-Intensive Applications - Computer Science Department and FZI, Uni... | Birgitta Knig-Ries, Peter C. Lockemann | 1997 | Research in databases and data-intensive applications: Computer Science Dept. and FIZ, Universit... | Brigitta König-Ries, Peter C. Lockermann | 1997 | 1 | 1 |
424 | 424 | l806 | r356 | Tribeca: A Stream Database Manager for Network Traffic Analysis | Mark Sullivan | 1996 | Type-safe relaxing of schema consistency rules for flexible modelling in OODBMS | Eric Amiel, Marie-Jo Bellosta, Eric Dujardin, Eric Simon | 1996 | 0 | 0 |
425 | 425 | l794 | r784 | Spatial Data Management for Computer Aided Design | Andreas Mller, Marco Ptke, Thomas Seidl, Hans-Peter Kriegel | 2001 | Dynamic content acceleration: a caching solution to enable scalable dynamic Web page generation | Anindya Datta, Kaushik Dutta, Krithi Ramamritham, Helen Thomas, Debra VanderMeer | 2001 | 0 | 0 |
426 | 426 | l28 | r1618 | Storage Technology: RAID and Beyond | Garth A. Gibson | 1995 | Tutorial on storage technology: RAID and beyond | Garth A. Gibson | 1995 | 1 | 1 |
427 | 427 | l1183 | r1409 | Stephen Blott, Roger Weber, Hans-Jrg Schek | A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ... | 1998 | A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional ... | Roger Weber, Hans-Jörg Schek, Stephen Blott | 1998 | 1 | 0 |
428 | 428 | l1122 | r232 | Interview with Jim Gray | Marianne Winslett | 2003 | In-context peer-to-peer information filtering on the Web | Aris M. Ouksel | 2003 | 0 | 0 |
429 | 429 | l1430 | r1444 | Condition Handling in SQL Persistent Stored Modules | Jeff Richey | 1995 | Condition handling in SQL persistent stored modules | Jeff Richey | 1995 | 1 | 1 |
430 | 430 | l1494 | r1257 | The Mariposa Distributed Database Management System | Jeff Sidell | 1996 | Open issues in parallel query optimization | Waqar Hasan, Daniela Florescu, Patrick Valduriez | 1996 | 0 | 0 |
431 | 431 | l1592 | r439 | Report on the 18th British National Conference on Databases (BNCOD) | Carole A. Goble, Brian J. Read | 2002 | Contracting in the days of eBusiness | W. Hümmer, W. Lehner, H. Wedekind | 2002 | 0 | 0 |
432 | 432 | l1015 | r45 | Database Systems - Breaking Out of the Box | Abraham Silberschatz, Stanley B. Zdonik | 1997 | Dynamic Memory Adjustment for External Mergesort | Weiye Zhang, Per-Åke Larson | 1997 | 0 | 0 |
433 | 433 | l1147 | r1016 | Xiaolei Qian | Scientist's Called Upon to Take Actions | 1996 | Scientists called upon to take actions | Xiaolei Qian | 1996 | 1 | 0 |
434 | 434 | l1756 | r310 | ARIES/CSA: A Method for Database Recovery in Client-Server Architectures | C. Mohan, Inderpal Narang | 1994 | Enterprise information architectures-they're finally changing | Wesley P. Melling | 1994 | 0 | 0 |
435 | 435 | l1044 | r67 | Digital Library Services in Mobile Computing | Evaggelia Pitoura, Melliyal Annamalai, Bharat K. Bhargava | 1995 | Ordered shared locks for real-time databases | Divyakant Agrawal, Amr El Abbadi, Richard Jeffers, Lijing Lin | 1995 | 0 | 0 |
436 | 436 | l412 | r651 | Phoenix: Making Applications Robust | David B. Lomet, Roger S. Barga | 1999 | DataBlitz storage manager: main-memory database performance for critical applications | J. Baulier, P. Bohannon, S. Gogate, C. Gupta, S. Haldar | 1999 | 0 | 0 |
437 | 437 | l796 | r1808 | Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl | Alexandros Labrinidis, Nick Roussopoulos | 2000 | On wrapping query languages and efficient XML integration | Vassilis Christophides, Sophie Cluet, Jérǒme Simèon | 2000 | 0 | 0 |
438 | 438 | l1570 | r1468 | Instance-based attribute identification in database integration | Roger H. L. Chiang, Ee-Peng Lim, Chua Eng Huang Cecil | 2003 | Index-driven similarity search in metric spaces | Gisli R. Hjaltason, Hanan Samet | 2003 | 0 | 0 |
439 | 439 | l1577 | r688 | Data Mining Using Two-Dimensional Optimized Accociation Rules: Scheme, Algorithms, and Visualiza... | Shinichi Morishita, Yasuhiko Morimoto, Takeshi Tokuyama, Takeshi Fukuda | 1996 | Static detection of security flaws in object-oriented databases | Keishi Tajima | 1996 | 0 | 0 |
440 | 440 | l617 | r310 | Fine-Grained Sharing in a Page Server OODBMS | Michael J. Carey, Markos Zaharioudakis, Michael J. Franklin | 1994 | Enterprise information architectures-they're finally changing | Wesley P. Melling | 1994 | 0 | 0 |
441 | 441 | l1304 | r1178 | Query Rewriting for Semistructured Data | Vasilis Vassalos, Yannis Papakonstantinou | 1999 | The Aqua approximate query answering system | Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy | 1999 | 0 | 0 |
442 | 442 | l727 | r597 | Design and Analysis of Parametric Query Optimization Algorithms | Sumit Ganguly | 1998 | Incremental distance join algorithms for spatial databases | Gísli R. Hjaltason, Hanan Samet | 1998 | 0 | 0 |
443 | 443 | l1205 | r395 | Proxy-Server Architectures for OLAP | Panos Kalnis, Dimitris Papadias | 2001 | Proxy-server architectures for OLAP | Panos Kalnis, Dimitris Papadias | 2001 | 1 | 1 |
444 | 444 | l915 | r1532 | Efficient k-NN search on vertically decomposed data | Niels Nes, Martin L. Kersten, Nikos Mamoulis, Arjen P. de Vries | 2002 | Efficient k-NN search on vertically decomposed data | Arjen P. de Vries, Nikos Mamoulis, Niels Nes, Martin Kersten | 2002 | 1 | 1 |
445 | 445 | l365 | r53 | 50,000 Users on an Oracle8 Universal Server Database | Ashok Josji, Tirthankar Lahiri, Amit Jasuja, Sumanta Chatterjee | 1998 | A workflow-based electronic marketplace on the Web | Asuman Dogac, Ilker Durusoy, Sena Arpinar, Nesime Tatbul, Pinar Koksal, Ibrahim Cingil, Nazife D... | 1998 | 0 | 0 |
446 | 446 | l458 | r767 | Comparing Hierarchical Data in External Memory | Sudarshan S. Chawathe | 1999 | Context-Based Prefetch for Implementing Objects on Relations | Philip A. Bernstein, Shankar Pal, David Shutt | 1999 | 0 | 0 |
447 | 447 | l655 | r412 | The SDSS skyserver: public access to the sloan digital sky server data | Tanu Malik, Jordan Raddick, Alexander S. Szalay, Peter Z. Kunszt, Jim Gray, Christopher Stoughto... | 2002 | Report on the ACM fourth international workshop on data warehousing and OLAP (DOLAP 2001) | Joachim Hammer | 2002 | 0 | 0 |
448 | 448 | l123 | r1493 | Change-Centric Management of Versions in an XML Warehouse | Laurent Mignet, Amlie Marian, Gregory Cobena, Serge Abiteboul | 2001 | A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services | Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, Jafar Adibi | 2001 | 0 | 0 |
449 | 449 | l590 | r295 | Skew handling techniques in sort-merge join | Richard T. Snodgrass, Wei Li, Dengfeng Gao | 2002 | QURSED: querying and reporting semistructured data | Yannis Papakonstantinou, Michalis Petropoulos, Vasilis Vassalos | 2002 | 0 | 0 |
450 rows × 11 columns