This IPython notebook illustrates how to down sample two large tables that are loaded in the memory
import py_entitymatching as em
/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
Down sampling is typically done when the input tables are large (e.g. each containing more than 100K tuples). For the purposes of this notebook we will use two large datasets: Citeseer and DBLP. You can download Citeseer dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/citeseer.csv and DBLP dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/dblp.csv. Once downloaded, save these datasets as 'citeseer.csv' and 'dblp.csv' in the current directory.
# Read the CSV files
A = em.read_csv_metadata('./citeseer.csv',low_memory=False) # setting the parameter low_memory to False to speed up loading.
B = em.read_csv_metadata('./dblp.csv', low_memory=False)
len(A), len(B)
(1823978, 2512927)
A.head()
id | title | authors | journal | month | year | publication_type | |
---|---|---|---|---|---|---|---|
0 | 1 | An Arithmetic Analogue of Bezouts Theorem | David Mckinnon | NaN | NaN | NaN | NaN |
1 | 2 | Thompsons Group F is Not Minimally Almost Convex | James Belk, Kai-uwe Bux | NaN | NaN | 2002.0 | NaN |
2 | 3 | Cognitive Dimensions Tradeoffs in Tangible User Interface Design | Darren Edge, Alan Blackwell | NaN | NaN | NaN | NaN |
3 | 4 | ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session... | J. Bohnemeyer, Max Planck, I. Introduction | NaN | NaN | 2002.0 | NaN |
4 | 5 | PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND | J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz | NaN | NaN | NaN | NaN |
B.head()
id | title | authors | journal | month | year | publication_type | |
---|---|---|---|---|---|---|---|
0 | 1 | Klaus Tschira Stiftung gemeinntzige GmbH, KTS | Klaus Tschira | NaN | NaN | 2012 | www |
1 | 2 | The SGML/XML Web Page | Robin Cover | NaN | NaN | 2006 | www |
2 | 3 | The Future of Classic Data Administration: Objects + Databases + CASE | Arnon Rosenthal | NaN | NaN | 1998 | www |
3 | 4 | XML Query Data Model | Mary F. Fernandez, Jonathan Robie | NaN | NaN | 2001 | www |
4 | 5 | The XML Query Algebra | Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler | NaN | NaN | 2001 | www |
# Set 'id' as the keys to the input tables
em.set_key(A, 'id')
em.set_key(B, 'id')
True
# Display the keys
em.get_key(A), em.get_key(B)
('id', 'id')
# Downsample the datasets
sample_A, sample_B = em.down_sample(A, B, size=1000, y_param=1)
In the down_sample command, set the size
to the number of tuples that should be sampled from B (this would be the size of sampled B table) and set the y_param
to be the number of matching tuples to be picked from A.
In the above, we set the number of tuples to be sampled from B to be 1000. We set the y_param
to 1 meaning that for each tuple sampled from B pick one matching tuple from A.
# Display the lengths of sampled datasets
len(sample_A), len(sample_B)
Now, the input tables A
and B
(with 1.8M and 2.5M tuples) are down sampled to smaller tables sample_A
and sample_B
(with ).