Trevor Muñoz
18 August 2013
from google.refine import refine, facet
import pystache
This client library assumes that Open Refine is installed and that the Refine server application is running. To create a connection to the server:
server = refine.RefineServer()
grefine = refine.Refine(server)
Now that we have a connection to the server from our application we can make calls that replicate the functionality available in the standard graphical user interface (GUI). For instance, I can list the projects I have in my copy of Refine (ymmv). In this case, I have just one project at the moment, containing the 8/1/2013 data release from NYPL. The list_projects command returns a JSON response containing metadata about my project(s). To do any work, I need the project identifier (2310205155087) so I can open the relevant project:
grefine.list_projects()
nypl_dishes = grefine.open_project('2310205155087')
Now I can issue commands to facet the values in the 'name' column. To demonstrate that the failure I experienced using the standard GUI is a bottleneck in the frontend code and not the server code that is actually computing the data facets, I'll time the execution of the facet command:
name_facet = facet.TextFacet('name')
%%timeit facet_response = nypl_dishes.compute_facets(name_facet)
At 7.68 seconds for best time, this command is not amazingly quick but neither is the execution so slow that the whole application should grind to a halt.
To check that we're getting the same behavior with this command as in the GUI, let's see if the number of calculated facets in the same (370,004):
facet_response = nypl_dishes.compute_facets(name_facet)
facets = facet_response.facets[0]
From this object we can access a dictionary called 'choices' that contains the facets of the data along with their associated counts. The number of keys in this dictionary is the number of facets.
len(facets.choices.keys())
Now I can display the list of unique values for dish 'name' in descendingly order of their raw count. This is the same information that would appear in one of the boxes of the sidebar of the Refine GUI if it didn't choke. For the sake of space, I'll only output the first 25 here:
for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True)[:25]:
print facets.choices[k].count, k
In the GUI, the order of operations is to calculate the facets for a column then generate clusters. From the perspective of this little script, the dependency is not clear. In any case, it's now also possible to calculate and inspect the output of Refine's clustering functionality operating on the 'name' column. At first we'll just use the default binning clusterer:
cluster_response = nypl_dishes.compute_clusters('name')
Again, what is returned is a JSON response with information about the clusters found in the data. In the GUI, this is turned into a modal overlay where users can select the best value to represent a cluster and then normalize batches of values at the same time.
We can see how many clusters were generated in this first pass:
len(cluster_response)
And we can inspect the data in the clusters. (For space, only the first five clusters are shown):
for cluster in cluster_response[:5]:
print '\n'
for line in cluster:
print(pystache.render('{{count}} \t {{value}}', line))
Directly inspecting the clusters produced by Refine shows that there is great potential for improving the quality of the data by normalizing names of dishes.