Using the openrefine-client in a Python 2 environment¶

Preparations¶

First we need an OpenRefine server running and the openrefine-client installed.

Option 1: binder¶

This binder has OpenRefine, the openrefine-client and a Jupyter server proxy preinstalled. OpenRefine should be listening on default port 3333 and the GUI should be available at the urlpath /openrefine.

In [ ]:

import os
if 'openrefineder' in os.environ['HOSTNAME']:
    notebook = !jupyter notebook list | grep -o -E 'http\S+'
    openrefine_url = notebook[0].replace('?token', 'openrefine?token')
    openrefine_url = openrefine_url.replace('http://0.0.0.0:8888','')
    from IPython.core.display import display, HTML
    display(HTML('<a href="' + openrefine_url + '" target="blank">Click here to open OpenRefine</a>'))

Option 2: Local environment¶

Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows.

pip install openrefine-client

Create a directory¶

We will store some files so it is clearer to use a new folder.

In [ ]:

import os, datetime
path = os.path.expanduser('~') + '/' + datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
try:
    os.mkdir(path)
    os.chdir(path)
except OSError:
    print ("Creation of the directory %s failed" % path)
else:
    print (os.getcwd())

Import module¶

In [ ]:

from google.refine import cli

Create project¶

Download sample data

In [ ]:

cli.download('https://git.io/fj5hF','duplicates.csv')

Import file into OpenRefine (and store returned project)

In [ ]:

p1 = cli.create('duplicates.csv')

List all projects¶

In [ ]:

cli.ls()

Show project metadata¶

In [ ]:

cli.info(p1.project_id)

Export project to terminal¶

In [ ]:

cli.export(p1.project_id)

Apply rules from json file¶

Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)

In [ ]:

cli.download('https://git.io/fj5ju','duplicates-deletion.json')

Apply transformations rules

In [ ]:

cli.apply(p1.project_id, 'duplicates-deletion.json')

Export project to terminal again

In [ ]:

cli.export(p1.project_id)

Export project to file¶

Export data in Excel (.xls) format

In [ ]:

cli.export(p1.project_id, 'deduped.xls')

Delete project¶

In [ ]:

cli.delete(p1.project_id)

Advanced templating¶

Create another project from the example file above

In [ ]:

p2 = cli.create('duplicates.csv')

The following example code will export the columns "name" and "purchase" in JSON format from the project "advanced" for rows matching the regex text filter ^F$ in column "gender"

In [ ]:

cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender')

There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix.

In [ ]:

cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True)

Filenames are suffixed with the row number by default (e.g. advanced_1.json, advanced_2.json etc.). There is another option to use the value in the first column instead:

In [ ]:

cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True,
suffixById=True)

Check the results in the current directory

In [ ]:

os.listdir(os.getcwd())

Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. advanced_melanie.white@example2.edu.json). When using this option, the first column should contain unique identifiers.

Delete project¶

In [ ]:

cli.delete(p2.project_id)

Getting help¶

In [ ]:

help(cli)

Client and server can be executed on different machines. Host and port of the OpenRefine server can be specified:

In [ ]:

cli.refine.REFINE_HOST = 'localhost'
cli.refine.REFINE_PORT = '3333'

Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!