First we need an OpenRefine server running and the openrefine-client installed.
import os
if 'openrefineder' in os.environ['HOSTNAME']:
notebook = !jupyter notebook list | grep -o -E 'http\S+'
openrefine_url = notebook[0].replace('?token', 'openrefine?token')
openrefine_url = openrefine_url.replace('http://0.0.0.0:8888','')
from IPython.core.display import display, HTML
display(HTML('<a href="' + openrefine_url + '" target="blank">Click here to open OpenRefine</a>'))
Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows.
pip install openrefine-client
We will store some files so it is clearer to use a new folder.
import os, datetime
path = os.path.expanduser('~') + '/' + datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
try:
os.mkdir(path)
os.chdir(path)
except OSError:
print ("Creation of the directory %s failed" % path)
else:
print (os.getcwd())
from google.refine import cli
Download sample data
cli.download('https://git.io/fj5hF','duplicates.csv')
Import file into OpenRefine (and store returned project)
p1 = cli.create('duplicates.csv')
cli.ls()
cli.info(p1.project_id)
cli.export(p1.project_id)
Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)
cli.download('https://git.io/fj5ju','duplicates-deletion.json')
Apply transformations rules
cli.apply(p1.project_id, 'duplicates-deletion.json')
Export project to terminal again
cli.export(p1.project_id)
Export data in Excel (.xls) format
cli.export(p1.project_id, 'deduped.xls')
cli.delete(p1.project_id)
Create another project from the example file above
p2 = cli.create('duplicates.csv')
The following example code will export the columns "name" and "purchase" in JSON format from the project "advanced" for rows matching the regex text filter ^F$ in column "gender"
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender')
There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix.
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True)
Filenames are suffixed with the row number by default (e.g. advanced_1.json
, advanced_2.json
etc.). There is another option to use the value in the first column instead:
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True,
suffixById=True)
Check the results in the current directory
os.listdir(os.getcwd())
Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. advanced_melanie.white@example2.edu.json
). When using this option, the first column should contain unique identifiers.
cli.delete(p2.project_id)
help(cli)
Client and server can be executed on different machines. Host and port of the OpenRefine server can be specified:
cli.refine.REFINE_HOST = 'localhost'
cli.refine.REFINE_PORT = '3333'
Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!