First we need an OpenRefine server running and the openrefine-client installed.
Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows.
wget -nv https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux -O ~/.local/bin/openrefine-client
chmod +x ~/.local/bin/openrefine-client
We will store some files so it is clearer to use a new folder.
workspace=$(date +%Y%m%d_%H%M%S)
mkdir -p ~/$workspace && cd ~/$workspace && pwd
Download sample data
openrefine-client --download "https://git.io/fj5hF" --output=duplicates.csv
Import file into OpenRefine
openrefine-client --create duplicates.csv
openrefine-client --list
openrefine-client --info "duplicates"
openrefine-client --export "duplicates"
Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)
openrefine-client --download "https://git.io/fj5ju" --output=duplicates-deletion.json
Apply transformations rules
openrefine-client --apply duplicates-deletion.json "duplicates"
Export project to terminal again
openrefine-client --export "duplicates"
Export data in Excel (.xls) format
openrefine-client --export "duplicates" --output deduped.xls
openrefine-client --delete "duplicates"
Create another project from the example file above
openrefine-client --create duplicates.csv --projectName=advanced
The following example code will export the columns "name" and "purchase" in JSON format from the project "advanced" for rows matching the regex text filter ^F$ in column "gender"
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender'
There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix.
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender' \
--output=advanced.json \
--splitToFiles=true
Filenames are suffixed with the row number by default (e.g. advanced_1.json
, advanced_2.json
etc.). There is another option to use the value in the first column instead:
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender' \
--output=advanced.json \
--splitToFiles=true \
--suffixById=true
Check the results in the current directory.
ls
Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. advanced_melanie.white@example2.edu.json
). When using this option, the first column should contain unique identifiers.
openrefine-client --delete "advanced"
openrefine-client --help
The openrefine-client is available as a one file executable for Windows, Mac OS and Linux. Client and server can be executed on different machines (host and port of the OpenRefine server can be specified, e.g. -H 127.0.0.1 -P 80
).
Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!