Intake has been designed with a plugable structure, to allow adding new data sources and other features as easily as possible. Here we demonstrate making a new plugin to read data from the GitHub API, using a small and handy package that already exists, pygithub
. We will access only one particular part of the API, as a simple example.
The directory intake-github
contains a full package to do the job. Let's investigate its contents by changing directory - the exact line you need next will vary depending on how you are running this notebook.
cd intake-github/
The contents are typical python package files:
import os
cwd = os.getcwd()
for path, dirs, files in os.walk(cwd):
for f in files:
print(os.path.join(path, f)[len(cwd):])
The setup.py
and meta.yaml
are standard distutils and conda package information boilerplate, specifying dependencies and a little metadata. The LICENSE
file is empty, but in general, all open-source projects should have an appropriate license. With this structure, one can install the package:
pip install .
or add the modifier -e
to link rather than copy. In the following section, we will execute this command. It is also possible to use pip
to install directly for a remote github repository.
You can create a distribution (in the dist/ folder), which could be uploaded to PyPI
python setup.py sdist
And finally, you can create a conda package, which could be uploaded to a channel on anaconda.org .
conda build conda
(all of these commands to be run in the intake-github directory)
Note that, once the package is installed via pip or conda, the package github_issues
will become importable in your environment, and the driver defined below, "github_issues" will automatically be loaded into the Intake, registry, so that the function intake.open_github_issues
gets created at import time.
This example contains just one DataSource class, although it could have had several related source.
The source code has very little code: a subclass of DataSource
with the following class-level attributes:
container = 'dataframe'
version = '0.0.1'
partition_access = False
name = 'github_issues'
These name the new plugin, give it a version and an output data type (Pandas dataframe), and specify that the data will always be loaded in a single shot, without partitioning. Whether or not a dataframe is the right representation for this data is left for another discussion.
The only real logic is in the read function:
def _get_partition(self, _):
from github import Github
import pandas as pd
gh = Github()
repo = gh.get_repo('%s/%s' % (self.organization, self.repo))
raw_data = repo.get_issues(**self.ghkwargs)
data = {}
for column in ['number', 'title', 'user', 'state', 'comments',
'created_at', 'updated_at', 'body']:
# user is special case
if column == 'user':
data['user'] = [issue.user.login for issue in raw_data]
else:
data[column] = [getattr(issue, column) for issue in raw_data]
return pd.DataFrame(data)
This code is typical of the kind of coercion of data that might be required for different sources. Since we do not attempt partitioning here, the read()
method is guaranteed to do the right thing.
The external module pygithub
takes care of actually formatting the HTTP REST calls to the github API. We simply pass on keyword arguments, so if a catalog author would like more fine control over what is downloaded for any specific instance of this data source, they would have to read the pygithub
and/or GitHub API documentation.
We can install the package with pip
, or we could build the conda package and install that instead. To distribute the package, it could be uploaded to PyPI, to a user conda channel on anaconda.org, or added as a feedstock to conda-forge.
# install with dependencies. This produces a lot of text output and can take a little while.
!pip install -e .
Since the class is in the top-level of the package, and the package name starts with intake_
, it will be scanned when Intake is imported.
Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated intake.open_github_issues
function is created at import time.
import intake
'github_issues' in intake.registry
If this automatic behaviour is not desired, then either catalogs would need to reference the plugin module/location explicitly in the plugins:
section, or the package would need to insert its class into the Intake registry when it is imported.
Now we can use our new plugin to load data.
source = intake.open_github_issues('intake', 'intake')
source.discover()
out = source.read()
# currently open issues and PRs in the Intake repo
# would it make sense to here to regard the issue number as the index?
out[['number', 'title', 'state']]
The point of doing this, is to enable data engineers to specify the driver in their catalogs or when distributing data packages with conda. Notice that in YAML form, our data source references the driver class in the source. If the package were not found on the system when such a catalog entry is loaded, an error would be raised. The second form also includes the optional plugins section. The plugins directory is a convenient place in which to maintain links to the various drivers.
print(source.yaml())
Now there is a clear path for this particular type of data to be accessed by end-users, and authors of catalogs can use a spec like the one above to list particular data-sets in catalogs. If included in a conda data package, then the dependency (this intake-github package) can be automatically installed along with the catalog. If not using conda, users will find from the catalog spec that they need the github_issues python package and will have to install it themselves.
In a small amount of code, we have shown how to wrap a third-party data-loading package, to make it available to Intake as a plugin. The creation of the plugin enables inclusion of data-sets made from this kind of data possible in data catalogs. In this way, we build a uniform interface to access a wide variety of arbirtary data sources.