Loading Graphs

In addition to the NetworkX compatible APIs, GraphScope proposed a set of APIs in Python to meet the needs for loading/analysing/quering very large graphs.

GraphScope models graph data as property graphs, in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including

  • How to load a built-in dataset quickly;
  • How to define the schema of a property graph;
  • Loading graph from various locations;
  • Serializing/Deserializing a graph to/from disk.

Prerequisite

First, we launch a session and import necessary packages.

In [ ]:
# Install graphscope package if you are NOT in the Playground
!pip3 install graphscope
In [ ]:
import graphscope

graphscope.set_option(show_log=False)  # enable logging

Load Built-in Datasets

GraphScope comes with a set of popular datasets, and utility functions to load them into memory, makes it easy for user to get started. Here's an example:

In [ ]:
from graphscope.dataset import load_ldbc

graph = load_ldbc()

In standalone mode, it will automatically download the data to ${HOME}/.graphscope/dataset, and it will remain in there for future usage.

Loading Your Own Datasets

However, it's common that users need to load their own data and do some analysis.

To build a property graph on GraphScope, we firstly create an empty graph using g().

In [ ]:
import graphscope
from graphscope.framework.loader import Loader

graph = graphscope.g()

The class Graph has several methods:

def add_vertices(self, vertices, label="_", properties=None, vid_field=0):
        pass

    def add_edges(self, edges, label="_e", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):
        pass

These methods helps users to construct the schema of the property graph iteratively.

We will use files in ldbc_sample through this tutorial. You can get the files in here. Here in this tutorial, we have already download it to local in the previous step.

And you can inspect the graph schema by using print(graph.schema).

Adding Vertices

We can add a kind of vertices to graph, the method has the following parameters:

vertices: A location for the vertex data source, which can be a file location, or a numpy, etc.

A simple example:

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|")
)

It will read data from the the location ${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv. Since we didn't give additional arguments, these vertices will be labeled _ by default, using the first column in the file as their ID, and other columns as their properties. Both the names and data types of properties will be deduced.

Another commonly used parameter is label:

label: The label name of the vertex, default to _.

Since a property graph allows many kinds of vertices, it is suggested for users to give each kind of vertices a meaningful label name. For example:

In [ ]:
graph = graphscope.g()

graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)

Then we have a graph with one kind of vertices, its label name is person.

In addition, each kind of labeled vertices have their own properties. Here is the third parameter:

properties: A list of properties, Optional, default to None.

This parameter selects the corresponding columns from the source data file or pandas DataFrames as properties. Please note that the values of this parameter should exist in the file/DataFrame. By default( values None), all columns except the vid_field column will be added as properties. If it equals to a empty list [], then no properties will be added.

For example:

In [ ]:
# All columns (firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed) will be added as properties.
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
    properties=None,
)

# Only columns firstName, lastName will be added as properties.
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
    properties=["firstName", "lastName"],
)

# no properties will be added.
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
    properties=[],
)

vid_field determines which column used as vertex ID. (as well as the source ID or destination ID when loading edges.)

It can be a str, the name of columns, or int, representing the index of the columns.

By default, the value is 0, hence the first column will be used as vertex ID.

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    vid_field="id",
)

graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    vid_field=0,
)

Adding Edges

Next, let's take a look on the parameters for loading edges.

edges: The location indicating where to read the data. e.g.,

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)

# Note we already added a vertex label named 'person'.
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
        delimiter="|",
    ),
    src_label="person",
    dst_label="person",
)

This will load an edge which label is _e (the default value), its source vertex and destination vertex will be person, using the first column as the source vertex ID, the second column as the destination vertex ID, the others as properties.

Similar to vertices, we can use parameter label to assign label name and properties to select properties.

label: The label name of the edges, default to _e. (It's recommended to use a meaningful label name.) properties: A list of properties, default to None(add all columns as properties).

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
        delimiter="|",
    ),
    label="knows",
    src_label="person",
    dst_label="person",
)

Differ to vertices, edges have some additional parameters.

src_label: The label name of the source vertex. dst_label: The label name of the destination vertex, it can be different to the src_label, src_field and dst_field: The columns used for source(destination) vertex id. Default to 0 and 1, respectively.

e.g.,

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
    label="comment",
)

# Please note we already added a vertex label named 'person'.
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="comment",
)
In [ ]:
# examples for ``src_field`` and ``dst_field``

graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
    label="comment",
)

graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="comment",
    src_field="Person.id",
    dst_field="Comment.id",
)
# Or use the index.
# graph = graph.add_edges(Loader('${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)

Advanced Usage

Here are some advanced usages to deal with homogeneous graphs or very complex graphs.

Deduce vertex labels when not ambiguous

If there is only one kind of vertices in a graph, the vertex label can be omitted. GraphScope will infer the source and destination vertex label to that very label.

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
        delimiter="|",
    )
)

Inducing vertex from edges

If user add edges with unseen src_label or dst_label, graphscope will extract an vertex table from the given labels from the edge data.

In [ ]:
graph = graphscope.g()
# Deduce vertex label `person` from the source and destination endpoints of edges.
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
        delimiter="|",
    ),
    src_label="person",
    dst_label="person",
)

graph = graphscope.g()
# Deduce the vertex label `person` from the source endpoint,
# and vertex label `comment` from the destination endpoint of edges.
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="comment",
)

Multiple relations

In some cases, an edge label may connect two kinds of vertices. For example, in a graph, two kinds of edges are labeled with likes but represents two relations. i.e., person -> likes <- comment and person -> likes <- post.

In this case, we can simply add the relation again with the same edge label, but with different source and destination labels.

In [ ]:
sess = graphscope.session(cluster_type="hosts", num_workers=1, mode="lazy")
graph = sess.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
    label="comment",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
    label="post",
)

graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="comment",
)

graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="post",
)
graph = sess.run(graph)
print(graph.schema)

Please note:

  1. This feature(multiple relations using same edge label) is only avaiable in lazy mode yet.
  2. It is worth noting that for several configurations in the side Label, the attributes should be the same in number and type, and preferably have the same name, because the data of the same Label will be put into one Table, and the attribute names will uses the names specified by the first configuration.

Specifying data types of properties manually

GraphScope will deduce data types from input files, and it works as expected in most cases. However, sometimes user may want to determine the data types as well, e.g.

In [ ]:
graph = graphscope.g()
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
    label="post",
    properties=["content", ("length", "int")],
)

It forces the property to be (casted and) loaded as specified data type. The format of this parameter is tuple(s) with the name and the type. e.g., in this case, the property length will have type int rather than the default int64_t. The options of the types are int, int64, float, double, or str.

Other parameters of graph

The class Graph has three meta options, which are:

  • oid_type, can be int64_t or string. Default to int64_t in consideration of efficiency. But if the ID column can't be represented by int64_t, then we should use string.
  • directed, boolean value and default to True. Controls to load an directed or undirected graph.
  • generate_eid, bool, default to True, whether to generate an unique id for all edges automatically.

Putting them Together

Let's make this example complete. A more complex example to load LDBC snb graph can be find here.

In [ ]:
graph = graphscope.g(oid_type="int64_t", directed=True, generate_eid=True)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
    label="person",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
    label="comment",
)
graph = graph.add_vertices(
    Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
    label="post",
)

graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
        delimiter="|",
    ),
    label="knows",
    src_label="person",
    dst_label="person",
)
graph = graph.add_edges(
    Loader(
        "${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
        delimiter="|",
    ),
    label="likes",
    src_label="person",
    dst_label="comment",
)

print(graph.schema)

Loading From Pandas or Numpy

The data source aforementioned is an object of Loader. A loader wraps a location or the data itself. GraphScope supports load a graph from pandas dataframes or numpy ndarrays, making it easy to construct a graph right in the python console.

Apart from the loader, the other fields like properties, label, etc. are the same as examples above.

From Pandas

In [ ]:
import numpy as np
import pandas as pd

leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=["leader_id", "member_id", "group_size"])
In [ ]:
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array(
    [490.33, 164.5, 190.25, 762.0, 434.2, 513.0, 569.0, 25.0, 308.0, 87.0]
)
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=["student_id", "avg_score"]).astype(
    {"student_id": np.int64}
)
In [ ]:
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
graph = graphscope.g().add_vertices(df_student).add_edges(df_group)

From Numpy

Note that each array is a column, we pass it like as COO matrix format to the loader.

In [ ]:
array_group = [df_group[col].values for col in ["leader_id", "member_id", "group_size"]]
array_student = [df_student[col].values for col in ["student_id", "avg_score"]]

graph = graphscope.g().add_vertices(array_student).add_edges(array_group)

Loader Variants

When a loader wraps a location, it may only contains a str. The string follows the standard of URI. When receiving a request for loading graph from a location, graphscope will parse the URI and invoke corresponding loader according to the schema.

Currently, graphscope supports loaders for local, s3, oss, hdfs. Under the hood, data is loaded distributedly by v6d , v6d takes advantage of fsspec to resolve specific scheme and formats. Any additional configurations can be passed in kwargs of Loader, which will be parsed directly by the specific class. e.g., host and port to hdfs, or access-id, secret-access-key to oss or s3.


    from graphscope.framework.loader import Loader

    ds1 = Loader("file:///var/datafiles/group.e")
    ds2 = Loader("oss://graphscope_bucket/datafiles/group.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')
    ds3 = Loader("hdfs:///datafiles/group.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})
    d34 = Loader("s3://datafiles/group.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})

Users can implement customized loaders to support additional data sources. Take ossfs as an example, a user needs to subclass AbstractFileSystem, which is used to resolve specific protocol scheme, and AbstractBufferFile to read and write. The only methods the user needs to override is _upload_chunk, _initiate_upload and _fetch_range. In the end, the user needs to use fsspec.register_implementation('protocol_name', 'protocol_file_system') to register corresponding resolver.

Serialization and Deserialization (Only avaiable in k8s mode)

When the graph is huge, it takes large amount of time(e.g., hours) for the graph loading. GraphScope provides serialization and deserialization for graph data, which dumps and load the constructed graphs in the form of binary data to(from) disk.

Serialization

graph.save_to takes a path argument, indicating the location to store the binary data.

graph.save_to('/tmp/serial')

Deserialization

graph.load_from is a classmethod, its path argument should be exactly the same to the path passed in graph.save_to. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be exactly the same to that for serialization.

In addition, graph.load_from needs an extra sess parameter, specifying which session the graph would be deserialized in.

import graphscope
from graphscope import Graph

sess = graphscope.session()
deserialized_graph = Graph.load_from('/tmp/seri', sess)
print(deserialized_graph.schema)