In addition to the NetworkX compatible APIs, GraphScope proposed a set of APIs in Python to meet the needs for loading/analysing/quering very large graphs.
GraphScope models graph data as property graphs, in which the edges/vertices are labeled and have many properties. In this tutorial, we show how GraphScope load graphs, including
First, we launch a session and import necessary packages.
# Install graphscope package if you are NOT in the Playground
!pip3 install graphscope
import graphscope
graphscope.set_option(show_log=False) # enable logging
GraphScope comes with a set of popular datasets, and utility functions to load them into memory, makes it easy for user to get started. Here's an example:
from graphscope.dataset import load_ldbc
graph = load_ldbc()
In standalone mode, it will automatically download the data to ${HOME}/.graphscope/dataset
, and it will remain in there for future usage.
However, it's common that users need to load their own data and do some analysis.
To build a property graph on GraphScope, we firstly create an empty graph using g()
.
import graphscope
from graphscope.framework.loader import Loader
graph = graphscope.g()
The class Graph
has several methods:
def add_vertices(self, vertices, label="_", properties=None, vid_field=0):
pass
def add_edges(self, edges, label="_e", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):
pass
These methods helps users to construct the schema of the property graph iteratively.
We will use files in ldbc_sample
through this tutorial. You can get the files in here. Here in this tutorial, we have already download it to local in the previous step.
And you can inspect the graph schema by using print(graph.schema)
.
We can add a kind of vertices to graph, the method has the following parameters:
vertices
: A location for the vertex data source, which can be a file location, or a numpy, etc.
A simple example:
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|")
)
It will read data from the the location ${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv
. Since we didn't give additional arguments, these vertices will be labeled _
by default, using the first column in the file as their ID, and other columns as their properties. Both the names and data types of properties will be deduced.
Another commonly used parameter is label:
label
: The label name of the vertex, default to _
.
Since a property graph allows many kinds of vertices, it is suggested for users to give each kind of vertices a meaningful label name. For example:
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
Then we have a graph with one kind of vertices, its label name is person.
In addition, each kind of labeled vertices have their own properties. Here is the third parameter:
properties
: A list of properties, Optional, default to None
.
This parameter selects the corresponding columns from the source data file or pandas DataFrames as properties. Please note that
the values of this parameter should exist in the file/DataFrame. By default( values None
), all columns except the vid_field
column
will be added as properties. If it equals to a empty list []
, then no properties will be added.
For example:
# All columns (firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed) will be added as properties.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=None,
)
# Only columns firstName, lastName will be added as properties.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=["firstName", "lastName"],
)
# no properties will be added.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=[],
)
vid_field
determines which column used as vertex ID. (as well as the source ID or destination ID when loading edges.)
It can be a str
, the name of columns, or int
, representing the index of the columns.
By default, the value is 0, hence the first column will be used as vertex ID.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
vid_field="id",
)
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
vid_field=0,
)
Next, let's take a look on the parameters for loading edges.
edges
: The location indicating where to read the data. e.g.,
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
# Note we already added a vertex label named 'person'.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
src_label="person",
dst_label="person",
)
This will load an edge which label is _e
(the default value), its source vertex and destination vertex will be person
, using the first column as the source vertex ID, the second column as the destination vertex ID, the others as properties.
Similar to vertices, we can use parameter label
to assign label name and properties
to select properties.
label
: The label name of the edges, default to _e
. (It's recommended to use a meaningful label name.)
properties
: A list of properties, default to None
(add all columns as properties).
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
label="knows",
src_label="person",
dst_label="person",
)
Differ to vertices, edges have some additional parameters.
src_label
: The label name of the source vertex.
dst_label
: The label name of the destination vertex, it can be different to the src_label
,
src_field
and dst_field
: The columns used for source(destination) vertex id. Default to 0 and 1, respectively.
e.g.,
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
# Please note we already added a vertex label named 'person'.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
# examples for ``src_field`` and ``dst_field``
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
src_field="Person.id",
dst_field="Comment.id",
)
# Or use the index.
# graph = graph.add_edges(Loader('${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)
Here are some advanced usages to deal with homogeneous graphs or very complex graphs.
If there is only one kind of vertices in a graph, the vertex label can be omitted. GraphScope will infer the source and destination vertex label to that very label.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
)
)
If user add edges with unseen src_label
or dst_label
, graphscope will extract an vertex table from the given labels from the edge data.
graph = graphscope.g()
# Deduce vertex label `person` from the source and destination endpoints of edges.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
src_label="person",
dst_label="person",
)
graph = graphscope.g()
# Deduce the vertex label `person` from the source endpoint,
# and vertex label `comment` from the destination endpoint of edges.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
In some cases, an edge label may connect two kinds of vertices. For example, in a
graph, two kinds of edges are labeled with likes
but represents two relations.
i.e., person
-> likes
<- comment
and person
-> likes
<- post
.
In this case, we can simply add the relation again with the same edge label, but with different source and destination labels.
sess = graphscope.session(cluster_type="hosts", num_workers=1, mode="lazy")
graph = sess.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="post",
)
graph = sess.run(graph)
print(graph.schema)
Please note:
lazy
mode yet.Label
,
the attributes should be the same in number and type, and preferably
have the same name, because the data of the same Label
will be put into one Table,
and the attribute names will uses the names specified by the first configuration.GraphScope will deduce data types from input files, and it works as expected in most cases. However, sometimes user may want to determine the data types as well, e.g.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
properties=["content", ("length", "int")],
)
It forces the property to be (casted and) loaded as specified data type. The format of this parameter is tuple(s) with the name and the type.
e.g., in this case, the property length
will have type int
rather than the default int64_t
. The options of the types are int
, int64
, float
, double
, or str
.
The class Graph
has three meta options, which are:
oid_type
, can be int32_t
, int64_t
or string
. Default to int64_t
in consideration of efficiency.
But if the ID column can't be represented by int64_t
, then we should use string
.
When it is known that the ID is within of range of int32
, using int32_t
can be helpful
to optimize the memory usage.directed
, boolean value and default to True
. Controls to load an directed or undirected graph.generate_eid
, bool, default to True
, whether to generate an unique id for all edges automatically.retain_oid
, bool, default to True
, whether to keep the original ID in vertex table.Let's make this example complete. A more complex example to load LDBC snb graph can be find here.
graph = graphscope.g(oid_type="int64_t", directed=True, generate_eid=True, retain_oid=True)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
label="knows",
src_label="person",
dst_label="person",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
print(graph.schema)
The data source aforementioned is an object of Loader
. A loader wraps a location or the data itself.
GraphScope supports load a graph from pandas dataframes or numpy ndarrays, making it easy to construct a graph right in the python console.
Apart from the loader, the other fields like properties, label, etc. are the same as examples above.
import numpy as np
import pandas as pd
leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=["leader_id", "member_id", "group_size"])
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array(
[490.33, 164.5, 190.25, 762.0, 434.2, 513.0, 569.0, 25.0, 308.0, 87.0]
)
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=["student_id", "avg_score"]).astype(
{"student_id": np.int64}
)
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
graph = graphscope.g().add_vertices(df_student).add_edges(df_group)
Note that each array is a column, we pass it like as COO matrix format to the loader.
array_group = [df_group[col].values for col in ["leader_id", "member_id", "group_size"]]
array_student = [df_student[col].values for col in ["student_id", "avg_score"]]
graph = graphscope.g().add_vertices(array_student).add_edges(array_group)
When a loader wraps a location, it may only contains a str.
The string follows the standard of URI. When receiving a request for loading graph from a location,
graphscope
will parse the URI and invoke corresponding loader according to the schema.
Currently, graphscope
supports loaders for local
, s3
, oss
, hdfs
.
Under the hood, data is loaded distributedly by v6d , v6d
takes advantage
of fsspec to resolve specific scheme and formats.
Any additional configurations can be passed in kwargs of Loader
, which will be parsed
directly by the specific class. e.g., host
and port
to hdfs
, or access-id
, secret-access-key
to oss
or s3
.
from graphscope.framework.loader import Loader
ds1 = Loader("file:///var/datafiles/group.e")
ds2 = Loader("oss://graphscope_bucket/datafiles/group.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')
ds3 = Loader("hdfs:///datafiles/group.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})
d34 = Loader("s3://datafiles/group.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})
Users can implement customized loaders to support additional data sources. Take ossfs as an example, a user needs to subclass AbstractFileSystem
, which
is used to resolve specific protocol scheme, and AbstractBufferFile
to read and write.
The only methods the user needs to override is _upload_chunk
,
_initiate_upload
and _fetch_range
. In the end, the user needs to use fsspec.register_implementation('protocol_name', 'protocol_file_system')
to register corresponding resolver.
When the graph is huge, it takes large amount of time(e.g., hours) for the graph loading. GraphScope provides serialization and deserialization for graph data, which dumps and load the constructed graphs in the form of binary data to(from) disk.
graph.save_to
takes a path
argument, indicating the location to store the binary data.
graph.save_to('/tmp/serial')
graph.load_from
is a classmethod
, its path
argument should be exactly the same to the path
passed in graph.save_to
. Please note that during serialization, the workers dump its own data to files with its index as suffix. Thus the number of workers for deserialization should be exactly the same to that for serialization.
In addition, graph.load_from
needs an extra sess
parameter, specifying which session the graph would be deserialized in.
import graphscope
from graphscope import Graph
sess = graphscope.session()
deserialized_graph = Graph.load_from('/tmp/seri', sess)
print(deserialized_graph.schema)