首先,创建会话并导入相关的包
# Install graphscope package if you are NOT in the Playground
!pip3 install graphscope
import graphscope
graphscope.set_option(show_log=False)
from graphscope.dataset import load_ldbc
graph = load_ldbc()
在单机模式下,GraphScope 会将数据文件下载到 ${HOME}/.graphscope/dataset
,并且会保留以供将来使用。
然而,更常见的情况是用户需要使用自己的数据集,并做一些数据分析的工作。
我们提供了一个函数用来定义一个属性图的模型(schema),并以将属性图载入 GraphScope:
首先建立一个空图:
import graphscope
from graphscope.framework.loader import Loader
graph = graphscope.g()
Graph
有几个方法来配置:
def add_vertices(self, vertices, label="_", properties=None, vid_field=0):
pass
def add_edges(self, edges, label="_e", properties=None, src_label=None, dst_label=None, src_field=0, dst_field=1):
pass
这些方法可以增量的构建一个属性图。
我们将使用 ldbc_sample
里的文件做完此篇教程的示例。你可以在 这里 找到源数据。
你可以随时使用 print(graph.schema)
来查看图的模型.
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|")
)
这将会从文件 ${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv
载入数据,并且创建一个名为 _
的边,但是有不同的起始点标签和终点标签。
点标签的名字,默认为 _
.
一张图中不能含有同名的标签,所以若有两个或以上的标签,用户必须指定标签名字。另外,总是给标签一个有意义的名字也有好处。
可以为任何标识符 (identifier)。
举个例子:
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
结果与上一步结果除了标签名完全一致。
一组属性名字。可选项,默认为 None
。
属性名应当与数据中的首行表头中的名字相一致。
如果省略或为 None
,除ID列之外的所有列都将会作为属性载入;如果为空列表 []
,那么将不会载入任何属性;其他情况下,只会载入指定了的列作为属性。
比如说:
# properties will be firstName,lastName,gender,birthday,creationDate,locationIP,browserUsed
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=None,
)
# properties will be firstName, lastName
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=["firstName", "lastName"],
)
# no properties
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
properties=[],
)
作为 ID 列的列名,默认为 0。此列将在载入边时被用做起始点 ID 或目标点 ID。
其值可以是一个字符串,此时指代列名;
或者可以是一个正整数,代表第几列 (从0开始)。
默认为第0列。
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
vid_field="id",
)
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
vid_field=0,
)
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
# Note we already added a vertex label named 'person'.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
src_label="person",
dst_label="person",
)
这将会载入一个标签名为 _e
的边,源节点标签和终点节点标签都为 person
,第一列作为起点的点ID,第二列作为终点的点ID。其他列都作为属性。
边的标签名,默认为 _e
。推荐总是使用一个有意义的标签名。
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
label="knows",
src_label="person",
dst_label="person",
)
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
# Note we already added a vertex label named 'person'.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
起点的 ID 列名与终点的 ID 列名。 默认分别为 0 和 1。
意义和表现与点中的 vid_field
类似,不同的是需要两列,一列为起点 ID, 一列为终点 ID。 以下是个例子:
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
src_field="Person.id",
dst_field="Comment.id",
)
# Or use the index.
# graph = graph.add_edges(Loader('${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv', delimiter='|'), label='likes', src_label='person', dst_label='comment', src_field=0, dst_field=1)
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
# GraphScope will assign ``src_label`` and ``dst_label`` to ``person`` automatically.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
)
)
如果用户的 add_edges
中 src_label
或者 dst_label
取值为图中不存在的点标签,graphscope
会从边的端点中聚合出点表。
graph = graphscope.g()
# Deduce vertex label `person` from the source and destination endpoints of edges.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
src_label="person",
dst_label="person",
)
graph = graphscope.g()
# Deduce the vertex label `person` from the source endpoint,
# and vertex label `comment` from the destination endpoint of edges.
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
在一些情况下,一种边的标签可能连接了两种及以上的点。例如,在下面的属性图中,有一个名为 likes
的边标签,
连接了两种点标签,i.e., person
-> likes
<- comment
and person
-> likes
<- post
。
在这种情况下,可以添加两次名为 likes
的边,但是有不同的起始点标签和终点标签。
sess = graphscope.session(cluster_type="hosts", num_workers=1, mode="lazy")
graph = sess.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_post_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="post",
)
graph = sess.run(graph)
print(graph.schema)
注意:
lazy
会话中支持。GraphScope 可以从输入文件中推断点的类型,大部分情况下工作的很好。
然而,用户有时需要更多的自定义能力。为了满足此种需求,可以在属性名之后加入一个额外类型的参数。像这样:
graph = graphscope.g()
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
properties=["content", ("length", "int")],
)
这将会将属性的类型转换为指定的类型,注意属性名字和类型需要在同一个元组中。
在这里,属性 length
的类型将会是 int
,而默认不指定的话为 int64_t
。 常见的使用场景是指定 int
, int64_t
, float
, double
, str
等类型。
类 Graph
有三个配置元信息的参数,分别为:
oid_type
, 可以为 int32_t
, int64_t
或 string
。 默认为 int64_t
,会有更快的速度,和使用更少的内存。当ID不能用 int64_t
表示时,才应该使用 string
。directed
, bool, 默认为 True
. 指示载入无向图还是有向图。generate_eid
, bool, 默认为 True
. 指示是否为每条边分配一个全局唯一的ID。retain_oid
, bool, 默认为 True
. 指示是否保留原始点ID到点属性表中。让我们写一个完整的图的定义。
graph = graphscope.g(oid_type="int64_t", directed=True, generate_eid=True, retain_oid=True)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/person_0_0.csv", delimiter="|"),
label="person",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/comment_0_0.csv", delimiter="|"),
label="comment",
)
graph = graph.add_vertices(
Loader("${HOME}/.graphscope/datasets/ldbc_sample/post_0_0.csv", delimiter="|"),
label="post",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_knows_person_0_0.csv",
delimiter="|",
),
label="knows",
src_label="person",
dst_label="person",
)
graph = graph.add_edges(
Loader(
"${HOME}/.graphscope/datasets/ldbc_sample/person_likes_comment_0_0.csv",
delimiter="|",
),
label="likes",
src_label="person",
dst_label="comment",
)
print(graph.schema)
import numpy as np
import pandas as pd
leader_id = np.array([0, 0, 0, 1, 1, 3, 3, 6, 6, 6, 7, 7, 8])
member_id = np.array([2, 3, 4, 5, 6, 6, 8, 0, 2, 8, 8, 9, 9])
group_size = np.array([4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2])
e_data = np.transpose(np.vstack([leader_id, member_id, group_size]))
df_group = pd.DataFrame(e_data, columns=["leader_id", "member_id", "group_size"])
student_id = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
avg_score = np.array(
[490.33, 164.5, 190.25, 762.0, 434.2, 513.0, 569.0, 25.0, 308.0, 87.0]
)
v_data = np.transpose(np.vstack([student_id, avg_score]))
df_student = pd.DataFrame(v_data, columns=["student_id", "avg_score"]).astype(
{"student_id": np.int64}
)
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
graph = graphscope.g().add_vertices(df_student).add_edges(df_group)
注意每个数组都代表一列,我们将其以 COO 矩阵的方式传入。
array_group = [df_group[col].values for col in ["leader_id", "member_id", "group_size"]]
array_student = [df_student[col].values for col in ["student_id", "avg_score"]]
graph = graphscope.g().add_vertices(array_student).add_edges(array_group)
当 loader
包含文件路径时,它可能仅包含一个字符串。
文件路径应遵循 URI 标准。当收到包含文件路径的载图请求时, graphscope
将会解析 URI,调用相应的载图模块。
目前, graphscope
支持多种数据源:本地, OSS,S3,和 HDFS:
数据由 Vineyard 负责载入,Vineyard``` 使用 [fsspec](https://github.com/intake/filesystem_spec) 解析不同的数据格式以及参数。任何额外的具体的配置都可以在Loader的可变参数列表中传入,这些参数会直接被传递到对应的存储类中。比如
host和
port之于
HDFS,或者是
access-id,
secret-access-key`` 之于 oss 或 s3。
from graphscope.framework.loader import Loader
ds1 = Loader("file:///var/datafiles/group.e")
ds2 = Loader("oss://graphscope_bucket/datafiles/group.e", key='access-id', secret='secret-access-key', endpoint='oss-cn-hangzhou.aliyuncs.com')
ds3 = Loader("hdfs:///datafiles/group.e", host='localhost', port='9000', extra_conf={'conf1': 'value1'})
d34 = Loader("s3://datafiles/group.e", key='access-id', secret='secret-access-key', client_kwargs={'region_name': 'us-east-1'})
用户可以方便的实现自己的driver来支持更多的数据源,比如参照 ossfs driver的实现方式。
用户需要继承 AbstractFileSystem
类用来做scheme对应的resolver, 以及 AbstractBufferedFile
。用户仅需要实现 _upload_chunk
,
_initiate_upload
and _fetch_range
这几个方法就可以实现基本的read,write功能。最后通过 fsspec.register_implementation('protocol_name', 'protocol_file_system')
注册自定义的resolver。
当图的规模很大时,可能要花大量时间载入(可能多达几小时)。
GraphScope 提供了序列化与反序列化图数据的功能,可以将载入的图以二进制的形式序列化到磁盘上,以及从这些文件反序列化为一张图。
graph.load_from
的参数类似 graph.save_to
. 但是,其 path
参数必须和序列化时为 graph.save_to
提供的 path
参数完全一致,因为 GraphScope 依赖命名规则去找到所有文件,注意在序列化时,所有的工作者都将其自己所持有的图数据写到一个以自己的工作者ID结尾的文件中,所以在反序列化时的工作者数目也必须和序列化时的工作者数目 完全一致。
graph.load_from
额外需要一个 sess
的参数,代表将反序列化后的图载入到此会话。
import graphscope
from graphscope import Graph
sess = graphscope.session()
deserialized_graph = Graph.load_from('/tmp/seri', sess)
print(deserialized_graph.schema)