In this tutorial, we demostrate how GraphScope process node classification task on citation network by combining analytics, interactive and graph neural networks computation.
In this example, we use ogbn-mag dataset. ogbn-mag is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.
Given the heterogeneous ogbn-mag data, the task is to predict the class of each paper. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.
This tutorial has the following steps:
First, let's create a session and load obgn_mag dataset as a graph.
import os
import graphscope
from graphscope.dataset.ogbn_mag import load_ogbn_mag
k8s_volumes = {
"data": {
"type": "hostPath",
"field": {
"path": "/testingdata",
"type": "Directory"
},
"mounts": {
"mountPath": "/home/jovyan/datasets",
"readOnly": True
}
}
}
graphscope.set_option(show_log=True)
sess = graphscope.session(k8s_volumes=k8s_volumes)
graph = load_ogbn_mag(sess, "/home/jovyan/datasets/ogbn_mag_small/")
In this example, we launch a interactive query and use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID 2
and 4307
, respectively.
# get the entrypoint for submitting Gremlin queries on graph g.
interactive = sess.gremlin(graph)
# count the number of papers two authors (with id 2 and 4307) have co-authored.
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()
print("result", papers)
Continuing our example, we run graph algorithms on graph to generate structural features. below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.
# exact a subgraph of publication within a time range.
sub_graph = interactive.subgraph(
"g.V().has('year', inside(2014, 2020)).outE('cites')"
)
# project the subgraph to simple graph by selecting papers and their citations.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})
# compute the kcore and triangle-counting.
kc_result = graphscope.k_core(simple_g, k=5)
tc_result = graphscope.triangles(simple_g)
# add the results as new columns to the citation graph.
sub_graph = sub_graph.add_column(kc_result, {"kcore": "r"})
sub_graph = sub_graph.add_column(tc_result, {"tc": "r"})
Then, we use the generated structural features and original features to train a learning model with learning engine.
In our example, we train a GCN model to classify the nodes (papers) into 349 categories, each of which represents a venue (e.g. pre-print and conference).
# define the features for learning,
# we chose original 128-dimension feature and k-core, triangle count result as new features.
paper_features = []
for i in range(128):
paper_features.append("feat_" + str(i))
paper_features.append("kcore")
paper_features.append("tc")
# launch a learning engine. here we split the dataset, 75% as train, 10% as validation and 15% as test.
lg = sess.learning(sub_graph, nodes=[("paper", paper_features)],
edges=[("paper", "cites", "paper")],
gen_labels=[
("train", "paper", 100, (0, 75)),
("val", "paper", 100, (75, 85)),
("test", "paper", 100, (85, 100))
])
# Then we define the training process, use internal GCN model.
from graphscope.learning.examples import GCN
from graphscope.learning.graphlearn.python.model.tf.trainer import LocalTFTrainer
from graphscope.learning.graphlearn.python.model.tf.optimizer import get_tf_optimizer
def train(config, graph):
def model_fn():
return GCN(graph,
config["class_num"],
config["features_num"],
config["batch_size"],
val_batch_size=config["val_batch_size"],
test_batch_size=config["test_batch_size"],
categorical_attrs_desc=config["categorical_attrs_desc"],
hidden_dim=config["hidden_dim"],
in_drop_rate=config["in_drop_rate"],
neighs_num=config["neighs_num"],
hops_num=config["hops_num"],
node_type=config["node_type"],
edge_type=config["edge_type"],
full_graph_mode=config["full_graph_mode"])
trainer = LocalTFTrainer(model_fn,
epoch=config["epoch"],
optimizer=get_tf_optimizer(
config["learning_algo"],
config["learning_rate"],
config["weight_decay"]))
trainer.train_and_evaluate()
# hyperparameters config.
config = {"class_num": 349, # output dimension
"features_num": 130, # 128 dimension + kcore + triangle count
"batch_size": 500,
"val_batch_size": 100,
"test_batch_size":100,
"categorical_attrs_desc": "",
"hidden_dim": 256,
"in_drop_rate": 0.5,
"hops_num": 2,
"neighs_num": [5, 10],
"full_graph_mode": False,
"agg_type": "gcn", # mean, sum
"learning_algo": "adam",
"learning_rate": 0.01,
"weight_decay": 0.0005,
"epoch": 5,
"node_type": "paper",
"edge_type": "cites"}
# start traning and evaluating
train(config, lg)
Finally, don't forget to close the session.
# close the session.
sess.close()