This demo explains how to load data into a form that can be used by the StellarGraph library. See all other demos.
The StellarGraph library supports loading graph information from Pandas. Pandas is a library for working with data frames.
This is a great way to load data that offers a good balance between performance and convenience.
The StellarGraph library supports many deep machine learning (ML) algorithms on graphs. A graph consists of a set of nodes connected by edges, potentially with information associated with each node and edge. Any task using the StellarGraph library needs data to be loaded into an instance of the StellarGraph
class. This class stores the graph structure (the nodes and the edges between them), as well as information about them:
All of these are optional, because they have sensible defaults if they're not relevant to the task at hand.
This notebook walks through loading several kinds of graphs using Pandas. Pandas is a reasonably efficient form of loading, that is convenient for preprocessing.
StellarGraph supports loading data from many sources with all sorts of data preprocessing, via Pandas DataFrames, NumPy arrays, Neo4j and NetworkX graphs. See all loading demos for more details.
The documentation for the StellarGraph
class includes a compressed reminder of everything discussed in this file, as well as explanations of all of the parameters.
The StellarGraph
class is available at the top level of the stellargraph
library:
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
%pip install -q stellargraph[demos]==1.2.1
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg
try:
sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
raise ValueError(
f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed. Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
) from None
from stellargraph import StellarGraph
Pandas DataFrames are tables of data that can be created from many input sources, such as CSV files and SQL databases. StellarGraph builds on this power by allowing construction from these DataFrames.
import pandas as pd
Pandas is widely supported by other libraries and products, like scikit-learn, and thus a user of StellarGraph gets to benefit from these easily too.
We'll start with a homogeneous graph without any node features. This means the graph consists of only nodes and edges without any information other than a unique identifier.
The basic form of constructing a StellarGraph
is passing in an edge DataFrame
with two columns (source
and target
), where each row represents a pair of nodes that are connected. Let's construct a StellarGraph
representing a square with a diagonal:
a -- b
| \ |
| \ |
d -- c
We'll start with a synthetic DataFrame defined in code here (there's some examples later of reading DataFrames from files).
Each row represents a connection: for instance, the first one is the edge from a
to b
.
square_edges = pd.DataFrame(
{"source": ["a", "b", "c", "d", "a"], "target": ["b", "c", "d", "a", "c"]}
)
square_edges
source | target | |
---|---|---|
0 | a | b |
1 | b | c |
2 | c | d |
3 | d | a |
4 | a | c |
Given our edges, we can create a StellarGraph
directly:
square = StellarGraph(edges=square_edges)
The info
method (docs) gives a high-level summary of a StellarGraph
:
print(square.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-default->default Edge types: default-default->default: [5] Weights: all 1 (default) Features: none
On this square, it tells us that there's 4 nodes of type default
(a homogeneous graph still has node and edge types, but they default to default
), with no features, and one type of edge that touches it. It also tells us that there's 5 edges of type default
that go between nodes of type default
. This matches what we expect: it's a graph with 4 nodes and 5 edges and one type of each.
The default node type and edge types can be set using the node_type_default
and edge_type_default
parameters to StellarGraph(...)
:
square_named = StellarGraph(
edges=square_edges, node_type_default="corner", edge_type_default="line"
)
print(square_named.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: corner: [4] Features: none Edge types: corner-line->corner Edge types: corner-line->corner: [5] Weights: all 1 (default) Features: none
The names of the columns used for the edges can be controlled with the source_column
and target_column
parameters to StellarGraph(...)
. For instance, maybe our graph comes from a file with first
and second
columns:
square_edges_first_second = square_edges.rename(
columns={"source": "first", "target": "second"}
)
square_edges_first_second
first | second | |
---|---|---|
0 | a | b |
1 | b | c |
2 | c | d |
3 | d | a |
4 | a | c |
square_first_second = StellarGraph(
edges=square_edges_first_second, source_column="first", target_column="second"
)
print(square_first_second.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-default->default Edge types: default-default->default: [5] Weights: all 1 (default) Features: none
For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For instance, we might have a graph of academic papers (nodes) and how they cite each other (edges): we might have information about the nodes such as the authors and the publication year, and even the abstract or full paper contents. If we're doing a machine learning task, it can be useful to feed this information into models. The StellarGraph
class supports this using a Pandas DataFrame: each row corresponds to a feature vector for a node or edge.
Let's imagine the nodes have two features, which might be their coordinates, or maybe some other piece of information. We'll continue using synthetic DataFrames, but these could easily be read from a file. (There's an example in the "Real data: Homogeneous graph from CSV files" section at the end of this notebook.)
square_node_data = pd.DataFrame(
{"x": [1, 2, 3, 4], "y": [-0.2, 0.3, 0.0, -0.5]}, index=["a", "b", "c", "d"]
)
square_node_data
x | y | |
---|---|---|
a | 1 | -0.2 |
b | 2 | 0.3 |
c | 3 | 0.0 |
d | 4 | -0.5 |
StellarGraph
uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Notice that the square_features
DataFrame has a
, ..., d
as its index, matching the identifiers used in the edges.
We've now got all the right node data, in addition to the edges from before, so now we can create a StellarGraph
.
square_node_features = StellarGraph(square_node_data, square_edges)
print(square_node_features.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: float32 vector, length 2 Edge types: default-default->default Edge types: default-default->default: [5] Weights: all 1 (default) Features: none
Notice the output of info
now says that the nodes of the default
type have 2 features.
We can also give the node and edge types helpful names, using either the node_type_default
/edge_type_default
parameters we saw before, or by passing the DataFrames in with a dictionary, where the key is the name of the type.
square_named_node_features = StellarGraph(
{"corner": square_node_data}, {"line": square_edges}
)
print(square_named_node_features.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: corner: [4] Features: float32 vector, length 2 Edge types: corner-line->corner Edge types: corner-line->corner: [5] Weights: all 1 (default) Features: none
Edges can have features in the same way as nodes. Any columns that don't have a special meaning are taken as feature vector elements. This means that the source and target columns are not included in the feature vectors (nor are the weight or edge type columns, that are discussed later).
Let's imagine the edges have 3 features each.
square_edge_data = pd.DataFrame(
{
"source": ["a", "b", "c", "d", "a"],
"target": ["b", "c", "d", "a", "c"],
"A": [-1, 2, -3, 4, -5],
"B": [0.4, 0.1, 0.9, 0, 0.9],
"C": [12, 34, 56, 78, 90],
}
)
square_edge_data
source | target | A | B | C | |
---|---|---|---|---|---|
0 | a | b | -1 | 0.4 | 12 |
1 | b | c | 2 | 0.1 | 34 |
2 | c | d | -3 | 0.9 | 56 |
3 | d | a | 4 | 0.0 | 78 |
4 | a | c | -5 | 0.9 | 90 |
square_named_features = StellarGraph(
{"corner": square_node_data}, {"line": square_edge_data}
)
print(square_named_features.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: corner: [4] Features: float32 vector, length 2 Edge types: corner-line->corner Edge types: corner-line->corner: [5] Weights: all 1 (default) Features: float32 vector, length 3
Notice the output of info
now says that the edges of the line
type have 3 features, in addition to the 2 features for each node of type corner
.
Some algorithms can understand edge weights, which can be used as a measure of the strength of the connection, or a measure of distance between nodes. A StellarGraph
instance can have weighted edges, by including a weight
column in the DataFrame of edges.
We'll continue with the synthetic square example, by adding that extra weight
column into the DataFrame. This column might be part of the data naturally, or it might need to be computed. Either of these is fine with Pandas: in the first case, it can be loaded at the same time as loading the source and target information, and in the second, the full power of Pandas is available to compute it (such as manipulating other information associated with the edge DataFrame, or even by comparing the nodes at each end).
square_weighted_edges = pd.DataFrame(
{
"source": ["a", "b", "c", "d", "a"],
"target": ["b", "c", "d", "a", "c"],
"weight": [1.0, 0.2, 3.4, 5.67, 1.0],
}
)
square_weighted_edges
source | target | weight | |
---|---|---|---|
0 | a | b | 1.00 |
1 | b | c | 0.20 |
2 | c | d | 3.40 |
3 | d | a | 5.67 |
4 | a | c | 1.00 |
square_weighted = StellarGraph(edges=square_weighted_edges)
print(square_weighted.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-default->default Edge types: default-default->default: [5] Weights: range=[0.2, 5.67], mean=2.254, std=2.25534 Features: none
Notice the output of info
now shows additional information about edge weights.
Edges weights can be used with node and edge features; for instance, we create a similar graph to the last graph in the "Homogeneous graph with features" section that has our edge weights:
square_weighted_edge_data = pd.DataFrame(
{
"source": ["a", "b", "c", "d", "a"],
"target": ["b", "c", "d", "a", "c"],
"weight": [1.0, 0.2, 3.4, 5.67, 1.0],
"A": [-1, 2, -3, 4, -5],
"B": [0.4, 0.1, 0.9, 0, 0.9],
"C": [12, 34, 56, 78, 90],
}
)
square_weighted_edge_data
source | target | weight | A | B | C | |
---|---|---|---|---|---|---|
0 | a | b | 1.00 | -1 | 0.4 | 12 |
1 | b | c | 0.20 | 2 | 0.1 | 34 |
2 | c | d | 3.40 | -3 | 0.9 | 56 |
3 | d | a | 5.67 | 4 | 0.0 | 78 |
4 | a | c | 1.00 | -5 | 0.9 | 90 |
square_features_weighted = StellarGraph(
{"corner": square_node_data}, {"line": square_weighted_edge_data}
)
print(square_features_weighted.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: corner: [4] Features: float32 vector, length 2 Edge types: corner-line->corner Edge types: corner-line->corner: [5] Weights: range=[0.2, 5.67], mean=2.254, std=2.25534 Features: float32 vector, length 3
Some graphs have edge directions, where going from source to target has a different meaning to going from target to source.
A directed graph can be created by using the StellarDiGraph
class instead of the StellarGraph
one. The construction is almost identical, and we can reuse any of the DataFrames that we created in the sections above. For instance, continuing from the previous cell, we can have a directed homogeneous graph with node features and edge weights.
from stellargraph import StellarDiGraph
square_features_weighted_directed = StellarDiGraph(
{"corner": square_node_data}, {"line": square_weighted_edge_data}
)
print(square_features_weighted_directed.info())
StellarDiGraph: Directed multigraph Nodes: 4, Edges: 5 Node types: corner: [4] Features: float32 vector, length 2 Edge types: corner-line->corner Edge types: corner-line->corner: [5] Weights: range=[0.2, 5.67], mean=2.254, std=2.25534 Features: float32 vector, length 3
Everything discussed about StellarGraph
in this file also works with StellarDiGraph
, including parameters like node_type_default
and source_column
.
Some graphs have multiple types of nodes and multiple types of edges.
For example, an academic citation network that includes authors might have wrote
edges connecting author
nodes to paper
nodes, in addition to the cites
edges between paper
nodes. There could be supervised
edges between author
s (example) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.
StellarGraph
supports all forms of heterogeneous graphs.
A heterogeneous StellarGraph
can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did for the Cora examples in the "homogeneous graph with features" section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single DataFrame does not work.
Let's return to the square graph from earlier:
a -- b
| \ |
| \ |
d -- c
Suppose a
is of type foo
, and no features, but b
, c
and d
are of type bar
and have two features each, e.g. for b
, y = 0.4, z = 100
. Since the features are different shapes (a
has zero), they need to be modeled as different types, with separate DataFrame
s.
square_foo = pd.DataFrame(index=["a"])
square_foo
a |
---|
square_bar = pd.DataFrame(
{"y": [0.4, 0.1, 0.9], "z": [100, 200, 300]}, index=["b", "c", "d"]
)
square_bar
y | z | |
---|---|---|
b | 0.4 | 100 |
c | 0.1 | 200 |
d | 0.9 | 300 |
We have the information for the two node types foo
and bar
in separate DataFrames, so we can now put them in a dictionary to create a StellarGraph
. Notice that info()
is now reporting multiple node types, as well as information specific to each.
square_foo_and_bar = StellarGraph({"foo": square_foo, "bar": square_bar}, square_edges)
print(square_foo_and_bar.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: bar: [3] Features: float32 vector, length 2 Edge types: bar-default->bar, bar-default->foo foo: [1] Features: none Edge types: foo-default->bar Edge types: foo-default->bar: [2] Weights: all 1 (default) Features: none bar-default->bar: [2] Weights: all 1 (default) Features: none bar-default->foo: [1] Weights: all 1 (default) Features: none
Node IDs (the DataFrame index) needs to be unique across all types. For example, renaming the a
corner to b
like square_foo_overlap
in the next cell, is not accepted and a StellarGraph(...)
call will throw an error
square_foo_overlap = pd.DataFrame({"x": [-1]}, index=["b"])
square_foo_overlap
x | |
---|---|
b | -1 |
# Uncomment to see the error
# StellarGraph({"foo": square_foo_overlap, "bar": square_bar}, square_edges)
If the node IDs aren't unique across types, one way to make them unique is to add a string prefix. You'll need to add the same prefix to the node IDs used in the edges too. Adding a prefix can be done by replacing the index:
square_foo_overlap_prefix = square_foo_overlap.set_index(
"foo-" + square_foo_overlap.index.astype(str)
)
square_foo_overlap_prefix
x | |
---|---|
foo-b | -1 |
square_bar_prefix = square_bar.set_index("bar-" + square_bar.index.astype(str))
square_bar_prefix
y | z | |
---|---|---|
bar-b | 0.4 | 100 |
bar-c | 0.1 | 200 |
bar-d | 0.9 | 300 |
Graphs with multiple edge types can be simpler. Since there are often no features on the edges, we can pass a DataFrame with an additional column for the type, specifying it via the edge_type_column
parameter. If there are features on the edges, multiple edge types can also be created in the same way as multiple node types, by passing with a dictionary of DataFrames.
For example, suppose the edges in our square graph have types based on their orientation.
square_edges_types = square_edges.assign(
orientation=["horizontal", "vertical", "horizontal", "vertical", "diagonal"]
)
square_edges_types
source | target | orientation | |
---|---|---|---|
0 | a | b | horizontal |
1 | b | c | vertical |
2 | c | d | horizontal |
3 | d | a | vertical |
4 | a | c | diagonal |
square_orientation = StellarGraph(
edges=square_edges_types, edge_type_column="orientation"
)
print(square_orientation.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default Edge types: default-vertical->default: [2] Weights: all 1 (default) Features: none default-horizontal->default: [2] Weights: all 1 (default) Features: none default-diagonal->default: [1] Weights: all 1 (default) Features: none
Edge weights are supported, in the same way as a homogeneous graph above, with a weight
column:
square_edges_types_weighted = square_edges_types.assign(weight=[1.0, 0.2, 3.4, 5.67, 1.0])
square_edges_types_weighted
source | target | orientation | weight | |
---|---|---|---|---|
0 | a | b | horizontal | 1.00 |
1 | b | c | vertical | 0.20 |
2 | c | d | horizontal | 3.40 |
3 | d | a | vertical | 5.67 |
4 | a | c | diagonal | 1.00 |
square_orientation_weighted = StellarGraph(
edges=square_edges_types_weighted, edge_type_column="orientation"
)
print(square_orientation_weighted.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default Edge types: default-vertical->default: [2] Weights: range=[0.2, 5.67], mean=2.935, std=3.86787 Features: none default-horizontal->default: [2] Weights: range=[1, 3.4], mean=2.2, std=1.69706 Features: none default-diagonal->default: [1] Weights: all 1 (default) Features: none
As mentioned above, if there are multiple edge types and the edges have edge features, one will typically need to pass a dictionary of DataFrames similar to multiple node types. The features of each type can be different.
Note: Edges also have IDs (the DataFrame index, like nodes), and they need to be unique across all edge types.
square_edges_horizontal = pd.DataFrame(
{"source": ["a", "c"], "target": ["b", "d"], "A": [-1, -3]}, index=[0, 2]
)
square_edges_vertical = pd.DataFrame(
{"source": ["b", "d"], "target": ["c", "a"], "B": [0.1, 0], "C": [34, 78]},
index=[1, 3],
)
square_edges_diagonal = pd.DataFrame({"source": ["a"], "target": ["c"]}, index=[4])
# example:
square_edges_horizontal
source | target | A | |
---|---|---|---|
0 | a | b | -1 |
2 | c | d | -3 |
square_orientation_separate = StellarGraph(
edges={
"horizontal": square_edges_horizontal,
"vertical": square_edges_vertical,
"diagonal": square_edges_diagonal,
},
)
print(square_orientation_separate.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default Edge types: default-vertical->default: [2] Weights: all 1 (default) Features: float32 vector, length 2 default-horizontal->default: [2] Weights: all 1 (default) Features: float32 vector, length 1 default-diagonal->default: [1] Weights: all 1 (default) Features: none
Notice that vertical
edges have 2 features, horizontal
have 1, and diagonal
have 0.
Edge weights can be specified with this multiple-DataFrames form too. Any or all of the DataFrames for an edge type can contain a weight
column.
square_edges_horizontal_weighted = square_edges_horizontal.assign(weight=[12.3, 45.6])
square_edges_horizontal_weighted
source | target | A | weight | |
---|---|---|---|---|
0 | a | b | -1 | 12.3 |
2 | c | d | -3 | 45.6 |
square_orientation_separate_weighted = StellarGraph(
edges={
"horizontal": square_edges_horizontal_weighted,
"vertical": square_edges_vertical,
"diagonal": square_edges_diagonal,
},
)
print(square_orientation_separate_weighted.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: default: [4] Features: none Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default Edge types: default-vertical->default: [2] Weights: all 1 (default) Features: float32 vector, length 2 default-horizontal->default: [2] Weights: range=[12.3, 45.6], mean=28.95, std=23.5467 Features: float32 vector, length 1 default-diagonal->default: [1] Weights: all 1 (default) Features: none
A graph can have multiple node types and multiple edge types, with features or without, with edge weights or without and with edge_type_column=...
(shown here) or with multiple DataFrames for edge types. We can put everything together from the previous sections to make a single complicated StellarGraph
.
square_everything = StellarGraph(
{"foo": square_foo, "bar": square_bar},
square_edges_types_weighted,
edge_type_column="orientation",
)
print(square_everything.info())
StellarGraph: Undirected multigraph Nodes: 4, Edges: 5 Node types: bar: [3] Features: float32 vector, length 2 Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo foo: [1] Features: none Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar Edge types: foo-horizontal->bar: [1] Weights: all 1 (default) Features: none foo-diagonal->bar: [1] Weights: all 1 (default) Features: none bar-vertical->foo: [1] Weights: all 5.67 Features: none bar-vertical->bar: [1] Weights: all 0.2 Features: none bar-horizontal->bar: [1] Weights: all 3.4 Features: none
A heterogeneous graph can be directed by using StellarDiGraph
to construct it, similar to a homogeneous graph.
from stellargraph import StellarDiGraph
square_everything_directed = StellarDiGraph(
{"foo": square_foo, "bar": square_bar},
square_edges_types_weighted,
edge_type_column="orientation",
)
print(square_everything_directed.info())
StellarDiGraph: Directed multigraph Nodes: 4, Edges: 5 Node types: bar: [3] Features: float32 vector, length 2 Edge types: bar-horizontal->bar, bar-vertical->bar, bar-vertical->foo foo: [1] Features: none Edge types: foo-diagonal->bar, foo-horizontal->bar Edge types: foo-horizontal->bar: [1] Weights: all 1 (default) Features: none foo-diagonal->bar: [1] Weights: all 1 (default) Features: none bar-vertical->foo: [1] Weights: all 5.67 Features: none bar-vertical->bar: [1] Weights: all 0.2 Features: none bar-horizontal->bar: [1] Weights: all 3.4 Features: none
We've been using a synthetic square graph with perfectly formatted data as an example for this whole notebook, because it helps us focus on just the core StellarGraph
functionality. Real life isn't so simple; there's usually files to wrangle and formats to convert, so we'll finish this demo covering some example steps to go from data in files to a StellarGraph
.
We'll work with the Cora dataset from https://linqs.soe.ucsc.edu/data:
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details.
The dataset contains two files: cora.cites
and cora.content
.
cora.cites
is a tab-separated values (TSV) file of the graph edges. The first column identifies the cited paper, and the second column identifies the paper that cites it. The first three lines of the file look like:
35 1033
35 103482
35 103515
...
cora.content
is also a TSV file of information about each node (paper), with 1435 columns: the first column is the node ID (matching the IDs used in cora.cites
), the next 1433 are the 0/1-values of word vectors, and the last is the subject area class of the paper. The first three lines of the file look like (with the 1423 of the 0/1 columns truncated)
31336 0 0 ... 0 1 0 0 0 0 0 0 Neural_Networks
1061127 0 0 ... 1 0 0 0 0 0 0 0 Rule_Learning
1106406 0 0 ... 0 0 0 0 0 0 0 0 Reinforcement_Learning
...
This graph is homogeneous (all nodes are papers, and all edges are citations), with node features (the 0/1-values) but no edge weights.
The StellarGraph library provides the datasets
module (docs) for working with some common datasets via classes like Cora
(docs). It can download the necessary files via the download
method. (The load
method also converts it into a StellarGraph
, but that's too helpful for this tutorial: we're learning how to do that ourselves.)
from stellargraph.datasets import Cora
import os
cora = Cora()
cora.download()
# the base_directory property tells us where it was downloaded to:
cora_cites_file = os.path.join(cora.base_directory, "cora.cites")
cora_content_file = os.path.join(cora.base_directory, "cora.content")
We've now got the files on disk, so we can read them using the pd.read_csv
function. Despite the "CSV" in the name, this function can be used to read TSV files too. The files don't have a row of column headings, so we'll want to set our own.
First, the edges. We can use source
and target
as the column headings, to match StellarGraph
's defaults. However, the natural phrasing is "paper X cites paper Y", not "paper Y is cited by paper X", so we use the columns in reverse order to match.
cora_cites = pd.read_csv(
cora_cites_file,
sep="\t", # tab-separated
header=None, # no heading row
names=["target", "source"], # set our own names for the columns
)
cora_cites
target | source | |
---|---|---|
0 | 35 | 1033 |
1 | 35 | 103482 |
2 | 35 | 103515 |
3 | 35 | 1050679 |
4 | 35 | 1103960 |
... | ... | ... |
5424 | 853116 | 19621 |
5425 | 853116 | 853155 |
5426 | 853118 | 1140289 |
5427 | 853155 | 853118 |
5428 | 954315 | 1155073 |
5429 rows × 2 columns
Now, the nodes. Again, we have to choose the columns' names. The names of the 0/1-columns don't matter so much, but we can give the first column (of IDs) and the last one (of subjects) useful names.
cora_feature_names = [f"w{i}" for i in range(1433)]
cora_raw_content = pd.read_csv(
cora_content_file,
sep="\t", # tab-separated
header=None, # no heading row
names=["id", *cora_feature_names, "subject"], # set our own names for the columns
)
cora_raw_content
id | w0 | w1 | w2 | w3 | w4 | w5 | w6 | w7 | w8 | ... | w1424 | w1425 | w1426 | w1427 | w1428 | w1429 | w1430 | w1431 | w1432 | subject | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Neural_Networks |
1 | 1061127 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Rule_Learning |
2 | 1106406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Reinforcement_Learning |
3 | 13195 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Reinforcement_Learning |
4 | 37879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Probabilistic_Methods |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2703 | 1128975 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
2704 | 1128977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
2705 | 1128978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
2706 | 117328 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Case_Based |
2707 | 24043 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Neural_Networks |
2708 rows × 1435 columns
As we saw above when adding node features, StellarGraph
uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Currently our dataframe just has a simple numeric range as the index, but it needs to be using the id
column. Pandas offers a few ways to control the indexing; in this case, we want to replace the current index by moving the id
column to it, which is done most easily with set_index
:
cora_content_str_subject = cora_raw_content.set_index("id")
cora_content_str_subject
w0 | w1 | w2 | w3 | w4 | w5 | w6 | w7 | w8 | w9 | ... | w1424 | w1425 | w1426 | w1427 | w1428 | w1429 | w1430 | w1431 | w1432 | subject | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
31336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Neural_Networks |
1061127 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Rule_Learning |
1106406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Reinforcement_Learning |
13195 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Reinforcement_Learning |
37879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Probabilistic_Methods |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1128975 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
1128977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
1128978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Genetic_Algorithms |
117328 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Case_Based |
24043 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Neural_Networks |
2708 rows × 1434 columns
We're almost ready to create the StellarGraph
, we just have to do something about the non-numeric subject
column. Many machine learning models only work on numeric features, requiring text and other data to be converted before apply; the models in StellarGraph are no different.
There are two options, depending on the task:
subject
column entirely: many uses of Cora are predicting the subject
of a node, given all of the graph structure and other information, so including it as information in the graph is giving the answer directlyw...
features)We'll look at both (feel free to skip ahead to 2).
Let's start with the first, removing the columns. The drop
method (docs) lets us remove one or more columns.
cora_content_no_subject = cora_content_str_subject.drop(columns="subject")
cora_content_no_subject
w0 | w1 | w2 | w3 | w4 | w5 | w6 | w7 | w8 | w9 | ... | w1423 | w1424 | w1425 | w1426 | w1427 | w1428 | w1429 | w1430 | w1431 | w1432 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
31336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1061127 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1106406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13195 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
37879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1128975 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1128977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1128978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
117328 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
24043 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2708 rows × 1433 columns
We've got all the right node data, and the right edges, so now we can create a StellarGraph
using the techniques we saw in the "homogeneous graph with features" section above.
cora_no_subject = StellarGraph({"paper": cora_content_no_subject}, {"cites": cora_cites})
print(cora_no_subject.info())
StellarGraph: Undirected multigraph Nodes: 2708, Edges: 5429 Node types: paper: [2708] Features: float32 vector, length 1433 Edge types: paper-cites->paper Edge types: paper-cites->paper: [5429] Weights: all 1 (default) Features: none
If we're trying to predict the subject, we'll probably need to use the subject
labels as ground-truth labels in a supervised or semi-supervised machine learning task. This can be extracted from the DataFrame and held separately, to be passed in as training, validation or test examples.
cora_subject = cora_content_str_subject["subject"]
cora_subject
id 31336 Neural_Networks 1061127 Rule_Learning 1106406 Reinforcement_Learning 13195 Reinforcement_Learning 37879 Probabilistic_Methods ... 1128975 Genetic_Algorithms 1128977 Genetic_Algorithms 1128978 Genetic_Algorithms 117328 Case_Based 24043 Neural_Networks Name: subject, Length: 2708, dtype: object
This is a normal Pandas Series, and so can be manipulated with any of the functions that support it. For example, if we wanted to train a machine learning algorithm using 25% of the nodes, we could use the train_test_split
function (docs) from the scikit-learn library.
from sklearn import model_selection
cora_train, cora_test = model_selection.train_test_split(
cora_subject, train_size=0.25, random_state=123
)
cora_train
id 191222 Neural_Networks 1109208 Genetic_Algorithms 308003 Rule_Learning 13205 Reinforcement_Learning 3217 Theory ... 642827 Probabilistic_Methods 1126315 Neural_Networks 1105718 Neural_Networks 3084 Case_Based 80491 Neural_Networks Name: subject, Length: 677, dtype: object
cora_test
id 1103969 Probabilistic_Methods 1119295 Rule_Learning 1130567 Reinforcement_Learning 59045 Theory 1129494 Neural_Networks ... 126867 Case_Based 1105764 Reinforcement_Learning 782486 Neural_Networks 74821 Probabilistic_Methods 41732 Reinforcement_Learning Name: subject, Length: 2031, dtype: object
This dataset, with this preparation, is used in a demo of the GCN algorithm for node classification. The task is to predict the subject of each node.
cora_content_one_hot_subject = pd.get_dummies(
cora_content_str_subject, columns=["subject"]
)
cora_content_one_hot_subject
w0 | w1 | w2 | w3 | w4 | w5 | w6 | w7 | w8 | w9 | ... | w1430 | w1431 | w1432 | subject_Case_Based | subject_Genetic_Algorithms | subject_Neural_Networks | subject_Probabilistic_Methods | subject_Reinforcement_Learning | subject_Rule_Learning | subject_Theory | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
31336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1061127 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1106406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
13195 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
37879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1128975 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1128977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1128978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
117328 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
24043 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2708 rows × 1440 columns
Using this DataFrame, we can create a StellarGraph
with 1440 features per node instead of 1433 like the previous section.
cora_one_hot_subject = StellarGraph(
{"paper": cora_content_one_hot_subject}, {"cites": cora_cites}
)
print(cora_one_hot_subject.info())
StellarGraph: Undirected multigraph Nodes: 2708, Edges: 5429 Node types: paper: [2708] Features: float32 vector, length 1440 Edge types: paper-cites->paper Edge types: paper-cites->paper: [5429] Weights: all 1 (default) Features: none
You hopefully now know more about building a StellarGraph
in various configurations via Pandas DataFrames, including some feature preprocessing in the "Real data: Homogeneous graph from CSV files" section.
Revisit this document to use as a reminder, or the documentation for the StellarGraph
class.
Once you've loaded your data, you can start doing machine learning: a good place to start is the demo of the GCN algorithm on the Cora dataset for node classification. Additionally, StellarGraph includes many other demos of other algorithms, solving other tasks.