In this notebook you will learn how to build a link prediction classifier using Neo4j and scikit-learn.
Import the libraries that you will need (remember to unset Reset all runtimes before running):
from py2neo import Graph
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import pandas as pd
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
Next, create a connection to your Neo4j database, just as you did previously when you set up your environment.
Update the cell below to use the Bolt URL, and Password, as you did previously.
# Change the line of code below to use the Bolt URL and Password of your Neo4j Database instance.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>"))
#graph = Graph("bolt://52.3.242.176:33698", auth=("neo4j", "equivalent-listing-parts"))
graph = Graph("bolt://localhost:7687", auth=("neo4j", "neo"))
You will build an inferred graph of co-authors based on people collaborating on the same papers. You will store a property on the relationship indicating the year of their first collaboration.
Run this code to do this:
query = """
CALL apoc.periodic.iterate(
"MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
RETURN a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations",
"MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
SET coauthor.collaborations = collaborations",
{batchSize: 100})
"""
graph.run(query).stats()
Now that you have created a co-author graph, you need an approach that will allow you to predict future links (relationships) that will be created between people.
You will use the link prediction algorithms that you learned about in the previous section. Once you have computed scores with this algorithms what should you do?
There are two main approaches that one can take:
You can use the scores from the link predictions directly, specifying a threshold value above which we predict that a link will be created between two nodes.
You can take a supervised learning approach where you use the scores as features to train a binary classifier. The binary classifier then predicts whether a pair of nodes will have a link.
In this notebook you will apply the supervised learning approach.
Next, you must create the train and test datasets on which you can build, and then evaluate a model.
The tricky thing when working with graph data is that you cannot just randomly split the data, as this could lead to data leakage.
Data leakage can occur when data outside of your training data is inadvertently used to create your model. This can easily happen when working with graphs because pairs of nodes in the training set may be connected to those in the test set.
When you compute link prediction measures over that training set the measures computed contain information from the test set that you will later evaluate the model against.
Instead, you need to split the graph into training and test sub graphs. If the graph has a concept of time, things are easier as you can split the graph at a point in time. The training set will be from before the time, the test set after.
This is still not a perfect solution and you must ensure that the general network structure in the training and test sub-graphs is similar.
Subsequently, pairs of nodes in our train and test datasets will have relationships between them. They will be the positive examples in your Machine Learning model.
Because the citation graph contains times, you can create train and test graphs by splitting the data on a particular year. Next, you must determine what year that should be. Determine the distribution of the first year that co-authors collaborated:
query = """
MATCH p=()-[r:CO_AUTHOR]->()
WITH r.year AS year, count(*) AS count
ORDER BY year
RETURN toString(year) AS year, count
"""
by_year = graph.run(query).to_data_frame()
ax = by_year.plot(kind='bar', x='year', y='count', legend=None, figsize=(15,8))
ax.xaxis.set_label_text("")
plt.tight_layout()
plt.show()
It looks like 2006 would act as a good year for splitting the data. All co-authorships from 2005 and earlier as our train graph, and everything from 2006 onwards as the test graph.
Create explicit CO_AUTHOR_EARLY
and CO_AUTHOR_LATE
relationships in the graph based on that year. The following code will create these relationships:
query = """
MATCH (a)-[r:CO_AUTHOR]->(b)
where r.year < 2006
MERGE (a)-[:CO_AUTHOR_EARLY {year: r.year}]-(b);
"""
graph.run(query).stats()
query = """
MATCH (a)-[r:CO_AUTHOR]->(b)
where r.year >= 2006
MERGE (a)-[:CO_AUTHOR_LATE {year: r.year}]-(b);
"""
graph.run(query).stats()
Determine how many co-author relationship you have in each of these sub graphs:
query = """
MATCH ()-[:CO_AUTHOR_EARLY]->()
RETURN count(*) AS count
"""
graph.run(query).to_data_frame()
query = """
MATCH ()-[:CO_AUTHOR_LATE]->()
RETURN count(*) AS count
"""
graph.run(query).to_data_frame()
This graph has a split of 52-48, which is a bit on the high side, but should be ok. Next, you will create negative examples.
The simplest approach is to use all pair of nodes that don’t have a relationship. The problem with this approach is that there are significantly more examples of pairs of nodes that don’t have a relationship than there are pairs of nodes that do.
The maximum number of negative examples is equal to:
# negative examples = (# nodes)² - (# relationships) - (# nodes)
i.e. the number of nodes squared, minus the relationships that the graph has, minus self relationships.
If you were to use all of these negative examples in your training set, you would have a massive class imbalance — there are many negative examples and relatively few positive ones.
A model trained using data that is this imbalanced will achieve very high accuracy by predicting that any pair of nodes don’t have a relationship between them, which is not quite what you want!
You need to reduce the number of negative examples. An approach described in several link prediction papers is to use pairs of nodes that are a specific number of hops away from each other.
This will significantly reduce the number of negative examples, although there will still be a lot more negative examples than positive.
To solve this problem, you either need to down sample the negative examples or up sample the positive examples.
You will take the down sampling approach. The following function will do this:
def down_sample(df):
copy = df.copy()
zero = Counter(copy.label.values)[0]
un = Counter(copy.label.values)[1]
n = zero - un
copy = copy.drop(copy[copy.label == 0].sample(n=n, random_state=1).index)
return copy.sample(frac=1)
Now you are ready to build the train and test datasets based on the train and test sub-graphs that you created.
train_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
train_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
train_missing_links = train_missing_links.drop_duplicates()
training_df = train_missing_links.append(train_existing_links, ignore_index=True)
training_df['label'] = training_df['label'].astype('category')
training_df = down_sample(training_df)
The train DataFrame contains:
training_df.head()
Repeat the process for the test set:
test_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
test_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_LATE]-()
MATCH (author)-[:CO_AUTHOR_LATE*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_LATE]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
test_missing_links = test_missing_links.drop_duplicates()
test_df = test_missing_links.append(test_existing_links, ignore_index=True)
test_df['label'] = test_df['label'].astype('category')
test_df = down_sample(test_df)
And it's time to sample the test DataFrame:
test_df.head()
Next, you will create a Machine Learning pipeline based on a random forest classifier. This method is well suited as this data set will be comprised of a mix of strong and weak features. While the weak features will sometimes be helpful, the random forest method will ensure that you don’t create a model that only fits the training data.
classifier = RandomForestClassifier(n_estimators=30, max_depth=10, random_state=0)
Start by creating a simple model that tries to predict whether two authors will have a future collaboration based on features extracted from common authors, preferential attachment, and the total union of neighbors.
The following function computes each of these measures for pairs of nodes:
def apply_graphy_features(data, rel_type):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
gds.alpha.linkprediction.commonNeighbors(
p1, p2, {relationshipQuery: $relType}) AS cn,
gds.alpha.linkprediction.preferentialAttachment(
p1, p2, {relationshipQuery: $relType}) AS pa,
gds.alpha.linkprediction.totalNeighbors(
p1, p2, {relationshipQuery: $relType}) AS tn
"""
pairs = [{"node1": node1, "node2": node2} for node1,node2 in data[["node1", "node2"]].values.tolist()]
features = graph.run(query, {"pairs": pairs, "relType": rel_type}).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])
Now apply the function to the training DataFrame:
training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
This is what the DataFrame looks like now:
training_df.head()
Do the same to the test DataFrame:
test_df = apply_graphy_features(test_df, "CO_AUTHOR")
test_df.head()
Next, you will build a model based on these graphy features. You will start by just using one of the features - common neighbors.
The following code builds a random forest model, evaluates it against the test dataset, and then indicates which of the features had the most importance in the model.
columns = ["cn"]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
Next, you need to evaluate the model. You will compute its accuracy, precision, and recall. Then, you will return the importance of each feature used in the model. The following functions will help with this:
def evaluate_model(predictions, actual):
return pd.DataFrame({
"Measure": ["Accuracy", "Precision", "Recall"],
"Score": [accuracy_score(actual, predictions),
precision_score(actual, predictions),
recall_score(actual, predictions)]
})
def feature_importance(columns, classifier):
display("Feature Importance")
df = pd.DataFrame({
"Feature": columns,
"Importance": classifier.feature_importances_
})
df = df.sort_values("Importance", ascending=False)
ax = df.plot(kind='bar', x='Feature', y='Importance', legend=None)
ax.xaxis.set_label_text("")
plt.tight_layout()
plt.show()
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
display(evaluate_model(predictions, y_test))
feature_importance(columns, classifier)
The scores for accuracy and precision are adequate, but the recall is not very good. What happens if you include preferential attachment and total neighbors as well?
columns = ["cn", "pa", "tn"]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
display(evaluate_model(predictions, y_test))
feature_importance(columns, classifier)
Common Neighbors is the dominant feature, but including the two other features has improved the accuracy and recall of the model.
Next, you will add some new features that are generated from graph algorithms.
We will reuse the same projected graphs multiple times during the workflow, so you will store it as a named graph in the graph catalog. You can project multiple relationship types and filter them at algorithm execution time. The following functions will project the named graph used later in the analysis. The relationships will be treated as undirected.
graph.run("""
CALL gds.graph.create('coauthor','Author',
{CO_AUTHOR_EARLY: {orientation:'UNDIRECTED'},
CO_AUTHOR: {orientation:'UNDIRECTED'}})
""").to_data_frame()
Start by running the triangle count and the local clustering coefficient algorithms over the test and train sub-graphs. This algorithm will return the number of triangles that each node forms, as well as each node's clustering coefficient. The clustering coefficient of a node indicates the likelihood that its neighbors are also connected.
graph.run("""
CALL gds.triangleCount.write('coauthor',
{relationshipTypes:['CO_AUTHOR_EARLY'],
writeProperty:'trianglesTrain'});
""").to_data_frame()
graph.run("""
CALL gds.triangleCount.write('coauthor',
{relationshipTypes:['CO_AUTHOR'],
writeProperty:'trianglesTest'});
""").to_data_frame()
graph.run("""
CALL gds.localClusteringCoefficient.write('coauthor',
{relationshipTypes:['CO_AUTHOR_EARLY'],
writeProperty:'coefficientTrain'});
""").to_data_frame()
graph.run("""
CALL gds.localClusteringCoefficient.write('coauthor',
{relationshipTypes:['CO_AUTHOR'],
writeProperty:'coefficientTest'});
""").to_data_frame()
The following function will add these features to the train and test DataFrames:
def apply_triangles_features(data, triangles_prop, coefficient_prop):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
apoc.coll.min([p1[$trianglesProp], p2[$trianglesProp]]) AS minTriangles,
apoc.coll.max([p1[$trianglesProp], p2[$trianglesProp]]) AS maxTriangles,
apoc.coll.min([p1[$coefficientProp], p2[$coefficientProp]]) AS minCoefficient,
apoc.coll.max([p1[$coefficientProp], p2[$coefficientProp]]) AS maxCoefficient
"""
pairs = [{"node1": node1, "node2": node2} for node1,node2 in data[["node1", "node2"]].values.tolist()]
params = {
"pairs": pairs,
"trianglesProp": triangles_prop,
"coefficientProp": coefficient_prop
}
features = graph.run(query, params).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])
Add the new features:
training_df = apply_triangles_features(training_df, "trianglesTrain", "coefficientTrain")
test_df = apply_triangles_features(test_df, "trianglesTest", "coefficientTest")
training_df.head()
test_df.head()
And you will train and evaluate a model with these features:
columns = [
"cn", "pa", "tn", # graph features
"minTriangles", "maxTriangles", "minCoefficient", "maxCoefficient" # triangle features
]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
display(evaluate_model(predictions, y_test))
feature_importance(columns, classifier)
The coefficient features have not added much to the model, but the triangles are useful. Next you will see if Community Detection algorithms can help improve the model.
Community Detection algorithms evaluate how a group is clustered or partitioned. Nodes are considered more similar to nodes that fall in their community than to nodes in other communities.
You will run two Community Detection algorithms over the train and test sub-graphs - Label Propagation and Louvain. First, Label Propagation:
graph.run("""
CALL gds.labelPropagation.write("coauthor",
{relationshipTypes:["CO_AUTHOR_EARLY"],
writeProperty: "partitionTrain"});
""").to_data_frame()
graph.run("""
CALL gds.labelPropagation.write("coauthor",
{relationshipTypes:["CO_AUTHOR"],
writeProperty: "partitionTest"});
""").to_data_frame()
And now Louvain. The Louvain algorithm returns intermediate communities, which are useful for finding fine grained communities that exist in a graph. You will add a property to each node containing the community revealed on the first iteration of the algorithm:
graph.run("""
CALL gds.louvain.stream("coauthor",
{relationshipTypes:["CO_AUTHOR_EARLY"],
includeIntermediateCommunities:true})
YIELD nodeId, communityId, intermediateCommunityIds
WITH gds.util.asNode(nodeId) AS node, intermediateCommunityIds[0] AS smallestCommunity
SET node.louvainTrain = smallestCommunity;
""").stats()
graph.run("""
CALL gds.louvain.stream("coauthor",
{relationshipTypes:["CO_AUTHOR"],
includeIntermediateCommunities:true})
YIELD nodeId, communityId, intermediateCommunityIds
WITH gds.util.asNode(nodeId) AS node, intermediateCommunityIds[0] AS smallestCommunity
SET node.louvainTest = smallestCommunity;
""").stats()
The following function will add these features to the train and test DataFrames:
def apply_community_features(data, partition_prop, louvain_prop):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
gds.alpha.linkprediction.sameCommunity(p1, p2, $partitionProp) AS sp,
gds.alpha.linkprediction.sameCommunity(p1, p2, $louvainProp) AS sl
"""
pairs = [{"node1": node1, "node2": node2} for node1,node2 in data[["node1", "node2"]].values.tolist()]
params = {
"pairs": pairs,
"partitionProp": partition_prop,
"louvainProp": louvain_prop
}
features = graph.run(query, params).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])
training_df = apply_community_features(training_df, "partitionTrain", "louvainTrain")
test_df = apply_community_features(test_df, "partitionTest", "louvainTest")
training_df.head()
test_df.head()
columns = [
"cn", "pa", "tn", # graph features
"minTriangles", "maxTriangles", "minCoefficient", "maxCoefficient", # triangle features
"sp", "sl" # community features
]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
predictions = classifier.predict(test_df[columns])
y_test = test_df["label"]
display(evaluate_model(predictions, y_test))
feature_importance(columns, classifier)
After you have finished the graph analysis, do not forget to release the named graph from the memory.
graph.run("""
CALL gds.graph.drop('coauthor')
""").to_data_frame()