%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import ktrain
from ktrain import graph as gr
Using TensorFlow backend.
using Keras version: 2.2.4
Consider a social network (e.g., Facebook, Linkedin, Twitter) where each node is a person and links represent friendships. Each node (or person) in the graph can be be described by various attributes such as their location, Alma mater, organizational memberships, gender, relationship status, children, etc. Suppose we had the U.S. political affiliation (e.g., Democrat, Republican, Libertarian, Green Party) of only a small subset of nodes with the remaining nodes being unknown. Here, node classification involves predicting the political affiliation of unknown nodes based only on the small subset of of nodes for which we know the political affiliation.
Where as traditional tabular-based models (e.g., logistic regression, SVM) utilize only the node's attributes to predict a node's label, graph neural networks utilize both the node's attributes and the graph's structure. For instance, to predict the political affiliation of a person it is helpful to not only look at the person's attributes but the attributes of other people within the vicinity of this person in the social network. Birds of a feather typically flock together. By exploiting graph structure, graph neural networks require much less labeled ground truth than non-graph approaches. For instance, in the example below, we will consider the labels of only a very small fraction of all nodes to build our model.
In this notebook, we will use ktain to perform node classification on a Twitter graph to predict hateful users. Each Twitter user is described by various attributes related to both their profile and their tweet behavior. Examples include number of tweets and retweets, status length, etc.
The dataset can be downloaded from here.
For node classification, ktrain requires two files formatted in a specific way:
We must first transform the raw dataset into the file formats described above. We consider two files: users.edges
which describes the graph structure and users_neighborhood_anon.csv
which contains each node's label and attributes. The file users.edges
is the edge list and is already in the format expected by ktrain for the most part. We must clean and prepare users_neighborhood_anon.csv
into the format expected by ktrain. We will drop unused columns, normalize numeric attributes, re-order/transform the target column hate
into an interpretable string label, and save the data as a tab-delimited file.
# useful imports
import sklearn
import numpy as np
import pandas as pd
# read in data
data_dir = 'data/hateful-twitter-users/'
users_feat = pd.read_csv(os.path.join(data_dir, 'users_neighborhood_anon.csv'))
# clean the data and drop unused columns
def data_cleaning(feat):
feat = feat.drop(columns=["hate_neigh", "normal_neigh"])
# Convert target values in hate column from strings to integers (0,1,2)
feat['hate'] = np.where(feat['hate']=='hateful', 1, np.where(feat['hate']=='normal', 0, 2))
# missing information
number_of_missing = feat.isnull().sum()
number_of_missing[number_of_missing!=0]
# Replace NA with 0
feat.fillna(0, inplace=True)
# droping info about suspension and deletion as it is should not be use din the predictive model
feat.drop(feat.columns[feat.columns.str.contains("is_")], axis=1, inplace=True)
# drop glove features
feat.drop(feat.columns[feat.columns.str.contains("_glove")], axis=1, inplace=True)
# drop c_ features
feat.drop(feat.columns[feat.columns.str.contains("c_")], axis=1, inplace=True)
# drop sentiment features for now
feat.drop(feat.columns[feat.columns.str.contains("sentiment")], axis=1, inplace=True)
# drop hashtag feature
feat.drop(['hashtags'], axis=1, inplace=True)
# Drop centrality based measures
feat.drop(columns=['betweenness', 'eigenvector', 'in_degree', 'out_degree'], inplace=True)
feat.drop(columns=['created_at'], inplace=True)
return feat
node_data = data_cleaning(users_feat)
# recode the target column into human-readable string labels
node_data = node_data.replace({'hate': {0:'normal', 1:'hateful', 2:'unknown'}})
# normalize the numeric columns (ignore columns user_id and hate which is label colum)
df_values = node_data.iloc[:, 2:].values
pt = sklearn.preprocessing.PowerTransformer(method='yeo-johnson', standardize=True)
df_values_log = pt.fit_transform(df_values)
node_data.iloc[:, 2:] = df_values_log
# drop user_id and use the equivalent index as node ID
node_data.index = node_data.index.map(str)
node_data.drop(columns=['user_id'], inplace=True)
# move target column to last position
cols = list(node_data)
cols.remove('hate')
cols.append('hate')
node_data = node_data.reindex(columns= cols)
node_data.head()
statuses_count | followers_count | followees_count | favorites_count | listed_count | negotiate_empath | vehicle_empath | science_empath | timidity_empath | gain_empath | ... | tweet number | retweet number | quote number | status length | number urls | baddies | mentions | time_diff | time_diff_median | hate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.541150 | 0.046773 | 1.104767 | 1.869391 | 0.017835 | -1.752256 | 0.164900 | 0.181173 | 0.875069 | 1.130523 | ... | -0.049013 | 0.321929 | -0.369992 | -1.036127 | -0.796091 | 0.047430 | 0.356495 | -1.888186 | -1.299249 | normal |
1 | -0.700240 | 0.772450 | -0.526061 | -1.434183 | 0.613187 | -0.735320 | -0.864337 | 0.599279 | 1.610977 | -1.203049 | ... | 1.479066 | -1.999580 | -1.545285 | -0.188945 | -1.875745 | -0.626192 | -1.972207 | 0.160925 | -1.512603 | unknown |
2 | -1.077284 | -0.127775 | 0.767345 | -0.669050 | -0.523882 | -0.118440 | -1.573040 | 1.211083 | -0.154213 | 0.932754 | ... | -0.201320 | 0.452537 | -1.545285 | 0.637869 | 0.884530 | -0.096918 | 0.348954 | 0.698841 | 0.122176 | unknown |
3 | 1.908494 | -0.021575 | -0.548705 | 0.078540 | 0.017835 | -0.472125 | 1.281633 | -0.544862 | 1.259492 | -0.456470 | ... | -1.018822 | 1.085858 | -0.662393 | -0.701835 | 0.088472 | -0.626192 | -1.254997 | -1.576801 | -1.311031 | unknown |
4 | -0.778589 | 0.729918 | 2.296049 | -0.725089 | 0.700128 | -1.488804 | -1.573040 | -0.969812 | 0.199834 | -1.203049 | ... | -0.427866 | 0.638106 | -1.545285 | 1.370832 | 0.655433 | 0.955922 | -1.914894 | 0.803553 | 1.472247 | unknown |
5 rows × 205 columns
# save both data files as tab-delimited to be consistent
node_data.to_csv('/tmp/twitter-nodes.tab', sep='\t', header=False)
edge_data = pd.read_csv(data_dir+'/users.edges', header=None, names=['Source', 'Destination'])
edge_data.to_csv('/tmp/twitter-edges.tab', sep='\t', header=False, index=False)
# check to make sure there are no missing values in the non-target columns
node_data[node_data.drop('hate', axis=1).isnull().any(axis=1)].shape[0]
0
Here, we will load the preprocessed dataset. Of the nodes that are annotated with labels (i.e., hate vs normal), we will use 15% as the training set and 85% as for validation. For the nodes with no labels (i.e., tagged as 'unknown' in this dataset), we will create df_holdout
and G_complete
. The dataframe df_holdout
contains the features of those nodes with no labels and G_complete
is the entire graph including the nodes in df_holdout
that were held out.
If holdout_for_inductive=True
, then features of holdout nodes are not visible during training, and the training graph is a subgraph of G_complete
. Otherwise, the features (not labels) of the heldout nodes can be exploited during training. The holdout_for_inductive=True
paramter is useful for assessing how well your model can make predictions for new nodes added to the graph later using G_complete
(inductive inference). In this case, holdout_for_inductive=False
, as we would like to use the features of unlabeled nodes to help learn to make accurate predictions. G_complete
, then, is identical to the training graph and is not used, as we are only doing transductive inference. See this example notebook to better see the difference between transductive and inductive inference in graphs.
(train_data, val_data, preproc,
df_holdout, G_complete) = gr.graph_nodes_from_csv('/tmp/twitter-nodes.tab',
'/tmp/twitter-edges.tab',
sample_size=20,
holdout_pct=None, # using missing_label_value for holdout
holdout_for_inductive=False,
missing_label_value='unknown',
train_pct=0.15,
sep='\t')
Largest subgraph statistics: 100386 nodes, 2194979 edges using 95415 nodes with missing target as holdout set Size of training graph: 100386 nodes Training nodes: 745 Validation nodes: 4226 Nodes treated as unlabeled for testing/inference: 95415 Holdout node features are visible during training (transductive inference)
The training graph and the dataframe containing features of all nodes in the training graph are both accessible via the Preprocessor
instance. Let's look at the class distributions in the training graph. There is class imbalance here that might be addressed by computing class weights and supplying them to the class_weight
parameter of any *fit*
method in ktrain and Keras. We will train without doing so here, though.
print("Initial hateful/normal users distribution")
print(preproc.df.target.value_counts())
Initial hateful/normal users distribution unknown 95415 normal 4427 hateful 544 Name: target, dtype: int64
gr.print_node_classifiers()
graphsage: GraphSAGE: https://arxiv.org/pdf/1706.02216.pdf
learner = ktrain.get_learner(model=gr.graph_node_classifier('graphsage', train_data),
train_data=train_data,
val_data=val_data,
batch_size=64)
Is Multi-Label? False done
Given the small number of batches per epoch, a larger number of epochs is required to estimate the learning rate. We will cap it at 100 here.
learner.lr_find(max_epochs=100)
simulating training for different learning rates... this may take a few moments... Epoch 1/100 11/11 [==============================] - 1s 129ms/step - loss: 0.6290 - acc: 0.6650 Epoch 2/100 11/11 [==============================] - 1s 106ms/step - loss: 0.6213 - acc: 0.6693 Epoch 3/100 11/11 [==============================] - 1s 106ms/step - loss: 0.6265 - acc: 0.6871 Epoch 4/100 11/11 [==============================] - 1s 105ms/step - loss: 0.6252 - acc: 0.6963 Epoch 5/100 11/11 [==============================] - 1s 108ms/step - loss: 0.6309 - acc: 0.6832 Epoch 6/100 11/11 [==============================] - 1s 104ms/step - loss: 0.6445 - acc: 0.6350 Epoch 7/100 11/11 [==============================] - 1s 109ms/step - loss: 0.6300 - acc: 0.6543 Epoch 8/100 11/11 [==============================] - 1s 109ms/step - loss: 0.6207 - acc: 0.6785 Epoch 9/100 11/11 [==============================] - 1s 98ms/step - loss: 0.6455 - acc: 0.6495 Epoch 10/100 11/11 [==============================] - 1s 108ms/step - loss: 0.6296 - acc: 0.6577 Epoch 11/100 11/11 [==============================] - 1s 91ms/step - loss: 0.6227 - acc: 0.6906 Epoch 12/100 11/11 [==============================] - 1s 96ms/step - loss: 0.6279 - acc: 0.6828 Epoch 13/100 11/11 [==============================] - 1s 106ms/step - loss: 0.6233 - acc: 0.6757 Epoch 14/100 11/11 [==============================] - 1s 119ms/step - loss: 0.6258 - acc: 0.6884 Epoch 15/100 11/11 [==============================] - 1s 111ms/step - loss: 0.6312 - acc: 0.6757 Epoch 16/100 11/11 [==============================] - 1s 106ms/step - loss: 0.6386 - acc: 0.6600 Epoch 17/100 11/11 [==============================] - 1s 104ms/step - loss: 0.6318 - acc: 0.6729 Epoch 18/100 11/11 [==============================] - 1s 107ms/step - loss: 0.6249 - acc: 0.6985 Epoch 19/100 11/11 [==============================] - 1s 110ms/step - loss: 0.6316 - acc: 0.6656 Epoch 20/100 11/11 [==============================] - 1s 110ms/step - loss: 0.6154 - acc: 0.7031 Epoch 21/100 11/11 [==============================] - 1s 101ms/step - loss: 0.6404 - acc: 0.6308 Epoch 22/100 11/11 [==============================] - 1s 92ms/step - loss: 0.6310 - acc: 0.6750 Epoch 23/100 11/11 [==============================] - 1s 93ms/step - loss: 0.6150 - acc: 0.6814 Epoch 24/100 11/11 [==============================] - 1s 97ms/step - loss: 0.6292 - acc: 0.6629 Epoch 25/100 11/11 [==============================] - 1s 114ms/step - loss: 0.6173 - acc: 0.6776 Epoch 26/100 11/11 [==============================] - 1s 110ms/step - loss: 0.6191 - acc: 0.6942 Epoch 27/100 11/11 [==============================] - 1s 112ms/step - loss: 0.6225 - acc: 0.6771 Epoch 28/100 11/11 [==============================] - 1s 107ms/step - loss: 0.6172 - acc: 0.6895 Epoch 29/100 11/11 [==============================] - 1s 104ms/step - loss: 0.6170 - acc: 0.6776 Epoch 30/100 11/11 [==============================] - 1s 100ms/step - loss: 0.6127 - acc: 0.6952 Epoch 31/100 11/11 [==============================] - 1s 110ms/step - loss: 0.6057 - acc: 0.7107 Epoch 32/100 11/11 [==============================] - 1s 108ms/step - loss: 0.6125 - acc: 0.7045 Epoch 33/100 11/11 [==============================] - 1s 101ms/step - loss: 0.5887 - acc: 0.7348 Epoch 34/100 11/11 [==============================] - 1s 108ms/step - loss: 0.5973 - acc: 0.7348 Epoch 35/100 11/11 [==============================] - 1s 108ms/step - loss: 0.5597 - acc: 0.7776 Epoch 36/100 11/11 [==============================] - 1s 100ms/step - loss: 0.5613 - acc: 0.7818 Epoch 37/100 11/11 [==============================] - 1s 109ms/step - loss: 0.5502 - acc: 0.7839 Epoch 38/100 11/11 [==============================] - 1s 104ms/step - loss: 0.5477 - acc: 0.7897 Epoch 39/100 11/11 [==============================] - 1s 110ms/step - loss: 0.5317 - acc: 0.8054 0s - loss: 0.5096 - acc: 0 Epoch 40/100 11/11 [==============================] - 1s 119ms/step - loss: 0.5094 - acc: 0.8511 Epoch 41/100 11/11 [==============================] - 1s 108ms/step - loss: 0.5038 - acc: 0.8439 Epoch 42/100 11/11 [==============================] - 1s 106ms/step - loss: 0.4800 - acc: 0.8622 Epoch 43/100 11/11 [==============================] - 1s 106ms/step - loss: 0.4655 - acc: 0.8710 Epoch 44/100 11/11 [==============================] - 1s 109ms/step - loss: 0.4374 - acc: 0.8845 Epoch 45/100 11/11 [==============================] - 1s 111ms/step - loss: 0.4223 - acc: 0.8866 Epoch 46/100 11/11 [==============================] - 1s 108ms/step - loss: 0.4025 - acc: 0.8824 Epoch 47/100 11/11 [==============================] - 1s 97ms/step - loss: 0.3779 - acc: 0.8945 Epoch 48/100 11/11 [==============================] - 1s 94ms/step - loss: 0.3659 - acc: 0.8888 Epoch 49/100 11/11 [==============================] - 1s 108ms/step - loss: 0.3548 - acc: 0.8859 Epoch 50/100 11/11 [==============================] - 1s 111ms/step - loss: 0.3284 - acc: 0.8845 Epoch 51/100 11/11 [==============================] - 1s 110ms/step - loss: 0.3067 - acc: 0.8945 Epoch 52/100 11/11 [==============================] - 1s 121ms/step - loss: 0.2858 - acc: 0.8977 Epoch 53/100 11/11 [==============================] - 1s 110ms/step - loss: 0.2695 - acc: 0.8905 Epoch 54/100 11/11 [==============================] - 1s 108ms/step - loss: 0.2523 - acc: 0.9077 Epoch 55/100 11/11 [==============================] - 1s 104ms/step - loss: 0.2539 - acc: 0.9237 Epoch 56/100 11/11 [==============================] - 1s 90ms/step - loss: 0.2307 - acc: 0.9185 Epoch 57/100 11/11 [==============================] - 1s 108ms/step - loss: 0.2081 - acc: 0.9318 Epoch 58/100 11/11 [==============================] - 1s 104ms/step - loss: 0.2110 - acc: 0.9237 Epoch 59/100 11/11 [==============================] - 1s 101ms/step - loss: 0.1850 - acc: 0.9437 Epoch 60/100 11/11 [==============================] - 1s 96ms/step - loss: 0.1805 - acc: 0.9373 Epoch 61/100 11/11 [==============================] - 1s 108ms/step - loss: 0.1932 - acc: 0.9252 Epoch 62/100 11/11 [==============================] - 2s 141ms/step - loss: 0.1815 - acc: 0.9347 Epoch 63/100 11/11 [==============================] - 1s 110ms/step - loss: 0.1673 - acc: 0.9316 Epoch 64/100 11/11 [==============================] - 1s 98ms/step - loss: 0.1852 - acc: 0.9407 Epoch 65/100 11/11 [==============================] - 1s 109ms/step - loss: 0.1670 - acc: 0.9332 Epoch 66/100 11/11 [==============================] - 1s 106ms/step - loss: 0.1820 - acc: 0.9373 Epoch 67/100 11/11 [==============================] - 1s 96ms/step - loss: 0.1355 - acc: 0.9521 Epoch 68/100 11/11 [==============================] - 1s 115ms/step - loss: 0.1501 - acc: 0.9517 Epoch 69/100 11/11 [==============================] - 1s 106ms/step - loss: 0.1522 - acc: 0.9521 Epoch 70/100 11/11 [==============================] - 1s 109ms/step - loss: 0.1639 - acc: 0.9460 Epoch 71/100 11/11 [==============================] - 1s 101ms/step - loss: 0.1746 - acc: 0.9287 Epoch 72/100 11/11 [==============================] - 1s 121ms/step - loss: 0.1861 - acc: 0.9294 Epoch 73/100 11/11 [==============================] - 1s 115ms/step - loss: 0.1656 - acc: 0.9444 Epoch 74/100 11/11 [==============================] - 1s 110ms/step - loss: 0.1401 - acc: 0.9516 Epoch 75/100 11/11 [==============================] - 1s 103ms/step - loss: 0.1422 - acc: 0.9401 Epoch 76/100 11/11 [==============================] - 1s 100ms/step - loss: 0.1438 - acc: 0.9537 Epoch 77/100 11/11 [==============================] - 1s 110ms/step - loss: 0.1594 - acc: 0.9394 Epoch 78/100 11/11 [==============================] - 1s 105ms/step - loss: 0.1477 - acc: 0.9380 Epoch 79/100 11/11 [==============================] - 1s 116ms/step - loss: 0.1931 - acc: 0.9261 Epoch 80/100 11/11 [==============================] - 1s 111ms/step - loss: 0.1378 - acc: 0.9587 Epoch 81/100 11/11 [==============================] - 1s 105ms/step - loss: 0.1332 - acc: 0.9456 Epoch 82/100 11/11 [==============================] - 1s 109ms/step - loss: 0.1417 - acc: 0.9560 Epoch 83/100 11/11 [==============================] - 1s 98ms/step - loss: 0.1865 - acc: 0.9344 Epoch 84/100 11/11 [==============================] - 1s 99ms/step - loss: 0.1545 - acc: 0.9416 Epoch 85/100 11/11 [==============================] - 1s 113ms/step - loss: 0.1665 - acc: 0.9387 Epoch 86/100 11/11 [==============================] - 1s 102ms/step - loss: 0.1233 - acc: 0.9558 Epoch 87/100 11/11 [==============================] - 1s 109ms/step - loss: 0.2232 - acc: 0.9095 Epoch 88/100 11/11 [==============================] - 1s 99ms/step - loss: 0.1615 - acc: 0.9437 Epoch 89/100 11/11 [==============================] - 1s 103ms/step - loss: 0.1802 - acc: 0.9373 Epoch 90/100 11/11 [==============================] - 1s 109ms/step - loss: 0.1947 - acc: 0.9460 Epoch 91/100 11/11 [==============================] - 1s 104ms/step - loss: 0.2548 - acc: 0.8948 Epoch 92/100 11/11 [==============================] - 1s 104ms/step - loss: 0.1870 - acc: 0.9347 Epoch 93/100 11/11 [==============================] - 1s 98ms/step - loss: 0.2973 - acc: 0.9002 Epoch 94/100 11/11 [==============================] - 1s 102ms/step - loss: 0.2487 - acc: 0.9245 Epoch 95/100 11/11 [==============================] - 1s 103ms/step - loss: 0.1924 - acc: 0.9294 Epoch 96/100 11/11 [==============================] - 1s 101ms/step - loss: 0.3887 - acc: 0.9009 Epoch 97/100 11/11 [==============================] - 1s 109ms/step - loss: 0.3643 - acc: 0.9045 Epoch 98/100 11/11 [==============================] - 1s 100ms/step - loss: 0.3606 - acc: 0.9159 Epoch 99/100 11/11 [==============================] - 1s 98ms/step - loss: 0.4588 - acc: 0.8867 Epoch 100/100 11/11 [==============================] - 1s 99ms/step - loss: 0.5343 - acc: 0.9173 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.lr_plot()
We will train the model using autofit
, which uses a triangular learning rate policy. We will save the weights for each epoch so that we can reload the best weights when training completes.
learner.autofit(0.005, 30, checkpoint_folder='/tmp/saved_weights')
begin training using triangular learning rate policy with max lr of 0.005... Epoch 1/30 12/12 [==============================] - 9s 748ms/step - loss: 0.6526 - acc: 0.5750 - val_loss: 0.4119 - val_acc: 0.8907 Epoch 2/30 12/12 [==============================] - 7s 578ms/step - loss: 0.4043 - acc: 0.8889 - val_loss: 0.3422 - val_acc: 0.8907 Epoch 3/30 12/12 [==============================] - 8s 674ms/step - loss: 0.3405 - acc: 0.8889 - val_loss: 0.3063 - val_acc: 0.8907 Epoch 4/30 12/12 [==============================] - 8s 644ms/step - loss: 0.3079 - acc: 0.8889 - val_loss: 0.2740 - val_acc: 0.8907 Epoch 5/30 12/12 [==============================] - 7s 620ms/step - loss: 0.2774 - acc: 0.8889 - val_loss: 0.2527 - val_acc: 0.8907 Epoch 6/30 12/12 [==============================] - 8s 662ms/step - loss: 0.2488 - acc: 0.8902 - val_loss: 0.2311 - val_acc: 0.9193 Epoch 7/30 12/12 [==============================] - 7s 614ms/step - loss: 0.2264 - acc: 0.9118 - val_loss: 0.2206 - val_acc: 0.9243 Epoch 8/30 12/12 [==============================] - 8s 667ms/step - loss: 0.2084 - acc: 0.9347 - val_loss: 0.2185 - val_acc: 0.9158 Epoch 9/30 12/12 [==============================] - 8s 640ms/step - loss: 0.2012 - acc: 0.9380 - val_loss: 0.2146 - val_acc: 0.9188 Epoch 10/30 12/12 [==============================] - 7s 621ms/step - loss: 0.2021 - acc: 0.9301 - val_loss: 0.2158 - val_acc: 0.9240 Epoch 11/30 12/12 [==============================] - 8s 675ms/step - loss: 0.1944 - acc: 0.9367 - val_loss: 0.2254 - val_acc: 0.9101 Epoch 12/30 12/12 [==============================] - 8s 673ms/step - loss: 0.1904 - acc: 0.9341 - val_loss: 0.2141 - val_acc: 0.9207 Epoch 13/30 12/12 [==============================] - 7s 622ms/step - loss: 0.1785 - acc: 0.9406 - val_loss: 0.2164 - val_acc: 0.9195 Epoch 14/30 12/12 [==============================] - 7s 621ms/step - loss: 0.1702 - acc: 0.9432 - val_loss: 0.2188 - val_acc: 0.9200 Epoch 15/30 12/12 [==============================] - 7s 582ms/step - loss: 0.1738 - acc: 0.9399 - val_loss: 0.2218 - val_acc: 0.9177 Epoch 16/30 12/12 [==============================] - 7s 580ms/step - loss: 0.1644 - acc: 0.9406 - val_loss: 0.2223 - val_acc: 0.9160 Epoch 17/30 12/12 [==============================] - 8s 652ms/step - loss: 0.1713 - acc: 0.9341 - val_loss: 0.2230 - val_acc: 0.9167 Epoch 18/30 12/12 [==============================] - 7s 584ms/step - loss: 0.1687 - acc: 0.9393 - val_loss: 0.2262 - val_acc: 0.9160 Epoch 19/30 12/12 [==============================] - 7s 620ms/step - loss: 0.1685 - acc: 0.9426 - val_loss: 0.2224 - val_acc: 0.9179 Epoch 20/30 12/12 [==============================] - 8s 645ms/step - loss: 0.1555 - acc: 0.9439 - val_loss: 0.2420 - val_acc: 0.9042 Epoch 21/30 12/12 [==============================] - 8s 649ms/step - loss: 0.1730 - acc: 0.9406 - val_loss: 0.2194 - val_acc: 0.9179 Epoch 22/30 12/12 [==============================] - 7s 572ms/step - loss: 0.1606 - acc: 0.9425 - val_loss: 0.2232 - val_acc: 0.9203 Epoch 23/30 12/12 [==============================] - 7s 614ms/step - loss: 0.1529 - acc: 0.9504 - val_loss: 0.2301 - val_acc: 0.9096 Epoch 24/30 12/12 [==============================] - 7s 575ms/step - loss: 0.1606 - acc: 0.9399 - val_loss: 0.2243 - val_acc: 0.9210 Epoch 25/30 12/12 [==============================] - 7s 588ms/step - loss: 0.1453 - acc: 0.9536 - val_loss: 0.2297 - val_acc: 0.9122 Epoch 26/30 12/12 [==============================] - 8s 701ms/step - loss: 0.1427 - acc: 0.9517 - val_loss: 0.2262 - val_acc: 0.9165 Epoch 27/30 12/12 [==============================] - 7s 568ms/step - loss: 0.1424 - acc: 0.9451 - val_loss: 0.2420 - val_acc: 0.9113 Epoch 28/30 12/12 [==============================] - 8s 632ms/step - loss: 0.1402 - acc: 0.9471 - val_loss: 0.2382 - val_acc: 0.9167 Epoch 29/30 12/12 [==============================] - 8s 688ms/step - loss: 0.1393 - acc: 0.9530 - val_loss: 0.2369 - val_acc: 0.9103 Epoch 30/30 12/12 [==============================] - 7s 590ms/step - loss: 0.1352 - acc: 0.9517 - val_loss: 0.2465 - val_acc: 0.9080
<keras.callbacks.History at 0x7fbc080336d8>
Let's use the weights from Epoch 12.
learner.model.load_weights('/tmp/saved_weights/weights-12.hdf5')
learner.validate(class_names=preproc.get_classes())
precision recall f1-score support hateful 0.68 0.57 0.62 462 normal 0.95 0.97 0.96 3764 accuracy 0.92 4226 macro avg 0.82 0.77 0.79 4226 weighted avg 0.92 0.92 0.92 4226
array([[ 262, 200], [ 122, 3642]])
Let's make predictions for all Twitter users that are unlabeled (i.e., we don't know whether or not they are hateful).
p = ktrain.get_predictor(learner.model, preproc)
df_unlabeled = preproc.df[preproc.df.target=='unknown']
preds = p.predict_transductive(df_unlabeled.index)
preds = np.array(preds)
import pandas as pd
df_preds = pd.DataFrame(zip(df_unlabeled.index, preds), columns=['UserID', 'Predicted'])
df_preds[df_preds.Predicted=='hateful'].head()
UserID | Predicted | |
---|---|---|
52 | 56 | hateful |
129 | 140 | hateful |
243 | 259 | hateful |
475 | 499 | hateful |
501 | 526 | hateful |
df_preds[df_preds.Predicted=='hateful'].shape[0]
578
Out of over 95,000 unlabeled nodes in the Twitter graph, our model predicted 578 as potential hateful users that would seem to warrant a review.