Notebook

In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

In [2]:

import ktrain
from ktrain import graph as gr

Using TensorFlow backend.

using Keras version: 2.2.4

Node Classification in Graphs¶

Consider a social network (e.g., Facebook, Linkedin, Twitter) where each node is a person and links represent friendships. Each node (or person) in the graph can be be described by various attributes such as their location, Alma mater, organizational memberships, gender, relationship status, children, etc. Suppose we had the U.S. political affiliation (e.g., Democrat, Republican, Libertarian, Green Party) of only a small subset of nodes with the remaining nodes being unknown. Here, node classification involves predicting the political affiliation of unknown nodes based only on the small subset of of nodes for which we know the political affiliation.

Where as traditional tabular-based models (e.g., logistic regression, SVM) utilize only the node's attributes to predict a node's label, graph neural networks utilize both the node's attributes and the graph's structure. For instance, to predict the political affiliation of a person it is helpful to not only look at the person's attributes but the attributes of other people within the vicinity of this person in the social network. Birds of a feather typically flock together. By exploiting graph structure, graph neural networks require much less labeled ground truth than non-graph approaches. For instance, in the example below, we will consider the labels of only a very small fraction of all nodes to build our model.

Hateful Twitter Users¶

In this notebook, we will use ktain to perform node classification on a Twitter graph to predict hateful users. Each Twitter user is described by various attributes related to both their profile and their tweet behavior. Examples include number of tweets and retweets, status length, etc.

The dataset can be downloaded from here.

For node classification, ktrain requires two files formatted in a specific way:

a CSV or tab-delimited file containing the links (or edges) in the graph. Each row containing two node IDs representing an edge.
A CSV or tab-delimited file describing the attributes and label associated with each node. The first column is the node ID and the last column should be the label or target (as string labels such as "hate" or "normal"). All other columns should contain numerical features and are assumed to be standardized or transformed as necessary. If the last column representing the target has missing values, these are treated as a holdout set for which predictions can be made after training the model. The numeric feature columns should not have any missing values.

Clean and Prepare Data¶

We must first transform the raw dataset into the file formats described above. We consider two files: users.edges which describes the graph structure and users_neighborhood_anon.csv which contains each node's label and attributes. The file users.edges is the edge list and is already in the format expected by ktrain for the most part. We must clean and prepare users_neighborhood_anon.csv into the format expected by ktrain. We will drop unused columns, normalize numeric attributes, re-order/transform the target column hate into an interpretable string label, and save the data as a tab-delimited file.

In [3]:

# useful imports
import sklearn
import numpy as np
import pandas as pd

# read in data
data_dir = 'data/hateful-twitter-users/'
users_feat = pd.read_csv(os.path.join(data_dir, 'users_neighborhood_anon.csv'))

# clean the data and drop unused columns
def data_cleaning(feat):
    feat = feat.drop(columns=["hate_neigh", "normal_neigh"])
    # Convert target values in hate column from strings to integers (0,1,2)
    feat['hate'] = np.where(feat['hate']=='hateful', 1, np.where(feat['hate']=='normal', 0, 2))
    # missing information
    number_of_missing = feat.isnull().sum()
    number_of_missing[number_of_missing!=0]
    # Replace NA with 0
    feat.fillna(0, inplace=True)
    # droping info about suspension and deletion as it is should not be use din the predictive model
    feat.drop(feat.columns[feat.columns.str.contains("is_")], axis=1, inplace=True)
    # drop glove features
    feat.drop(feat.columns[feat.columns.str.contains("_glove")], axis=1, inplace=True)
    # drop c_ features
    feat.drop(feat.columns[feat.columns.str.contains("c_")], axis=1, inplace=True)
    # drop sentiment features for now
    feat.drop(feat.columns[feat.columns.str.contains("sentiment")], axis=1, inplace=True)
    # drop hashtag feature
    feat.drop(['hashtags'], axis=1, inplace=True)
    # Drop centrality based measures
    feat.drop(columns=['betweenness', 'eigenvector', 'in_degree', 'out_degree'], inplace=True)
    feat.drop(columns=['created_at'], inplace=True)
    return feat
node_data = data_cleaning(users_feat)

# recode the target column into human-readable string labels
node_data = node_data.replace({'hate': {0:'normal', 1:'hateful', 2:'unknown'}})

# normalize the numeric columns (ignore columns user_id and hate which is label colum)
df_values = node_data.iloc[:, 2:].values
pt = sklearn.preprocessing.PowerTransformer(method='yeo-johnson', standardize=True)
df_values_log = pt.fit_transform(df_values)
node_data.iloc[:, 2:] = df_values_log

# drop user_id and use the equivalent index as node ID
node_data.index = node_data.index.map(str)
node_data.drop(columns=['user_id'], inplace=True)

# move target column to last position
cols = list(node_data)
cols.remove('hate')
cols.append('hate')
node_data = node_data.reindex(columns= cols)
node_data.head()

Out[3]:

	statuses_count	followers_count	followees_count	favorites_count	listed_count	negotiate_empath	vehicle_empath	science_empath	timidity_empath	gain_empath	...	tweet number	retweet number	quote number	status length	number urls	baddies	mentions	time_diff	time_diff_median	hate
0	1.541150	0.046773	1.104767	1.869391	0.017835	-1.752256	0.164900	0.181173	0.875069	1.130523	...	-0.049013	0.321929	-0.369992	-1.036127	-0.796091	0.047430	0.356495	-1.888186	-1.299249	normal
1	-0.700240	0.772450	-0.526061	-1.434183	0.613187	-0.735320	-0.864337	0.599279	1.610977	-1.203049	...	1.479066	-1.999580	-1.545285	-0.188945	-1.875745	-0.626192	-1.972207	0.160925	-1.512603	unknown
2	-1.077284	-0.127775	0.767345	-0.669050	-0.523882	-0.118440	-1.573040	1.211083	-0.154213	0.932754	...	-0.201320	0.452537	-1.545285	0.637869	0.884530	-0.096918	0.348954	0.698841	0.122176	unknown
3	1.908494	-0.021575	-0.548705	0.078540	0.017835	-0.472125	1.281633	-0.544862	1.259492	-0.456470	...	-1.018822	1.085858	-0.662393	-0.701835	0.088472	-0.626192	-1.254997	-1.576801	-1.311031	unknown
4	-0.778589	0.729918	2.296049	-0.725089	0.700128	-1.488804	-1.573040	-0.969812	0.199834	-1.203049	...	-0.427866	0.638106	-1.545285	1.370832	0.655433	0.955922	-1.914894	0.803553	1.472247	unknown

5 rows × 205 columns

In [4]:

# save both data files as tab-delimited to be consistent
node_data.to_csv('/tmp/twitter-nodes.tab', sep='\t', header=False)
edge_data = pd.read_csv(data_dir+'/users.edges', header=None, names=['Source', 'Destination'])
edge_data.to_csv('/tmp/twitter-edges.tab', sep='\t', header=False, index=False)

In [5]:

# check to make sure there are no missing values in the non-target columns
node_data[node_data.drop('hate', axis=1).isnull().any(axis=1)].shape[0]

Out[5]:

STEP 1: Load and Preprocess Data¶

Here, we will load the preprocessed dataset. Of the nodes that are annotated with labels (i.e., hate vs normal), we will use 15% as the training set and 85% as for validation. For the nodes with no labels (i.e., tagged as 'unknown' in this dataset), we will create df_holdout and G_complete. The dataframe df_holdout contains the features of those nodes with no labels and G_complete is the entire graph including the nodes in df_holdout that were held out.

If holdout_for_inductive=True, then features of holdout nodes are not visible during training, and the training graph is a subgraph of G_complete. Otherwise, the features (not labels) of the heldout nodes can be exploited during training. The holdout_for_inductive=True paramter is useful for assessing how well your model can make predictions for new nodes added to the graph later using G_complete (inductive inference). In this case, holdout_for_inductive=False, as we would like to use the features of unlabeled nodes to help learn to make accurate predictions. G_complete, then, is identical to the training graph and is not used, as we are only doing transductive inference. See this example notebook to better see the difference between transductive and inductive inference in graphs.

In [6]:

(train_data, val_data, preproc, 
 df_holdout, G_complete)        = gr.graph_nodes_from_csv('/tmp/twitter-nodes.tab',
                                           '/tmp/twitter-edges.tab',
                                           sample_size=20, 
                                           holdout_pct=None,  # using missing_label_value for holdout
                                           holdout_for_inductive=False,
                                           missing_label_value='unknown',
                                           train_pct=0.15,
                                           sep='\t')

Largest subgraph statistics: 100386 nodes, 2194979 edges
using 95415 nodes with missing target as holdout set
Size of training graph: 100386 nodes
Training nodes: 745
Validation nodes: 4226
Nodes treated as unlabeled for testing/inference: 95415
Holdout node features are visible during training (transductive inference)

The training graph and the dataframe containing features of all nodes in the training graph are both accessible via the Preprocessor instance. Let's look at the class distributions in the training graph. There is class imbalance here that might be addressed by computing class weights and supplying them to the class_weight parameter of any *fit* method in ktrain and Keras. We will train without doing so here, though.

In [7]:

print("Initial hateful/normal users distribution")
print(preproc.df.target.value_counts())

Initial hateful/normal users distribution
unknown    95415
normal      4427
hateful      544
Name: target, dtype: int64

In [ ]:

STEP 2: Define a Model and Wrap in Learner Object¶

In [7]:

gr.print_node_classifiers()

graphsage: GraphSAGE:  https://arxiv.org/pdf/1706.02216.pdf

In [8]:

learner = ktrain.get_learner(model=gr.graph_node_classifier('graphsage', train_data), 
                             train_data=train_data, 
                             val_data=val_data, 
                             batch_size=64)

Is Multi-Label? False
done

STEP 3: Estimate LR¶

Given the small number of batches per epoch, a larger number of epochs is required to estimate the learning rate. We will cap it at 100 here.

In [10]:

learner.lr_find(max_epochs=100)

simulating training for different learning rates... this may take a few moments...
Epoch 1/100
11/11 [==============================] - 1s 129ms/step - loss: 0.6290 - acc: 0.6650
Epoch 2/100
11/11 [==============================] - 1s 106ms/step - loss: 0.6213 - acc: 0.6693
Epoch 3/100
11/11 [==============================] - 1s 106ms/step - loss: 0.6265 - acc: 0.6871
Epoch 4/100
11/11 [==============================] - 1s 105ms/step - loss: 0.6252 - acc: 0.6963
Epoch 5/100
11/11 [==============================] - 1s 108ms/step - loss: 0.6309 - acc: 0.6832
Epoch 6/100
11/11 [==============================] - 1s 104ms/step - loss: 0.6445 - acc: 0.6350
Epoch 7/100
11/11 [==============================] - 1s 109ms/step - loss: 0.6300 - acc: 0.6543
Epoch 8/100
11/11 [==============================] - 1s 109ms/step - loss: 0.6207 - acc: 0.6785
Epoch 9/100
11/11 [==============================] - 1s 98ms/step - loss: 0.6455 - acc: 0.6495
Epoch 10/100
11/11 [==============================] - 1s 108ms/step - loss: 0.6296 - acc: 0.6577
Epoch 11/100
11/11 [==============================] - 1s 91ms/step - loss: 0.6227 - acc: 0.6906
Epoch 12/100
11/11 [==============================] - 1s 96ms/step - loss: 0.6279 - acc: 0.6828
Epoch 13/100
11/11 [==============================] - 1s 106ms/step - loss: 0.6233 - acc: 0.6757
Epoch 14/100
11/11 [==============================] - 1s 119ms/step - loss: 0.6258 - acc: 0.6884
Epoch 15/100
11/11 [==============================] - 1s 111ms/step - loss: 0.6312 - acc: 0.6757
Epoch 16/100
11/11 [==============================] - 1s 106ms/step - loss: 0.6386 - acc: 0.6600
Epoch 17/100
11/11 [==============================] - 1s 104ms/step - loss: 0.6318 - acc: 0.6729
Epoch 18/100
11/11 [==============================] - 1s 107ms/step - loss: 0.6249 - acc: 0.6985
Epoch 19/100
11/11 [==============================] - 1s 110ms/step - loss: 0.6316 - acc: 0.6656
Epoch 20/100
11/11 [==============================] - 1s 110ms/step - loss: 0.6154 - acc: 0.7031
Epoch 21/100
11/11 [==============================] - 1s 101ms/step - loss: 0.6404 - acc: 0.6308
Epoch 22/100
11/11 [==============================] - 1s 92ms/step - loss: 0.6310 - acc: 0.6750
Epoch 23/100
11/11 [==============================] - 1s 93ms/step - loss: 0.6150 - acc: 0.6814
Epoch 24/100
11/11 [==============================] - 1s 97ms/step - loss: 0.6292 - acc: 0.6629
Epoch 25/100
11/11 [==============================] - 1s 114ms/step - loss: 0.6173 - acc: 0.6776
Epoch 26/100
11/11 [==============================] - 1s 110ms/step - loss: 0.6191 - acc: 0.6942
Epoch 27/100
11/11 [==============================] - 1s 112ms/step - loss: 0.6225 - acc: 0.6771
Epoch 28/100
11/11 [==============================] - 1s 107ms/step - loss: 0.6172 - acc: 0.6895
Epoch 29/100
11/11 [==============================] - 1s 104ms/step - loss: 0.6170 - acc: 0.6776
Epoch 30/100
11/11 [==============================] - 1s 100ms/step - loss: 0.6127 - acc: 0.6952
Epoch 31/100
11/11 [==============================] - 1s 110ms/step - loss: 0.6057 - acc: 0.7107
Epoch 32/100
11/11 [==============================] - 1s 108ms/step - loss: 0.6125 - acc: 0.7045
Epoch 33/100
11/11 [==============================] - 1s 101ms/step - loss: 0.5887 - acc: 0.7348
Epoch 34/100
11/11 [==============================] - 1s 108ms/step - loss: 0.5973 - acc: 0.7348
Epoch 35/100
11/11 [==============================] - 1s 108ms/step - loss: 0.5597 - acc: 0.7776
Epoch 36/100
11/11 [==============================] - 1s 100ms/step - loss: 0.5613 - acc: 0.7818
Epoch 37/100
11/11 [==============================] - 1s 109ms/step - loss: 0.5502 - acc: 0.7839
Epoch 38/100
11/11 [==============================] - 1s 104ms/step - loss: 0.5477 - acc: 0.7897
Epoch 39/100
11/11 [==============================] - 1s 110ms/step - loss: 0.5317 - acc: 0.8054 0s - loss: 0.5096 - acc: 0
Epoch 40/100
11/11 [==============================] - 1s 119ms/step - loss: 0.5094 - acc: 0.8511
Epoch 41/100
11/11 [==============================] - 1s 108ms/step - loss: 0.5038 - acc: 0.8439
Epoch 42/100
11/11 [==============================] - 1s 106ms/step - loss: 0.4800 - acc: 0.8622
Epoch 43/100
11/11 [==============================] - 1s 106ms/step - loss: 0.4655 - acc: 0.8710
Epoch 44/100
11/11 [==============================] - 1s 109ms/step - loss: 0.4374 - acc: 0.8845
Epoch 45/100
11/11 [==============================] - 1s 111ms/step - loss: 0.4223 - acc: 0.8866
Epoch 46/100
11/11 [==============================] - 1s 108ms/step - loss: 0.4025 - acc: 0.8824
Epoch 47/100
11/11 [==============================] - 1s 97ms/step - loss: 0.3779 - acc: 0.8945
Epoch 48/100
11/11 [==============================] - 1s 94ms/step - loss: 0.3659 - acc: 0.8888
Epoch 49/100
11/11 [==============================] - 1s 108ms/step - loss: 0.3548 - acc: 0.8859
Epoch 50/100
11/11 [==============================] - 1s 111ms/step - loss: 0.3284 - acc: 0.8845
Epoch 51/100
11/11 [==============================] - 1s 110ms/step - loss: 0.3067 - acc: 0.8945
Epoch 52/100
11/11 [==============================] - 1s 121ms/step - loss: 0.2858 - acc: 0.8977
Epoch 53/100
11/11 [==============================] - 1s 110ms/step - loss: 0.2695 - acc: 0.8905
Epoch 54/100
11/11 [==============================] - 1s 108ms/step - loss: 0.2523 - acc: 0.9077
Epoch 55/100
11/11 [==============================] - 1s 104ms/step - loss: 0.2539 - acc: 0.9237
Epoch 56/100
11/11 [==============================] - 1s 90ms/step - loss: 0.2307 - acc: 0.9185
Epoch 57/100
11/11 [==============================] - 1s 108ms/step - loss: 0.2081 - acc: 0.9318
Epoch 58/100
11/11 [==============================] - 1s 104ms/step - loss: 0.2110 - acc: 0.9237
Epoch 59/100
11/11 [==============================] - 1s 101ms/step - loss: 0.1850 - acc: 0.9437
Epoch 60/100
11/11 [==============================] - 1s 96ms/step - loss: 0.1805 - acc: 0.9373
Epoch 61/100
11/11 [==============================] - 1s 108ms/step - loss: 0.1932 - acc: 0.9252
Epoch 62/100
11/11 [==============================] - 2s 141ms/step - loss: 0.1815 - acc: 0.9347
Epoch 63/100
11/11 [==============================] - 1s 110ms/step - loss: 0.1673 - acc: 0.9316
Epoch 64/100
11/11 [==============================] - 1s 98ms/step - loss: 0.1852 - acc: 0.9407
Epoch 65/100
11/11 [==============================] - 1s 109ms/step - loss: 0.1670 - acc: 0.9332
Epoch 66/100
11/11 [==============================] - 1s 106ms/step - loss: 0.1820 - acc: 0.9373
Epoch 67/100
11/11 [==============================] - 1s 96ms/step - loss: 0.1355 - acc: 0.9521
Epoch 68/100
11/11 [==============================] - 1s 115ms/step - loss: 0.1501 - acc: 0.9517
Epoch 69/100
11/11 [==============================] - 1s 106ms/step - loss: 0.1522 - acc: 0.9521
Epoch 70/100
11/11 [==============================] - 1s 109ms/step - loss: 0.1639 - acc: 0.9460
Epoch 71/100
11/11 [==============================] - 1s 101ms/step - loss: 0.1746 - acc: 0.9287
Epoch 72/100
11/11 [==============================] - 1s 121ms/step - loss: 0.1861 - acc: 0.9294
Epoch 73/100
11/11 [==============================] - 1s 115ms/step - loss: 0.1656 - acc: 0.9444
Epoch 74/100
11/11 [==============================] - 1s 110ms/step - loss: 0.1401 - acc: 0.9516
Epoch 75/100
11/11 [==============================] - 1s 103ms/step - loss: 0.1422 - acc: 0.9401
Epoch 76/100
11/11 [==============================] - 1s 100ms/step - loss: 0.1438 - acc: 0.9537
Epoch 77/100
11/11 [==============================] - 1s 110ms/step - loss: 0.1594 - acc: 0.9394
Epoch 78/100
11/11 [==============================] - 1s 105ms/step - loss: 0.1477 - acc: 0.9380
Epoch 79/100
11/11 [==============================] - 1s 116ms/step - loss: 0.1931 - acc: 0.9261
Epoch 80/100
11/11 [==============================] - 1s 111ms/step - loss: 0.1378 - acc: 0.9587
Epoch 81/100
11/11 [==============================] - 1s 105ms/step - loss: 0.1332 - acc: 0.9456
Epoch 82/100
11/11 [==============================] - 1s 109ms/step - loss: 0.1417 - acc: 0.9560
Epoch 83/100
11/11 [==============================] - 1s 98ms/step - loss: 0.1865 - acc: 0.9344
Epoch 84/100
11/11 [==============================] - 1s 99ms/step - loss: 0.1545 - acc: 0.9416
Epoch 85/100
11/11 [==============================] - 1s 113ms/step - loss: 0.1665 - acc: 0.9387
Epoch 86/100
11/11 [==============================] - 1s 102ms/step - loss: 0.1233 - acc: 0.9558
Epoch 87/100
11/11 [==============================] - 1s 109ms/step - loss: 0.2232 - acc: 0.9095
Epoch 88/100
11/11 [==============================] - 1s 99ms/step - loss: 0.1615 - acc: 0.9437
Epoch 89/100
11/11 [==============================] - 1s 103ms/step - loss: 0.1802 - acc: 0.9373
Epoch 90/100
11/11 [==============================] - 1s 109ms/step - loss: 0.1947 - acc: 0.9460
Epoch 91/100
11/11 [==============================] - 1s 104ms/step - loss: 0.2548 - acc: 0.8948
Epoch 92/100
11/11 [==============================] - 1s 104ms/step - loss: 0.1870 - acc: 0.9347
Epoch 93/100
11/11 [==============================] - 1s 98ms/step - loss: 0.2973 - acc: 0.9002
Epoch 94/100
11/11 [==============================] - 1s 102ms/step - loss: 0.2487 - acc: 0.9245
Epoch 95/100
11/11 [==============================] - 1s 103ms/step - loss: 0.1924 - acc: 0.9294
Epoch 96/100
11/11 [==============================] - 1s 101ms/step - loss: 0.3887 - acc: 0.9009
Epoch 97/100
11/11 [==============================] - 1s 109ms/step - loss: 0.3643 - acc: 0.9045
Epoch 98/100
11/11 [==============================] - 1s 100ms/step - loss: 0.3606 - acc: 0.9159
Epoch 99/100
11/11 [==============================] - 1s 98ms/step - loss: 0.4588 - acc: 0.8867
Epoch 100/100
11/11 [==============================] - 1s 99ms/step - loss: 0.5343 - acc: 0.9173


done.
Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.

In [11]:

learner.lr_plot()

STEP 4: Train the Model¶

We will train the model using autofit, which uses a triangular learning rate policy. We will save the weights for each epoch so that we can reload the best weights when training completes.

In [9]:

learner.autofit(0.005, 30, checkpoint_folder='/tmp/saved_weights')


begin training using triangular learning rate policy with max lr of 0.005...
Epoch 1/30
12/12 [==============================] - 9s 748ms/step - loss: 0.6526 - acc: 0.5750 - val_loss: 0.4119 - val_acc: 0.8907
Epoch 2/30
12/12 [==============================] - 7s 578ms/step - loss: 0.4043 - acc: 0.8889 - val_loss: 0.3422 - val_acc: 0.8907
Epoch 3/30
12/12 [==============================] - 8s 674ms/step - loss: 0.3405 - acc: 0.8889 - val_loss: 0.3063 - val_acc: 0.8907
Epoch 4/30
12/12 [==============================] - 8s 644ms/step - loss: 0.3079 - acc: 0.8889 - val_loss: 0.2740 - val_acc: 0.8907
Epoch 5/30
12/12 [==============================] - 7s 620ms/step - loss: 0.2774 - acc: 0.8889 - val_loss: 0.2527 - val_acc: 0.8907
Epoch 6/30
12/12 [==============================] - 8s 662ms/step - loss: 0.2488 - acc: 0.8902 - val_loss: 0.2311 - val_acc: 0.9193
Epoch 7/30
12/12 [==============================] - 7s 614ms/step - loss: 0.2264 - acc: 0.9118 - val_loss: 0.2206 - val_acc: 0.9243
Epoch 8/30
12/12 [==============================] - 8s 667ms/step - loss: 0.2084 - acc: 0.9347 - val_loss: 0.2185 - val_acc: 0.9158
Epoch 9/30
12/12 [==============================] - 8s 640ms/step - loss: 0.2012 - acc: 0.9380 - val_loss: 0.2146 - val_acc: 0.9188
Epoch 10/30
12/12 [==============================] - 7s 621ms/step - loss: 0.2021 - acc: 0.9301 - val_loss: 0.2158 - val_acc: 0.9240
Epoch 11/30
12/12 [==============================] - 8s 675ms/step - loss: 0.1944 - acc: 0.9367 - val_loss: 0.2254 - val_acc: 0.9101
Epoch 12/30
12/12 [==============================] - 8s 673ms/step - loss: 0.1904 - acc: 0.9341 - val_loss: 0.2141 - val_acc: 0.9207
Epoch 13/30
12/12 [==============================] - 7s 622ms/step - loss: 0.1785 - acc: 0.9406 - val_loss: 0.2164 - val_acc: 0.9195
Epoch 14/30
12/12 [==============================] - 7s 621ms/step - loss: 0.1702 - acc: 0.9432 - val_loss: 0.2188 - val_acc: 0.9200
Epoch 15/30
12/12 [==============================] - 7s 582ms/step - loss: 0.1738 - acc: 0.9399 - val_loss: 0.2218 - val_acc: 0.9177
Epoch 16/30
12/12 [==============================] - 7s 580ms/step - loss: 0.1644 - acc: 0.9406 - val_loss: 0.2223 - val_acc: 0.9160
Epoch 17/30
12/12 [==============================] - 8s 652ms/step - loss: 0.1713 - acc: 0.9341 - val_loss: 0.2230 - val_acc: 0.9167
Epoch 18/30
12/12 [==============================] - 7s 584ms/step - loss: 0.1687 - acc: 0.9393 - val_loss: 0.2262 - val_acc: 0.9160
Epoch 19/30
12/12 [==============================] - 7s 620ms/step - loss: 0.1685 - acc: 0.9426 - val_loss: 0.2224 - val_acc: 0.9179
Epoch 20/30
12/12 [==============================] - 8s 645ms/step - loss: 0.1555 - acc: 0.9439 - val_loss: 0.2420 - val_acc: 0.9042
Epoch 21/30
12/12 [==============================] - 8s 649ms/step - loss: 0.1730 - acc: 0.9406 - val_loss: 0.2194 - val_acc: 0.9179
Epoch 22/30
12/12 [==============================] - 7s 572ms/step - loss: 0.1606 - acc: 0.9425 - val_loss: 0.2232 - val_acc: 0.9203
Epoch 23/30
12/12 [==============================] - 7s 614ms/step - loss: 0.1529 - acc: 0.9504 - val_loss: 0.2301 - val_acc: 0.9096
Epoch 24/30
12/12 [==============================] - 7s 575ms/step - loss: 0.1606 - acc: 0.9399 - val_loss: 0.2243 - val_acc: 0.9210
Epoch 25/30
12/12 [==============================] - 7s 588ms/step - loss: 0.1453 - acc: 0.9536 - val_loss: 0.2297 - val_acc: 0.9122
Epoch 26/30
12/12 [==============================] - 8s 701ms/step - loss: 0.1427 - acc: 0.9517 - val_loss: 0.2262 - val_acc: 0.9165
Epoch 27/30
12/12 [==============================] - 7s 568ms/step - loss: 0.1424 - acc: 0.9451 - val_loss: 0.2420 - val_acc: 0.9113
Epoch 28/30
12/12 [==============================] - 8s 632ms/step - loss: 0.1402 - acc: 0.9471 - val_loss: 0.2382 - val_acc: 0.9167
Epoch 29/30
12/12 [==============================] - 8s 688ms/step - loss: 0.1393 - acc: 0.9530 - val_loss: 0.2369 - val_acc: 0.9103
Epoch 30/30
12/12 [==============================] - 7s 590ms/step - loss: 0.1352 - acc: 0.9517 - val_loss: 0.2465 - val_acc: 0.9080

Out[9]:

<keras.callbacks.History at 0x7fbc080336d8>

Evaluate¶

Let's use the weights from Epoch 12.

In [10]:

learner.model.load_weights('/tmp/saved_weights/weights-12.hdf5')

In [11]:

learner.validate(class_names=preproc.get_classes())

              precision    recall  f1-score   support

     hateful       0.68      0.57      0.62       462
      normal       0.95      0.97      0.96      3764

    accuracy                           0.92      4226
   macro avg       0.82      0.77      0.79      4226
weighted avg       0.92      0.92      0.92      4226

Out[11]:

array([[ 262,  200],
       [ 122, 3642]])

Predict¶

Let's make predictions for all Twitter users that are unlabeled (i.e., we don't know whether or not they are hateful).

In [13]:

p = ktrain.get_predictor(learner.model, preproc)

In [17]:

df_unlabeled = preproc.df[preproc.df.target=='unknown']

In [18]:

preds = p.predict_transductive(df_unlabeled.index)

In [20]:

preds = np.array(preds)

In [31]:

import pandas as pd
df_preds = pd.DataFrame(zip(df_unlabeled.index, preds), columns=['UserID', 'Predicted'])

In [32]:

df_preds[df_preds.Predicted=='hateful'].head()

Out[32]:

	UserID	Predicted
52	56	hateful
129	140	hateful
243	259	hateful
475	499	hateful
501	526	hateful

In [34]:

df_preds[df_preds.Predicted=='hateful'].shape[0]

Out[34]:

Out of over 95,000 unlabeled nodes in the Twitter graph, our model predicted 578 as potential hateful users that would seem to warrant a review.