This notebook presents training of multi-label classification using structured SVM presented in shogun. We would be using MultilabelModel for multi-label classfication.
We begin with brief introduction to Multi-Label Structured Prediction [1] followed by corresponding API in Shogun. Then we are going to implement a toy example (for illustration) before getting to the real one. Finally, we evaluate the multi-label classification on well-known datasets [2]. We showed that SHOGUNs [3] implementation delivers same accuracy as scikit-learn and same or better training time.
Multi-Label Structured Prediction combines the aspects of multi-label prediction and structured output. Structured prediction typically involves an input $\mathbf{x}$ (can be structured) and a structured output $\mathbf{y}$. Given a training set $\{(x^i, y^i)\}_{i=1,...,n} \subset \mathcal{X} \times \mathbb{P}(\mathcal{Y})$ where $\mathcal{Y}$ is a structured output set of potentially very large size (in this case $\mathcal{Y} = \{y_1, y_2, ...., y_q\}$ where $q$ is total number of possible classes). A joint feature map $\psi(x, y)$ is defined to incorporate structure information into the labels.
The joint feature map $\psi(x, y)$ for MultilabelModel
is defined as $\psi(x, y) \rightarrow x \otimes y$ where $\otimes$ is the tensor product.
We formulate the prediction as:
$h(x) = \{y \in \mathcal{Y} : f(x, y) > 0\}$
The compatibility function, $f(x, y)$, acts on individual inputs and outputs, as in single-label prediction, but the prediction step consists of collecting all outputs of positive scores instead of finding the outputs of maximal score.
In this notebook, we are going to compare the performance of two multi-label models:
MultilabelModel model
: with constant entry term $0$ in joint feature vector to not model bias term.MultilabelModel model_with_bias
: with constant entry $1$ in the joint feature vector to model bias term.The joint feature vector are:
model
$\leftrightarrow \psi(x, y) = [x || 0] \otimes y$.model_with_bias
$\leftrightarrow \psi(x, y) = [x || 1] \otimes y$.For comparision of the two models, we are going to perform on the datasets with binary labels.
First of all we create some synthetic data for our toy example. We add some static offset to the data to compare the models with/without threshold.
from __future__ import print_function
try:
from sklearn.datasets import make_classification
except ImportError:
import pip
pip.main(['install', '--user', 'scikit-learn'])
from sklearn.datasets import make_classification
import numpy as np
X, Y = make_classification(n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=2)
# adding some static offset to the data
X = X + 1
To create a multi-label model in shogun, we'll first create an instance of MultilabelModel and initialize it by the features and labels. The labels should be MultilabelSOLables. It should be initialized by providing with the n_labels
(number of examples) and n_classes
(total number of classes) and then individually adding a label using set_sparse_label() method.
from modshogun import RealFeatures, MultilabelSOLabels, MultilabelModel
def create_features(X, constant):
features = RealFeatures(
np.c_[X, constant * np.ones(X.shape[0])].T)
return features
from modshogun import MultilabelSOLabels
def create_labels(Y, n_classes):
try:
n_samples = Y.shape[0]
except AttributeError:
n_samples = len(Y)
labels = MultilabelSOLabels(n_samples, n_classes)
for i, sparse_label in enumerate(Y):
try:
sparse_label = sorted(sparse_label)
except TypeError:
sparse_label = [sparse_label]
labels.set_sparse_label(i, np.array(sparse_label, dtype=np.int32))
return labels
def split_data(X, Y, ratio):
num_samples = X.shape[0]
train_samples = int(ratio * num_samples)
return (X[:train_samples], Y[:train_samples],
X[train_samples:], Y[train_samples:])
X_train, Y_train, X_test, Y_test = split_data(X, Y, 0.9)
feats_0 = create_features(X_train, 0)
feats_1 = create_features(X_train, 1)
labels = create_labels(Y_train, 2)
model = MultilabelModel(feats_0, labels)
model_with_bias = MultilabelModel(feats_1, labels)
In Shogun, several solvers and online solvers have been implemented for SO-Learning. Let's try to train the model using an online solver StochasticSOSVM.
from modshogun import StochasticSOSVM, DualLibQPBMSOSVM, StructuredAccuracy, LabelsFactory
from time import time
sgd = StochasticSOSVM(model, labels)
sgd_with_bias = StochasticSOSVM(model_with_bias, labels)
start = time()
sgd.train()
print(">>> Time taken for SGD *without* threshold tuning = %f" % (time() - start))
start = time()
sgd_with_bias.train()
print(">>> Time taken for SGD *with* threshold tuning = %f" % (time() - start))
For measuring accuracy in multi-label classification, Jaccard Similarity Coefficients $\big(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\big)$ is used :
$Accuracy = \frac{1}{p}\sum_{i=1}^{p}\frac{ |Y_i \cap h(x_i)|}{|Y_i \cup h(x_i)|}$
This is available in MultilabelAccuracy for MultilabelLabels
and StructuredAccuracy for MultilabelSOLabels
.
def evaluate_machine(machine,
X_test,
Y_test,
n_classes,
bias):
if bias:
feats_test = create_features(X_test, 1)
else:
feats_test = create_features(X_test, 0)
test_labels = create_labels(Y_test, n_classes)
out_labels = LabelsFactory.to_structured(machine.apply(feats_test))
evaluator = StructuredAccuracy()
jaccard_similarity_score = evaluator.evaluate(out_labels, test_labels)
return jaccard_similarity_score
print(">>> Accuracy of SGD *without* threshold tuning = %f " % evaluate_machine(sgd, X_test, Y_test, 2, False))
print(">>> Accuracy of SGD *with* threshold tuning = %f " %evaluate_machine(sgd_with_bias, X_test, Y_test, 2, True))
import matplotlib.pyplot as plt
%matplotlib inline
def get_parameters(weights):
return -weights[0]/weights[1], -weights[2]/weights[1]
def scatter_plot(X, y):
zeros_class = np.where(y == 0)
ones_class = np.where(y == 1)
plt.scatter(X[zeros_class, 0], X[zeros_class, 1], c='b', label="Negative Class")
plt.scatter(X[ones_class, 0], X[ones_class, 1], c='r', label="Positive Class")
def plot_hyperplane(machine_0,
machine_1,
label_0,
label_1,
title,
X, y):
scatter_plot(X, y)
x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
xx = np.linspace(x_min, x_max, 1000)
m_0, c_0 = get_parameters(machine_0.get_w())
m_1, c_1 = get_parameters(machine_1.get_w())
yy_0 = m_0 * xx + c_0
yy_1 = m_1 * xx + c_1
plt.plot(xx, yy_0, "k--", label=label_0)
plt.plot(xx, yy_1, "g-", label=label_1)
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.grid()
plt.legend(loc="best")
plt.title(title)
plt.show()
fig = plt.figure(figsize=(10, 10))
plot_hyperplane(sgd, sgd_with_bias,
"Boundary for machine *without* bias for class 0",
"Boundary for machine *with* bias for class 0",
"Binary Classification using SO-SVM with/without threshold tuning",
X, Y)
As we can see from the above plot that sgd_with_bias
can produce better classification boundary. The model without threshold tuning is crossing origin of space, while the one with threshold tuning is crossing $(1,1)$ (the constant we have added earlier).
from modshogun import SparseMultilabel_obtain_from_generic
def plot_decision_plane(machine,
title,
X, y, bias):
plt.figure(figsize=(24, 8))
plt.suptitle(title)
plt.subplot(1, 2, 1)
x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
xx = np.linspace(x_min, x_max, 200)
yy = np.linspace(y_min, y_max, 200)
x_mesh, y_mesh = np.meshgrid(xx, yy)
if bias:
feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 1)
else:
feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 0)
out_labels = machine.apply(feats)
z = []
for i in range(out_labels.get_num_labels()):
label = SparseMultilabel_obtain_from_generic(out_labels.get_label(i)).get_data()
if label.shape[0] == 1:
# predicted a single label
z.append(label[0])
elif label.shape[0] == 2:
# predicted both the classes
z.append(2)
elif label.shape[0] == 0:
# predicted none of the class
z.append(3)
z = np.array(z)
z = z.reshape(x_mesh.shape)
c = plt.pcolor(x_mesh, y_mesh, z, cmap=plt.cm.gist_heat)
scatter_plot(X, y)
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.colorbar(c)
plt.title("Decision Surface")
plt.legend(loc="best")
plt.subplot(1, 2, 2)
weights = machine.get_w()
m_0, c_0 = get_parameters(weights[:3])
m_1, c_1 = get_parameters(weights[3:])
yy_0 = m_0 * xx + c_0
yy_1 = m_1 * xx + c_1
plt.plot(xx, yy_0, "r--", label="Boundary for class 0")
plt.plot(xx, yy_1, "g-", label="Boundary for class 1")
plt.title("Hyper planes for different classes")
plt.legend(loc="best")
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.show()
plot_decision_plane(sgd,"Model *without* Threshold Tuning", X, Y, False)
plot_decision_plane(sgd_with_bias,"Model *with* Threshold Tuning", X, Y, True)
As we can see from the above plots of decision surface, the black region corresponds to the region of negative (label = $0$) class, where as the red region corresponds to the positive (label = $1$). But along with that there are some regions (although very small) of white surface and orange surface. The white surface corresponds to the region not classified to any label, whereas the orange region correspond to the region classified to both the labels. The reason for existence of these type of surface is that the above boundaries for both the class don't overlap exactly with each other (illustrated above). So, there are some regions for which both the compatibility function $f(x, 0) > 0$ as well as $f(x, 1) > 0$ (predicted both the labels) and there are some regions where both the compatibility function $f(x, 0) < 0$ and $f(x, 1) < 0$ (predicted none of the labels).
def load_data(file_name):
input_file = open(file_name)
lines = input_file.readlines()
n_samples = len(lines)
n_features = len(lines[0].split()) - 1
Y = []
X = []
for line in lines:
data = line.split()
Y.append(map(int, data[0].split(",")))
feats = []
for feat in data[1:]:
feats.append(float(feat.split(":")[1]))
X.append(feats)
X = np.array(X)
n_classes = max(max(label) for label in Y) + 1
return X, Y, n_samples, n_features, n_classes
def test_multilabel_data(train_file,
test_file):
X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)
X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)
# create features and labels
multilabel_feats_0 = create_features(X_train, 0)
multilabel_feats_1 = create_features(X_train, 1)
multilabel_labels = create_labels(Y_train, n_classes)
# create multi-label model
multilabel_model = MultilabelModel(multilabel_feats_0, multilabel_labels)
multilabel_model_with_bias = MultilabelModel(multilabel_feats_1, multilabel_labels)
# initializing machines for SO-learning
multilabel_sgd = StochasticSOSVM(multilabel_model, multilabel_labels)
multilabel_sgd_with_bias = StochasticSOSVM(multilabel_model_with_bias, multilabel_labels)
start = time()
multilabel_sgd.train()
t1 = time() - start
multilabel_sgd_with_bias.train()
t2 = time() - start - t1
return (evaluate_machine(multilabel_sgd,
X_test, Y_test,
n_classes, False), t1,
evaluate_machine(multilabel_sgd_with_bias,
X_test, Y_test,
n_classes, True), t2)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import jaccard_similarity_score
from sklearn.preprocessing import LabelBinarizer
def sklearn_implementation(train_file,
test_file):
label_binarizer = LabelBinarizer()
X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)
X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)
clf = OneVsRestClassifier(SVC(kernel='linear'))
start = time()
clf.fit(X_train, label_binarizer.fit_transform(Y_train))
t1 = time() - start
return (jaccard_similarity_score(label_binarizer.fit_transform(Y_test),
clf.predict(X_test)), t1)
def print_table(train_file,
test_file,
caption):
acc_0, t1, acc_1, t2 = test_multilabel_data(train_file,
test_file)
sk_acc, sk_t1 = sklearn_implementation(train_file,
test_file)
result = '''
\t\t%s
Machine\t\t\t\tAccuracy\tTrain-time\n
SGD *without* threshold tuning \t%f \t%f
SGD *with* threshold tuning \t%f \t%f
scikit-learn's implementation \t%f \t%f
''' % (caption, acc_0, t1, acc_1, t2,
sk_acc, sk_t1)
print(result)
print_table("../../../data/multilabel/yeast_train.svm",
"../../../data/multilabel/yeast_test.svm",
"Yeast dataset")
print_table("../../../data/multilabel/scene_train",
"../../../data/multilabel/scene_test",
"Scene dataset")
As we can see that the accuracy of the machine with threshold tuning is comparable to that of scikit-learn's implementation. A possible explanation of that is : for multi-label classification using scikit-learn, we have used OneVsRestClassifier
strategy. This strategy fits one classifier per class. It also support multi-label classification. It is initiated using an estimator, for eg. in our case:
clf = OneVsRestClassifier(SVC(kernel='linear'))
the estimator is SVC(kernel="linear")
a support vector machine for classification using linear kernel. So, the OneVsRestClassifier
would train a number of estimator (one for each class). The SVC
estimator learns the weight ($w$) as well as the thresholds/bias($b$).
In the shogun implementation, the structured machines only learn the weights($w$) and there is no threshold or bias. So, to model the threshold to we have to add an constant entry to the joint feature vector.
Thus the machines with constant entry have the same accuracy as that of scikit-learn implementation.
[1] C. Lampert. Maximum Margin Multi-Label Structured Prediction, NIPS 2011