Support Vector Machines

Machine Learning Classification

The math behind the Support Vector Machines algorithm.

Andrea Bonvini https://github.com/andreabonvini
07-20-2021

Introduction & Brief History

In this blog-post we are gonna talk about one of the most powerful and fascinating techniques in Machine Learning: Support Vector Machines.

In the field of Statistical Learning the Support Vector Machine technique is a binary classification algorithm which aims to find the hyperplane which is able to separate the data with the largest margin possible. The concept of margin is illustrated in the following images.

Suppose we have a set of points in \(\mathbb{R}^2\), each point belongs to a class \(\in\{-1,+1\}\)

We want to find the best hyperplane (in this case a line) which is able to correctly separate the data.

We identify this hyperplane by maximizing the margin, i.e. the distance from the hyperplane to the closest points of both classes, we call this points support vectors.

In this case we identified two support vectors, they are called like that because they support the dashed lines, which represent the set of points equidistant from the separating hyperplane.

The margins from the support vectors to the hyperplane are drawed in red

Before diving into the theory of the algorithm let’s have a look at the history behind it.

The birth of SVMs dates back to \(1963\) in Russia, when Vladimir Vapnik and Aleksondr Lerner introduced the Generalized Portrait algorithm.

After almost \(30\) years, at the end of \(1990\), Vapnik moved to the USA and joined Bernhard Boser and Isabelle Guyen at the Adaptive Systems Research Department at AT&T Bell Labs in New Jersey, where the algorithm was refined.

“The invention of SVMs happened when Bernhard decided to implement Vladimir’s algorithm in the three months we had left before we moved to Berkeley. After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the ‘potential function’ algorithm. Vladimir initially resisted the idea because the inventors of the ‘potential functions’ algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!”

Isabelle Guyen

Premise on linear classifiers

For a binary classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space of dimension \(d\) with an hyperplane of dimension \(d\) (which, as you’ll see in a minute, corresponds to a \(d-1\) dimensional space): all points on one side of the hyperplane are classified as \(+1\) (or \(-1\)), while the others are classified as \(-1\) (or \(+1\)). In case you doubt the power of linear classifiers just observe that we’re always able to transform (or enrich) our input space by means of some basis functions, if we “guess” the right transformation maybe we are able to correctly classify our samples with a linear classifier.

If, for instance, we have the following unseparable data in the 2D space

there’s nothing stopping us from enriching the input space with some new coordinates which depend on the old features, e.g. by adding a ne w dimension \(x_3 = \sqrt{x_1^2+x_2^2}\).

This way, in the new 3D input space, we are able to correctly classify the data by means of a 2D plane.

Derivation

First of all, we should be familiar with the equation of a generic \(D\)-dimensional hyperplane:

\[ \text{hyperplane: }\\ \mathbf{w}^T\mathbf{x}=0\\ \mathbf{w} = [w_0,w_1,\dots,w_D]^T\\ \mathbf{x} = [1,x_1,\dots,x_D]^T \]

If \(D=2\) we have that

\[ \text{hyperplane: }\\ w_0+w_1x_1+w_2x_2=0\\ \mathbf{w} = [w_0,w_1,w_2]^T\\ \mathbf{x} = [1,x_1,x_2]^T \]

Let \(\mathbf{x}_N\) be the nearest data point to the hyperplane \(\mathbf{w}^T\mathbf{x} = 0\) , before finding the distance we just have to state two observations:

\[ \mathbf{w} = (w_1,\dots,w_d)\\w_0=b \]

So now our notation is changed:

The hyperplane is represented by

\[ \mathbf{w}^T\mathbf{x} +b= 0 \]

and our constraint becomes

\[ |\mathbf{w}^T\mathbf{x}_N+b|=1 \]

It’s trivial to demonstrate that the vector \(\mathbf{w}\) is orthogonal to the hyperplane, just suppose to have two point \(\mathbf{x}'\) and \(\mathbf{x''}\) belonging to the hyperplane , then \(\mathbf{w}^T\mathbf{x}' +b= 0\) and \(\mathbf{w}^T\mathbf{x}'' +b= 0\).

And of course \(\mathbf{w}^T\mathbf{x}'' +b - (\mathbf{w}^T\mathbf{x}' +b)=\mathbf{w}^T(\mathbf{x}''-\mathbf{x}') = 0\)

Since \(\mathbf{x}''-\mathbf{x}'\) is a vector which lays on the hyperplane , we deduce that \(\mathbf{w}\) is orthogonal to the hyperplane.

Then the distance from \(\mathbf{x}_N\) to the hyperplane can be expressed as a dot product between \(\mathbf{x}_N-\mathbf{x}\) (where \(\mathbf{x}\) is any point belonging to the plane) and the unit vector \(\hat{\mathbf{w}}\) where \(\hat{\mathbf{w}} = \frac{\mathbf{w}}{\vert\vert\mathbf{w}\vert\vert}\)

(the distance is just the projection of \(\mathbf{x}_N-\mathbf{x}\) in the direction of \(\hat{\mathbf{w}}\)!)

\[ distance = |\hat{\mathbf{w}}^T(\mathbf{x}_N-\mathbf{x})| \]

We take the absolute value since we don’t know if \(\mathbf{w}\) is facing \(\mathbf{x}_N\) or is facing the other direction

We’ll now try to simplify our notion of distance.

\[ \text{distance} = |\hat{\mathbf{w}}^T(\mathbf{x}_N-\mathbf{x})\;| = \frac{1}{||\mathbf{w}||}|\;\mathbf{w}^T\mathbf{x}_N-\mathbf{w}^T\mathbf{x}| \]

This can be simplified if we add and subtract the missing term \(b\).

\[ distance = \frac{1}{||\mathbf{w}||}|\;\mathbf{w}^T\mathbf{x}_N+b-\mathbf{w}^T\mathbf{x}-b\;| = \frac{1}{||\mathbf{w}||}|\;\mathbf{w}^T\mathbf{x}_N+b-(\mathbf{w}^T\mathbf{x}+b)\;| \]

Well, \(\mathbf{w}^T\mathbf{x}+b\) is just the value of the equation of the plane…for a point on the plane. So without any doubt \(\mathbf{w}^T\mathbf{x}+b= 0\) , our notion of distance becomes

\[ distance = \frac{1}{||\mathbf{w}||}|\;\mathbf{w}^T\mathbf{x}_N+b\;| \]

But wait… what is \(\vert\mathbf{w}^T\mathbf{x}_N+b\vert\) ? It is the constraint that we defined at the beginning of our derivation!

\[ \vert\mathbf{w}^T\mathbf{x}_N+b\vert=1 \]

So we end up with the formula for the distance being just

\[ distance = \frac{1}{\vert\vert\mathbf{w}\vert\vert} \]

Let’s now formulate the optimization problem, we have:

\[ \underset{w}{\operatorname{max}}\frac{1}{||\mathbf{w}||}\\\text{subject to}\;\underset{n=1,2,\dots,N}{\operatorname{min}}|\mathbf{w}^T\mathbf{x}_n+b|=1 \]

Since this is not a friendly optimization problem (the constraint is characterized by a minimum and an absolute, which are annoying) we are going to find an equivalent problem which is easier to solve. Our optimization problem can be rewritten as

\[ \underset{w}{\operatorname{min}}\frac{1}{2}\mathbf{w}^T\mathbf{w} \\ \text{subject to} \ \ \ \ y_n \cdot(\mathbf{w}^T\mathbf{x}_n+b)\ge1 \;\;\;\;\text{for $n = 1,2,\dots,N$} \]

where \(y_n\) is a variable that we introduce that will be equal to either \(+1\) or \(-1\) accordingly to its real target value (remember that this is a supervised learning technique and we know the real target value of each sample). One could argue that the new constraint is actually different from the former one, since maybe the \(\mathbf{w}\) that we’ll find will allow the constraint to be strictly greater than \(1\) for every possible point in our dataset [ \(y_n(\mathbf{w}^T\mathbf{x}_n+b)> 1 \;\;\forall{n}\) ] while we’d like it to be exactly equal to \(1\) for at least one value of \(n\). But that’s actually not true! Since we’re trying to minimize \(\frac{1}{2}\mathbf{w}^T\mathbf{w}\) our algorithm will try to scale down \(\mathbf{w}\) until \(\mathbf{w}^T\mathbf{x}_n+b\) will touch \(1\) for some specific point \(n\) of the dataset.

So how can we solve this? This is a constraint optimization problem with inequality constraints, we have to derive the Lagrangian and apply the KKT (Karush–Kuhn–Tucker) conditions.

Objective Function:

We have to minimize

\[ \mathcal{L}(\mathbf{w},b,\mathbf{\alpha}) = \frac{1}{2}\mathbf{w}^T\mathbf{w}-\sum_{n=1}^{N}\alpha_n(y_n(\mathbf{w}^T\mathbf{x}_n+b)-1)\\ \]

w.r.t. to \(\mathbf{w}\) and \(b\) and maximize it w.r.t. the Lagrange Multipliers \(\alpha_n\)

We can easily get the two conditions for the unconstrained part:

\[ \nabla_{\mathbf{w}}\mathcal{L}=\mathbf{w}-\sum_{n=1}^{N}\alpha_n y_n\mathbf{x}_n = 0 \;\;\;\;\;\;\;\; \mathbf{w}=\sum_{n=1}^{N}\alpha_n y_n\mathbf{x}_n\\ \frac{\partial\mathcal{L}}{\partial b} = -\sum_{n=1}^{N}\alpha_n y_n = 0\;\;\;\;\;\;\;\;\;\;\;\sum_{n=1}^{N}\alpha_n y_n=0 \]

And list the other KKT conditions:

\[ y_n(\mathbf{w}^T\mathbf{x}_n+b)-1\ge0\;\;\;\;\;\;\forall{n}\\ \alpha_n\ge0\;\;\;\;\;\;\;\forall{n}\\ \alpha_n(y_n(\mathbf{w}^T\mathbf{x}_n+b)-1)=0\;\;\;\;\;\;\forall{n} \]

Alert : the last condition is called the KKT dual complementary condition and will be key for showing that the SVM has only a small number of “support vectors”, and will also give us our convergence test when we’ll talk about the SMO algorithm.

Now we can reformulate the Lagrangian by applying some substitutions

\[ \mathcal{L}(\mathbf{w},b,\mathbf{\alpha}) = \frac{1}{2}\mathbf{w}^T\mathbf{w}-\sum_{n=1}^{N}\alpha_n(y_n(\mathbf{w}^T\mathbf{x}_n+b)-1)\\ \mathcal{L}(\mathbf{\alpha}) =\sum_{n=1}^{N}\alpha_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{M}y_n y_m\alpha_n\alpha_m\mathbf{x}_n^T\mathbf{x}_m \]

(if you have doubts just go to minute 36.50 of this excellent lecture by professor Yaser Abu-Mostafa at Caltech )

We end up with the dual formulation of the problem

\[ \underset{\alpha}{\operatorname{argmax}}\sum_{n=1}^{N}\alpha_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{M}y_n y_m\alpha_n\alpha_m\mathbf{x}_n^T\mathbf{x}_m\\ \;\\ s.t. \;\;\;\;\;\;\;\;\alpha_n\ge0\;\;\;\forall{n}\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\sum_{n=1}^{N}\alpha_n y_n=0 \]

We can notice that the old constraint \(\mathbf{w}=\sum_{n=1}^{N}\alpha_n y_n\mathbf{x}_n\) doesn’t appear in the new formulation since it is not a constraint on \(\alpha\) , it was a constraint on \(\mathbf{w}\) which is not part of our formulation anymore.

How do we find the solution? we throw this objective (which btw happens to be a convex function) to a quadratic programming package.

Once the quadratic programming package gives us back the solution we find out that a whole bunch of \(\alpha\) are just \(0\) ! All the \(\alpha\) which are not \(0\) are the ones associated with the so-called support vectors ! ( which are just samples from our dataset )
They are called support vectors because they are the vectors that determine the width of the margin , this can be noted by observing the last KKT condition

\[ \big\{\alpha_n(y_n(\mathbf{w}^T\mathbf{x}_n+b)-1)=0\;\;\;\forall{n}\big\} \]

in fact either a constraint is active, and hence the point is a support vector, or its multiplier is zero.

Now that we solved the problem we can get both \(\mathbf{w}\) and \(b\).

\[ \mathbf{w} = \sum_{\mathbf{x}_n \in \text{ SV}}\alpha_ny_n\mathbf{x}_n\\ y_n(\mathbf{w}^T\mathbf{x}_{n\in\text{SV}}+b)=1 \]

where \(\mathbf{x}_{n\in\text{SV}}\) is any support vector. (you’d find the same \(b\) for every support vector)

But the coolest thing about SVMs is that we can rewrite our objective functions from

\[ \mathcal{L}(\mathbf{\alpha}) =\sum_{n=1}^{N}\alpha_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{M}y_n y_m\alpha_n\alpha_m\mathbf{x}_n^T\mathbf{x}_m \]

to

\[ \mathcal{L}(\mathbf{\alpha}) =\sum_{n=1}^{N}\alpha_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{M}y_n y_m\alpha_n\alpha_mk(\mathbf{x}_n,\mathbf{x}_m) \]

We can use kernels ! (if you don’t know what I’m talking about just check this one)

Finally we end up with the following equation for classifying new points:

\[ \hat{y}(\mathbf{x}) = sign\left(\sum_{n=1}^{N}\alpha_n y_n k(\mathbf{x},\mathbf{x}_n)+b\right) \]

Soft-margin Formulation

The method described so far is called hard-margin SVM since the margin has to be satisfied strictly, it can happen that the points are not linearly separable in any way, or we just want to handle noisy data to avoid overfitting, so now we’re going to briefly define another version of it, which is called soft-margin SVM that allows for few errors and penalizes for them.

We introduce slack variables \(\xi_n\) , this way we allow to violate the margin constraint but we add a penalty expressed by the distance of the misclassified samples from the hyperplane ( samples correctly classified have \(\xi_n=0\)).

We now have to

\[ \text{Minimize}\ \ ||\mathbf{w}||_2^2+C\sum_n \xi_n \\ \text{s.t.}\\ \ y_n(\mathbf{w}^Tx_n+b)\ge1-\xi_n\ ,\ \ \ \forall{n}\\ \xi_n\ge0\ ,\ \ \ \forall{n} \]

\(C\) is a coefficient that allows to trade-off bias-variance and is chosen by cross-validation.

And obtain the Dual Representation

\[ \text{Maximize}\ \ \ \mathcal{L}(\mathbf{\alpha}) =\sum_{n=1}^{N}\alpha_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{M}y_n y_m\alpha_n\alpha_mk(\mathbf{x}_n\mathbf{x}_m)\\ \text{s.t.}\\ 0\le\alpha_n\le C\ \ \ \ \ \forall{n}\\ \sum_{n=1}^N\alpha_n y_n = 0 \]

if \(\alpha_n\le0\) the point \(x_n\) is just correctly classified.

if \(0<\alpha_n<C\) the points lies on the margin. They are indeed Support Vectors.

if \(\alpha_n = C\) the point lies inside the margin, and it can be either correctly classified (\(\xi_n \le 1\)) or misclassified (\(\xi_n>1\))

Fun fact: When \(C\) is large, larger slacks penalize the objective function of SVM’s more than when \(C\) is small. As \(C\) approaches infinity, this means that having any slack variable set to non-zero would have infinite penalty. Consequently, as \(C\) approaches infinity, all slack variables are set to \(0\) and we end up with a hard-margin SVM classifier.

Error bounds

And what about generalization? Can we compute an Error bound in order to see if our model is overfitting?

As Vapnik said :

“In the support-vectors learning algorithm the complexity of the construction does not depend on the dimensionality of the feature space, but on the number of support vectors.”

It’s reasonable to define an upper bound of the error as:

\[ L_h\le\frac{\mathbb{E}[\text{number of support vectors}]}{N} \]

Where \(N\) is the total number of samples in the dataset. The good thing is that this bound can be easily computed and we don’t need to run SVM multiple times.

Citation

For attribution, please cite this work as

Bonvini (2021, July 20). Last Week's Potatoes: Support Vector Machines. Retrieved from https://lastweekspotatoes.com/posts/2021-07-20-support-vector-machines/

BibTeX citation

@misc{bonvini2021support,
  author = {Bonvini, Andrea},
  title = {Last Week's Potatoes: Support Vector Machines},
  url = {https://lastweekspotatoes.com/posts/2021-07-20-support-vector-machines/},
  year = {2021}
}