Last Week's Potatoes: Support Vector Machines

Andrea Bonvini

Support Vector Machines

The math behind the Support Vector Machines algorithm.

Author

Affiliation

Andrea Bonvini

Published

July 19, 2021

Citation

Bonvini, 2021

Introduction & Brief History

In this blog-post we are gonna talk about one of the most powerful and fascinating techniques in Machine Learning: Support Vector Machines.

In the field of Statistical Learning the Support Vector Machine technique is a binary classification algorithm which aims to find the hyperplane which is able to separate the data with the largest margin possible. The concept of margin is illustrated in the following images.

Suppose we have a set of points in , each point belongs to a class

We want to find the best hyperplane (in this case a line) which is able to correctly separate the data.

We identify this hyperplane by maximizing the margin, i.e. the distance from the hyperplane to the closest points of both classes, we call this points support vectors.

In this case we identified two support vectors, they are called like that because they support the dashed lines, which represent the set of points equidistant from the separating hyperplane.

The margins from the support vectors to the hyperplane are drawed in red

Before diving into the theory of the algorithm let’s have a look at the history behind it.

The birth of SVMs dates back to in Russia, when Vladimir Vapnik and Aleksondr Lerner introduced the Generalized Portrait algorithm.

After almost years, at the end of , Vapnik moved to the USA and joined Bernhard Boser and Isabelle Guyen at the Adaptive Systems Research Department at AT&T Bell Labs in New Jersey, where the algorithm was refined.

“The invention of SVMs happened when Bernhard decided to implement Vladimir’s algorithm in the three months we had left before we moved to Berkeley. After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the ‘potential function’ algorithm. Vladimir initially resisted the idea because the inventors of the ‘potential functions’ algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!”

Isabelle Guyen

Premise on linear classifiers

For a binary classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space of dimension with an hyperplane of dimension (which, as you’ll see in a minute, corresponds to a dimensional space): all points on one side of the hyperplane are classified as (or ), while the others are classified as (or ). In case you doubt the power of linear classifiers just observe that we’re always able to transform (or enrich) our input space by means of some basis functions, if we “guess” the right transformation maybe we are able to correctly classify our samples with a linear classifier.

If, for instance, we have the following unseparable data in the 2D space

there’s nothing stopping us from enriching the input space with some new coordinates which depend on the old features, e.g. by adding a ne w dimension .

This way, in the new 3D input space, we are able to correctly classify the data by means of a 2D plane.

Derivation

First of all, we should be familiar with the equation of a generic -dimensional hyperplane:

If we have that

Let be the nearest data point to the hyperplane , before finding the distance we just have to state two observations:

Let’s say I multiply the vector by , I get the same hyperplane! So any formula that takes and produces the margin will have to have built-in scale-invariance, we do that by normalizing , requiring that for the nearest data point :
( So I just scale up and down in order to fulfill the condition stated above, we just do it because it’s mathematically convenient! By the way remember that does not represent the Euclidean distance)
When you solve for the margin, the to will play a completely different role from the role of , so it is no longer convenient to have them on the same vector. We pull out from and rename with (for bias).

So now our notation is changed:

The hyperplane is represented by

and our constraint becomes

It’s trivial to demonstrate that the vector is orthogonal to the hyperplane, just suppose to have two point and belonging to the hyperplane , then and .

And of course

Since is a vector which lays on the hyperplane , we deduce that is orthogonal to the hyperplane.

Then the distance from to the hyperplane can be expressed as a dot product between (where is any point belonging to the plane) and the unit vector where

(the distance is just the projection of in the direction of !)

We take the absolute value since we don’t know if is facing or is facing the other direction

We’ll now try to simplify our notion of distance.

This can be simplified if we add and subtract the missing term .

Well, is just the value of the equation of the plane…for a point on the plane. So without any doubt , our notion of distance becomes

But wait… what is ? It is the constraint that we defined at the beginning of our derivation!

So we end up with the formula for the distance being just

Let’s now formulate the optimization problem, we have:

Since this is not a friendly optimization problem (the constraint is characterized by a minimum and an absolute, which are annoying) we are going to find an equivalent problem which is easier to solve. Our optimization problem can be rewritten as

where is a variable that we introduce that will be equal to either or accordingly to its real target value (remember that this is a supervised learning technique and we know the real target value of each sample). One could argue that the new constraint is actually different from the former one, since maybe the that we’ll find will allow the constraint to be strictly greater than for every possible point in our dataset [ ] while we’d like it to be exactly equal to for at least one value of . But that’s actually not true! Since we’re trying to minimize our algorithm will try to scale down until will touch for some specific point of the dataset.

So how can we solve this? This is a constraint optimization problem with inequality constraints, we have to derive the Lagrangian and apply the KKT (Karush–Kuhn–Tucker) conditions.

Objective Function:

We have to minimize

w.r.t. to and and maximize it w.r.t. the Lagrange Multipliers

We can easily get the two conditions for the unconstrained part:

And list the other KKT conditions:

Alert : the last condition is called the KKT dual complementary condition and will be key for showing that the SVM has only a small number of “support vectors”, and will also give us our convergence test when we’ll talk about the SMO algorithm.

Now we can reformulate the Lagrangian by applying some substitutions

(if you have doubts just go to minute 36.50 of this excellent lecture by professor Yaser Abu-Mostafa at Caltech )

We end up with the dual formulation of the problem

We can notice that the old constraint doesn’t appear in the new formulation since it is not a constraint on , it was a constraint on which is not part of our formulation anymore.

How do we find the solution? we throw this objective (which btw happens to be a convex function) to a quadratic programming package.

Once the quadratic programming package gives us back the solution we find out that a whole bunch of are just ! All the which are not are the ones associated with the so-called support vectors ! ( which are just samples from our dataset )
They are called support vectors because they are the vectors that determine the width of the margin , this can be noted by observing the last KKT condition

in fact either a constraint is active, and hence the point is a support vector, or its multiplier is zero.

Now that we solved the problem we can get both and .

where is any support vector. (you’d find the same for every support vector)

But the coolest thing about SVMs is that we can rewrite our objective functions from

We can use kernels ! (if you don’t know what I’m talking about just check this one)

Finally we end up with the following equation for classifying new points:

Soft-margin Formulation

The method described so far is called hard-margin SVM since the margin has to be satisfied strictly, it can happen that the points are not linearly separable in any way, or we just want to handle noisy data to avoid overfitting, so now we’re going to briefly define another version of it, which is called soft-margin SVM that allows for few errors and penalizes for them.

We introduce slack variables , this way we allow to violate the margin constraint but we add a penalty expressed by the distance of the misclassified samples from the hyperplane ( samples correctly classified have ).

We now have to

is a coefficient that allows to trade-off bias-variance and is chosen by cross-validation.

And obtain the Dual Representation

if the point is just correctly classified.

if the points lies on the margin. They are indeed Support Vectors.

if the point lies inside the margin, and it can be either correctly classified () or misclassified ()

Fun fact: When is large, larger slacks penalize the objective function of SVM’s more than when is small. As approaches infinity, this means that having any slack variable set to non-zero would have infinite penalty. Consequently, as approaches infinity, all slack variables are set to and we end up with a hard-margin SVM classifier.

Error bounds

And what about generalization? Can we compute an Error bound in order to see if our model is overfitting?

As Vapnik said :

“In the support-vectors learning algorithm the complexity of the construction does not depend on the dimensionality of the feature space, but on the number of support vectors.”

It’s reasonable to define an upper bound of the error as:

Where is the total number of samples in the dataset. The good thing is that this bound can be easily computed and we don’t need to run SVM multiple times.

Citation

For attribution, please cite this work as

Bonvini (2021, July 20). Last Week's Potatoes: Support Vector Machines. Retrieved from https://lastweekspotatoes.com/posts/2021-07-20-support-vector-machines/

BibTeX citation

@misc{bonvini2021support,
  author = {Bonvini, Andrea},
  title = {Last Week's Potatoes: Support Vector Machines},
  url = {https://lastweekspotatoes.com/posts/2021-07-20-support-vector-machines/},
  year = {2021}
}

Support Vector Machines

Author

Affiliation

Published

Citation

Introduction & Brief History

Premise on linear classifiers

Derivation

Soft-margin Formulation

Error bounds

Footnotes

Citation