The math behind the Support Vector Machines algorithm.

In this blog-post we are gonna talk about one of the most powerful and fascinating techniques in Machine Learning: Support Vector Machines.

In the field of Statistical Learning the *Support Vector Machine* technique is a *binary classification* algorithm which aims to find the hyperplane which is able to separate the data with the largest *margin* possible. The concept of margin is illustrated in the following images.

Suppose we have a set of points in \(\mathbb{R}^2\), each point belongs to a class \(\in\{-1,+1\}\)

We want to find the best hyperplane (in this case a *line*) which is able to correctly separate the data.

We identify this hyperplane by maximizing the *margin*, i.e. the distance from the *hyperplane* to the closest points of both classes, we call this points *support vectors*.

In this case we identified two *support vectors*, they are called like that because they support the dashed lines, which represent the set of points equidistant from the separating hyperplane.

The *margins* from the support vectors to the hyperplane are drawed in red

Before diving into the theory of the algorithm let’s have a look at the history behind it.

The birth of SVMs dates back to \(1963\) in Russia, when *Vladimir Vapnik* and *Aleksondr Lerner* introduced the Generalized Portrait algorithm.

After almost \(30\) years, at the end of \(1990\), *Vapnik* moved to the USA and joined *Bernhard Boser* and *Isabelle Guyen* at the Adaptive Systems Research Department at AT&T Bell Labs in New Jersey, where the algorithm was refined.

“The invention of SVMs happened when Bernhard decided to implement Vladimir’s algorithm in the three months we had left before we moved to Berkeley. After some initial success of the linear algorithm, Vladimir suggested introducing products of features. I proposed to rather use the kernel trick of the ‘potential function’ algorithm. Vladimir initially resisted the idea because the inventors of the ‘potential functions’ algorithm (Aizerman, Braverman, and Rozonoer) were from a competing team of his institute back in the 1960’s in Russia! But Bernhard tried it anyways, and the SVMs were born!”

*Isabelle Guyen*

For a *binary* classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space of dimension \(d\) with an *hyperplane* of dimension \(d\) (which, as you’ll see in a minute, corresponds to a \(d-1\) dimensional space): all points on one side of the hyperplane are classified as \(+1\) (or \(-1\)), while the others are classified as \(-1\) (or \(+1\)). In case you doubt the power of linear classifiers just observe that we’re always able to transform (or enrich) our input space by means of some *basis functions*, if we “guess” the right transformation maybe we are able to correctly classify our samples with a *linear* classifier.

If, for instance, we have the following unseparable data in the 2D space