The Kernel Trick.

Machine Learning

What is the kernel trick? What’s the main advantage of this technique?

Andrea Bonvini https://github.com/andreabonvini
05-18-2021

Traditionally, theory and algorithms of machine learning and statistics have been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a usually high-dimensional (possibly infinite) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space! (this is called the Kernel Trick)

Suppose we have a mapping \(\varphi : \mathbb{R}^d \to \mathbb{R}^m\) that brings our vectors in to some feature space \(\mathbb{R}^m\). Then the dot product of \(\textbf{x}\) and \(\textbf{y}\) in this space is \(\varphi (\textbf{x})^T\varphi (\textbf{y})\).

A kernel is a function \(k\) that corresponds to this dot product, i.e. $k(,)=()^T() $ .

Why is this useful? Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is \(\varphi\) .

For example, consider a simple polynomial kernel \(k(\textbf{x},\textbf{y})=(1+\textbf{x}^T\textbf{y})^2\) with \(\textbf{x},\textbf{y} \in \mathbb{R}^2\).

This doesn’t seem to correspond to any mapping function \(\varphi\) , it’s just a function that returns a real number. Assuming that \(\textbf{x} = (x_1,x_2)\) and \(\textbf{y} = (y_1,y_2)\), let’s expand this expression:

\[ k(\textbf{x},\textbf{y})=(1+\textbf{x}^T\textbf{y})^2 = (1+x_1y_1 + x_2y_2)^2=\\1+x_1^2y_1^2+x_2^2y_2^2+2x_1y_1+2x_2y_2+2x_1x_2y_1y_2 \]

Note that this is nothing else but a dot product between two vectors:

\[\varphi(\mathbf x) = \varphi(x_1, x_2) = (1, x_1^2, x_2^2, \sqrt{2} x_1, \sqrt{2} x_2, \sqrt{2} x_1 x_2)\]

and

\[\varphi(\mathbf y) = \varphi(y_1, y_2) = (1, y_1^2, y_2^2, \sqrt{2} y_1, \sqrt{2} y_2, \sqrt{2} y_1 y_2)\]

So the kernel \(k(\mathbf x, \mathbf y) = (1 + \mathbf x^T \mathbf y)^2 = \varphi(\mathbf x)^T \varphi(\mathbf y)\) computes a dot product in a 6-dimensional space without explicitly visiting this space.

Another example is the Gaussian kernel \(k(\mathbf x, \mathbf y) = \exp\big(- \gamma \, \|\mathbf x - \mathbf y\|^2 \big)\). If we Taylor-expand this function, we’ll see that it corresponds to an infinite-dimensional codomain of \(\varphi\).

Instead, the simplest kernel is the linear kernel which corresponds to an identity mapping in the feature space: \(k(\mathbf{x},\mathbf{x'}) = \varphi(\mathbf{x})^T\varphi(\mathbf{x'}) = \mathbf{x}^T\mathbf{x}\)

Moreover, the kernel is a symmetric function of its arguments: \(k(\mathbf{x},\mathbf{x'}) = k(\mathbf{x'},\mathbf{x})\)

Many linear models for regression and classification can be reformulated in terms of dual representation in which the kernel function arises naturally ! For example if we consider a linear ridge regression model we know that we obtain the best parameters by minimizing the regularized sum of squares error function (ridge):

\[ L_{\mathbf{w}} = \frac{1}{2}\sum_{n=1}^{N}(\mathbf{w}^T\varphi(\mathbf{x_n})-t_n)^2+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}=\\ \frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w} \]

Where \(\Phi\) is the design matrix whose \(n^{th}\) row is \(\varphi(\mathbf{x}_n)^T\) (remember that in \(L_{\mathbf{w}}\) all the vectors are column vectors) and \(\mathbf{t} = (t_1,...,t_N)^T\) is the target vector.

Setting the gradient of \(L_{\mathbf{w}}\) w.r.t. \(\mathbf{w}\) equal to \(0\) we obtain the following:

\[ \frac{\partial L_\mathbf{w}}{\partial \mathbf{w}}=0\\ \frac{\partial \left(\frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\right)}{\partial \mathbf{w}}=0\\ \mathbf{\Phi}^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\lambda\mathbf{w} = 0\\ \mathbf{w} = -\frac{1}{\lambda}\Phi^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)\\ \mathbf{w} = \Phi^T\mathbf{a} \]

Where \(\mathbf{a}=-\frac{1}{\lambda}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)\) is a \(N\times 1\) vector.

We observe that the coefficients \(a_n\) are functions of \(\mathbf{w}\). So our definition of \(\mathbf{w}\) is function of \(\mathbf{w}\) itself…which is surely weird, just wait for it…

We now define the Gram Matrix \(\mathbf{K} = \Phi \times \Phi^T\), an \(N \times N\) matrix, with elements:

\[ K_{nm} = \varphi(\mathbf{x_n})^T\varphi(\mathbf{x_m})=k(\mathbf{x}_n,\mathbf{x}_m) \]

So, given \(N\) samples, the Gram Matrix is the matrix of all inner products

\[ K = \begin{bmatrix} k(\mathbf{x}_1,\mathbf{x}_1) & \dots & k(\mathbf{x}_1,\mathbf{x}_N) \\ \vdots &\ddots & \vdots\\ k(\mathbf{x}_N,\mathbf{x}_1) & \dots& k(\mathbf{x}_N,\mathbf{x}_N) \end{bmatrix} \]

This will come in handy in a few seconds…

If we substitute \(\mathbf{w} = \Phi^T\mathbf{a}\) into \(L_{\mathbf{w}}\) we get

\[ L_{\mathbf{w}} = \frac{1}{2}\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\mathbf{w}}-\mathbf{t}\right)+\frac{\lambda}{2}\mathbf{w}^T\mathbf{w} \]

\[ L_{\mathbf{w}} = \frac{1}{2}\left(\mathbf{\Phi\Phi^T\mathbf{a}}-\mathbf{t}\right)^T\left(\mathbf{\Phi\Phi^T\mathbf{a}}-\mathbf{t}\right)+\frac{\lambda}{2}\left(\Phi^T\mathbf{a}\right)^T\left(\Phi^T\mathbf{a}\right) \]

\[ L_{\mathbf{a}} = \frac{1}{2}\mathbf{a}^T\Phi\Phi^T\Phi\Phi^T\mathbf{a}-\mathbf{a}^T\Phi\Phi^T\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{\lambda}{2}\mathbf{a}^T\Phi\Phi^T\mathbf{a} \]

Guess what? we can rewrite the Loss function in terms of the Gram Matrix !

\[ L_{\mathbf{a}} = \frac{1}{2}\mathbf{a}^TKK\mathbf{a}-\mathbf{a}^TK\mathbf{t}+\frac{1}{2}\mathbf{t}^T\mathbf{t}+\frac{\lambda}{2}\mathbf{a}^TK\mathbf{a} \]

By combining \(\mathbf{w} = \Phi^T\mathbf{a}\) and \(a_n = -\frac{1}{\lambda}(\mathbf{w}^T\varphi(\mathbf{x}_n)-t_n)\), setting the gradient w.r.t \(\mathbf{a}\) equal to \(0\) and isolating \(\mathbf{a}\) we obtain:

\[ \mathbf{a}=(K+\lambda\mathbf{I}_N)^{-1}\mathbf{t} \]

Where \(I_N\) is the identity matrix of dimension \(N\). Consider that \(K = N\times N\) and \(\mathbf{t} = N\times 1\), so \(\mathbf{a} = N \times 1\).

So we can make our prediction for a new input \(\mathbf{x}\) by substituting back into our linear regression model:

\[ y(\mathbf{x}) = \mathbf{w}^T\varphi(\mathbf{x}) = (\Phi^T\mathbf{a})^T\varphi(\mathbf{x}) = \mathbf{a}^T\Phi\varphi(\mathbf{x})= \mathbf{k}(\mathbf{x})^T(K+\lambda\mathbf{I}_N)^{-1}\mathbf{t} \]

where \(\mathbf{k}(\mathbf{x})\) is an \(N\)-dimensional column vector with elements \(k_n(\mathbf{x}) = k(\mathbf{x}_n,\mathbf{x})\).

The good thing is that instead of inverting an \(M\times M\) matrix, we are inverting an \(N\times N\) matrix! This allows us to work with very high or infinite dimensionality of \(\mathbf{x}\).

But how can we build a valid kernel?

We have mainly two ways to do it:

New kernels can be constructed from simpler kernels as building blocks; given valid kernels \(k_1(\mathbf{x},\mathbf{x})\) and \(k_2(\mathbf{x},\mathbf{x})\) the following new kernels will be valid:

Citation

For attribution, please cite this work as

Bonvini (2021, May 18). Last Week's Potatoes: The Kernel Trick.. Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/

BibTeX citation

@misc{bonvini2021the,
  author = {Bonvini, Andrea},
  title = {Last Week's Potatoes: The Kernel Trick.},
  url = {https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/},
  year = {2021}
}