The Kernel Trick.

What is the kernel trick? What’s the main advantage of this technique?


Author

Affiliation

Andrea Bonvini

 

Published

May 17, 2021

Citation

Bonvini, 2021


Traditionally, theory and algorithms of machine learning and statistics have been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to detect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a usually high-dimensional (possibly infinite) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to compute in the high dimensional feature space! (this is called the Kernel Trick)

Suppose we have a mapping φ:RdRm that brings our vectors in to some feature space Rm. Then the dot product of x and y in this space is φ(x)Tφ(y).

A kernel is a function k that corresponds to this dot product, i.e. $k(,)=()^T() $ .

Why is this useful? Kernels give a way to compute dot products in some feature space without even knowing what this space is and what is φ .

For example, consider a simple polynomial kernel k(x,y)=(1+xTy)2 with x,yR2.

This doesn’t seem to correspond to any mapping function φ , it’s just a function that returns a real number. Assuming that x=(x1,x2) and y=(y1,y2), let’s expand this expression:

k(x,y)=(1+xTy)2=(1+x1y1+x2y2)2=1+x21y21+x22y22+2x1y1+2x2y2+2x1x2y1y2

Note that this is nothing else but a dot product between two vectors:

φ(x)=φ(x1,x2)=(1,x21,x22,2x1,2x2,2x1x2)

and

φ(y)=φ(y1,y2)=(1,y21,y22,2y1,2y2,2y1y2)

So the kernel k(x,y)=(1+xTy)2=φ(x)Tφ(y) computes a dot product in a 6-dimensional space without explicitly visiting this space.

Another example is the Gaussian kernel k(x,y)=exp(γxy2). If we Taylor-expand this function, we’ll see that it corresponds to an infinite-dimensional codomain of φ.

Instead, the simplest kernel is the linear kernel which corresponds to an identity mapping in the feature space: k(x,x)=φ(x)Tφ(x)=xTx

Moreover, the kernel is a symmetric function of its arguments: k(x,x)=k(x,x)

Many linear models for regression and classification can be reformulated in terms of dual representation in which the kernel function arises naturally ! For example if we consider a linear ridge regression model we know that we obtain the best parameters by minimizing the regularized sum of squares error function (ridge):

Lw=12Nn=1(wTφ(xn)tn)2+λ2wTw=12(Φwt)T(Φwt)+λ2wTw

Where Φ is the design matrix whose nth row is φ(xn)T (remember that in Lw all the vectors are column vectors) and t=(t1,...,tN)T is the target vector.

Setting the gradient of Lw w.r.t. w equal to 0 we obtain the following:

Lww=0(12(Φwt)T(Φwt)+λ2wTw)w=0ΦT(Φwt)+λw=0w=1λΦT(Φwt)w=ΦTa

Where a=1λ(Φwt) is a N×1 vector.

We observe that the coefficients an are functions of w. So our definition of w is function of w itself…which is surely weird, just wait for it…

We now define the Gram Matrix K=Φ×ΦT, an N×N matrix, with elements:

Knm=φ(xn)Tφ(xm)=k(xn,xm)

So, given N samples, the Gram Matrix is the matrix of all inner products

K=[k(x1,x1)k(x1,xN)k(xN,x1)k(xN,xN)]

This will come in handy in a few seconds…

If we substitute w=ΦTa into Lw we get

Lw=12(Φwt)T(Φwt)+λ2wTw

Lw=12(ΦΦTat)T(ΦΦTat)+λ2(ΦTa)T(ΦTa)

La=12aTΦΦTΦΦTaaTΦΦTt+12tTt+λ2aTΦΦTa

Guess what? we can rewrite the Loss function in terms of the Gram Matrix !

La=12aTKKaaTKt+12tTt+λ2aTKa

By combining w=ΦTa and an=1λ(wTφ(xn)tn), setting the gradient w.r.t a equal to 0 and isolating a we obtain:

a=(K+λIN)1t

Where IN is the identity matrix of dimension N. Consider that K=N×N and t=N×1, so a=N×1.

So we can make our prediction for a new input x by substituting back into our linear regression model:

y(x)=wTφ(x)=(ΦTa)Tφ(x)=aTΦφ(x)=k(x)T(K+λIN)1t

where k(x) is an N-dimensional column vector with elements kn(x)=k(xn,x).

The good thing is that instead of inverting an M×M matrix, we are inverting an N×N matrix! This allows us to work with very high or infinite dimensionality of x.

But how can we build a valid kernel?

We have mainly two ways to do it:

New kernels can be constructed from simpler kernels as building blocks; given valid kernels k1(x,x) and k2(x,x) the following new kernels will be valid:

Footnotes

    Citation

    For attribution, please cite this work as

    Bonvini (2021, May 18). Last Week's Potatoes: The Kernel Trick.. Retrieved from https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/

    BibTeX citation

    @misc{bonvini2021the,
      author = {Bonvini, Andrea},
      title = {Last Week's Potatoes: The Kernel Trick.},
      url = {https://lastweekspotatoes.com/posts/2021-07-22-the-kernel-trick/},
      year = {2021}
    }