If you haven’t checked out my previous blog posts on Support Vector Machines (SVMs), I suggest you do because this blog post will require knowledge on the dual formulation of our optimization problem covered in those posts.

For now, before we get into the math of the so-called “Kernel Trick”, we first have to ask what it even is?

**Non-Linearly Separable Data Sets**

Well, with all great techniques, there must be an equally challenging **problem**.

While it is great that our Support Vector Machine can divide data sets using hyperplanes in really any dimensions it wants, most data in the real world can’t be divided linearly. Take a look at the image below, as an example.

No matter how we cut it, I think we can all agree there is no hyperplane that can accurately classify both the green and red points. In fact, it seems we might need some sort of circular boundary if we wanted to correctly classify or “divide” our data.

**THIS IS A HUGE PROBLEM!** **This means all that work we did to solve for our weights and biases in the previous blog posts means nothing.**

Desperate for a solution, talented researchers found a method called the “kernel trick” where they extended the applicability of Support Vector Machines. Essentially, the vague idea goes like this. If we can’t divide our data in whatever dimensions it is in (like 2 dimensions for example), then maybe we can **transform** our data so that it can be mapped in higher dimension space (like in 3 dimensions), where it can be divided.

How could this transformation work? To get an intuition, let’s consider the reconsider the same example shown below, where it seems like we need to create a circular boundary.

Consider a function \phi which is defined by \phi(x,y) = (x-a)^2 + (y-b)^2, where a and b are simply constants. We will make this our transformation. If we map out our transformed x-y plane in 3 Dimensions (or more formally \mathbb{R}^3), we can see that points closer towards the center of the paraboloid get deeper while farther points get higher.

We can exploit this transformation, because that means if we adjust our a and b values so so the center of the curve is relatively is where our green cluster is, then we can transform our points as such.

As we can see these points on the transformed x-y plane are linearly separable, **which is amazing by the way**, and that all we really would have to do to translate the plane back into a 2-Dimensions is algebraically rewrite the variables of the plane using \phi.

In this case, let’s just assume the plane was z = 2, so all I could just do is manipulate the equation above algebraically and say z = (x-a)^2 + (y-b)^2 = 2 or in other words our boundary curve is (x-a)^2 + (y-b)^2 = 2.** Tada**! **We did it!**

So as you can see, with a transformation, we can convert non-linear data sets into linear data sets where we can just apply our original SVM formulas, and then convert our hyperplane in our transformed space to our original space. It’s a very roundabout (and some may call ugly) way; however, it does the job well, and it’s frankly amazing how well it works for Support Vector Machines!

But there is still a **BIG** question still lingering?

**How can we find \phi , our transformation function for any data set?**

We sort of cheated by considering our data set and the properties it had, when in reality the computers of today can’t intuit these things. We need to give it some sort of algorithm or method of finding \phi regardless of the data set it is given. To do this, we need to go back to the dual optimization problem we covered in our previous blog post because we missed one fundamental concept that’s a game-changer in our understanding of SVMs.

**Getting Back Into The Math** **of Support Vector Machines**

Recall the dual problem we had encountered back in our previous blog post, namely to choose our Lagrange Multipliers, \lambda_i, such that we could deal with just purely maximizing L(w,b,\lambda_i), where

L(w,b,\lambda_i) = \frac{1}{2}||w||^2 + \sum_{i = 1}^{M} \lambda_i(L - y_i(x_i^Tw+b)

When we were dealing with this problem back in our previous blog post, we just analytically analyzed the terms to maximize the function; however, we never considered what the equation is actually telling us.

What do I mean by that? Well, let’s try converting the above formula into something with more meaning. You will see what I mean by that.

Recall, that previously, we defined the weights (assuming we don’t just let the Lagrange Multipliers for support vectors to be 1) as

w = \sum_{i=1}^{Z} \lambda_iy_ix_i

where Z is the number of support vectors. Plugging in w and realizing ||w|| is w^Tw, we can rewrite the above as the following.

L(w,b,\lambda_i) = \frac{1}{2}(\sum_{i=1}^{M}\lambda_iy_ix_i^T)(\sum_{j=1}^{M}\lambda_jy_jx_j) - (\sum_{i=1}^{M}\lambda_iy_ix_i^T)(\sum_{j=1}^{M}\lambda_jy_jx_j) - \sum_{i=1}^{M}\lambda_iy_ib + L\sum_{i=1}^{M}\lambda_i

If we look at the \sum_{i=1}^{M}\lambda_iy_ib, we can see that b can be pulled out being that is independent of i, and we can rewrite the term as b\sum_{i=1}^{M}\lambda_iy_i. Recalling our previous constraint, we can say the term is equal to 0, because in the previous blog post, we derived that \sum_{i=1}^{M}\lambda_iy_i = 0. This means we can simplify our equation to

L(w,b,\lambda_i) = \frac{1}{2}(\sum_{i=1}^{M}\lambda_iy_ix_i^T)(\sum_{j=1}^{M}\lambda_jy_jx_j) - (\sum_{i=1}^{M}\lambda_iy_ix_i^T)(\sum_{j=1}^{M}\lambda_jy_jx_j) + L\sum_{i=1}^{M}\lambda_i

Noticing that our first term and second term are like terms, we can further simplify our Lagrangian to

L(w,b,\lambda_i) = -\frac{1}{2}(\sum_{i=1}^{M}\lambda_iy_ix_i^T)(\sum_{j=1}^{M}\lambda_jy_jx_j) + L\sum_{i=1}^{M}\lambda_i

Combining the sums and reorganizing our terms, we can finally rewrite our equation as

L(w,b,\lambda_i) = \sum_{i=1}^{M}\lambda_i -\frac{1}{2}\sum_{i=1}^{M}\sum_{j=1}^{M}\lambda_i\lambda_jy_iy_jx_i^Tx_j

Notice something, namely that if we use the same analysis we had before of maximizing our Lagrangian, assuming our multipliers and labels are non-zeros, x_i^Tx_j represents the pair-wise projection of our data set on itself, and that is what our optimal solution depends on.

English, please?

Well, basically x_i^Tx_j represents the dot product of the vector coordinates of each point as shown below. If our points are pointing in the same direction away from the origin, the negative term becomes larger, essentially meaning the closer our points are, the harder it will be for our boundary to divide the points well.

Why is this important, you ask? Well, it’s important in the sense that it gives us an intuition of how Support Vector Machines work. Essentially, a linear support vector machine is based on some notion of similarity based on direction for points, where points that are closer are harder to classify.

This intuition is important to have because when you look at our equation this way in addition to realizing that it is very easy to find the corresponding Lagrangian in our transformed space, namely

L(w,b,\lambda_i) = \sum_{i=1}^{M}\lambda_i -\frac{1}{2}\sum_{i=1}^{M}\sum_{j=1}^{M}\lambda_i\lambda_jy_iy_j\phi(x_i)^T\phi(x_j)

just by applying \phi. Using our previous intuition of similarity, that means in our transformed space our boundary is mainly dependent on the behavior of \phi(x_i)^T\phi(x_j). Depending on how I choose my \phi, I can mend the rules of the support vector machine to be optimized, when say points are really close to each other (just as an example). Now, this would be a bad decision to have, but I could if I wanted to.

**However, once again the question still lingers, how do we decide the right \phi for our data set knowing that it affects our optimization as such?**

Well, realize that we actually don’t need to find \phi and the reason why I say that is because if our Lagrangian is based on \phi(x_i)^T\phi(x_j), then it would more sense to actually compute \phi(x_i)^T\phi(x_j) directly than somehow compute \phi and then compute \phi(x_i)^T\phi(x_j). Also, to be honest, it’s quite hard to compute to find \phi by itself, but it is much easier to find \phi(x_i)^T\phi(x_j) mostly because there is some significance to it as it should be a measure of similarity between transformed data points. **And that’s the kernel trick!**

**Instead of find something complex like \phi, we define a kernel function K where**

K(x_i,x_j) = \phi(x_i)^T\phi(x_j)

**It’s genius because we avoid having to find some obscure transformation and find something with genuine meaning. **The thing is depending on your data set, you will have to define your similarity function or kernel function differently to account for your data set. For a linear data set, K is simply x_i^Tx_j, whereas, for a data set that is might need a circular boundary, the kernel could be based on a measure where points about the same distance (instead of direction) are considered more similar. Because there are so many different possible kernel functions as this is **A VERY ACTIVE AREA OF RESEARCH**, it can be confusing to keep track of which ones you should use. Below are a few of them and how they define similarity for data sets used in Support Vector Machines.

**But wait, so how do I use the kernel to find a boundary curve?**

Great you asked that because we still haven’t translated how to actually use the kernel function. We know how to make one as it should measure similarity with more positive values of K representing more similarity and vice-versa (by the way as I said before, there is a ton of research being done on kernels, I really suggest you try making your own kernels for fun), but how do we apply it?

Well we know that in our non-transformed space, that

w = \sum_{i=1}^{Z}\lambda_iy_ix_i

where Z is the number of support vectors. If we apply that equation to our hyperplane in our non-transformed space, which is x^Tw+b = 0, we get

\sum_{i=1}^{Z}\lambda_iy_ix^Tx_i + b = 0

Now because we have written in terms of the coordinates of our data set, we can now find the corresponding hyperplane in our higher dimension by applying \phi. This makes our hyperplane in our transformed dimension (sometimes called a feature space)

\sum_{i=1}^{Z}\lambda_iy_i\phi(x)^T\phi(x_i) + b = 0

**OH, WELL LOOKS WHO’S THERE? THE KERNEL FUNCTION!** Thus, the equation of our hyperplane in our feature space is (drumroll please…)

\sum_{i=1}^{Z}\lambda_iy_iK(x,x_i) + b = 0

**This means we can classify points regardless of our data set (assuming we have the right kernel) as points who evaluate above our hyperplane (>0), will be one of one label, and points below of another label. **

**Conclusion**

There is still tons of research being done on how to choose the best kernels for different data sets and as I have hinted at before, I suggest you try playing around with it and invent kernel functions of your own. I know that this can be a very confusing topic, but it is one that is very important. The kernel trick can be applied in more than just Support Vector Machines and is a highly valuable trick to learn. Obviously, this post doesn’t cover all the insights of the Kernel Trick, so I will link helpful and more articulate resources, though I hoped this at least made sense to you. We essentially have a complete classifier that can distinguish between two labels…. or do we?

Actually, we can classify as many labels as we want, if we assign multiple labels to each point. If we have three groups, we can first divide the first group from the other two groups. And among the other two groups, we can divide them with another set of labels. We can continue this process as much as we want, meaning we have a **REALLY DARN GOOD** classifier that can classify in **ANY DIMENSIONS, ANY NUMBER OF GROUPS, AND NON-LINEAR DATASETS.**

I think for now, unless you guys want something more on Support Vector Machines, I shall move on to digital signal processing and how that works. Stay safe and thanks! If you haven’t already considered subscribing, you may want to do that! Also, if you want to write a guest post on my blog, just contact me and I’ll respond ASAP! Comment below if I made some sort of mistake or you have questions!

Also, here are some great technology products, I would like to recommend. If you already are planning on buying these products, then buying it here will give me a small commission to keep this blog running at no cost to you! Thanks!:

- Roku Express HD Streaming Media Player 2019
- Echo Dot (3rd Gen) – Smart speaker with clock and Alexa – Sandstone
- Portable Charger Anker PowerCore 20100mAh – Ultra High Capacity Power Bank with 4.8A Output and PowerIQ Technology, External Battery Pack for iPhone, iPad & Samsung Galaxy & More (Black)
- All-New Fire 7 Tablet (7″ display, 16 GB)

byteofmath.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to ("byteofmath.com” (amazon.com)).

**Resources**

https://towardsdatascience.com/kernel-function-6f1d2be6091

http://www-prima.imag.fr/Prima/Homepages/jlc/Courses/2015/ENSI2.SIRR/ENSI2.SIRR.S5.pdf

https://www.cs.cmu.edu/~ggordon/SVMs/new-svms-and-kernels.pdf