# The Math Behind Support Vector Machines

Warning: This post is going to have a lot of math, but I will run through the math as simple as possible

If you haven’t checked out my previous blog post on SVMs, I suggest you do that. For those of you who want a quick rundown on what Support Vector Machines (SVM) are again, it is a classification algorithm that decides how to draw a boundary line between two different groups of data, as shown below.

The above task seems simple enough for humans because we can sort of see with our eyes what line seems best, but for computers, they can only read 1’s and 0’s. What Support Vector Machines do is represent this line mathematically so that computers can find this line. But how, you may ask?! Well, before we go there, we need to do a bit of brushing up on some basic linear algebra.

## Linear Algebra Representation of Lines

Do you guys remember what matrices are? Well, if you don’t, no need to worry!

A matrix is simply an array or table of numbers, as shown below.

In the above case the matrix has 4 rows and 4 columns, but they can have any number of rows and columns. All they really do is hold a bunch of numbers.

Whenever we have a matrix with 1 column, we call it a column vector, and whenever we have a matrix with 1 row, we call it a row vector.

Whenever we multiply two matrices, such as those depicted in the image, below you take the first row, and first column. Multiply corresponding elements and add them to get the element that goes in both the first row and first column of the resultant matrix. You repeat the procedure for every row and column combination until the resultant matrix is complete. In the image below, we can set that by doing the operation $3*1 + 4*3$, we get 15 in the 1st row and 1 column position. Similarly for the first row, second column position, we do $3*5 + 4*7$ which resultants in 43 in the first row, second column position. We do the same procedure over for the second row until we get our finalized matrix.

Considering all this, you may wonder how matrices tie into the so-called “Linear Algebra Representation of Lines”. Well, normally in algebra, we denote a line in the following form $Ax + By +C = 0$ when looking in the x-y plane (some of you may have learned it as y = mx+b, but in reality, they are the same thing)

Now, we can generalize this formulation of a line in higher dimensions, like 3 dimensions or even 4 dimensions, by continuing to add terms. If you don’t understand what I mean, look at the equation of a “line” in 3-Dimensions $Ax+By+Cz + D = 0$. It’s very similar to the equation of 2-D line but with one more $Cz$ term. In fact, this is a plane.

If we were to look at 4-Dimensions the equation for a line, or hyperplane, as we call it, is $Ax+By+Cz+Dw + E = 0$. This pattern of continuing to add terms continues, so to generalize this equation of a line in all dimensions, we make three column vectors called the weight vector, the variable vector, and the bias vector. The weight vector contains all the constant terms in the equation of a line (like A, B, C, D, E, and so on) and has 1 column, and N rows where N represents the number of dimensions this line. An example of a weight vector in 3-D might be $\begin{bmatrix} 1 \\2\\ 3 \end{bmatrix}$. The variable vector is the vector that contains all the variables like x,y,z, and so on. This vector contains no constants and has 1 column and N rows. An example of a row vector in 3-D might be $\begin{bmatrix}x \\ y \\ z \end{bmatrix}$. The bias vector is similar to the extra term in the equation of the line and an example in 3-D might be $\begin{bmatrix} 5 \end{bmatrix}$ as it is always 1 row by 1 column. Now great, you may be asking, but what significance do these vectors have?

Well, let’s try multiplying the weight vector in the example above with the row vector in the example above. Because matrix multiplication is not commutative, it also requires that the number of rows in the first matrix namely $w = \begin{bmatrix} 1 \\2\\ 3 \end{bmatrix}$ is equal to the number of columns in the second matrix, namely $x = \begin{bmatrix}x \\ y \\ z \end{bmatrix}$. Unfortunately, our weight vector has 3 rows, but our variable vector only has 1 column, so to compensate for that, we transform w into a row vector by “transposing” it, which is just a fancy word for flipping it. The transpose of w is $w^T$ and is represented as [1,2,3]. Now we can multiply the transpose of w with x, and if we do so we get

$w^Tx = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} * \begin{bmatrix}x \\ y \\ z \end{bmatrix} = 1x + 2y + 3z.$

Amazing, it’s exactly an equation of a 3-D Line, except without the extra term at the end! To compensate for this, we can just add the extra D term, through our bias vector, and voila

$w^Tx + b = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix} * \begin{bmatrix}x \\ y \\ z \end{bmatrix} + \begin{bmatrix} 5 \end{bmatrix} = 1x + 2y + 3z + 5$

Now, technically the above is not an equation unless we add the 0, because $1x+2y+3z+5$ is an expression, so we can just set $1x+2y+3z+5 = 0$.

Similarly, we can now get our equation for a hyperplane in any dimension, that is

$w^Tx + b = 0$.

Notice, that a hyperplane has an interesting property. It always divides the space it exists in, whether it be in 2 dimensions, 3 dimensions, 4 dimensions, or more into two pieces. Hence, an SVM, which might be used to divide points in higher dimensions, can only divide between 2 different types of points, not more.

## Finally, how does the SVM algorithm work?!

Consider the data set below (for simplicity’s sake it is in 2-Dimensions) and an SVM wants to divide the red dots and the green dots.

How does it do it? Well first, it creates a randomized weight vector and bias vector and attempts to draw a hyperplane in the form of $w^Tx+b = 0$. Because we haven’t fed any information about the points yet, the line is pretty much random. Let’s just assume, after randomization, the line looks like this.

So, it’s easy to see this random line doesn’t divide the sets accurately, but let’s try something so we can fix this line. Let’s find the closest red point and the closest green point to our hyperplane. In this case, we can see in the image below, the dots circled in black are the closest such points for our current line.

These points, believe it or not, are called our support vectors, because we can represent their coordinates as two different column vectors where the first element would be their x coordinates and their second element would be their y coordinates respectively. To find these points, the computer inputs the following algorithm on all candidate points, where $x_1$ represents the coordinates of any point as a column vector.

$Distance = \frac{|w^Tx_1 + b|}{||w||}$ where ||w|| is the magnitude or the length of the weight vector.

It uses this formula for each of the candidate points until it finds the support vector points which are the points with the minimum value. This is rather simple in code because we can just set the a minimum variable equal to the lowest distance and store the coordinates of the points that produce the lowest distance.

For now, let’s assume these points are equidistant from the hyperplane we currently have, even if they might not be. We will adjust for this later, trust me :). Let us call the coordinates of the green dot $x_g$ and the coordinates of the red dot $x_r$ in the form of column vectors. Notice that if we plug in the coordinates of any red dot in the left-hand side of our hyperplane equation, we will always get a negative value. For example’s sake, let’s assume the blue line above is -2x – y = 0, and the coordinate of the red point is (4,4).

Plugging in (4,4) into -2x-y, we can get -12. In fact, for all values on the right side of the hyperplane we get negative values, and for all values to the left of the hyperplane we get positive values. Now since we said our support vectors our equidistant (even though they might not actually be, but we will deal with later), we also know that if we plug in the green point we should get 12. So what we do, is we draw two parallel lines to the original hyperplane that intersect the support vectors, namely these lines will be at least in this case, $w^Tx+b = 12$ and $w^Tx+b = -12$.

Sorry, for the weird text formatting :(, but if you are wondering why I put L and -L instead of 12 and -12, it is because, depending on your support vectors and your current boundary plane, you will always have different constants at that point. The 12 was specific to the example, in reality, it will just be a constant. Now, what we want to do is maximize the distance between the red and green hyperplane by changing our weight and bias vectors so our blue plane is as far as possible to the optimal support vectors. To do this, we first have to write an equation for the distance between the lines. Well, remember $x_g$ and $x_r$. Those were the column vectors pointing that represented the coordinates of each point. We can represent them as arrows to each point. If we subtract $x_g$ from $x_r$, we get a vector that points directly from the green point to the red point. The magnitude of this arrow, however, will not represent the distance between the hyperplanes, as we are looking at the perpendicular distance. So we have to take the perpendicular component of this vector to the hyperplanes.

So how, do we find the magnitude of the perpendicular yellow vector? Well, we know that the yellow vector can be represented by $x_g-x_r$ (doing it this way, makes the vector point up instead of down), and we also know that the perpendicular vector of the green hyperplane is $w^T$, also known as the normal vector. If you don’t know, by taking the dot product (element-wise multiplication of elements, then adding them up) of a unit vector (a vector with magnitude 1), with another vector, we can get the component/magnitude of the vector in the direction of the unit vector. The unit vector in the direction of the purple would be $\frac{w^T}{||w^T||}$ as we have to divide by the magnitude in order to get a magnitude of 1. Thus the vector spanning the red and green hyperplanes can be represented as

$\frac{w^T}{||w^T||}\cdot(x_g-x_r)$

Now let’s take two other pieces of information we know, namely that $w^Tx_r+b = -L$ and $w^Tx_g+b = L$ as both the red and green support vectors lies on their respective hyperplanes and thus satisfy their equations. Subtracting both equations we get

$(w^Tx_r+b = -L) - (w^Tx_g+b = L) = w^T(x_g-x_r) = 2L$

Dividing by $||w^T||$, we get

$\frac{w^T}{||w^T||}(x_g-x_r) = \frac{2L}{||w^T||}$

Amazing!!! We already proved the left side of the above equation was the magnitude of the purple vector, meaning the distance between both hyperplanes is $\frac{2L}{||w^T||}$ and that’s what we have to maximize!

## Minimization Time, Baby!

Maximizing $\frac{2L}{||w^T||}$ is the same as minimizing $\frac{||w^T||}{2L}$ by reciprocity. How simple, right?!

Except it isn’t just a simple minimization problem. We forgot something. We need to make sure our data is still classified correctly

What do I mean by the above? Well, we aren’t guaranteeing that everything on the left of the hyperplane is going to be a green dot and that everything right of the hyperplane is going to be a red dot. In essence, after we minimize the weights, we might end up with a line with the greatest margins, but they may be with two support vectors that are somewhere in the middle of both clusters. For example, in the image below, we can see that we have two support vectors with the greatest margins possible, but the line still doesn’t divide the data sets properly.

And in fact, you can even show the above line wouldn’t be possible. What’s the minimum of $\frac{||w^T||}{2L}$? It’s basically 0 if we set all its components 0. That means there really wouldn’t be a line! (Also, technically there should be no points in between the red and green hyperplanes)

So, how do we fix this problem? Seems, tough!

Maybe, we can take advantage of labels! Think about it, the whole reason why we know the red dots aren’t the green dots are because they are red. We already know through our data what each dot is classified as, so we can mathematically represent this by using two labels, namely -1 and 1.

Any dots that are red will be given a label of -1 and any dots that are green will be given a label of 1. We will now refer to each point in the data set as $x_i$ where i goes from 1 to M where M is the number of points in the data set. We will call their corresponding label $y_i$ (-1 for red, 1 for green). Now, why did I label the red ones -1 and the green ones 1 and not vice versa? Well, there is a reason for it, and that is that if we go back to the beginning, remember how we said if we plug the coordinates of a red point, in the left-hand side of the hyperplane it’s always negative, and if a green point, it will always be positive. Well, consider the following expression

$y_i(w^Tx_i+b)$

We know that that for a red dot, $y_i$ should be -1 and that $(w^Tx_i+b)$ should be negative, meaning their product should be positive. Similarly, we know for a green dot, $y_i$ should be 1 and that $w^Tx_i+b$ should be positive. Therefore, the product should be positive. We also know that any given line, by our previous definition the support vectors will make it so $w^Tx_i+b$ that any farther point must produce a value whose absolute value is greater than L for our current line. That is for every point, we have a constraint that must be followed, namely

$y_i(w^Tx_i+b) \geq L$ assuming our line divides both clusters correctly

Now, we have a real minimization problem! Let’s go, BABY!

We have to minimize $\frac{||w^T||}{2L}$ such that for every point $y_i(w^Tx_i+b) \geq L$. The fact that we have M constraints should make sense because now our answer min($\frac{||w^T||}{2L}$) can’t simply be 0.

## Lagrangian and Simplification

Now, that we have formally defined the problem, we basically have an optimization problem with M constraints (I know the letters and numbers can be confusing but bear with me). It is also good that the problem is relatively simple in nature as we are minimizing a function with similar constraints. There is a principle called Occam’s Razor, which can even be seen in economics, that states the simpler and fewer equations we have, the fewer assumptions we make, and thus the more generalizable our solutions will be. This is very promising!

Now, we are going to use Langrage Multipliers, a common way of solving such constrained maximization problems. If you don’t remember Langrage Multipliers, it basically claims that a maximum or minimization for a function occurs when the gradient of the function is proportional to the gradients of the constraints. Basically,

$\nabla f(x,y,z...) =\sum {M}_{i=1} \lambda_i\ \nabla g_i(x,y,z...)$ assuming $g_i$ are the constraint functions such that the left side of the inequality is greater than or equal to 0 and lambda are constants multiplied to each constraint. Technically, the sigma is not incorporated; however, we include it because we can just scale down each multiplier to compensate for summing all the constraints.

Whenever we mention the “Lagrangian”, how fancy, we just say

$L(x,y,z..., \lambda) = f(x,y,z) - \sum {M}_{i=1} \lambda_i\ g_i(x,y,z...)$

Finding the gradient of the Langrangian and setting each partial derivative equal to 0, is the equivalent of solving $\nabla f(x,y,z...) =\sum {M}_{i=1} \lambda_i\ \nabla g_i(x,y,z...)$. The idea is, that by writing our problem in such a way we have actually transformed a constrained optimization problem into an unconstrained optimization problem. Our constraints are already taken into account in our Lagrangian so now we just have to work on maximizing and minimizing our Lagrangian as you will see a little later in the blog post after we solve for w and b.

Side Note: Thank you to Theo, a master’s student in A.I, for letting me know that we are actually turning a constrained optimization problem to an unconstrained one by using the Lagrangian.

Before we solve our problem, we are going to make our problem more mathematically suitable for using Langrage Multipliers.

Side Note: Remember how I said to trust me that the support vectors will be equidistant. Well now think about it, because we are having constraints based on that premise they are equidistant, that information is encapsulated in the problem above. Essentially, the optimized weights will make it so that it is true. All we have to do is make sure, our first randomized line is between two points, and the math will carry itself.

To make this problem more Lagrangian ready, we have to rewrite the constraint so that the left-hand side is less than or equal to 0, giving us $L-y_i(w^Tx_i+b) \leq 0$. And because we are using Lagrangian Multipliers generally for these problems, we want to use a convex function, because then we can make guarantees on whether extrema are minima or maxima. So we use $\frac{||w||^2}{2}$ instead of $\frac{||w||}{2}$.

Our new problem:

Minimize $\frac{||w||^2}{2}$ such that every point satisifies $L-y_i(w^Tx_i+b) \leq 0$

This way of stating the problem is also called the primal form for SVMs.

Right now, we are getting in danger territory, because we are doing something called convex optimization, and without a background in the subject, the math can be quite daunting. Don’t worry if this part starts sounding too like too much to handle because as long as you know that you are just doing a minimization problem with constraints, you should just know it’s generally possible to solve these problems using mathematical techniques. The Lagrangian is just one of them. Also, if you got this far, congrats! SVMs are more complicated than they look, but I wanted to make sure every detail was taken care of. You may want to read the post a few more times to get a better understanding. For those who want to continue with the math, feel free!

Now that our problem is probably set up for convex optimization using the Lagrangian, we define a Lagrangian function, namely

$L(w,b,\lambda) = \frac{||w||^2}{2} + \sum^{M}_{i=1}\lambda_i (L-y_i(w^Tx_i+b))$

To get rid of $w^T$, we can also rewrite the above as

$L(w,b,\lambda) = \frac{||w||^2}{2} + \sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b))$

$\lambda_i$ are simply our Lagrange Multipliers for each constraint. I know it may seem like magic, but if we find $\frac{\partial L}{\partial w}$ and set it equal to 0, we can solve for w and that will be our optimized solution. Similarly, if we find $\frac{\partial L}{\partial b}$ and set it equal to 0, we can solve for b and that will be our optimal solution as well. Essentially we are setting the gradient of the Lagrangian equal to 0 and adding any new constraints we may get from solving for w and b.

## Finding w

To find $\frac{\partial L}{\partial w}$, let’s go term by term and apply what we know (I love linearity!). Our first term in the Lagrangian is $\frac{||w||^2}{2}$ and we know that $||w||^2 = w^Tw$ (linear algebra property, you can prove it yourself). Now, we know that for the first term,

$\frac{\partial L_1}{\partial w} = \frac{\partial L_1}{\partial w^Tw} \frac{\partial w^Tw}{\partial w}$

$\frac{\partial L_1}{\partial w^Tw} = \frac{\partial \frac{w^Tw}{2} }{\partial w^Tw} = \frac{1}{2}$

$\frac{\partial w^Tw}{\partial w} = (\frac{\partial \sum{n}_{i=1}w_i^2 }{\partial w}) = (2w_1, 2w_2, 2w_3, ...) = 2w$

$\frac{\partial L_1}{\partial w} = \frac{\partial L_1}{\partial w^Tw} \frac{\partial w^Tw}{\partial w} = \frac{1}{2} * 2w = w$

Thus the partial derivative of the first term is w, which is quite extraordinarily simple because of how we set up our problem.

For the second term $L_2 = \sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b))$

$\frac{\partial L_2}{\partial w} = \frac{\partial \sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b))}{\partial w} = \frac{\partial -\sum^{M}_{i=1}\lambda_i y_ix_i^Tw}{\partial w} = -\sum^{M}_{i=1}\lambda_i y_ix_i$

Thus

$\frac{\partial L}{\partial w} = \frac{\partial L_1}{\partial w} + \frac{\partial L_2}{\partial w} = w + -\sum^{M}_{i=1}\lambda_i y_ix_i$

Amazing! We found it!

Setting $\frac{\partial L}{\partial w} = 0$, we get

$w = \sum^{M}_{i=1}\lambda_i y_ix_i$

To think the solution would be simple, just wow! It is simply a weighted linear combination of all our training examples in our data set.

## Finding b

Following a similar procedure to above, we get

$\frac{\partial L}{\partial b} = \frac{\partial L_1}{\partial b} + \frac{\partial L_2}{\partial b} = 0 + \frac{\partial L_2}{\partial b} = \frac{\partial L_2}{\partial b} = \frac{\partial \sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b))}{\partial b} = \frac{\partial \sum^{M}_{i=1}\lambda_i y_i}{\partial b} =\sum^{M}_{i=1}\lambda_i y_i$

We eventually get

$b => \sum^{M}_{i=1}\lambda_i y_i$ = 0

Essentially, we have another constraint that we have to take account of in our problem. But how do we take this new constraint into account? Well, remember how I said the way we stated the problem is in its primal form. Believe it or not, after long mathematical proofs, Lagrange also made another claim, that is, upon finding the solutions and constraints to the primal problem, we must now plug those solutions and constraints back into the Lagrangian and maximize the function. We call this the dual problem because it’s like the second step we have to go through in order to actually find, not just b, but the Lagrange Multipliers we are using. Formally it is stated as such,

Find the multipliers that maximize $L(w, \lambda)$

Looking at the equation like this, we can do a simple analysis of our Lagrangian, and we see that $\frac{1}{2}||w||^2$ will always be positive.

$L(w,b,\lambda) = \frac{||w||^2}{2} + \sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b))$

The second term on the other hand has to be negative because if we consider $\lambda_i$ is always positive given we are looking at a convex function. Knowing this, we know that $y_i(x_i^Tw+b)$ is always greater than or equal to L, meaning $L-y_i(x_i^Tw+b) \leq 0$ for all points. This means that overall that $\sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b)) \leq 0$, and thus to maximize our Lagrangian with our mulitpliers we should set $\sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b)) = 0$. Doing this, we arrive at another pivotal conclusion. T

$\sum^{M}_{i=1}\lambda_i (L-y_i(x_i^Tw+b)) = 0$ can only be true if either for every point $\lambda_i$ is 0 or $L-y_i(x_i^Tw+b)$ is 0. Consider when $L-y_i(x_i^Tw+b) = 0$. If we consider non-support vector points, we know they must farther than the boundary hyperplane and thus for such points $y_i(x_i^Tw+b) > L$ meaning that $L-y_i(x_i^Tw+b)$ cannot equal 0 for those points and that the $\lambda_i$ for those points must be 0. For support vectors, we know $L-y_i(x_i^Tw+b)$ must be 0. THIS MEANS ONLY THE SUPPORT VECTORS HAVE LAGRANGE MULTIPLIERS AS FOR OTHER POINTS THEY NEED THEIR MULTIPLIERS TO BE 0.

## Putting it all together

So, we know three things:

$w = \sum^{M}_{i=1}\lambda_i y_ix_i$

$\sum^{M}_{i=1}\lambda_i y_i = 0$

$w^Tx+b = y$

Consider $w = \sum^{M}_{i=1}\lambda_i y_ix_i$ knowing that only the support vectors have Lagrange Multipliers, instead of making a calculation over every point, we can just say $\sum^{Z}_{i=1}\lambda_sy_sx_s$ where Z is the number of support vectors we are dealing with. We have the equation for our weights. For our bias, because we know the equation for one of our support hyperplanes is $w^Tx_s + b = L$, we can say $b = L -w^Tx_s$. We have the equation for our biases. To find the multipliers we know $\sum^{M}_{i=1}\lambda_i y_i = 0$, but considering our implementation only considers two support vectors, we can just set them both equal to 1 as their labels have to cancel each other out being on opposite sides of the line (one positive the other negative). Essentially, this means our final two equations are simply

$w = \sum^{Z}_{i=1}y_sx_s$

$b = L-w^Tx_s$

And that’s all! I know it can be confusing with so many steps and whatnot, but the main ideas are that our weights and biases are solely based on our support vectors. There are other methods of optimization like Stochastic Gradient Descent, as we see used in many other machine learning and A.I systems these days, but the Lagrangian really simplifies it for us by giving equations we can plug in. Because our way of doing SVMs required 2 support vectors only, I simplified the equations further though you don’t necessarily have to. If you want to learn more sophisticated algorithms for SVM, there are many resources I will link below. Next up is kernels and how we can use SVM on non-linear data sets. Below is my pure python (+matplotlib for visualization) implementation of an SVM. Up next, we will talk about how we can use kernel functions on our current work to make SVMS work on non-linearly separable datasets.

This implementation uses ONLY 2 Support Vectors, implements a hard margin of SVM, and uses Lagrangian Optimization. Please use the code below as you like!

#Author: Manu Gupta
#Feel Free to Use this Code as You Like :)
#Also, sorry for the excessive comments, but this code was originally used in a blog post for educational purposes

import random
import matplotlib.pyplot as plt
import numpy as np

#Just some hyper-parameters
NUM_POINTS = 50
DIMENSIONS = 2
X_MAX, Y_MAX = 10, 10

#Preparing my data in a way where it can be plotted and linearly separable
Red_Dots, Green_Dots = [],[]
for i in range(DIMENSIONS):
Temp = []
for i in range(NUM_POINTS):
Temp.append(random.uniform(X_MAX * 0.05, X_MAX * 0.45))
Red_Dots.append(Temp)
for i in range(DIMENSIONS):
Temp = []
for i in range(NUM_POINTS):
Temp.append(random.uniform(X_MAX * 0.55, X_MAX * 0.95))
Green_Dots.append(Temp)

#Plotting the points and graphing it
plt.plot(Red_Dots[0], Red_Dots[1], 'ro')
plt.plot(Green_Dots[0], Green_Dots[1], 'go')
plt.axis([0, X_MAX, 0, Y_MAX])

#The hardest issue with SVM is having two equidistant support vectors, so first we have to choose random ones
r = random.randint(0,NUM_POINTS-1)
Red_SV, Green_SV = [], []
for i in range(DIMENSIONS):
Red_SV.append(Red_Dots[i][r])
Green_SV.append(Green_Dots[i][r])

#We find the midpoint of the support vectors as that point will be on our boundary
Midpoint = []
for i in range(DIMENSIONS):
Midpoint.append((Red_SV[i]+Green_SV[i])/2)

#Now, we use the formula we derived for the weights in the blog post, namely labels (we will label red positive and green negative)
Weights = []
for i in range(DIMENSIONS):
Weights.append(Red_SV[i]-Green_SV[i])

#We will now calculate the bias, namely b = -w^Tx, assuming our midpoint is on the line w^Tx+b = 0. Because of how we implemented our code
#support vectors first instead of a random line we don't have to do b = L-w^Tx_s. In reality, the concept still remains the same.
Bias = 0
for i in range(DIMENSIONS):
Bias -= Weights[i]*Midpoint[i]

#Our line works, but to choose the optimal support vectors, we can go through each array of points and find the lowest distance through the formula in the blog
Weight_Magnitude = 0
for i in range(DIMENSIONS):
Weight_Magnitude += Weights[i]**2
Weight_Magnitude = Weight_Magnitude**0.5

Min_RD, Min_GD = X_MAX + Y_MAX, X_MAX + Y_MAX
Red_SV, Green_SV = [], []
for i in range(NUM_POINTS):
R_Distance, G_Distance = 0, 0
for j in range(DIMENSIONS):
R_Distance += Weights[j]*Red_Dots[j][i]
G_Distance += Weights[j]*Green_Dots[j][i]
R_Distance = abs((R_Distance + Bias)/Weight_Magnitude)
G_Distance = abs((G_Distance + Bias)/Weight_Magnitude)
if R_Distance < Min_RD:
Min_RD = R_Distance
Red_SV = []
for j in range(DIMENSIONS):
Red_SV.append(Red_Dots[j][i])
if G_Distance < Min_GD:
Min_GD = G_Distance
Green_SV = []
for j in range(DIMENSIONS):
Green_SV.append(Green_Dots[j][i])

#Now, we have better support vectors, we just rinse and repeat
Midpoint, Weights, Bias = [], [], 0
for i in range(DIMENSIONS):
Midpoint.append((Red_SV[i]+Green_SV[i])/2)
Weights.append(Red_SV[i]-Green_SV[i])
Bias -= Weights[i]*Midpoint[i]

#Plotting the line
x = np.linspace(0,X_MAX)
y = (-Bias - Weights[0]*x)/Weights[1]
plt.plot(x,y)
plt.show()

#What I basically did was use a roundabout way of doing SVM with Lagrangian Optimization and 2 Support Vectors.
#I used a hard margin, but for soft margins SGD might be a better form of optimization
#
#
#



Also, here are some great technology products, I would like to recommend. If you already are planning on buying these products, then buying it here will give me a small commission to keep this blog running at no cost to you! Thanks!:

byteofmath.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to ("byteofmath.com” (amazon.com)).

Resources

http://cs229.stanford.edu/notes/cs229-notes3.pdf

https://datascience.stackexchange.com/questions/13266/understanding-the-math-behind-svm

http://web.mit.edu/6.034/wwwbob/svm.pdf