Abstract

$K$-Nearest Neighbors ($K$NN) is one of the easiest (but effective) algorithms in machine learning used for classification and regression. When I have a new piece of data that I want to classify, I simply take that new piece of data and compare it to every piece of data in my training set. I then store $K$ ($K$ can be set to any number) examples from the training set most similar to the new data point, and predict the majority label from those $K$ examples. The similarity metric is usually computed by the Euclidean distance. There is no need to train or create a model. All I need is data and to set $K$.

Once Upon A Time

You’re a new student at a high school in Kansas. It’s a stereotypical high school with a bunch of different cliques . It has the jocks, the nerds, the cheerleaders, the rock band kids, and the theatre students. None of these are mutually exclusive, but for this story lets assume they are :). You, being a new student wanting friends, look for the clique that’s most similar to you to hang out with.

During your first week, you go around the whole school meeting each student (training data). You know what clique each student belongs to (label). When you meet each person, you compare your hobbies, wardrobe, and GPA (features). You record how similar they are to you in your journal, based off of those 3 features.

By the end of the week, you have met the entire school and decide it’s time to fully integrate into one of the cliques. You look in your journal and decide that you are going to find the 5 (you set $K$ = 5) people that are most similar to you and are going to join the clique that the majority of the 5 belong to.

You see the top 5 people most similar to you are Ceejorn (jock), Nickenny (cheerleader), Nacolm (nerd), Dayaan (nerd), and Adam (nerd). You see that a majority of the people that are most similar to you are the nerds. You now classify yourself as a nerd. As a result, you hang out with the nerds for the remainder of your high school career.

K-Nearest Neighbors

$K$-Nearest Neighbors, unlike all the other algorithms, doesn’t build a model. There is no training step. The main idea of it is to simply use the similarity between examples’ features. The assumption that drives this algorithm is that similar examples should have the same label, which intuitively makes sense in real life. We also assume that all examples (instances) are points in $d$-dimensional space $\mathbb{R}^{d}$.

For example, if you wanted to classify whether someone was an adult or child (labels), and you were given 2 features (height and weight), you would model it with a 2 dimensional graph.

For our similarity measure, we use the Euclidean distance. The closer an example is, the more similar. Given two examples $x_i$ and $x_z$, we can calculate the Euclidean distance as:

with $j$ being each feature in the examples.

Given a new example $x_{new}$ that we want to classify, we simply compare it to our entire training set using the euclidean distance:

for all $i$ in our training set.

We then take the $K$ training examples’ labels $y_i$ with the smallest euclidean distances, which correspond to the $K$ nearest neighbors, and place it in the set $N_K$, i.e. $N_2 = \{y_3, y_{48}\}$. Since all the labels are -1 or 1, we can simply compute:

which will return the majority label within $N_K$. That is the label we will predict.

If you're still a little confused, keep reading on and it should clear up by the time you read through the example section.

Algorithm

Given a training set ${(x_1, y_1), \ldots (x_n, y_n)}$ with $d$ features and parameter $K$, the $K$-Nearest Neighbors algorithm works like this:

• Given a new example to be classified $x_{new}$:
• For each example in the training set:
• Calculate $d(x_{new}, x_i) = \sqrt{\sum_{g=1}^d (x_{new,g}-x_{i,g})^2}$ for all $1 \leq i \leq n$
• Return the $K$ closest examples in the set $N_K(x_{new})$
• Predict label $y_{predicted} = sign(\sum_{x_i \in N_K(x_{new})}y_i)$

Example

Going with our example, assume we are trying to predict which clique you $(x_{you})$ will fit into the best.

The two class labels are jocks $y =$-1 or rock band kids $y =$+1.

The 2 features are $x_{you, 1} =$ love of sports and  $x_{you, 2} =$ love of rock music, all rated on a scale of 0-9. You love sports $x_{you,1} = 9$ and you’re pretty neutral about rock music $x_{you,2} =5.$ So, $x_{you} = (9,5)$.

You are trying to predict which clique you will belong to $y_{predicted} = ?$.

Let’s say we set $K$ = 3 and we’re given (training data) 10 students, 5 jocks and 5 rock band kids:

$$x_1 = (9, 0), \ y_1 =$$ -1
$$x_2 = (7,3), \ y_2 =$$ -1
$$x_3 = (8,3), \ y_3 =$$ -1
$$x_4 = (9,2), \ y_4 =$$ -1
$$x_5 = (7,1), \ y_5 =$$ -1
$$x_6 = (3,9), \ y_6 =$$ +1
$$x_7 = (4,8), \ y_7 =$$ +1
$$x_8 = (2,7), \ y_8 =$$ +1
$$x_9 = (4,7), \ y_9 =$$ +1
$$x_{10} = (0,9), \ y_{10} =$$ +1

We compare each one’s features to $x_{you}$’s features using the euclidean distance formula. For example, if we compare the 3rd person to you, we’d get:

We calculate this for the other 9 examples and return the 3 smallest numbers in the set $N_K$ which corresponds to the 3 closest examples. The 3 closest to $x_{you}$ are $x_2, x_3, x_4$.

To find our predicted label $y_{predicted}$ we take the sum of the 3 labels corresponding to $N_K(x_{new}) = x_2, x_3, x_4$, given by:

$=sign((-1)+(-1)+(-1))=-1$.

So based off of your features, we predict that you will most likely get along with the jocks!

Summary/Discussion

$K$-Nearest Neighbors is simple to implement, works well in practice, and can be extended easily with new examples. It does not require the user to build a model, make assumptions, or tune parameters.

There is no training step. When given a new data to classify, you simply compare that new data’s features to each training example’s features and return the $K$ closest examples determined by the euclidean distance. The predicted label will be the label of the majority of the $K$ examples.

In order to choose the most optimal parameter $K$, see cross validation.

Note that there are definitely cons when it comes to $K$NN. For one, it requires large space to store the entire training dataset. The actual algorithm can also be extremely slow! Given $n$ examples and $d$ features. The method takes $O(n\times d)$ to run. Last but not least, it could suffer from the curse of dimensionality if implemented by an approximate nearest neighbor search algorithm like a K-D tree, but is free if implemented using brute force.

Other observations:

• $K$NN can be used for regression.
• Regression: The output will be the average of the values of its $k$ nearest neighbors, instead of the majority label.
• $K$NN is non-parametric, meaning the models do no require the modeler to make any assumption about the distribution of the population.
• $K$NN is discriminative since in models the conditional probability $p(y\mid x)$.
...