## k-Nearest-Neighbor Classifier

"Show me who your friends are and I’ll tell you who you are?"

The concept of the k-nearest neighbor classifier can hardly be simpler described. This is an old saying, which can be found in many languages and many cultures. It's also metnioned in other words in the Bible: "He who walks with wise men will be wise, but the companion of fools will suffer harm" (Proverbs 13:20 )

This means that the concept of the k-nearest neighbor classifier is part of our everyday life and judging: Imagine you meet a group of people, they are all very young, stylish and sportive. They talk about there friend Ben, who isn't with them. So, what is your imagination of Ben? Right, you imagine him as being yong, stylish and sportive as well.

If you learn that Ben lives in a neighborhood where people vote conservative and that the average income is above 200000 dollars a year? Both his neighbors make even more than 300,000 dollars per year? What do you think of Ben? Most probably, you do not consider him to be an underdog and you may suspect him to be a conservative as well?

The principle behind nearest neighbor classification consists in finding a predefined number, i.e. the 'k' - of training samples closest in distance to a new sample, which has to be classified. The label of the new sample will be defined from these neighbors. k-nearest neighbor classifiers have a fixed user defined constant for the number of neighbors which have to be determined. There are also radius-based neighbor learning algorithms, which have a varying number of neighbors based on the local density of points, all the samples inside of a fixed radius. The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply "remember" all of its training data. Classification can be computed by a majority vote of the nearest neighbors of the unknown sample.

The k-NN algorithm is among the simplest of all machine learning algorithms, but despite its simplicity, it has been quite successful in a large number of classification and regression problems, for example character recognition or image analysis.

Now let's get a little bit more mathematically:

The k-Nearest-Neighbor Classifier (k-NN) works directly on the learned samples, instead of creating rules compared to other classification methods.

**Nearest Neighbor Algorithm:**

Given a set of categories $\{c_1, c_2, ... c_n\}$, also called classes, e.g. {"male", "female"}. There is also a learnset $LS$ consisting of labelled instances.

The task of classification consists in assigning a category or class to an arbitrary instance. If the instance $o$ is an element of $LS$, the label of the instance will be used.

Now, we will look at the case where $o$ is not in $LS$:

$o$ is compared with all instances of $LS$. A distance metric is used for comparison. We determine the $k$ closest neighbors of $o$, i.e. the items with the smallest distances. $k$ is a user defined constant and a positive integer, which is usually small.

The most common class of $LS$ will be assigned to the instance $o$. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

The algorithm for the k-nearest neighbor classifier is among the simplest of all machine learning algorithms. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all the computations are performed, when we do the actual classification.

### k-nearest-neighbor from Scratch

#### Preparing the Dataset

Before we actually start with writing a nearest neighbor classifier, we need to think about the data, i.e. the learnset. We will use the "iris" dataset provided by the datasets of the sklearn module.

The data set consists of 50 samples from each of three species of Iris

- Iris setosa,
- Iris virginica and
- Iris versicolor.

Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

import numpy as np from sklearn import datasets iris = datasets.load_iris() iris_data = iris.data iris_labels = iris.target print(iris_data[0], iris_data[79], iris_data[100]) print(iris_labels[0], iris_labels[79], iris_labels[100])

[5.1 3.5 1.4 0.2] [5.7 2.6 3.5 1. ] [6.3 3.3 6. 2.5] 0 1 2

We create a learnset from the sets above. We use permutation from np.random to split the data randomly.

np.random.seed(42) indices = np.random.permutation(len(iris_data)) n_training_samples = 12 learnset_data = iris_data[indices[:-n_training_samples]] learnset_labels = iris_labels[indices[:-n_training_samples]] testset_data = iris_data[indices[-n_training_samples:]] testset_labels = iris_labels[indices[-n_training_samples:]] print(learnset_data[:4], learnset_labels[:4]) print(testset_data[:4], testset_labels[:4])

[[6.1 2.8 4.7 1.2] [5.7 3.8 1.7 0.3] [7.7 2.6 6.9 2.3] [6. 2.9 4.5 1.5]] [1 0 2 1] [[5.7 2.8 4.1 1.3] [6.5 3. 5.5 1.8] [6.3 2.3 4.4 1.3] [6.4 2.9 4.3 1.3]] [1 2 1 1]

The following code is only necessary to visualize the data of our learnset. Our data consists of four values per iris item, so we will reduce the data to three values by summing up the third and fourth value. This way, we are capable of depicting the data in 3-dimensional space:

# following line is only necessary, if you use ipython notebook!!! %matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D colours = ("r", "b") X = [] for iclass in range(3): X.append([[], [], []]) for i in range(len(learnset_data)): if learnset_labels[i] == iclass: X[iclass][0].append(learnset_data[i][0]) X[iclass][1].append(learnset_data[i][1]) X[iclass][2].append(sum(learnset_data[i][2:])) colours = ("r", "g", "y") fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for iclass in range(3): ax.scatter(X[iclass][0], X[iclass][1], X[iclass][2], c=colours[iclass]) plt.show()

#### Determining the Neighbors

To determine the similarity between two instances, we need a distance function. In our example, the Euclidean distance is ideal:

In [ ]:def distance(instance1, instance2): # just in case, if the instances are lists or tuples: instance1 = np.array(instance1) instance2 = np.array(instance2) return np.linalg.norm(instance1 - instance2) print(distance([3, 5], [1, 1])) print(distance(learnset_data[3], learnset_data[44]))

The function 'get_neighbors returns a list with 'k' neighbors, which are closest to the instance 'test_instance':

In [ ]:def get_neighbors(training_set, labels, test_instance, k, distance=distance): """ get_neighors calculates a list of the k nearest neighbors of an instance 'test_instance'. The list neighbors contains 3-tuples with (index, dist, label) where index is the index from the training_set, dist is the distance between the test_instance and the instance training_set[index] distance is a reference to a function used to calculate the distances """ distances = [] for index in range(len(training_set)): dist = distance(test_instance, training_set[index]) distances.append((training_set[index], dist, labels[index])) distances.sort(key=lambda x: x[1]) neighbors = distances[:k] return neighbors

We will test the function with our iris samples:

In [ ]:for i in range(5): neighbors = get_neighbors(learnset_data, learnset_labels, testset_data[i], 3, distance=distance) print(i, testset_data[i], testset_labels[i], neighbors)

#### Voting to get a Single Result

We will write a vote function now. This functions uses the class 'Counter' from collections to count the quantity of the classes inside of an instance list. This instance list will be the neighbors of course. The function 'vote' returns the most common class:

In [ ]:from collections import Counter def vote(neighbors): class_counter = Counter() for neighbor in neighbors: class_counter[neighbor[2]] += 1 return class_counter.most_common(1)[0][0]

We will test 'vote' on our training samples:

In [ ]:for i in range(n_training_samples): neighbors = get_neighbors(learnset_data, learnset_labels, testset_data[i], 3, distance=distance) print("index: ", i, ", result of vote: ", vote(neighbors), ", label: ", testset_labels[i], ", data: ", testset_data[i])

We can see that the predictions correspond to the labelled results, except in case of the item with the index 8.

'vote_prob' is a function like 'vote' but returns the class name and the probability for this class:

In [ ]:def vote_prob(neighbors): class_counter = Counter() for neighbor in neighbors: class_counter[neighbor[2]] += 1 labels, votes = zip(*class_counter.most_common()) winner = class_counter.most_common(1)[0][0] votes4winner = class_counter.most_common(1)[0][1] return winner, votes4winner/sum(votes)In [ ]:

for i in range(n_training_samples): neighbors = get_neighbors(learnset_data, learnset_labels, testset_data[i], 5, distance=distance) print("index: ", i, ", vote_prob: ", vote_prob(neighbors), ", label: ", testset_labels[i], ", data: ", testset_data[i])

#### The Weighted Nearest Neighbour Classifier

We looked only at k items in the vicinity of an unknown object „UO", and had a majority vote. Using the majority vote has shown quite efficient in our previous example, but this didn't take into account the following reasoning: The farther a neighbor is, the more it "deviates" from the "real" result. Or in other words, we can trust the closest neighbors more than the farther ones. Let's assume, we have 11 neighbors of an unknown item UO. The closest five neighbors belong to a class A and all the other six, which are farther away belong to a class B. What class should be assigned to UO? The previous approach says B, because we have a 6 to 5 vote in favor of B. On the other hand the closest 5 are all A and this should count more.

To pursue this strategy, we can assign weights to the neighbors in the following way: The nearest neighbor of an instance gets a weight $1 / 1$, the second closest gets a weight of $1 / 2$ and then going on up to $1/k$ for the farthest away neighbor.

This means that we are using the harmonic series as weights:

$$\sum_{i}^{k}{1/(i+1)} = 1 + \frac{1}{2} + \frac{1}{3} + ... + \frac{1}{k}$$

We implement this in the following function:

In [ ]:def vote_harmonic_weights(neighbors, all_results=True): class_counter = Counter() number_of_neighbors = len(neighbors) for index in range(number_of_neighbors): class_counter[neighbors[index][2]] += 1/(index+1) labels, votes = zip(*class_counter.most_common()) #print(labels, votes) winner = class_counter.most_common(1)[0][0] votes4winner = class_counter.most_common(1)[0][1] if all_results: total = sum(class_counter.values(), 0.0) for key in class_counter: class_counter[key] /= total return winner, class_counter.most_common() else: return winner, votes4winner / sum(votes)In [ ]:

for i in range(n_training_samples): neighbors = get_neighbors(learnset_data, learnset_labels, testset_data[i], 6, distance=distance) print("index: ", i, ", result of vote: ", vote_harmonic_weights(neighbors, all_results=True))

The previous approach took only the ranking of the neighbors according to their distance in account. We can improve the voting by using the actual distance. To this purpos we will write a new voting function:

In [ ]:def vote_distance_weights(neighbors, all_results=True): class_counter = Counter() number_of_neighbors = len(neighbors) for index in range(number_of_neighbors): dist = neighbors[index][1] label = neighbors[index][2] class_counter[label] += 1 / (dist**2 + 1) labels, votes = zip(*class_counter.most_common()) #print(labels, votes) winner = class_counter.most_common(1)[0][0] votes4winner = class_counter.most_common(1)[0][1] if all_results: total = sum(class_counter.values(), 0.0) for key in class_counter: class_counter[key] /= total return winner, class_counter.most_common() else: return winner, votes4winner / sum(votes)In [ ]:

for i in range(n_training_samples): neighbors = get_neighbors(learnset_data, learnset_labels, testset_data[i], 6, distance=distance) print("index: ", i, ", result of vote: ", vote_distance_weights(neighbors, all_results=True))

### Another Example for Nearest Neighbor Classification

We want to test the previous functions with another very simple dataset:

In [ ]:train_set = [(1, 2, 2), (-3, -2, 0), (1, 1, 3), (-3, -3, -1), (-3, -2, -0.5), (0, 0.3, 0.8), (-0.5, 0.6, 0.7), (0, 0, 0) ] labels = ['apple', 'banana', 'apple', 'banana', 'apple', "orange", 'orange', 'orange'] k = 1 for test_instance in [(0, 0, 0), (2, 2, 2), (-3, -1, 0), (0, 1, 0.9), (1, 1.5, 1.8), (0.9, 0.8, 1.6)]: neighbors = get_neighbors(train_set, labels, test_instance, 2) print("vote distance weights: ", vote_distance_weights(neighbors))

### kNN in Linguistics

The next example comes from computer linguistics. We show how we can use a k-nearest neighbor classifier to recognize misspelled words.

We use a module called levenshtein, which we have implemented in our tutorial on Levenshtein Distance.

In [ ]:from levenshtein import levenshtein cities = [] with open("data/city_names.txt") as fh: for line in fh: city = line.strip() if " " in city: # like Freiburg im Breisgau add city only as well cities.append(city.split()[0]) cities.append(city) #cities = cities[:20] for city in ["Freiburg", "Frieburg", "Freiborg", "Hamborg", "Sahrluis"]: neighbors = get_neighbors(cities, cities, city, 2, distance=levenshtein) print("vote_distance_weights: ", vote_distance_weights(neighbors))

If you work under Linux (especially Ubuntu), you can find a file with a British-English dictionary under /usr/share/dict/british-english. Windows users and others can download the file as

british-english.txtWe use extremely misspelled words in the following example. We see that our simple vote_prob function is doing well only in two cases: In correcting "holpposs" to "helpless" and "blagrufoo" to "barefoot". Whereas our distance voting is doing well in all cases. Okay, we have to admit that we had "liberty" in mind, when we wrote "liberdi", but suggesting "liberal" is a good choice.

In [ ]:words = [] with open("british-english.txt") as fh: for line in fh: word = line.strip() words.append(word) for word in ["holpful", "kundnoss", "holpposs", "blagrufoo", "liberdi"]: neighbors = get_neighbors(words, words, word, 3, distance=levenshtein) print("vote_distance_weights: ", vote_distance_weights(neighbors, all_results=False)) print("vote_prob: ", vote_prob(neighbors))

### Using sklearn for kNN

*neighbors* is a package of the *sklearn*, which provides functionalities for nearest neighbor classifiers both for unsupervised and supervised learning.

The classes in sklearn.neighbors can handle both Numpy arrays and scipy.sparse matrices as input. For dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics are supported for searches.

*scikit-learn* implements two different nearest neighbors classifiers:

- KNeighborsClassifier
- is based on the k nearest neighbors of a sample, which has to be classified. The number 'k' is an integer value specified by the user. This is the most frequently used classifiers of both algorithms.
- RadiusNeighborsClassifier
- is based on the number of neighbors within a fixed radius r for each sample which has to be classified. 'r' is float value specified by the user. This classifier is less often used.

There is no general way to define an optimal value for 'k'. This value depends on the data. As a general rule we can say that increasing 'k' reduces the noise but on the other hand makes the boundaries less distinct.

The decision based on the nearest neighbors can be reached either uniform weights, the class assigned to a query sample is calculated by a simple majority vote of the k-nearest neighbors. This does not take into account that the neighbors closer to the sample should contribute more than the ones further away. The weighting can be controlled by the *weights* keyword:

weights = 'uniform' assigns uniform weights to each neighbor. This is also the default value.

weights = 'distance' assigns weights proportional to the inverse of the distance from the query sample.

It is also possible to supply a user-defined function to compute the distance.

**Parameters**

- n_neighbors
- int, optional (default = 5)

Number of neighbors to use by default for meth:'kneighbors' queries. - weights
- str or callable, optional (default = 'uniform')

weight function used in prediction. Possible values:- 'uniform' : uniform weights. All points in each neighborhood are weighted equally.
- 'distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
- [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

- algorithm
- optional Algorithm used to compute the nearest neighbors:
- 'ball_tree' will use :class:'BallTree'
- 'kd_tree' will use :class:'KDTree'
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm based on the values passed to :meth:'fit' method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

- leaf_size
- int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. - p
- integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. - metric
- string or callable, default 'minkowski'

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics. - metric_params
- dict, optional (default = None)

Additional keyword arguments for the metric function. - n_jobs
- int, optional (default = 1)

The number of parallel jobs to run for neighbors search. If '-1', then the number of jobs is set to the number of CPU cores. Doesn't affect :meth:'fit' method.

### Example with kNN

We will use the k-nearest neighbor classifier 'KNeighborsClassifier' from 'sklearn.neighbors' on the Iris data set:

In [ ]:# Create and fit a nearest-neighbor classifier from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() knn.fit(learnset_data, learnset_labels) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform') print("Predictions form the classifier:") print(knn.predict(testset_data)) print("Target values:") print(testset_labels)In [ ]:

learnset_data[:5], learnset_labels[:5]

We can see that the sample with the index 8 is again not recognized properly.