9. k-Nearest-Neighbor Classifier with sklearn

By Bernd Klein. Last modified: 06 Mar 2024.

On this page ➤

Introduction

sklearn text with animals

The underlying concepts of the K-Nearest-Neighbor classifier (kNN) can be found in the chapter k-Nearest-Neighbor Classifier of our Machine Learning Tutorial. In this chapter we also showed simple functions written in Python to demonstrate the fundamental principals.

Instead of using these functions, even though they showed impressive results, we recommend to use the functionalities of the sklearn module. We used sklearn already in our previous chapters.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

Using sklearn for kNN

neighbors is a package of the sklearn module, which provides functionalities for nearest neighbor classifiers both for unsupervised and supervised learning.

The classes in sklearn.neighbors can handle both Numpy arrays and scipy.sparse matrices as input. For dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics are supported for searches.

scikit-learn implements two different nearest neighbors classifiers:

KNeighborsClassifier: is based on the k nearest neighbors of a sample, which has to be classified. The number 'k' is an integer value specified by the user. This is the most frequently used classifiers of both algorithms.
RadiusNeighborsClassifier: is based on the number of neighbors within a fixed radius r for each sample which has to be classified. 'r' is float value specified by the user. This classifier is less often used.

KNeighborsClassifier with sklearn

We will artificially create a dataset with three classes to test the k-nearest neighbor classifier 'KNeighborsClassifier' from 'sklearn.neighbors'. We described this in our chapter Data Set Creation for Machine Learning

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

centers = [[2, 3], [5, 5], [1, 8]]
n_classes = len(centers)
data, labels = make_blobs(n_samples=150, 
                          centers=np.array(centers),
                          random_state=1)

Let us visualize what we have created:

import matplotlib.pyplot as plt

colours = ('green', 'red', 'blue')
n_classes = 3

fig, ax = plt.subplots()
for n_class in range(0, n_classes):
    ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1], 
               c=colours[n_class], s=10, label=str(n_class))



ax.legend(loc='upper right');

No description has been provided for this image

We have to split now the data in a test and train set.

from sklearn.model_selection import train_test_split
res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=1)

train_data, test_data, train_labels, test_labels = res

We are ready now to perform the classification with the kNeighborsClassifier:

# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train_data, train_labels) 

predicted = knn.predict(test_data)
print("Predictions from the classifier:")
print(predicted)
print("Target values:")
print(test_labels)

OUTPUT:

Predictions from the classifier:
[2 2 2 0 0 1 1 2 2 1 0 1 0 0 2 0 0 0 1 0 0 1 1 2 0 0 0 1 2 1]
Target values:
[2 2 2 0 0 1 1 2 2 1 0 1 0 0 2 0 0 0 1 0 0 1 1 2 0 0 0 1 2 1]

To evaluate the result, we will use accuracy_score from the module sklearn.metrics. To see how accuracy_score works, we will use a simple example with pseudo predictions and labels:

from sklearn.metrics import accuracy_score
example_predictions = [0, 2, 1, 3, 2, 0, 1]
example_labels      = [0, 1, 2, 3, 2, 1, 1]
print(accuracy_score(example_predictions, example_labels))

OUTPUT:

0.5714285714285714

The return value corresponds to the quotient of correctly classified and the total number of predictions. If you are only interested in the number of correctly classified items, you can set the parameter normalize to False. The default value is True.

print(accuracy_score(example_predictions, 
                     example_labels,
                     normalize=False))

OUTPUT:

Now we are ready to evaluate the results of our previous clissification example:

print(accuracy_score(predicted, test_labels))

OUTPUT:

1.0

You may have noticed that we instantiated the k-nearest neighbor classifier in our previous example by calling it without any arguments, i.e. KNeighborsClassifier(). In the following, we instantiate it with some possible keyword parameters:

knn = KNeighborsClassifier(algorithm='auto', 
                           leaf_size=30, 
                           metric='minkowski',
                           p=2,
                           metric_params=None, 
                           n_jobs=1, 
                           n_neighbors=5, 
                           weights='uniform')

The parameter metric is Minkowski by default. We explained the Minkowski distance in our chapter k-Nearest-Neighbor Classifier. The parameter p is the p of the Minkowski formula: When p is set to 1, this is equivalent to using the manhattan_distance, and the euclidean_distance will be used if p is assigned the value 2.

The parameter 'algorithm` determines which algorithm will be used, i.e.

ball_tree will use BallTree
kd_tree will use KDTree
brute will use a brute-force search. We set the parameter to auto which will attempt to decide the most appropriate algorithm based on the values passed to the fit method.

The parameter leaf_size is needed by BallTree or KDTree. It can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Python Text Processing Course

Enrol here

Using the Iris Data

In the following example we will use the Iris data set:

from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
data, labels = iris.data, iris.target

res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=12)
train_data, test_data, train_labels, test_labels = res

# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
# classifier "out of the box", no parameters
knn = KNeighborsClassifier()
knn.fit(train_data, train_labels) 


print("Predictions from the classifier:")
test_data_predicted = knn.predict(test_data)
print(test_data_predicted)
print("Target values:")
print(test_labels)

OUTPUT:

Predictions from the classifier:
[0 2 0 1 2 2 2 0 2 0 1 0 0 0 1 2 2 1 0 2 0 1 2 1 0 2 1 1 0 0]
Target values:
[0 2 0 1 2 2 2 0 2 0 1 0 0 0 1 2 2 1 0 1 0 1 2 1 0 2 1 1 0 0]

print(accuracy_score(test_data_predicted, test_labels))

OUTPUT:

0.9666666666666667

print("Predictions from the classifier:")
learn_data_predicted = knn.predict(train_data)
print(learn_data_predicted)
print("Target values:")
print(train_labels)
print(accuracy_score(learn_data_predicted, train_labels))

OUTPUT:

Predictions from the classifier:
[0 1 2 0 2 0 1 1 0 1 1 0 0 0 0 0 0 0 2 0 2 1 1 1 0 2 1 1 2 0 2 0 2 1 2 2 1
 1 1 2 2 0 2 2 0 1 0 2 2 0 1 1 0 0 1 1 1 1 2 1 2 0 0 1 1 2 0 2 1 0 2 2 1 2
 2 0 0 2 1 1 2 0 1 1 0 1 1 2 2 1 0 2 0 2 0 0 1 2 2 1 2 2 0 1 1 0 2 2 2 1 2
 2 2 0 0 1 0 2 2 1]
Target values:
[0 1 2 0 2 0 1 1 0 1 1 0 0 0 0 0 0 0 2 0 2 1 1 1 0 2 1 1 2 0 2 0 2 2 2 2 1
 1 1 1 2 0 2 2 0 1 0 2 2 0 1 1 0 0 1 1 1 1 2 1 2 0 0 1 1 1 0 2 1 0 2 2 1 2
 2 0 0 2 1 1 2 0 1 1 0 1 1 2 2 1 0 2 0 2 0 0 1 2 2 1 2 2 0 1 1 0 2 2 2 1 2
 2 2 0 0 1 0 2 2 1]
0.975

knn2 = KNeighborsClassifier(algorithm='auto', 
                            leaf_size=30, 
                            metric='minkowski',
                            p=2,         # p=2 is equivalent to euclidian distance
                            metric_params=None, 
                            n_jobs=1, 
                            n_neighbors=5, 
                            weights='uniform')

knn.fit(train_data, train_labels) 
test_data_predicted = knn.predict(test_data)
accuracy_score(test_data_predicted, test_labels)

OUTPUT:

0.9666666666666667

RadiusNeighborsClassifier

The k-nearest neighbor classifier operates by expanding a circle around the unknown sample (i.e., the item needing classification) until it encompasses exactly k neighboring items. In contrast, the Radius Neighbors Classifier employs a fixed radius to define its search space. It identifies all items within the training dataset that fall within this predefined radius around the item awaiting classification. Consequently, the fixed-radius methodology results in dense regions of the feature distribution offering more informative contributions, while sparse regions contribute less information.

In the context of the k-nearest neighbor (KNN) classifier, as the number of neighbors (k) increases, the algorithm expands its search radius to include more neighboring points in the classification process. If k is set too high, the classifier might eventually include points from another class that are far away from the sample being classified.

This situation can occur when the dataset has regions where classes are not well-separated, leading to overlapping regions between classes. As the search radius increases, the algorithm may start including points from distant classes, which can result in misclassification or reduced performance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate the dataset
centers = [[-2, 2], [2, 2]]
X, y = make_blobs(n_samples=100, centers=centers, cluster_std=0.5, random_state=42)

X[:3], y[:3]

OUTPUT:

(array([[ 2.23661881,  1.96358554],
        [-2.41960876,  1.84539381],
        [ 1.04061439,  1.98674306]]),
 array([1, 0, 1]))

We create now an isolalted cluster of 3 items outside of the main cluster of class 0. We add these items to the main cluster with concatenate:

centers = [[0.2, 2.2]]
X1, y1 = make_blobs(n_samples=3, centers=centers, cluster_std=0.1, random_state=42)

X = np.concatenate((X, X1), axis=0)
y = np.concatenate((y, y1))

Let's visulize our two classes:

# Plot the data
plt.figure(figsize=(10, 6))
for class_value in range(2):
    # select indices of points with the current class label
    row_ix = np.where(y == class_value)
    # plot points for the current class
    plt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))
plt.title('Generated Classification Dataset with Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

No description has been provided for this image

With the following code we split the dataset into training and testing sets for both features and labels, with 80% of the data used for training and 20% (test_size=0.2) used for testing.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We will now visualize our training and testing sets. Green and blue points represent the training data, while light green and light blue points represent the testing data. We observe that within our greenish class (class 1), three points are situated in close proximity to the blueish class. In such cases, the Radius Neighbors Classifier tends to perform better, unless we set the value of k to a number smaller than 4.

import matplotlib.pyplot as plt


fig, ax = plt.subplots()
# plotting learn data
colours = ('green', 'blue')
for n_class in range(2):
    ax.scatter(X_train[y_train==n_class][:, 0], 
               X_train[y_train==n_class][:, 1], 
               c=colours[n_class], s=40, label=str(n_class))
    
    
# plotting test data
colours = ('lightgreen', 'lightblue')
for n_class in range(2):
    ax.scatter(X_test[y_test==n_class][:, 0], 
               X_test[y_test==n_class][:, 1], 
               c=colours[n_class], s=40, label=str(n_class))

ax.plot()

OUTPUT:

[]

No description has been provided for this image

from sklearn.neighbors import RadiusNeighborsClassifier

# Instantiate the RadiusNeighborsClassifier
rnc = RadiusNeighborsClassifier(radius=0.5)

# Fit the model to the training data
rnc.fit(X_train, y_train)

# Predict the labels for the testing data
y_pred = rnc.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:

Accuracy: 0.9685534591194969

# Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels for the testing data
y_pred = knn.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:

Accuracy: 0.9622641509433962

Setting the number of neighbors to 4 also enables the k-nearest neighbors classifier to perform effectively in this scenario.

knn = KNeighborsClassifier(n_neighbors=4)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels for the testing data
y_pred = knn.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:

Accuracy: 0.9622641509433962

Another Simple Example with RadusNeighborsClassifier

from sklearn.neighbors import RadiusNeighborsClassifier

X = [[0, 1], [0.5, 1], [3, 1], [3, 2], [1.3, 0.8], [2.5, 2.5], [2.4, 2.6]]
y = [0, 0, 1, 1, 0, 1, 1]

neigh = RadiusNeighborsClassifier(radius=1.0)
neigh.fit(X, y)

print(neigh.predict([[1.5, 1.2]]))

print(neigh.predict([[3.1, 2.1]]))

OUTPUT:

[0]
[1]

If we try to make a prediction on data like [30, 20], the algorithm cannot find any neighbors for the radius 1.0. So it will raise an exception with the following text:

ValueError: No neighbors found for test samples array([0]), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.

There is a parameter for setting the label for outlier, i.e. outlier_label.

There are three ways to use it:

manual label: str or int label (should be the same type we are using in our data) or list of manual labels if multi-output is used.
It can be set to the value 'most_frequent'. This will assign the most frequently occurring label of the data set to outliers.
If it is set to None (the default), a ValueError will be raised when an outlier is detected.

Let's do it again with 'most_frequent

neigh = RadiusNeighborsClassifier(radius=1.0,
                                  outlier_label='most_frequent')
neigh.fit(X, y)

print(neigh.predict([[1.5, 1.2]]))

# the following is the previously mentioned outlier:
print(neigh.predict([[30, 20]]))

OUTPUT:

[0]
[1]

Alternatively, we set the outlier class to 2. We add one outlier element to our learnset:

from sklearn.neighbors import RadiusNeighborsClassifier

X = [[0, 1], [0.5, 1], [3, 1], [3, 2], [1.3, 0.8], [2.5, 2.5], [2.4, 2.6], [10000, -2321]]
y = [0, 0, 1, 1, 0, 1, 1, 2]

neigh = RadiusNeighborsClassifier(radius=1.0,
                                  outlier_label=2)
neigh.fit(X, y)

print(neigh.predict([[1.5, 1.2]]))
print(neigh.predict([[30, 20]]))

OUTPUT:

[0]
[2]

Let's work again on a larger dataset:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np


centers = [[2, 3], [9, 2], [7, 9]]
n_classes = len(centers)
data, labels = make_blobs(n_samples=255, 
                          centers=np.array(centers),
                          cluster_std = 1.3,
                          random_state=1)

data[:5]

OUTPUT:

array([[10.88685804,  1.1965521 ],
       [ 9.67101133,  9.0694324 ],
       [ 4.56489073, 10.19679965],
       [ 8.99754107,  0.18439345],
       [ 1.10084102,  2.48422042]])

import matplotlib.pyplot as plt

colours = ('green', 'red', 'blue')
n_classes = 3    # not using the outlier 'class'

fig, ax = plt.subplots()
for n_class in range(0, n_classes):
    ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1], 
               c=colours[n_class], s=10, label=str(n_class))

No description has been provided for this image

res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=1)
train_data, test_data, train_labels, test_labels = res

Let's add one row to the end of the train_data which contains outlier data, i.e. not belonging to any class:

outlier = [4242.2, 4242.2]
train_data = np.vstack([train_data, outlier])
train_data[-3:]

OUTPUT:

array([[   8.42869523,    7.82787516],
       [   8.01064497,    8.84559748],
       [4242.2       , 4242.2       ]])

Now we have to add an outlier label to the labels.

outlier_label = len(np.unique(labels))
train_labels = np.append(train_labels, outlier_label)
train_labels[-10:]

OUTPUT:

array([0, 0, 0, 1, 0, 0, 0, 2, 2, 3])

With the unique command we can check which classes are available:

np.unique(train_labels)

OUTPUT:

array([0, 1, 2, 3])

In the following code we initialize a Radius Neighbors Classifier with a specified radius, train it on a training dataset, and then use the trained classifier to predict labels for a separate test dataset.

rnn = RadiusNeighborsClassifier(radius=1)
rnn.fit(train_data, train_labels)
predicted = rnn.predict(test_data)

print(accuracy_score(predicted, test_labels))

OUTPUT:

1.0

Let's shrink the radius:

rnn = RadiusNeighborsClassifier(radius=0.9,
                                outlier_label=outlier_label)
rnn.fit(train_data, train_labels)
predicted = rnn.predict(test_data)
print(accuracy_score(predicted, test_labels))

OUTPUT:

0.9803921568627451

Let's create some outliers and test them:

centers = [[100, 300]]
data_outliers, labels_outliers = make_blobs(n_samples=10, 
                                  centers=np.array(centers),
                                  random_state=1)

predicted = rnn.predict(data_outliers)
predicted

OUTPUT:

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

A good value for k is the square root of all the samples in the training set:

k = int(len(labels) ** 0.5)
# make this value odd:
if k % 2 == 0:
    k += 1
k

OUTPUT:

Let us compare this with a k nearest neighbor classifier:

knn = KNeighborsClassifier(algorithm='auto', 
                     leaf_size=30, 
                     metric='minkowski',
                     metric_params=None, 
                     n_jobs=1, 
                     n_neighbors=k, # default is 5
                     p=2,         # p=2 is equivalent to euclidian distance
                     weights='uniform')

knn.fit(data, labels) 
predicted = knn.predict(test_data)
print(accuracy_score(predicted, test_labels))

OUTPUT:

1.0

from sklearn.metrics import confusion_matrix 
# Evaluate Model
cm = confusion_matrix(predicted, test_labels)
print(cm)

OUTPUT:

[[24  0  0]
 [ 0 18  0]
 [ 0  0  9]]

predicted = knn.predict(data_outliers)
predicted

OUTPUT:

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

We can see that all the outliers have been wrongly classified as class 2 because this is the closest existing class to the outliers. We create in the following three clusters of outliers:

centers = [[100, 300], [10, -10], [-200, -200]]
data_outliers2, labels_outliers2 = make_blobs(n_samples=30, 
                                              centers=np.array(centers),
                                              random_state=1)

predicted = knn.predict(data_outliers2)
predicted

OUTPUT:

array([2, 2, 2, 1, 0, 2, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 2, 0, 2, 0, 1, 1,
       2, 2, 1, 2, 0, 1, 0, 0])

The outliers are asigned to the existing clusters even though they are far away from them. On the other hand the RadiusNeighbirClassifier will recognize them as outliers:

rnn = RadiusNeighborsClassifier(radius=0.9,
                                outlier_label=outlier_label)
rnn.fit(train_data, train_labels)
predicted = rnn.predict(data_outliers2)
predicted

OUTPUT:

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3])

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

Determining the Optimal k Value

As we have written the the optimal value for k is usually the square root of n, where n is the total number of samples of our data set.

We can also determine a value for k by plotting the accuracy values for different k values:

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np


n_classes = 6
data, labels = make_blobs(n_samples=1000, 
                          centers=n_classes,
                          cluster_std = 1.3,
                          random_state=1)

import matplotlib.pyplot as plt

colours = ('green', 'red', 'blue', 'magenta', 'yellow', 'pink')

fig, ax = plt.subplots()
for n_class in range(0, n_classes):
    ax.scatter(data[labels==n_class, 0], data[labels==n_class, 1], 
               c=colours[n_class], s=10, label=str(n_class))

No description has been provided for this image

res = train_test_split(data, labels, 
                       train_size=0.7,
                       test_size=0.3,
                       random_state=1)
train_data, test_data, train_labels, test_labels = res 

print(len(train_data), len(test_data), len(train_labels))

X, Y = [], []
for k in range(1, 25):
    classifier = KNeighborsClassifier(n_neighbors=k, 
                                      p=2,    # Euclidian
                                      metric="minkowski")
    classifier.fit(train_data, train_labels)
    predictions = classifier.predict(test_data)
    score = accuracy_score(test_labels, predictions)
    X.append(k)
    Y.append(score)
    


fig, ax = plt.subplots()
ax.set_xlabel('k')
ax.set_ylabel('accuracy')
ax.plot(X, Y, "go")

OUTPUT:

700 300 700

[<matplotlib.lines.Line2D at 0x7f4324a1e290>]

No description has been provided for this image

Exercises

Exercise 1

Classify the data in "strange_flowers.txt" with a k nearest neighbor classifier.

Exercise 2

Classify the data in "fruits_data.txt" with a k nearest neighbor classifier.

Exercise 3

Use sklearn to correct the city names from the previous chapter:

The misspelled city names are: "Freiburg", "Frieburg", "Freiborg", "Hamborg", "Sahrluis"

The correct city names are saved in data/city_names.txt

The Levenshtein distance can be imported with

import Levenshtein

It need to be installed:

pip install python-Levenshtein

Exercise 4

Do the same now for the misspelled words "holpful", "kundnoss", "holpposs", "thoes", "innerstand", "blagrufoo" and "liberdi"

Use the the file british-english.txt for a list of all the correct spelled English words!

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

Solutions

Solution to Exercise 1

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # necessary to reduce biases of large numbers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import f1_score 
from sklearn.metrics import accuracy_score 

dataset = pd.read_csv("data/strange_flowers.txt", 
                      header=None, 
                      names=["red", "green", "blue", "size", "label"],
                      sep=" ")
dataset

	red	green	blue	size	label
0	252.0	96.0	10.0	3.63	1.0
1	249.0	115.0	10.0	3.59	1.0
2	235.0	107.0	0.0	3.81	1.0
3	255.0	110.0	6.0	3.91	1.0
4	247.0	104.0	8.0	3.41	1.0
...	...	...	...	...	...
790	197.0	250.0	108.0	2.69	4.0
791	197.0	250.0	107.0	3.05	4.0
792	197.0	241.0	109.0	3.23	4.0
793	197.0	243.0	92.0	3.00	4.0
794	197.0	252.0	96.0	3.06	4.0

795 rows × 5 columns

Instead of using Pandas to read in the 'strange_flowers.txt' data, we could use 'loadtxt' from numpy:

# alternative way to read and extract the data

import numpy as np

raw_data = np.loadtxt("data/strange_flowers.txt")
data = raw_data[:,:-1]
labels = raw_data[:,-1]

We will continue now with the Pandas DataFrame object 'dataset', whe we read in with 'read_csv':

data = dataset.drop('label', axis=1)
labels = dataset.label

X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    labels, 
                                                    random_state=0, 
                                                    test_size=0.2)

scaler = StandardScaler() 
X_train = scaler.fit_transform(X_train) #  transform
X_test = scaler.transform(X_test) #  transform

X_train

OUTPUT:

array([[ 1.0031888 , -0.39408598, -0.38229346, -0.06392483],
       [-1.1023726 ,  1.9321053 ,  1.79682762, -1.61096419],
       [ 1.30398328, -0.51208119, -0.48484033,  0.94680755],
       ...,
       [-1.1023726 ,  1.83096655,  2.1813784 , -1.63159138],
       [-1.57504965, -0.39408598, -0.66429736,  1.48311452],
       [-1.1023726 ,  1.79725363,  2.00192137, -0.70336777]])

We set k to the square root of size of the learn set:

k = int(len(X_train) ** 0.5)
k

OUTPUT:

# Define the model
classifier = KNeighborsClassifier(n_neighbors=k, 
                                  metric="minkowski",
                                  p=2,    # Euclidian
                                 ) #  p for different label types

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_pred

OUTPUT:

array([3., 1., 3., 4., 3., 3., 1., 4., 3., 3., 4., 1., 3., 1., 2., 2., 2.,
       3., 1., 4., 2., 3., 4., 2., 3., 3., 4., 4., 1., 2., 1., 2., 2., 3.,
       1., 3., 3., 2., 2., 2., 3., 3., 4., 1., 4., 2., 3., 2., 3., 2., 2.,
       3., 1., 3., 4., 1., 2., 4., 2., 3., 3., 4., 3., 4., 3., 2., 1., 2.,
       1., 3., 3., 1., 4., 2., 2., 3., 2., 4., 2., 4., 1., 3., 4., 2., 4.,
       3., 2., 2., 2., 3., 1., 2., 3., 3., 1., 4., 2., 2., 2., 2., 1., 1.,
       4., 3., 3., 3., 2., 1., 1., 4., 2., 3., 3., 1., 2., 4., 3., 1., 1.,
       2., 1., 4., 3., 4., 2., 2., 3., 2., 4., 1., 4., 2., 4., 4., 4., 4.,
       4., 2., 4., 4., 4., 2., 3., 2., 1., 2., 2., 3., 1., 1., 3., 1., 2.,
       4., 2., 4., 1., 3., 1.])

# Evaluate Model
cm = confusion_matrix(y_test, y_pred)
print(cm)

OUTPUT:

[[28  4  0  0]
 [ 4 43  0  0]
 [ 0  0 44  0]
 [ 0  0  0 36]]

print(accuracy_score(y_test, y_pred))

OUTPUT:

0.949685534591195

Solution to Exercise 2

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from collections import Counter

# Read in the CSV file
df = pd.read_csv('data/fruits_data.csv')

# Extract features (X) and labels (y)
X = df[['Sweetness', 'Acidity', 'Weight']].values
y = df['Fruit'].values

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=0, 
                                                    test_size=0.2)

scaler = StandardScaler() 
X_train = scaler.fit_transform(X_train) #  transform
X_test = scaler.transform(X_test) #  transform

# Define the model
classifier = KNeighborsClassifier(n_neighbors=k, 
                                  metric="minkowski",
                                  p=2,    # Euclidian
                                 )

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_pred

OUTPUT:

array(['Apple', 'Apple', 'Mango', 'Mango', 'Mango', 'Apple', 'Apple',
       'Lemon', 'Mango', 'Apple', 'Lemon', 'Mango', 'Lemon', 'Mango',
       'Lemon', 'Apple', 'Apple', 'Apple', 'Apple', 'Mango', 'Lemon',
       'Mango', 'Lemon', 'Apple', 'Mango', 'Lemon', 'Lemon', 'Mango',
       'Apple', 'Mango', 'Lemon', 'Lemon', 'Lemon', 'Apple', 'Mango',
       'Mango', 'Apple', 'Mango', 'Lemon', 'Lemon', 'Mango', 'Mango',
       'Mango', 'Mango', 'Lemon', 'Lemon', 'Apple', 'Apple', 'Lemon',
       'Apple', 'Mango', 'Lemon', 'Mango', 'Apple', 'Apple', 'Apple',
       'Lemon', 'Apple', 'Apple', 'Lemon', 'Apple', 'Lemon', 'Lemon',
       'Apple', 'Apple', 'Apple', 'Apple', 'Lemon', 'Lemon', 'Lemon',
       'Apple', 'Lemon', 'Lemon', 'Apple', 'Apple', 'Lemon', 'Apple',
       'Lemon', 'Apple', 'Apple', 'Lemon', 'Lemon', 'Mango', 'Lemon',
       'Mango', 'Apple', 'Mango', 'Mango', 'Mango', 'Apple'], dtype=object)

# Evaluate Model
cm = confusion_matrix(y_test, y_pred)
print(cm)

OUTPUT:

[[28  0  0]
 [ 0 31  0]
 [ 6  0 25]]

print(accuracy_score(y_test, y_pred))

OUTPUT:

0.9333333333333333

Solution to Exercise 3

import Levenshtein

# Load the file containing correct city names
with open('data/city_names.txt', 'r') as file:
    correct_city_names = file.readlines()
correct_city_names = [name.strip() for name in correct_city_names]

# Misspelled city names
misspelled_city_names = ["Freiburg", "Frieburg", "Freiborg", "Hamborg", "Sahrluis"]

# Find the closest match for each misspelled city name
for misspelled_city in misspelled_city_names:
    min_distance = float('inf')
    closest_match = None
    
    # Calculate Levenshtein distance to all correct city names
    for correct_city in correct_city_names:
        distance = Levenshtein.distance(misspelled_city, correct_city)
        if distance < min_distance:
            min_distance = distance
            closest_match = correct_city
    
    print(f"Closest match for '{misspelled_city}': {closest_match}")

OUTPUT:

Closest match for 'Freiburg': Freiburg
Closest match for 'Frieburg': Freiburg
Closest match for 'Freiborg': Freiburg
Closest match for 'Hamborg': Hamburg
Closest match for 'Sahrluis': Saarlouis

Solution to Exercise 4

import Levenshtein

# Load the file containing correct city names
with open('british-english.txt', 'r') as file:
    correct_words = file.readlines()
correct_words = [name.strip() for name in correct_words]

# misspelled ords 
misspelled_words = ["holpful", "kundnoss", "holpposs", 
                    "thoes", "innerstand", "blagrufoo", 
                    "liberdi"]

# Find the closest match for each misspelled city name
for misspelled_word in misspelled_words:
    min_distance = float('inf')
    closest_match = None
    
    # Calculate Levenshtein distance to all correct city names
    for correct_word in correct_words:
        distance = Levenshtein.distance(misspelled_word, correct_word)
        if distance < min_distance:
            min_distance = distance
            closest_match = correct_word
    
    print(f"Closest match for '{misspelled_word}': {closest_match}")

OUTPUT:

Closest match for 'holpful': helpful
Closest match for 'kundnoss': kindness
Closest match for 'holpposs': helpless
Closest match for 'thoes': hoes
Closest match for 'innerstand': understand
Closest match for 'blagrufoo': barefoot
Closest match for 'liberdi': liberal

We are dissatisfied with the outcome of 'hoes' for 'thoes'. Therefore, let's enhance the program to also display the second closest match.

import Levenshtein

# Load the file containing correct words
with open('british-english.txt', 'r') as file:
    correct_words = file.readlines()
correct_words = [word.strip() for word in correct_words]

# Misspelled words
misspelled_words = ["holpful", "kundnoss", "holpposs", 
                    "thoes", "innerstand", "blagrufoo", 
                    "liberdi"]

# Find the closest and second closest match for each misspelled word
for misspelled_word in misspelled_words:
    distances = []
    
    # Calculate Levenshtein distance to all correct words
    for correct_word in correct_words:
        distance = Levenshtein.distance(misspelled_word, correct_word)
        distances.append((distance, correct_word))
    
    # Sort distances by the first element (distance)
    distances.sort(key=lambda x: x[0])
    
    # Get the closest and second closest matches
    closest_match = distances[0][1]
    second_closest_match = distances[1][1]
    
    print(f"Misspelled word: {misspelled_word}")
    print(f"Closest match: {closest_match}")
    print(f"Second closest match: {second_closest_match}")
    print()

OUTPUT:

Misspelled word: holpful
Closest match: helpful
Second closest match: doleful

Misspelled word: kundnoss
Closest match: kindness
Second closest match: fondness

Misspelled word: holpposs
Closest match: helpless
Second closest match: hippo's

Misspelled word: thoes
Closest match: hoes
Second closest match: shoes

Misspelled word: innerstand
Closest match: understand
Second closest match: interstate

Misspelled word: blagrufoo
Closest match: barefoot
Second closest match: Baguio

Misspelled word: liberdi
Closest match: liberal
Second closest match: liberty

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Python Text Processing Course

Enrol here