Introduction

K-Nearest Neighbor Matching is to classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class. Optionally, we give closer points larger weights and more distant points smaller weights. Common value for k is 3 or 5. At k=1, the error rate is always zero for the training sample because the closest point to any training data point is itself. Therefore, it will always overfit.

Keep in Mind

When to Consider	Advantages	Disadvantages
Instances map to points in $R^{n}$	Traning is very fast	Slow at query time
Less than 20 attributes per instance	Learn complex target functions	Easily fooled by irrelevant attributes
Lots of training data	Do not lose information

Also Consider

Distance measure
- Most common: Euclidean distance
- Euclidean distance makes sense when different measurements are commensurate; each is variable measured in the same units.
- If the measurements are different, say length and weight, it is not clear.

\[d_{E}(x^{i}, x^{j}) = (\sum_{k=1}^{p}(x^{i}_k - x^{j}_k)^2)^\frac{1}{2}\]

Standardization
- When variables are not commensurate, we want to standardize them by dividing by the sample standard deviation. This makes them all equally important.
- The estimate for the standard deviation of $x_k$: $\hat{\sigma}_k = \biggl(\frac{1}{n}\sum_{i=1}^{n}(x^{i}_k - \bar{x}_k)^2\biggr)^\frac{1}{2}$
  
  where $\bar{x}_k$ is the sample mean: $\bar{x}_k = \frac{1}{n}\sum_{i=1}^{n}x^i_k$
Weighted Euclidean Distance
- Finally, if we have some idea of the relative importance of each variable, we can weight them:

\[d_{WE}(i,j) = \biggl(\sum_{k=1}^{p}w_k(x^i_k - x^j_k)^2\biggr)^\frac{1}{2}\]

Choosing k
- Increasing k reduces variance and increases bias.
For high-dimensional space, problem that the nearest neighbor may not be very close at all.
Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.

Implementations

Python

For KNN, it is not required to import packages other than numpy. You can basically do KNN with one package because it is mostly about computing distance and normalization. You would need TensorFlow and Keras as you try more advanced algorithms such as convolutional neural network.

import argparse
import numpy as np
from collections import Counter

# Process arguments for k-NN classification
def handle_args():
    parser = argparse.ArgumentParser(description=
                 'Make predictions using the k-NN algorithms.')

    parser.add_argument('-k', type=int, default=1, help='Number of nearest neighbors to consider')
    parser.add_argument('--varnorm', action='store_true', help='Normalize features to zero mean and unit variance')
    parser.add_argument('--rangenorm', action='store_true', help='Normalize features to the range [-1,+1]')
    parser.add_argument('--exnorm', action='store_true', help='Normalize examples to unit length')
    parser.add_argument('train',  help='Training data file')
    parser.add_argument('test',   help='Test data file')

    return parser.parse_args()


# Load data from a file
def read_data(filename):
  data = np.genfromtxt(filename, delimiter=',', skip_header=1)
  x = data[:, 0:-1]
  y = data[:, -1]
  return (x,y)


# Distance between instances x1 and x2
def dist(x1, x2):
    euclidean_distance = np.linalg.norm(x1 - x2)
    return euclidean_distance


# Predict label for instance x, using k nearest neighbors in training data
def classify(train_x, train_y, k, x):
    dists = np.sqrt(np.sum((x - train_x) ** 2, axis=1))
    idx = np.argsort(dists, 0)[:k]
    k_labels = [train_y[index] for index in idx]
    prediction = list()
    prediction.append(max(k_labels, key=k_labels.count))
    prediction = np.array(prediction)
    return prediction


# Process the data to normalize features and/or examples.
# NOTE: You need to normalize both train and test data the same way.
def normalize_data(train_x, test_x, rangenorm, varnorm, exnorm):
  if rangenorm:
    train_x = 2 * (train_x - np.min(train_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1
    test_x = 2 * (test_x - np.min(test_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1

    pass

  if varnorm:
    train_x = (train_x - np.mean(train_x, axis=0)) / np.nan_to_num(np.std(train_x, axis=0))
    test_x = (test_x - np.mean(test_x, axis=0)) / np.nan_to_num(np.std(test_x, axis=0))

    pass

  if exnorm:
    for i in train_x:
      train_x = i / np.linalg.norm(i)
    for k in test_x:
      test_x = k / np.linalg.norm(k)

    pass

  return train_x, test_x


# Run classifier and compute accuracy
def runTest(test_x, test_y, train_x, train_y, k):
  correct = 0
  for (x,y) in zip(test_x, test_y):
    if classify(train_x, train_y, k, x) == y:
      correct += 1
  acc = float(correct)/len(test_x)
  return acc


# Load train and test data.  Learn model.  Report accuracy.
def main():

  args = handle_args()

  # Read in lists of examples.  Each example is a list of attribute values,
  # where the last element in the list is the class value.
  (train_x, train_y) = read_data(args.train)
  (test_x, test_y)   = read_data(args.test)

  # Normalize the training data
  (train_x, test_x) = normalize_data(train_x, test_x, 
                          args.rangenorm, args.varnorm, args.exnorm)
    
  acc = runTest(test_x, test_y,train_x, train_y,args.k)
  print("Accuracy: ",acc)

if __name__ == "__main__":
  main()

A very simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections.

#Import Libraries
from seaborn import load_dataset
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load a sample dataset
iris_df = load_dataset('iris')


# Quick and rough sketch comparing the petal feature to species
sns.scatterplot(data=iris_df, x='petal_length', y='petal_width', hue='species')


# Quick and rough sketch comparing the sepals feature to species
sns.scatterplot(data=iris_df, x='sepal_length', y='sepal_width', hue='species')



# Let's seperate the data into X and Y (features and target)
X = iris_df.drop(columns='species')
Y = iris_df['species']


# Split the data into training and testing for model evaluations
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=.70, shuffle=True,
                                                    random_state=777)


# Iterate through different neighbors to find the best accuracy with N neighbors.
accuracies = {}
errors = {}
for i in range(1, 15):
    clf = KNeighborsClassifier(n_neighbors=i)

    clf.fit(X=X_train, y=y_train)
    y_pred = clf.predict(X_test)
    
    accu_score = accuracy_score(y_true=y_test, y_pred=y_pred)
    accuracies[i] = accu_score

sns.lineplot(x=accuracies.keys(), y=accuracies.values()).set_title('Accuracies by N-Neighbors')

# Looks like about 8 is the first best accuracy, so we'll go with that.
print(f"{accuracies[8]:.1%}") #100% accuracy for 8 neighbors.

R

The simplest way to perform KNN in R is with the package class. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with Euclidean distance. For more advanced types of nearest neighbors matching it would be best to use the matchit function from the matchit package. To verify results this example also used the confusionMatrix function from the package caret.

Due to how this package is designed the most room for error is during normalization, by normalizing variables that do not require or cannot accept it, like character variables. Another good source of error is not including drop = TRUE for your target, or y, vector which will prevent the model from running. Finally, given the way this example verifies results it is vital to convert the target into a factor as the data has to be in similar kind in order for R to give you an output.

This walkthrough uses data from the UCI Machine Learning Repository under Breast Cancer Wisconsin (Diagnostic) Data Set (Rdocumentation for KNN). Also, statology’s “how to create a confusion matrix”.

library(tidyverse)
library(readr)

# For KNN
library(class)
library(caret)

# Import the Dataset
df <- read_csv("https://github.com/LOST-STATS/lost-stats.github.io/files/7088929/wdbc.csv")
view(df)

# the first column is an identifier so remove that, anything that does not aid in classifying can be removed
df <- df[-1]

# See the count of the target, either B, benign, or M, malignant
table(df[1])

# Normalize the Dataset

normal <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }

# Apply to what needs to be normalized, in this case not the target
df_norm <- as.data.frame(lapply(df[2:31], normal))

# Verify that normalization has occurred
summary(df_norm[1])
summary(df_norm[3])
summary(df_norm[11])
summary(df_norm[23])

# Split the dataframe into test and train datasets - note there are two dataframes
# First test and train from the features, here is an example of about a 70/30 split for testing and training
x_train <- df_norm[1:397,]
x_test <- df_norm[398:568,]

# Now test and train for the target - here it is important that you do ", 1" to indicate only one column
# It will not work unless you use drop = TRUE
y_train <- df[1:397, 1, drop = TRUE]
y_test <- df[398:568, 1, drop = TRUE]

# The purpose of installing those packages were to use these next functions, first KNN
# Like the Python example states, the best practice for a choice of K unless assigned is the square root of the number of observations
pred <- knn(train = x_train, test = x_test, cl = y_train, k = 23)

# Confusion Matrix from Caret

# KNN converts to a factor with two levels so we need to make sure the test dataset is similar
y_test <- y_test %>% factor(levels = c("B", "M"))

# See how well the model did
confusionMatrix(y_test, pred)