K Nearest Neighbours Algorithm

This algorithm chooses the nearest neighbour of the data point to be of the same class. This algorithm also uses a majority voting mechanim, in which the algorithm (n-KKN model) uses the ‘n’th neighbours to determine the class of the data points. This algorith can be used for both classification as well as regression problems, however it is mostly used for classification problems

K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

Advantages

Disadvantages

Alogrithm

  1. Pick a value for ‘k’
  2. Calculate the distance for the unknown cases
  3. Selecr the k-observations in the training data that is nearest to the unknown data points
  4. Predict the class using the most popular response from the k-neighbours (for regression average of the popular responses is taken)

Calculation Similarity in Knn’s

Usuallly in knn’s we use the euclidean distance to calculate the similarity between two datapoints however we can use other similarity functions

A. Euclidean

Data Sparsity:

This aspect, where the training samples do not capture all combinations, is referred to as ‘Data sparsity’ or simply ‘sparsity’ in high dimensional data.Training a model with sparse data could lead to high-variance or overfitting condition.

Distance Concentration

Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. The L1 norm or Manhattan distance is preferred to the L2 norm or the Euclidean distance for high dimensional data processing.

B. Cosine Similarity

C. Manhattan Distance

D. Minkowski Distance

#