K Nearest Neighbours Algorithm
This algorithm chooses the nearest neighbour of the data point to be of the same class. This algorithm also uses a majority voting mechanim, in which the algorithm (n-KKN model) uses the ‘n’th neighbours to determine the class of the data points. This algorith can be used for both classification as well as regression problems, however it is mostly used for classification problems
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
Advantages
- If the data is properly labelled
- It is robust to noisy training data
- It can be more effective if the training data is large
Disadvantages
- Determing the value of K can be time taking if we don’t know the number of classes
- The computation cost is high because we use similarity functions
Alogrithm
- Pick a value for ‘k’
- Calculate the distance for the unknown cases
- Selecr the k-observations in the training data that is nearest to the unknown data points
- Predict the class using the most popular response from the k-neighbours (for regression average of the popular responses is taken)
Calculation Similarity in Knn’s
Usuallly in knn’s we use the euclidean distance to calculate the similarity between two datapoints however we can use other similarity functions
A. Euclidean
- This similarity(distance) measure is not scale invariant, which means that the dataset needs to be normalized before it is used
- As the dimension increases this distance measure becomes less usefull. This is because of the curse of the ‘curse of dimensionality’. The popular aspects of this curse is ‘Data sparsity’ and ‘Distance concentration’
Data Sparsity:
This aspect, where the training samples do not capture all combinations, is referred to as ‘Data sparsity’ or simply ‘sparsity’ in high dimensional data.Training a model with sparse data could lead to high-variance or overfitting condition.
Distance Concentration
Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. The L1 norm or Manhattan distance is preferred to the L2 norm or the Euclidean distance for high dimensional data processing.
B. Cosine Similarity
- This was introduced to overcome the euclidean distance drawbacks.
- The magnitude of vectors is not taken into account. This means that the difference in values is not relevant
- Used paticularly in high dimensianl data, where magnitude is not majorly important
C. Manhattan Distance
- This refers to the distance between two vectors if they could only move right angles,there is no diagonal movement in calculating the distance
- This as well works with higher dimensional data, but is somewhat less intuitive
- Does not give the shortest path as well
D. Minkowski Distance
- This metric is used in norm vector space, which means in a space where distances can be represented as a vector that has a length and the lengths cannot be negative.
- Metric has 3 requirements :
- Zero vector: Distance between one place and itself is zero
- Scalar factor: When you multiply the distance with a scalar only the magnitude changes
- Triangle Inequality: Shortest distance between two points is a straight line
#