K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. Despite its simplicity, it can be remarkably effective for various classification and regression tasks. In this blog post, we will explore the working of KNN, its applications, and how it can be implemented.
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for classification and regression. In classification, the output of a query instance is determined by a majority vote of its K nearest neighbors. For regression, the output is the average or mean of the values of its K nearest neighbors.
Unlike many other machine learning algorithms, KNN does not require any prior assumptions about the data distribution. This makes KNN particularly useful in situations where you don’t know much about the underlying data and don’t want to make any assumptions.
KNN operates on the idea of "closeness," where we measure the similarity between a query point and other points in the dataset. There are different methods to calculate distance, with Euclidean distance being the most commonly used. Here’s the formula for Euclidean distance between two points and :
Where and are the coordinates of points and , respectively.
Other distance metrics include Manhattan distance, Minkowski distance, and cosine similarity, depending on the problem and the type of data you're working with.
The parameter represents the number of nearest neighbors to consider when making a prediction. Here’s how it works:
Choosing the right value for is important. A small makes the model sensitive to noise, while a large may smooth over distinct patterns in the data. Typically, odd values of are preferred for classification tasks to avoid ties.
KNN can be applied to a wide variety of domains, such as:
Let’s look at an example of implementing KNN using Python and the popular scikit-learn library.
First, you need to install scikit-learn if it’s not already installed:
pip install scikit-learn
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the KNN classifier with K=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
In this example, we use the famous Iris dataset to classify flower species based on several features like petal length and width. The model is trained using 70% of the data and tested on the remaining 30%.
The accuracy of the model will be printed, showing how well it predicts the species of flowers on the test set.