Linear regression is one of the most fundamental and widely used algorithms in machine learning. Despite its simplicity, it has proven to be an essential tool for making predictions and understanding relationships between variables. In this blog, we’ll explore the concept of linear regression, how it works, and provide sample code to help you implement it.
Linear regression is a statistical method used to model the relationship between a dependent (target) variable and one or more independent (predictor) variables. It assumes that there is a linear relationship between the input variables and the target variable.
In the case of simple linear regression, the model is based on the relationship between a single independent variable (feature) and a dependent variable (target). For multiple linear regression, there are multiple independent variables.
The goal of linear regression is to find the best-fitting straight line (or hyperplane in the case of multiple variables) that minimizes the error between the predicted and actual values.
The equation for linear regression in the simple case (one predictor) is:
Where:
In multiple linear regression, the equation becomes:
Where are the independent variables.
Linear regression works by fitting a line to the data that minimizes the sum of squared residuals (errors). This method is called Ordinary Least Squares (OLS). The algorithm adjusts the parameters and (and other coefficients in multiple regression) to minimize the distance between the observed data points and the predicted values.
For linear regression to work effectively, certain assumptions must be met:
Linear regression is widely used in various fields due to its simplicity and interpretability. Some common applications include:
Simple linear regression is used when there is only one independent variable.
Multiple linear regression is used when there are multiple independent variables influencing the dependent variable.
Let’s now walk through how to implement simple linear regression using Python and the popular scikit-learn
library.
Suppose we have a dataset of years of experience and corresponding salaries, and we want to predict the salary based on years of experience.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Sample dataset (years of experience vs salary)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Independent variable (experience)
y = np.array([40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000, 80000, 85000]) # Dependent variable (salary)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Predict the salary using the trained model
y_pred = model.predict(X_test)
# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
# Plot the data and the regression line
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.title('Simple Linear Regression: Salary Prediction')
plt.show()
X
as the years of experience and y
as the corresponding salaries.Let’s take an example where we want to predict house prices based on multiple factors: square footage, number of rooms, and age of the house.
# Sample dataset (features: square footage, number of rooms, age of house)
X = np.array([[1400, 3, 5], [1600, 3, 10], [1700, 3, 15], [1875, 4, 10], [1100, 2, 5], [1550, 3, 10], [2350, 4, 20], [2450, 4, 25], [1425, 3, 5], [1700, 3, 15]]) # Features
y = np.array([245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000]) # Target (Price)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Predict the house prices using the trained model
y_pred = model.predict(X_test)
# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')