Linear Regression
Key Concept: Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation.
Introduction
Linear regression is one of the most fundamental algorithms in machine learning. It's used to predict a continuous output variable based on one or more input features. The goal is to find the best-fitting line (or hyperplane in higher dimensions) that describes the relationship between variables.
Mathematical Formulation
The linear regression model can be expressed as:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
- y is the dependent variable (target)
- x₁, x₂, ..., xₙ are the independent variables (features)
- β₀ is the intercept term
- β₁, β₂, ..., βₙ are the coefficients
- ε is the error term
Cost Function
Linear regression uses Mean Squared Error (MSE) as its cost function:
MSE = (1/n) Σ(yᵢ - ŷᵢ)²
The goal is to minimize this cost function to find the optimal parameters.
Implementation in Python
Here's a simple implementation using scikit-learn:
import numpy as np
from sklearn.linear_model import LinearRegression
# Create sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict([[6]])
print(f"Prediction for X=6: {predictions[0]}")Key Assumptions
Linear regression makes several important assumptions:
- Linearity: The relationship between X and Y is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Constant variance of errors
- Normality: Errors are normally distributed
- No multicollinearity: Independent variables are not highly correlated
Advantages and Disadvantages
Advantages
- Simple to implement and interpret
- Computationally efficient
- Works well when the relationship is linear
- Less prone to overfitting with few features
Disadvantages
- Assumes a linear relationship
- Sensitive to outliers
- Cannot capture complex patterns
- Requires assumptions to be met for optimal performance
Pro Tip: Always visualize your data before applying linear regression to check if the relationship is indeed linear. Use residual plots to validate your model's assumptions.
Was this article helpful?