6-Overfitting

elementsofai.com

Published

March 11, 2025

Overfitting is one of the fundamental issues that should be a mandatory part of any practical machine learning course. Without understanding it, one can do a great deal of harm. So what is it? Simply put, overfitting means being too confident in predictions that worked in the training data.

Overfitting happens when a machine learning model learns the training data too well, including noise and irrelevant patterns, making it perform poorly on new, unseen data. It’s like memorizing answers instead of understanding concepts. This leads to high accuracy on training data but poor generalization.

How to avoid it?

Cross-validation

A simple solution is splitting your data into training and testing data but splitting it means we have less train data so what is the better way to split it?

One solution is called leave-one-out cross-validation and it is one of the simplest ways of doing cross validation

Split the data into n different sets, and train the model n times, each time with a different combination of n - 1 sets, with the remaining set being used as a test set.

For example, if you have three data points (1, 2, and 3), you’ll train the model three times: - Train on {1, 2} and test on {3} - Train on {1, 3} and test on {2} - Train on {2, 3} and test on {1}

And check which model is the most accurate. This way you will get n estimates on how your selected model performs when using unseen data.

Example

Let’s create a dataset with make moons, a function that creates data, in this case 500 data poits, wiht a noise of 0.3 to make it more random. random_state is a parameter for reproducibility that is used in lots of these functions that integrate random numbers

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt


# do not edit this
# create fake data
x, y = make_moons(
    n_samples=500,  # the number of points
    random_state=42, # state for reproducibility
    noise=0.3 # to make it a scatter plot insetad of just two crescends trully separated
)

plt.figure(figsize=(8, 6))
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired, edgecolors="k")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Two Interleaving Moons Dataset")
plt.grid(True)
plt.show()

Now we can use this data to split it into train and test datasets. Then we can create a k-nearest neighbour model and test it with the train data and with the test data to check the differences

## split the x,y sets into test and train of 33% as test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

def evaluate(k):
    knn = KNeighborsClassifier(n_neighbors=k) #create k-nearest neighbours model
    knn.fit(x_train, y_train)

    train_acc = knn.score(x_train, y_train) #get the score with train data when k=1 this will be 1
    test_acc = knn.score(x_test, y_test) #get the score with test data
    print(f"K = {k}:")
    print(f"Training accuracy: {train_acc:.4f}")
    print(f"Testing accuracy: {test_acc:.4f}")


evaluate(1)
evaluate(250)
evaluate(42)
evaluate(100)

K = 1:
Training accuracy: 1.0000
Testing accuracy: 0.8606
K = 250:
Training accuracy: 0.7672
Testing accuracy: 0.8121
K = 42:
Training accuracy: 0.9254
Testing accuracy: 0.9091
K = 100:
Training accuracy: 0.8806
Testing accuracy: 0.8909

Conclusions from this example

When k=1, train accuracy will be 1 because closest would be itself (That’s definitly overfitting)
A reasonable baseline to beat in this code would be 50% of the test data to give accurate results since the goal is to classify data in two classes. If it consistently gives a score of more than 50%, it means the model is learning something rather than just guessing the result.
Out of the values tested, k=42 preformed the best on the test data so we can say it was the best because it gave the highest test accuracy (0.9091)
To have a higher test set accuracy than training set accuracy. For example if you are doing a classification task like here, if your data sets have class imbalance it can easily lead to such a scenario, or if your test set points in this example here would have been picked far away from the decision boundary then they would have been easier to classify correctly than those near the border.

Regularization (TODO)

Ridge Regression (L2 Regularization)

A regularized version of Multiple Linear Regression that prevents overfitting.
Adds a penalty term (L2 norm) to shrink coefficients toward zero but never makes them exactly zero.
Equation: \(y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b + \lambda \sum w_i^2\)
Use case: When you have highly correlated features and want to avoid overfitting while keeping all features.

Lasso Regression (L1 Regularization)

Another regularized regression, but this one uses an L1 norm penalty, which can force some weights to be exactly zero (feature selection).
Equation: \(y = w_1x_1 + w_2x_2 + ... + w_nx_n + b + \lambda \sum |w_i|\)
Use case: When you want automatic feature selection, meaning it removes irrelevant features by setting their coefficients to zero.