The Enigma of Model Complexity Pt 1
Model Complexity also known as capacity of the model is a very important aspect for the effective performance of the machine learning model. It is easy to fit the model on the given data and achieve the accuracy of more than 90% on training data, but the real task is to get the same accuracy on test data. We will discuss methods to assess the model capacity to get enough capacity without overfitting or underfitting while generalize well on the given data distribution.
In this article, we will discuss the importance of model complexity and try to quantify it using VC dimension. In next few articles, we will deep dive in other metrics too including Rademacher complexity and Minimum description Length (MDL).
Learning vs Memorization
As per google,
Learning is the acquisition of knowledge or skills through study, experience, or being taught.
It is a process of acquiring an understanding of a subject, which you may learn from someone’s assistance or by your own experience. The experience that you gain throughout your day to day life affects your attitude, knowledge and perceptions which improves your future decisions. On the other hand, if you have the guidance of a mentor in your life, you will be less prone to making trivial mistakes, and get a head start to learn quickly.
Learning becomes memorization when you don’t have any understanding of the problem but just an overview. This may not always be the problem if the solution is simple but things go wrong when the problem does not have a straightforward solution but requires an understanding of the problem in hand.
Let’s go back to school now. If you are taught a theory on some topic and are given some exercises to work with, next day you will have a quiz on this topic. In the quiz, there are some questions exactly the same as in exercises and some are changed with some different values or the phrase changed, but the crux of the question is unchanged.
Result has been declared !!
There will be 3 types of students,
- People who scored well in both types of questions.
- People who scored well in the first type of questions but scored bad in the second type.
- People who cannot score well in both types of questions.
First type of students had a good understanding of the theory and were able to formulate the solution even if the questions were slightly changed. This is an example of generalization, where you have an understanding rather than memorization of the task.
Second type of students did not get an understanding of the theory and were just memorizing the exercise problems in the school. Thus, they could not solve the second type of problems, which required more thinking and understanding. This is also called overfitting in machine learning.
Third type of students were not even able to get the method of solution and were not able to approach the problems, which lead to bad scores in the first type of questions itself. This is also called underfitting in machine learning.
Model Complexity/Capacity
The model complexity refers to the complexity of the function trying to be learned in the training process which depends on the number of parameters learned.
These parameters differ with different machine learning models such as decision trees have a dependency on the depth of the tree, Linear regression’s complexity increases with number of features, Polynomial regression’s complexity increases with the degree of the function.
In terms of Generalizability,
- High Model Complexity leads to Overfitting.
- Low Model Complexity leads to Underfitting.
Overfitting case is prone to memorization and hence poor performance on unknown data. On the other hand, the second case fails to capture the important details in the data distribution. So, we need a tradeoff between the two to get the optimal model capacity. But how do we measure the model’s capacity?
VC Dimension
VC dimension(Vapnik-Chervonenkis dimension) is a measure of the complexity of a machine learning model. It is named after the mathematicians Vladimir Vapnik and Alexey Chervonenkis, who developed the concept in the 1970s.
Given a set of n points in d dimensional space and Hypothesis set H with binary classification task, c =(-1,1), the VC dimension(h) is given by-
h = {max q |∃ a set of q points which can shattered by the hypothesis set H}
We know that q points can be labeled in 2^q ways, and for each labeling there should exist hypothesis h in H which correctly classifies the points in two classes. This should happen for at least one arrangement of q points, then we say that Hypothesis set H can shatter q points.
VC dimension is a measure of model capacity , so it is evident that models with high VC dimension have high capacity and are more prone to overfitting. So we need models with less VC dimension which increase the chances of better generalization of the solution, leading to better performance on unseen data.
Let’s consider a binary classification task with samples x and classes y = (-1,1). The goal of the task is to learn a mapping from x → f(x,a) accurately. We will train the model on the training set and expect the model to perform well on the test set. So we want a bound on the error in the testing data without actually testing on this data as it is unseen.
R(a) is the risk expectation or the expected error on the unseen data, which we are concerned about. Here a are adjustable parameters such as weights and bias of the model.
Re is the empirical risk, expected error on the training data.
Now choose some η such that 0 ≥ η ≤ 1. Then for losses taking these values, with probability 1 − η, the following bound holds (Vapnik, 1995):
Our aim is to find a bound on the total risk (combined effect of performance of known and unknown data) i.e. the left-hand side, which can be computed using the two terms on the right-hand side. The first term is with respect to the error on the training data and second term in this equation corresponds to the complexity of the function being learned.
Second term is also called VC confidence or confidence interval which is a function of VC dimension (h) and the confidence parameter (η). Thus given several different learning machines, and choosing a fixed, sufficiently small η, we choose the machine which minimizes the right hand side, giving the lowest upper bound on the actual risk. This gives a principled method for choosing a learning machine for a given task, and is the essential idea of structural risk minimization. The VC confidence is a monotonic increasing function of h. This will be true for any value of l (# samples).
Inferences-
- If we have a high VC dimension, then the confidence interval would increase and hence the bound on total risk.
- If we have a low VC dimension, then the confidence interval would be less and hence the bound on total risk.
It is important to discuss that the model with high VC dimension may perform well on some dataset than a model with lower VC dimension. Overfitting does not mean that the model will always perform poorer, but expected error on the test data may be less.
References
[1] A Tutorial on Support Vector Machines for Pattern Recognition
[2] VC Dimension
[3] Model Complexity
https://vitalflux.com/model-complexity-overfitting-in-machine-learning/amp/