We are not concerned with the very poor. They are unthinkable, and only to be approached by the statistician or the poet by E.M. Forster, English novelist, writer, essayist and librettist
Source: Udacity |
First, let's understand term Overfitting through a real time example that you might have faced. Suppose, your exam is tomorrow and you have not prepared anything. Two things you can do: Either bring cheat notes or prepare some specific topics.
Let's take the case that you have prepared 2 important topics and you have memorized all the questions from those 2 topics. Now, you know everything but only from those 2 topics. You feel confident that you will pass the exam with great score, right?. But what if when the exam starts, all questions are different from what you have memorized. Exactly, that is the situation when your trained model is overfit.
You have memorized and approached a specific pattern to appear in exam but on unseen questions, you fail or you pass with low scores. That is the case of Overfitting in simple terms. Hope, you now have basic understanding of this concept.
It is normal to get confused about some concepts but you can remember them through an easy example.
Methods to Prevent Overfitting
1. Training with More Data!
Take the Overfitting example that i have explained in upper section. Beside learning only 2 topics, if he can learn or prepare more or have given some more time to prepare, there might be a chance that he can score more.
Similarly, when you give more data to the model, it will learn more and more different types of examples. So, it's performance would be increased on unseen data. Model will be able to perform better then it was, as now it have trained with more different types of data.
2. Removing features
This method is popularly known as "Feature selection". Well, it is a repetitive process to identify which features are essential to the model. Finding out the features which are providing unique information to the dependent variable.
Thus, if statistically, you see a particular variable not contributing much to the dependent variable you can remove it. Beside doing manually, there are some in built algorithms provided to avoid all these hustle to do feature selection.
Suppose, there are 500 rows and 100 columns of data. We can see that we have small amount of data but so many features are provided for the training purpose. One thing i can assure you that can not use all features because that will result in Overfitting. So, what will you do, only select few important ones.
3. Cross Validation
The main idea behind cross validation is that each observation in our dataset has the opportunity of being tested. It repeat the experiment multiple times, using all different parts of the training data as validation data set.
In standard k-fold cross validation, it iterate over a dataset k times. In each time, it split the dataset into k parts: one is used for validation and remaining k-1 parts are merged together and used as training data set for model evaulation.
4. Early Stopping
In this method, we would do what name suggests "Early stopping" means stopping early before a point/iteration. That's right, after a certain point/iteration, when we found out that from now on our model is going to degrade or it's performance going to worst side, we will stop right there.
This is the technique mostly used in Deep learning. Instead of running your model for fixed number of epochs/iteration, you stop as soon as the validation loss rises.
5. Regularization
I will suggest, for this method go visit this post for better deep understanding!.
This is the method that try to artificially convert/penalize complex models into simpler ones.
6. Ensembling
Ensemble models combines the decisions from the multiple models to improve the overall performance.
Mostly to tackle with overfitting, we average all the models to reduce the score. Besides, let's dive deeper into more simple to complex techniques in ensemble.
- We take average of all predictions from all models and use it to make final prediction.
- It is the extension of averaging the model. It assigns weights according to the importance of the each model.
- In this technique, we use MODE as our judge to decide the performance of our final model. It consider each model as separate votes and take the majority of votes. So, the score which we get from the majority of the models is used as final prediction.
- Bagging: In this technique, individual models are built separately and equal weight is assigned to all the models. It combines all the strong learners together in order to smooth out their predictions.
- Boosting: In this technique, each new model is influenced by the performance of previously built models and weights are assigned according to their performance. It combines all the weak learners into a single strong learner.
Feel free to comment your opinions!
0 Comments