Mathematicians come to the solution of a problem  by the simple arrangement of the data, and reducing the reasoning to such simple operations, to judgments so brief, that they never lose sight of evidence that serves as their guide by Antoine Lavoisier, Prominent French chemist and leading figure in 18th century


Model Building

First crucial step in modeling to know which algorithms we have to used based on our target variable. The target variable can fall in either of four categories:
  1. Nominal
  2. Ordinal
  3. Interval 
  4. Ratio
The target variable in our model is a continuous variable i.e., Count of bike rentals(cnt). Hence, the algorithms that we choose to build model on are Linear regression and random forest as discussed in it's Part 5.
From now on, we will use term multiple regression in place of linear regression because we have more than two independent variables. That's it.😀


We are splitting data set in the ratio 80:20 that means we will train our model with 80% of data and test the authenticity of model on other 20% of data. This is completely depend on you, in what ratio you are building the model.

Output
Multiple regression


Multiple regression
Summary of Multiple Regression

If you wonder, in the line 4,5 ,6 of code why i have used 13th variable, you are on right path. We have already excluded three unnecessary variable from 16 features that's why our target variable shifted on number 13. Feature "casual" and "registered" also excluded from the formula but that is up to you. It's a trail-error method, apply different types of combination of variables to obtain the best result. In the end, our main aim is to select only those features that are beneficial and good fit to our model.

Output
random forest
Summary of Random Forest

random forest model
Model Plot


Plotting the model will illustrate error rate. It is based on Out Of Bag(OOB) sample error. Thus, through above two lines of code, we can find which number of trees providing the lowest error rate which is 191 trees providing an average bike rental count error of 169.


Output

varimp
varimp


varimp plot, Machine learning
varimpplot

%IncMSE (Mean Decrease Accuracy) determines how much our model decreases if we leave out that variable. IncNodePurity(Mean Decrease Gini) is a measure of variable importance based on Gini Impurity index used for the calculating the splits in trees.
In simple words, the more higher value of both metrics, variable more importance to the model.

In this post, I have explained how to implement the algorithms though to evaluate the performance of model, i would explain every possible important metric in next and last post of the series.


Feel free to comment your opinions💓