Data Science Project | Bike Rental Count | Part 6

Mathematicians come to the solution of a problem by the simple arrangement of the data, and reducing the reasoning to such simple operations, to judgments so brief, that they never lose sight of evidence that serves as their guide by Antoine Lavoisier, Prominent French chemist and leading figure in 18th century

Model Building

First crucial step in modeling to know which algorithms we have to used based on our target variable. The target variable can fall in either of four categories:

Nominal
Ordinal
Interval
Ratio

The target variable in our model is a continuous variable i.e., Count of bike rentals(cnt). Hence, the algorithms that we choose to build model on are Linear regression and random forest as discussed in it's Part 5.

From now on, we will use term multiple regression in place of linear regression because we have more than two independent variables. That's it.😀

We are splitting data set in the ratio 80:20 that means we will train our model with 80% of data and test the authenticity of model on other 20% of data. This is completely depend on you, in what ratio you are building the model.

Output

Summary of Multiple Regression

If you wonder, in the line 4,5 ,6 of code why i have used 13th variable, you are on right path. We have already excluded three unnecessary variable from 16 features that's why our target variable shifted on number 13. Feature "casual" and "registered" also excluded from the formula but that is up to you. It's a trail-error method, apply different types of combination of variables to obtain the best result. In the end, our main aim is to select only those features that are beneficial and good fit to our model.

Output

Summary of Random Forest

Model Plot

Plotting the model will illustrate error rate. It is based on Out Of Bag(OOB) sample error. Thus, through above two lines of code, we can find which number of trees providing the lowest error rate which is 191 trees providing an average bike rental count error of 169.

Output

varimp

varimpplot

%IncMSE (Mean Decrease Accuracy) determines how much our model decreases if we leave out that variable. IncNodePurity(Mean Decrease Gini) is a measure of variable importance based on Gini Impurity index used for the calculating the splits in trees.
In simple words, the more higher value of both metrics, variable more importance to the model.

In this post, I have explained how to implement the algorithms though to evaluate the performance of model, i would explain every possible important metric in next and last post of the series.

Feel free to comment your opinions💓

Datacian

Data Science Project | Bike Rental Count | Part 6 | Machine Leaning, Modeling

Model Building

Post a Comment

0 Comments

More from Datacian

What is Stratified Random Sampling and it's implementaton in R

Categories

Contact Us

Recent Posts

Use Labels to Navigate

Translate

Wikipedia

Datacian

Data Science Project | Bike Rental Count | Part 6 | Machine Leaning, Modeling

Model Building

You may like these posts

Post a Comment

0 Comments

More from Datacian

What is Stratified Random Sampling and it's implementaton in R

Categories

Contact Us

Recent Posts

Use Labels to Navigate

Translate

Wikipedia