prev: Note of data science training EP 7: Metrics – It is qualified

We have learnt to predict something with one model, one set of selected features (or columns), and one set of parameters.

And in EP 7, we can predict it with one model and one set of selected features. Parameters are the results of model selection process.

What if we cannot select features, let’s say there are too many features to pick up?

Ensemble

Ensemble is a class embedded in scikit-learn package. It can calculate the best results from combining many algorithms on different feature sets.

For example, we have many data dimension of residents such as geo-location, size, land price, number of floors, referent web rating etc. and we need to predict a price of a house in downtown. In this case we are experiencing the tough problems against those many features and this is what the Ensemble is for.

This time is the sample Ensemble types: Bagging and Random Forest.

Bagging

Bagging stands for Bootstrap Aggregating. It creates different estimators on random dataset over all features. Here are some main parameter of this.

  • base_estimator
    Specify estimator type, Decision tree by default.
  • n_estimators
    Number of different estimators, 10 by default.
  • max_samples
    Number of sample sets for training model

Then .fit() and .predict()

Firstly, import sklearn.ensemble and create BaggingRegressor() with a DecisionTreeRegressor() inside. Apply n_estimators as 5 and max_samples as 25. After prediction we found its MedAE is 65,667.

Now we created 3 more models with different values of n_estimators and max_samples. The first one is the best here.

As the latest episode, we try run GridSearchCV() over it.

The best estimator after computing can create a model with MedAE by only 63,443 points. This uses 16 features out of 44 features from the source.

Reference link:

Random Forest

Right now we go for Random Forest. Random Forest is different from Bagging at Random Forest computes on some features.

We can apply Random Forest estimator in the same way as Decision tree. Just put parameters and .fit() then .predict()

Ooh, we made an estimator from Random Forest and it’s better and Bagging’s one. This MedAE is just 14,334 with 17 features occupied.

Reference link:


I can say this one is quite complex for me and need more practice.

Let’s see what’s next and I gonna share to you all.

Bye~

next: Note of data science training EP 9: NetworkX – Map of Marauder in real world