ENSEMBLE METHODS — Bagging, Boosting, and Stacking

A comprehensive guide to Ensemble Learning.

Ankit Chauhan
Analytics Vidhya

--

In this article, I will be giving a theoretical explanation about what ensemble learning is and the common types of Ensemble methods.

We regularly come across many game shows on television and you must have noticed an option of “Audience Poll”. Most of the time a contestant goes with the option which has the highest vote from the audience and most of the time they win. We can generalize this in real life as well where taking opinions from a majority of people is much more preferred than the opinion of a single person. The Ensemble technique has a similar underlying idea where we aggregate predictions from a group of predictors, which may be classifiers or regressors, and most of the time the prediction is better than the one obtained using a single predictor.

Definition: — Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined, we can obtain more accurate and/or robust models.

Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing.

In ensemble learning theory, we call weak learners (or base models) models that can be used as building blocks for designing more complex models by combining several of them. Most of the time, these basics models perform not so well by themselves either because they have a high bias or because they have too much variance to be robust. Then, the idea of ensemble methods is to try reducing bias and/or variance of such weak learners by combining several of them together to create a strong learner (or ensemble model) that achieves better performances.

Let’s suppose we have ’n’ predictors/models:

Z1, Z2, Z3, ……., Zn with a standard deviation of σ

Variance(z) = σ²

If we use single predictors Z1, Z2, Z3, ……., Zn the variance associated with each will be σ² but the expected value will be the average of all the predictors.

Let’s consider the average of the predictors:

µ = (Z1 + Z2 + Z3+…….+ Zn)/n

If we use µ as the predictor then the expected value remains the same but see the variance now:

variance(µ) = σ²/n

So, the expected value remained ‘µ’ but variance decreases when we use an average of all the predictors.

This is why taking mean is preferred over using single predictors.

Ensemble methods take multiple small models and combine their predictions to obtain a more powerful predictive power.

There are few very popular Ensemble techniques which we will talk about in detail such as Bagging, Boosting, and stacking.

1. BAGGING

· Bagging stands for Bootstrap Aggregation.

· In real-life scenarios, we don’t have multiple different training sets on which we can train our model separately and at the end combine their result. Here, bootstrapping comes into the picture.

· Bootstrapping is a technique of sampling different sets of data from a given training set by using replacement. After bootstrapping the training dataset, we train the model on all the different sets and aggregate the result. This technique is known as Bootstrap Aggregation or Bagging.

Definition: — Bagging is the type of ensemble technique in which a single training algorithm is used on different subsets of the training data where the subset sampling is done with replacement (bootstrap). Once the algorithm is trained on all the subsets, then bagging predicts by aggregating all the predictions made by the algorithm on different subsets.

· For aggregating the outputs of base learners, bagging uses majority voting (most frequent prediction among all predictions) for classification and averaging (mean of all the predictions) for regression.

Bagging visual representation:

Image courtesy: Google

· Advantages of a Bagging Model:

1. Bagging significantly decreases the variance without increasing bias.

2. Bagging methods work so well because of diversity in the training data since the sampling is done by bootstrapping.

3. Also, if the training set is very huge, it can save computational time by training the model on a relatively smaller data set and still can increase the accuracy of the model.

4. Works well with small datasets as well.

· Disadvantages of a Bagging Model:

1. The main disadvantage of Bagging is that it improves the accuracy of the model at the expense of interpretability i.e., if a single tree was being used as the base model, then it would have a more attractive and easily interpretable diagram, but with the use of bagging this interpretability gets lost.

2. Another disadvantage of Bootstrap Aggregation is that during sampling, we cannot interpret which features are being selected i.e., there are chances that some features are never used, which may result in a loss of important information.

Out of Bag Evaluation: -In bagging, when different samples are collected, no sample contains all the data but a fraction of the original dataset. There might be some data that are never sampled at all. The remaining data which are not sampled are called out of bag instances.

The Random Forest approach is a bagging method where deep trees (Decision Trees), fitted on bootstrap samples, are combined to produce an output with lower variance.

2.BOOSTING

· Boosting models fall inside this family of ensemble methods.

· Boosting, initially named Hypothesis Boosting, consists of the idea of filtering or weighting the data that is used to train our team of weak learners, so that each new learner gives more weight or is only trained with observations that have been poorly classified by the previous learners..

· By doing this our team of models learns to make accurate predictions on all kinds of data, not just on the most common or easy observations. Also, if one of the individual models is very bad at making predictions on some kind of observation, it does not matter, as the other N-1 models will most likely make up for it.

Definition: — The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners. Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea of boosting is to train weak learners sequentially, each trying to correct its predecessor. The weak learners are sequentially corrected by their predecessors and, in the process, they are converted into strong learners.

Image courtesy: Google

· Also, in boosting, the data set is weighted (represented by the different sizes of the data points), so that observations that were incorrectly classified by classifier n are given more importance in the training of model n + 1, while in bagging the training samples are taken randomly from the whole population.

· While in bagging the weak learners are trained in parallel using randomness, in boosting the learners are trained sequentially, such that each subsequent learner aims to reduce the errors of the previous learners.

· Boosting, like bagging, can be used for regression as well as for classification problems.

· Boosting is mainly focused on reducing bias.

Any algorithm could have been used as a base for the boosting technique, but the reason for choosing trees are:

Pro’s

· Computational scalability,

· Handles missing values,

· Robust to outliers,

· Does not require feature scaling,

· Can deal with irrelevant inputs,

· Interpretable (if small),

· Handles mixed predictors as well (quantitative and qualitative)

Con’s

· Inability to extract a linear combination of features

· High variance leading to a small computational power

And that’s where boosting comes into the picture. It minimizes the variance by taking into consideration the results from various trees.

· Advantages of a Boosting Model:

1. Boosting is a resilient method that curbs over-fitting easily.

2. Provably effective

3. Versatile — can be applied to a wide variety of problems.

· Disadvantages of a Boosting Model:

1. A disadvantage of boosting is that it is sensitive to outliers since every classifier is obliged to fix the errors in the predecessors. Thus, the method is too dependent on outliers.

2. Another disadvantage is that the method is almost impossible to scale up. This is because every estimator bases its correctness on the previous predictors, thus making the procedure difficult to streamline.

Ada Boost(Adaptive Boosting), Gradient Boosting, XG Boost(Xtreme Gradient Boosting) are few common examples of Boosting Techniques.

3.STACKING

· Stacked Generalization or “Stacking” for short is an ensemble machine learning algorithm.

· Stacking mainly differs from bagging and boosting on two points. First stacking often considers heterogeneous weak learners (different learning algorithms are combined) whereas bagging and boosting consider mainly homogeneous weak learners. Second, stacking learns to combine the base models using a meta-model whereas bagging and boosting combine weak learners following deterministic algorithms.

Definition: — Stacking is an ensemble learning method that combines multiple machine learning algorithms via meta-learning, In which base level algorithms are trained based on a complete training data-set, the meta-model is trained on the final outcomes of the all base-level model as a feature. We have a deal with bagging and boosting methods for handling bias and variance. Now we can learn stacking which is improve your model prediction accuracy.

Visual representation of Stacked Generalization :

Image courtesy: Google

In the above figure, we can see that different sample is not taken for training data to train classifiers. Instead of we are taking the whole data-set for training for every individual classifier. In this process, each classifier is working independently, which permits the classifiers with different hypotheses and algorithms. For Instance, We can train our model on linear regression classifier, Decision Tree, and random forest for training, and then we can combine their prediction using a support vector machine.

· Stacking, just like other ensemble techniques, tries to improve the accuracy of a model by using predictions of not so good models and then using those predictions as an input feature for a better model.

· Advantages of a Stacked Generalization Model:

1. The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble.

2. Stacking improves the model prediction accuracy.

· Disadvantage of a Stacked Generalization Model:

1. As we are taking the whole dataset for training for every individual classifier, in the case of huge datasets the computational time will be more as each classifier is working independently on the huge dataset.

Conclusion:

Finally, we would like to conclude by reminding you that ensemble learning is about combining some base models to obtain an ensemble model with better performances/properties. Thus, even if bagging, boosting, and stacking are the most commonly used ensemble methods, variants are possible and can be designed to better adapt to some specific problems.

In my next article, I’ll be coming up with a detailed explanation of the most common example of Bagging i.e., the Random Forest Classifier, till then keep reading.

Thanks for reading!

References:

1. iNeuron

2. towards data science

3. Data aspirant

4. https://github.com/Ankit-c2104/Machine-Learning-Notes

--

--

Ankit Chauhan
Analytics Vidhya

An aspiring Data Scientist! Here to deliver and impart my knowledge in a comprehensive way. Do follow!