Mixture of Experts and Stacking

The application of the technique does not have to be limited to neural network models and a range of standard machine learning techniques can be used in place seeking a similar end.

In this way, the mixture of experts method belongs to a broader class of ensemble learning methods that would also include stacked generalization, known as stacking. Like a mixture of experts, stacking trains a diverse ensemble of machine learning models and then learns a higher-order model to best combine the predictions.

We might refer to this class of ensemble learning methods as meta-learning models. That is models that attempt to learn from the output or learn how to best combine the output of other lower-level models.

Meta-learning is a process of learning from learners (classifiers). […] In order to induce a meta classifier, first the base classifiers are trained (stage one), and then the Meta classifier (second stage).

— Page 82, Pattern Classification Using Ensemble Methods, 2010.

Unlike a mixture of experts, stacking models are often all fit on the same training dataset, e.g. no decomposition of the task into subtasks. And also unlike a mixture of experts, the higher-level model that combines the predictions from the lower-level models typically does not receive the input pattern provided to the lower-level models and instead takes as input the predictions from each lower-level model.

Meta-learning methods are best suited for cases in which certain classifiers consistently correctly classify, or consistently misclassify, certain instances.

— Page 82, Pattern Classification Using Ensemble Methods, 2010.

Nevertheless, there is no reason why hybrid stacking and mixture of expert models cannot be developed that may perform better than either approach in isolation on a given predictive modeling problem.

For example:

  • Consider treating the lower-level models in stacking as experts trained on different perspectives of the training data. Perhaps this could involve using a softer approach to decomposing the problem into subproblems where different data transforms or feature selection methods are used for each model.
  • Consider providing the input pattern to the meta model in stacking in an effort to make the weighting or contribution of lower-level models conditional on the specific context of the prediction.