Mixture of Experts

An ML technique that divides a model into multiple “expert” models.

Main components:

The output of the gate is used to combine the outputs of the experts in some way.

Given a learned gating network $G$ and the experts $E$ , one typical example is:

y = i = 1 \sum n G (x) E_{i} (x)

Where the gating network with weights $W_{g}$ is defined as:

G (x) = Softmax (x \cdot W_{g})

In plain English, the outputs of the experts are combined in a weighted sum weighted by the gating scores.

A more modern way of using MoE is to do sparsely-gated mixture of experts.

Matto