Mixture of experts, but with conditional computation (AKA don’t use all the experts).
Using the output of the gate network, some common approaches are
Mixture of experts, but with conditional computation (AKA don’t use all the experts).
Using the output of the gate network, some common approaches are