So it is somewhat like a classic random forest or maybe bagging, where you're tr...

		Quarrel on April 17, 2024 \| parent \| context \| favorite \| on: Mixtral 8x22B So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?

Yea, it's very much bagging + top layer (router) for the importance score!