Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept...

huevosabio · on April 17, 2024

Ignore the "experts" part, it misleads a lot of people [0]. There is no explicit specialization in the most popular setups, it is achieved implicitly through training. In short: MoEs add multiple MLP sublayers and a routing mechanism after each attention sublayer and let the training procedure learn the MLP parameters and the routing parameters.

In a longer, but still rough, form...

How these transformers work is roughly:

``` x_{l+1} = mlp_l(attention_l(x_l)) ```

where `x_l` is the hidden representation at layer l, `attention_l` is the attention sublayer at layer l, and `mlp_l` is the multilayer perceptron at sublayer l.

This MLP layer is very expensive because it is fully connected (i.e. every input has a weight to every output). So! MoEs instead of creating an even bigger, more expensive MLP to get more capability, they create K MLP sublayers (the "experts") and a router that decides which MLP sublayers to use. This router spits out an importance score for each MLP "expert" and then you choose the top T MLPs and do an average weighed on importance, so roughly:

``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) * importance_score_{l, e} ```

where the `importance_score_{l, e}` is the score computed by the router at layer l for "expert" e. That is, `importance_score_{l} = attention_l(x_l)`. Note that here we are adding all experts, but in reality we choose the top T, often 2, and use that.

[0] some architectures do, in fact, combine domain experts to make a greater whole, but not the currently popular flavor

Quarrel · on April 17, 2024

So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?

huevosabio · on April 17, 2024

Yea, it's very much bagging + top layer (router) for the importance score!

DougBTX · on April 17, 2024

Would this be a reasonable explanation?

> MLPs are universal function approximators, but these models are big enough that it is better to train many small functions rather than a single unified function. MoE is a mechanism to force different parts of the model to learn distinct functions.

samus · on April 17, 2024

It misses the crucial detail that every transformer layer chooses the experts independently from the others. Of course they still indirectly influence each other since each layer processes the output of the previous one.

hlfshell · on April 17, 2024

This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.

During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.

The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.

andai · on April 17, 2024

I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.

MPSimmons · on April 17, 2024

>I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.

samus · on April 17, 2024

It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".

Filligree · on April 17, 2024

The latter. Yes, it all needs to stay in memory.

fire_lake · on April 17, 2024

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?

famouswaffles · on April 17, 2024

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse · on April 17, 2024

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

rgbrgb · on April 17, 2024

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

nialv7 · on April 17, 2024

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

samus · on April 17, 2024

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

cjbprime · on April 17, 2024

Not quite, you don't save memory, only compute.

api · on April 17, 2024

A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.

wenc · on April 17, 2024

Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?

Filligree · on April 17, 2024

Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.

andai · on April 17, 2024

Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?

wongarsu · on April 17, 2024

Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU

andai · on April 17, 2024

Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...

samus · on April 17, 2024

There is Qwen1.5-MoE-A2.7B, which was made by upcycling the weights of Qwen1.5-1.8B, splitting it and finetuning it.

jasonjmcghee · on April 17, 2024

Yes there are many fine tunes on huggingface. Search "8x1B huggingface"

auspiv · on April 17, 2024

The previous mixtral is 8x7B

londons_explore · on April 17, 2024

Nobody decides. The network itself determines which expert(s) to activate based on the context. It uses a small neural network for the task.

It typically won't behave like human experts - you might find one of the networks is an expert in determining where to place capital letters or full stops for example.

MoE's do not really improve accuracy - instead they are to reduce the amount of compute required. And, assuming you have a fixed compute budget, that in turn might mean you can make the model bigger to get better accuracy.

woadwarrior01 · on April 17, 2024

Not quite a layman's explanation, but if you're familiar with the implementation(s) of vanilla decoder only transformers, mixture-of-experts is just a small extension.

During inference, instead of a single MLP in each transformer layer, MoEs have `n` MLPs and a single layer "gate" in each transformer layer. In the forward pass, softmax of the gate's output is used to pick the top `k` (where k is < n) MLPs to use. The relevant code snippet in the HF transformers implementation is very readable IMO, and only about 40 lines.

https://github.com/huggingface/transformers/blob/main/src/tr...

vineyardmike · on April 17, 2024

It’s not “experts” in the typical sense of the word. There is no discrete training to learn a particular skill in one expert. It’s more closely modeled as a bunch of smaller models grafted together.

These models are actually a collection of weights for different parts of the system. It’s not “one” neural network. Transformers are composed of layers of transformations to the input, and each step can have its own set of weights. There was a recent video on the front page that had a good introduction to this. There is the MLP, there are the attention heads, etc.

With that in mind, a MoE model is basically where one of those layers has X different versions of the weights, and then an added layer (another neural network with its own weights) that picks the version of “expert” weights to use.

zozbot234 · on April 17, 2024

It's really a kind of enforced sparsity, in that it requires that only a limited amount of blocks be active at a time during inference. What blocks will be active for each token is decided by the network itself as part of training.

(Notably, MoE should not be conflated with ensemble techniques, which is where you would train entire separate networks, then use heuristic techniques to run inference across all of them simultaneously and combine the results.)

jerpint · on April 17, 2024

The simplest way to think about it is a form of dropout but instead of dropping weights, you drop an entire path of the network

adtac · on April 17, 2024

As always, code is the best documentation: https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...

Keyframe · on April 17, 2024

maybe there's one that is maitre d'llm?

jsemrau · on April 17, 2024

There is some good documentation around mergekit available that actually explains a lot and might be a good place to start.

HeatrayEnjoyer · on April 17, 2024

Correct, the experts are determined by Algo, not anything humans would understand.