feynmanatom

joined 10 months ago
[–] feynmanatom@alien.top 1 points 10 months ago

Hmm, not sure if I track what an encoding layer is? The encoding phase involves filling the KV cache across the depth of the model. I don’t think there’s an activation you could just pass across without model surgery + additional fine tuning.

[–] feynmanatom@alien.top 1 points 10 months ago (1 children)

Lots of rumors, but tbh I think it’s highly unlikely they’re using an MoE. MoEs work on batch size = 1 (you can take advantage of sparsity) but not on larger batch sizes. You would need so much RAM and would miss out on the point of using an MoE.

[–] feynmanatom@alien.top 1 points 10 months ago (2 children)

This might be pedantic, but this is a field with so much random vocabulary and it’s better for folks to not be confused.

MoE is slightly different. An MoE is a single LLM with gated layers that “select” which layers to route embeddings/tokens to. It’s pretty difficult to scale and serve in practice.

I think what you’re referring to is more like a model router. You can use a general LLM to “classify” a prompt and then route the entire prompt to a downstream LLM. It’s unclear if this would be faster than a 70B LLM since you would repeat the encoding phase and have some generation, but it could certainly be better.