this post was submitted on 22 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If it works that way, it will only be short term. Since the only reason it doesn't run on a GPU is because of conditional matrix OPs. So the GPU makers will just add them. Then they'll will be back on top with the same margins again.
Also, they say the speedup decreases with more layers. So the bigger the model, the less the benefit. A 512B model is much bigger than a 7B model thus the speedup will be much less. Possibly none.