LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding (alien.top)

submitted 2 years ago by Thistleknot@alien.top to c/localllama@poweruser.forum

4 comments fedilink hide all child comments

https://lmsys.org/blog/2023-11-21-lookahead-decoding/

top 4 comments

sorted by: hot top controversial new old

[–] yahma@alien.top 1 points 2 years ago

Game changer! Would love to see this incorporated into ExLLama, AutoGPTQ and LlamaCPP

[–] lone_striker@alien.top 1 points 2 years ago (1 children)

It's an innovative approach, but the practical real-world use case where it is beneficial are very very narrow:
https://twitter.com/joao_gante/status/1727985956404465959

TL;DR: you have to have massive spare compute to get a modest gain in speed. In most cases, you get slower inference. They are also comparing speeds to relatively slow native transformers inference. Exllamav2, GPTQ, and llama.cpp compared to base transformers performance is much more impressive.

[–] CorporationFlayer@alien.top 1 points 2 years ago (1 children)

Maybe not for speed, but do you think this approach could be well suited for environments where you have complex tasks that require knowledge on my different multidisciplinary fronts?

Aka Complex system building task creates many different fast models with different initializations in different directions and then aggregates?

[–] lone_striker@alien.top 1 points 2 years ago

I'm not sure how this would be applicable in those other scenarios you've mentioned; anything is possible. There may be other uses for this novel decoding method. But being touted as being X percent faster than transformers in a useful way isn't one of them.