LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

16 comments fedilink hide all child comments

Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i'm not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

you are viewing a single comment's thread
view the rest of the comments

[–] Spirited_Employee_61@alien.top 1 points 2 years ago (1 children)

If we can fit 1 layer at a time, can we do 3 or 4 at a time? A bit bigger but a bit faster than 1 at a time. Or am I dreaming?

[–] ron_krugman@alien.top 1 points 2 years ago

That doesn't make much of a difference. You still have to transfer the whole model to the GPU for ever single inference step. The GPU only saves you time if you can load the model (or parts of it) once and then do lots of inference steps.