LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

16 comments fedilink hide all child comments

Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i'm not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

top 16 comments

sorted by: hot top controversial new old

[–] xinranli@alien.top 1 points 2 years ago (1 children)

This seems like a very brilliant and almost obvious idea, is there a reason why this method wasn't a thing before? Besides the PCIe bandwidth and storage speed requirements.

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago (1 children)

Because it wouldn't be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It's that i/o that's the limiter. This changes none of that.

[–] radianart@alien.top 1 points 2 years ago (1 children)

Is here a better way to use bigger models than can fit in RAM\VRAM? I'd want to try 70b or maybe even 120b but I only have 32\8gb.

[–] TheTerrasque@alien.top 1 points 2 years ago

70b? Q4, llama.cpp, some layers on gpu.

Might need to run Linux to get the system ram usage low enough

[–] Art10001@alien.top 1 points 2 years ago

If it works it's a miracle.

[–] watkykjynaaier@alien.top 1 points 2 years ago (1 children)

Given my M1 Max's 400GB/s memory bandwidth, what would be the bottleneck for this on Apple Silicon? Disk speed? Is it possible to get this running on Metal?

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

There's no point to it. Since if it's too big to fit in RAM, it would be disk i/o that would be the limiter. Then it wouldn't matter if you had 400GB/s of memory bandwidth or 40GB/s. Since the disk i/o would be the bottleneck.

[–] sdmat@alien.top 1 points 2 years ago

This technique is actually really useful for batch processing.

I.e. if you run 100 generations and reuse the layer while it is loaded that will go much faster than the total serial time.

[–] Spirited_Employee_61@alien.top 1 points 2 years ago (1 children)

If we can fit 1 layer at a time, can we do 3 or 4 at a time? A bit bigger but a bit faster than 1 at a time. Or am I dreaming?

[–] ron_krugman@alien.top 1 points 2 years ago

That doesn't make much of a difference. You still have to transfer the whole model to the GPU for ever single inference step. The GPU only saves you time if you can load the model (or parts of it) once and then do lots of inference steps.

[–] petitmottin@alien.top 1 points 2 years ago

Would you choose speed over quality? Personally I prefer quality. So this is a great project! Because in real life, quality usually takes time…

[–] hackerllama@alien.top 1 points 2 years ago (1 children)

Hey there! I think this is doing offloading?

If so, it's not a new thing. Check out https://huggingface.co/docs/accelerate/usage_guides/big_modeling for a guide with code and videos about it

[–] kabachuha@alien.top 1 points 2 years ago

Was going to say the same XD

[–] ThisGonBHard@alien.top 1 points 2 years ago

I think it becomes faster to run on CPU than that.

[–] Tiny_Arugula_5648@alien.top 1 points 2 years ago

one of those cases where proving something can be done doesn't make it useful. This has to be one of the least efficient ways to do inferencing. Like the people who got Doom running on a HP printer. Great you did it but it's the worst possible version.

[–] Atharv_Jaju@alien.top 1 points 2 years ago

I feel the 70 B model itself won't fit my storage ( SSD ) since I have only 50 GB left.