this post was submitted on 30 Nov 2023
1 points (100.0% liked)

LocalLLaMA

4 readers
4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i'm not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

you are viewing a single comment's thread
view the rest of the comments
[–] xinranli@alien.top 1 points 2 years ago (3 children)

This seems like a very brilliant and almost obvious idea, is there a reason why this method wasn't a thing before? Besides the PCIe bandwidth and storage speed requirements.

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago (1 children)

Because it wouldn't be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It's that i/o that's the limiter. This changes none of that.

[–] radianart@alien.top 1 points 2 years ago (1 children)

Is here a better way to use bigger models than can fit in RAM\VRAM? I'd want to try 70b or maybe even 120b but I only have 32\8gb.

[–] TheTerrasque@alien.top 1 points 2 years ago

70b? Q4, llama.cpp, some layers on gpu.

Might need to run Linux to get the system ram usage low enough

load more comments (1 replies)