this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

4 readers
4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

Yes. This has to be the worst ram you guys have ever seen but hear me out. Is it possible? I want to run the full 70gb model but that’s far out of question and I’m not even going to bother. Can I atleast run the 13gb or at least the 7gb?

top 7 comments
sorted by: hot top controversial new old
[–] DarthNebo@alien.top 1 points 2 years ago

Yeah 7B is no problem on phones even at 4tok/s

[–] phree_radical@alien.top 1 points 2 years ago

RemindMe! 10 months

[–] Aaaaaaaaaeeeee@alien.top 1 points 2 years ago (1 children)

Cramming mistral at 2.7bpw I get 2k. Are you talking about vram though?

[–] TheHumanFixer@alien.top 1 points 2 years ago

Nope regular ram

[–] DarthInfinix@alien.top 1 points 2 years ago

Hmm, theoretically if you switch to a super light Linux distro, and get the q2 quantization 7b, using llama cpp where mmap is on by default, you should be able to run a 7b model, provided i can run a 7b on a shitty 150$ Android which has like 3 GB Ram free using llama cpp

[–] Delicious-View-8688@alien.top 1 points 2 years ago

Yes. There is an implementation that loads each layer as required - thereby reducing the VRAM requirements. Just Google it. LLaMa 70b with 4GB.

[–] m18coppola@alien.top 1 points 2 years ago

I have run 7B models with Q2_K on my raspberry pi with 4GB lol. It's kinda slow (still faster than I bargained for), but Q2_K models tend to be pretty stupid at the 7B size, no matter the speed. You can theoretically run a bigger model using swap-space (kind of like using your storage drive as ram), but then the token generation speeds come crawling to a halt.