this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Title essentially. I'm currently running RTX 3060 with 12GB of VRAM, 32GB RAM and an i5-9600k. Been running 7B and 13B models effortlessly via KoboldCPP(i tend to offload all 35 layers to GPU for 7Bs, and 40 for 13Bs) + SillyTavern for role playing purposes, but slowdown becomes noticeable at higher context with 13Bs(Not too bad so i deal with it). Is this setup capable of running bigger models like 20B or potentially even 34B?

top 14 comments
sorted by: hot top controversial new old
[–] vikarti_anatra@alien.top 1 points 11 months ago

!remindme 7 days

[–] No_Pilot_1974@alien.top 1 points 11 months ago (2 children)
[–] 875ysh@alien.top 1 points 11 months ago (1 children)

I’m personally too afraid to brick something

[–] localhost80@alien.top 1 points 11 months ago

Is this sarcasm?

Bricking from an ML model is not possible.

[–] localhost80@alien.top 1 points 11 months ago

Not sure why this is getting downvoted. This is the correct answer.

All models can be run on any reasonable computer. It's a matter of whether or not the speed is acceptable.

[–] flurbz@alien.top 1 points 11 months ago (2 children)

My setup has the same amount of VRAM and RAM as yours and I'm running 20B models with tolerable speed, meaning it generates tokens at almost at reading speed. This is using the rocm version of koboldcpp under linux with a Q4_K_M model (I have 5600x and a 6700XT).

Using the settings below, VRAM is maxed out and RAM sits at about 24GB used.

./koboldcpp.py --model ~/AI/LLMS/models/mlewd-remm-l2-chat-20b.Q4_K_M.gguf --threads 5 --contextsize 4096 --usecublas --gpulayers 47 --nommap --usemlock --port 8334

I have no idea how this would perform on windows or with an nvidia card, but good luck.

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)

Isn't cublas specific to Nvidia cards and clBLAST compatible with both Nvidia and AMD? I am not sure how cublas could work with AMD cards, ROCm?

[–] flurbz@alien.top 1 points 11 months ago

You're right, this shouldn't work. But for some strange reason, using --usecublas loads the hipblas library:

Welcome to KoboldCpp - Version 1.49.yr1-ROCm
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
Initializing dynamic library: koboldcpp_hipblas.so

I have no idea why this works but it does and since the 6700XT took quite a bit of effort to get going, i'm keeping it this way.

[–] wakuboys@alien.top 1 points 11 months ago

I can run similar models on my phone at reading speeds (i am illiterate)

[–] sebo3d@alien.top 1 points 11 months ago

Don't quite know about 34B and beyond as i never tested it on myself, but you can more or less easily run a 20B model with these specs. I also have a 3060 with 32gigs of RAM and i get around 3Tokens/ second while generating using u-amethyst20B(I believe this is the best, or at least the most popular 20B model at the moment) Q4KM after offloading 50 layers to GPU.

[–] sampdoria_supporter@alien.top 1 points 11 months ago (1 children)

Honestly, Ollama + LiteLLM is fantastic for people in your position (assuming you're running Linux). Way easier to focus on your application and not have to deal with the complications you're describing. It just works.

[–] henk717@alien.top 1 points 11 months ago

Koboldcpp which he is already using is a better fit due to the superior context shifting.

[–] blkknighter@alien.top 1 points 11 months ago
[–] henk717@alien.top 1 points 11 months ago

With Q4_K_S MMQ it should be possible to do a full offload on 13B. I'm not sure if you can fully fit 4K since that is a tight call but its definately worth a try.