LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Realistically, how far i can push my current PC? (alien.top)

submitted 11 months ago by constanzabestest@alien.top to c/localllama@poweruser.forum

14 comments fedilink hide all child comments

Title essentially. I'm currently running RTX 3060 with 12GB of VRAM, 32GB RAM and an i5-9600k. Been running 7B and 13B models effortlessly via KoboldCPP(i tend to offload all 35 layers to GPU for 7Bs, and 40 for 13Bs) + SillyTavern for role playing purposes, but slowdown becomes noticeable at higher context with 13Bs(Not too bad so i deal with it). Is this setup capable of running bigger models like 20B or potentially even 34B?

you are viewing a single comment's thread
view the rest of the comments

[–] flurbz@alien.top 1 points 11 months ago (3 children)

My setup has the same amount of VRAM and RAM as yours and I'm running 20B models with tolerable speed, meaning it generates tokens at almost at reading speed. This is using the rocm version of koboldcpp under linux with a Q4_K_M model (I have 5600x and a 6700XT).

Using the settings below, VRAM is maxed out and RAM sits at about 24GB used.

./koboldcpp.py --model ~/AI/LLMS/models/mlewd-remm-l2-chat-20b.Q4_K_M.gguf --threads 5 --contextsize 4096 --usecublas --gpulayers 47 --nommap --usemlock --port 8334

I have no idea how this would perform on windows or with an nvidia card, but good luck.

[–] wakuboys@alien.top 1 points 11 months ago

I can run similar models on my phone at reading speeds (i am illiterate)

load more comments (2 replies)