overview for Kevinswelt

Help for the n00b? Optimal loader parameters... in c/localllama@poweruser.forum

[–] Kevinswelt@alien.top 1 points 2 years ago

You can find a n-gpu-layers slider, when you select llama.cpp. You can just input the max amount if you want everything on the GPU. Otherwise the model you loaded will say how many layers it has during loading in the terminal.

System Configuration. (AMD / NVIDIA) in c/localllama@poweruser.forum

[–] Kevinswelt@alien.top 1 points 2 years ago

I am currently running windows with my AMD, but that is only because I prefer windows. Pretty much nothing, except Stable Diffusion at very slow speeds via direct ml and koboldcpp-rocm inference, works. I was able to use normal Stable Diffusion on Ubuntu after ~2h of trying to get it to work. Sadly, it randomly stopped working the next week. Never managed to get Ooba working, but I gave up rather quick after I found koboldcpp-rocm.

Help for the n00b? Optimal loader parameters... in c/localllama@poweruser.forum

[–] Kevinswelt@alien.top 1 points 2 years ago (2 children)

Model loaders: If you want to load a GPTQ model, you can use ExLlama 1 or 2. AutoGPTQ is old. I personally only use GGUF models, loaded in via Llama.cpp
Start-up parameter: I only use auto launch.
Context length: The normal length for Llama 1 based models is 2048, Llama 2 based (ethery model except new 7B models) is 4096 and Mistral (new 7B models) is 8192. You can use alpha rope and rope base to make more context usable. More VRAM is required. If you want to 2x your context (4k to 8k), you can put alpha rope to 2.5 and rope base to 25000. Do not use compress_pos.
Models: On 24GB you can fit any 7B and 13B model. 20B models are a thing, but not that great. Recently a few good 34B models have been released, but you won't be able to run them with a high context window.

Interest so run a local LLaMA with RTX 3060 12G in c/localllama@poweruser.forum

[–] Kevinswelt@alien.top 1 points 2 years ago

That depends on your usecase. If only need ChatGPT to do school/office work, then even ChatGPT 3 is way better than even llama 70B. If you run into limitations and get warnings all the time, then any uncensored model will greatly improve your experience. You can easily run GPTQ and GGUF Q8 versions of 7B models like Mistral. 13B models are also possible as GPTQ or GGUF Q3_K_M.

What is the best current uncensored Storytelling LLM that can run with 32gb system ram and 8 gb Vram PC? in c/localllama@poweruser.forum

[–] Kevinswelt@alien.top 1 points 2 years ago

At 20 words per minute... Oh the joys of CPU interference