AirwolfPL

joined 11 months ago

Help for the n00b? Optimal loader parameters... in c/localllama@poweruser.forum

[–] AirwolfPL@alien.top 1 points 11 months ago (1 children)

Thanks for the informative answer. I will take a look at GGUF models (although I'm not sure yet how to split them between cpu/gpu yet (I will take a look at llama.cpp parameters).

Help for the n00b? Optimal loader parameters... (alien.top)

submitted 11 months ago by AirwolfPL@alien.top to c/localllama@poweruser.forum

3 comments fedilink

Well, not a total n00b as I play with LLM for almost year and a half now, but with local LLMs since summer. Although I have a profound experience with local image generators I thought I can use some of this knowledge with setting LLMs although it doesn't seem to be that easy ;)

Any input that will shed some light on the problems I have will be greatly appreciated :)

Hardware:

Ryzen 9 3900X, 48GB RAM, RTX 4090

Oobabooga startup params:

--load-in-8bit --auto-devices --gpu-memory 23 --cpu-memory 42 --auto-launch --listen

I still have a problem getting around some issues, likely caused by improper loader settings.

I'm looking for some tips how to set them optimally. I use oobabooga UI as it's the most comfortable for me and lets me test models before deploying them elsewhere (ie. to company UIs - I'm working on a chatbot connected to a vector db for local document storage and I thought about ooba as a backend for quick loading of models and setting parameters and exposing them via API) however It's documentation is vague and I have a feeling that names for the parameters and so on are not standarized too. Which loader is optimal? ExLlama2_HF or AutoGPTQ? Latter pretty much always gives me issues :( and in ExLlama2 when I try to set longer context lenght and set alpha_value or compress_pos_emb it starts having trouble especially with repeating numbers ie it will say 190 instead of 1990 or 3137 instead of 31337 (but sometimes also with words shorting them in a strange way) - is that expected behaviour?

I would like to use context lenght that will be longer (4k or even 8k hardly cuts it) also I would like the LLM to generate longer replies - it's not always necessary but sometimes it's desired (ie for code generation) - usually instructing the model to "continue" helps, but longer answers would be nice.

BTW is the "max_position_embeddings" in the model's config the same as the " max_seq_len" in the ExLlamav2 loader settings?

Or maybe you can just point me into some more advanced tutorial discussing these thing? All the stuff I find doesn't delve into these things (just basic tutorials how to run oobabooga or other ui and they always use default configs).