overview for DarthNebo

Is LocalLLaMA on RunPod cheaper than Chat GPT4 for text prompts? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

Try HuggingFace Endpoints with any of the cheap T4 based serverless instances these go to sleep as well in 15mins.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

Pay as you go for opensouce models in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

HuggingFace has inference endpoint which is private & public as needed with sleep built in

Is it possible to run Llama on a 4gb ram? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

Yeah 7B is no problem on phones even at 4tok/s

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago (4 children)

There's hardly any case for using the 70B chat model, most LLM tasks are happening just fine with Mistral-7b-instruct at 30tok/s

Are 7b models useful? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

It should be model page on HuggingFace, they also have a explicit template module which you can import automatically when interacting using model-id.

Llama ones are forgiving for not using structure but the mistral-instruct is very bad if structure is not maintained

How can I make it easy for players playing a game in pygame that performs api calls to localhost oobabooga and extracts the generated text to include it in the game for NPCs? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)

Are 7b models useful? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago (2 children)

Try to use the instruct models like Mistral. Ensure your template is the correct one a well.

Is it just me or is prompt engineering basically useless with smaller models? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

Have you tried a combination of mistral-instruct & Langchain? If not can you share some sample inputs you're having problems with

I’m extremely confused about system requirements. Some people are worried about ram and others about vram. I have 64gb of ram and 12gb vram. What size of model can I run? in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

7B models will work just fine with FP4 or INT4. Similarly 13B as well with little offloading if needed.

Structured Output with Zephyr in c/localllama@poweruser.forum

[–] DarthNebo@alien.top 1 points 11 months ago

I reverted to mistral-instruct