DarthNebo

joined 1 year ago
[–] DarthNebo@alien.top 1 points 11 months ago

Try HuggingFace Endpoints with any of the cheap T4 based serverless instances these go to sleep as well in 15mins.

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

[–] DarthNebo@alien.top 1 points 11 months ago

HuggingFace has inference endpoint which is private & public as needed with sleep built in

[–] DarthNebo@alien.top 1 points 11 months ago

Yeah 7B is no problem on phones even at 4tok/s

[–] DarthNebo@alien.top 1 points 11 months ago (4 children)

There's hardly any case for using the 70B chat model, most LLM tasks are happening just fine with Mistral-7b-instruct at 30tok/s

[–] DarthNebo@alien.top 1 points 11 months ago

It should be model page on HuggingFace, they also have a explicit template module which you can import automatically when interacting using model-id.

Llama ones are forgiving for not using structure but the mistral-instruct is very bad if structure is not maintained

[–] DarthNebo@alien.top 1 points 11 months ago

The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)

[–] DarthNebo@alien.top 1 points 11 months ago (2 children)

Try to use the instruct models like Mistral. Ensure your template is the correct one a well.

[–] DarthNebo@alien.top 1 points 11 months ago

Have you tried a combination of mistral-instruct & Langchain? If not can you share some sample inputs you're having problems with

[–] DarthNebo@alien.top 1 points 11 months ago

7B models will work just fine with FP4 or INT4. Similarly 13B as well with little offloading if needed.

[–] DarthNebo@alien.top 1 points 11 months ago

I reverted to mistral-instruct

view more: next ›