Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b
DarthNebo
Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b
HuggingFace has inference endpoint which is private & public as needed with sleep built in
Yeah 7B is no problem on phones even at 4tok/s
There's hardly any case for using the 70B chat model, most LLM tasks are happening just fine with Mistral-7b-instruct at 30tok/s
It should be model page on HuggingFace, they also have a explicit template module which you can import automatically when interacting using model-id.
Llama ones are forgiving for not using structure but the mistral-instruct is very bad if structure is not maintained
The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)
Try to use the instruct models like Mistral. Ensure your template is the correct one a well.
Have you tried a combination of mistral-instruct & Langchain? If not can you share some sample inputs you're having problems with
7B models will work just fine with FP4 or INT4. Similarly 13B as well with little offloading if needed.
I reverted to mistral-instruct
Try HuggingFace Endpoints with any of the cheap T4 based serverless instances these go to sleep as well in 15mins.