overview for vasileer

fastchat-t5 is a 3B model on bfloat16, that means it needs at least at least 3B x 16bits ~ 6GB RAM only for the model itself, and 2K tokens limit for the context (for both prompt and answer),

a quick way to speed up is to use a quantized version:

8bit quant, with almost no quality lost, like https://huggingface.co/limcheekin/fastchat-t5-3b-ct2,

you will get a 2x smaller file and 2x faster inference,

but better read #2 :)

a better model/finetune for better quality

a Mistral finetune like https://huggingface.co/TheBloke/neural-chat-7B-v3-1-GGUF, wich is 7B, quantized to 4bits, will have ~ the same size as 8bit fastchat-t5,

but a superior performance as it was most probably trained on more tokens than llama2 (~2T tokens), and flan-t5 (base model of the fastchat-t5) was only on 1T,

explanation why a larger model quantized is better than a smaller one even not quantized is explained here https://github.com/ggerganov/llama.cpp/pull/1684

use HuggingFace as a hosting, it is ~20$/month for the same server you mentioned that costs 160$, so it is 8x cheaper

https://preview.redd.it/54x2ff87gk0c1.png?width=839&format=png&auto=webp&s=dae1d27376c9c858935c285dd765246af79a86a4

Brand New Mistral 16k Context Size Models got released last night from NurtureAI! in c/localllama@poweruser.forum

[–] vasileer@alien.top 0 points 2 years ago (2 children)

the context was already 32K

https://preview.redd.it/5jl7c7a53i0c1.png?width=958&format=png&auto=webp&s=ae51ae2b52717bb5ab14bed76580e7e0a45075ed

Brand New Mistral 16k Context Size Models got released last night from NurtureAI! in c/localllama@poweruser.forum

[–] vasileer@alien.top 0 points 2 years ago (4 children)

is this a scam or what? none of the models above are from NurtureAI:

- zephyr-beta is trained by HuggingFace and is 32K by default

- neural-chat is from Intel

- synthia is from migtissera

Original links:

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

https://huggingface.co/Intel/neural-chat-7b-v3-1

https://huggingface.co/migtissera/SynthIA-7B-v2.0

Nouse-Capybara-34B 200K in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago (1 children)

200K context!!

Faster prompt processing on cpu? in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago

on quality: if you go with a smaller model or even another model you will lose quality, as Mistral (and his finetunes) is the best among <70B models and another rule of thumb is that a bigger model quantized (even 2bits) is better than a smaller unquantized,

on speed: the fastest inference is from Q4_K_S https://github.com/ggerganov/llama.cpp/pull/1684

Translate to and from 400+ languages locally with MADLAD-400 in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago

I tested the 3B model for Romanian, Russian, French, and German translations of the "The sun rises in the East and sets in the West." and it works 100%: it gets 10/10 from ChatGPT

How do I choose the Llama Model? It's so confusing. in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago

"Clear winner: OpenHermes-2-Mistral-7B!"

https://www.reddit.com/r/LocalLLaMA/comments/17kpyd2/huge_llm_comparisontest_part_ii_7b20b_roleplay/

I don't understand Mistral and context size, honestly. in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago

I think it really depends on the finetune, for example, Mistral-Instruct is able to summarize or extract information from a 32K context, for writing, you will have to find a finetuned model for that task

Non English LLM in c/localllama@poweruser.forum

[–] vasileer@alien.top 1 points 2 years ago

Mistral and Llama2 work with many languages even if are marked as English.

Here is a quote from a benchmark on the German language, I think you will get a similar conclusion if you will do it for Portuguese.

"Kinda ironic that the English models worked better with the German data and exam than the ones finetuned in German. Looks like language doesn't matter as much as general intelligence and a more intelligent model can cope with different languages more easily. German-specific models need better tuning to compete in general and excel in German."

https://www.reddit.com/r/LocalLLaMA/comments/178nf6i/mistral_llm_comparisontest_instruct_openorca/