How about T4 GPU or something like 3090 from runpod? the 3090 costs around 0.5$ per hour which is around 350 dollars per month and it gives you 24 GB which should be enough for t4
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
3 ideas
- quantization
fastchat-t5 is a 3B model on bfloat16, that means it needs at least at least 3B x 16bits ~ 6GB RAM only for the model itself, and 2K tokens limit for the context (for both prompt and answer),
a quick way to speed up is to use a quantized version:
8bit quant, with almost no quality lost, like https://huggingface.co/limcheekin/fastchat-t5-3b-ct2,
you will get a 2x smaller file and 2x faster inference,
but better read #2 :)
- a better model/finetune for better quality
a Mistral finetune like https://huggingface.co/TheBloke/neural-chat-7B-v3-1-GGUF, wich is 7B, quantized to 4bits, will have ~ the same size as 8bit fastchat-t5,
but a superior performance as it was most probably trained on more tokens than llama2 (~2T tokens), and flan-t5 (base model of the fastchat-t5) was only on 1T,
explanation why a larger model quantized is better than a smaller one even not quantized is explained here https://github.com/ggerganov/llama.cpp/pull/1684
- use HuggingFace as a hosting, it is ~20$/month for the same server you mentioned that costs 160$, so it is 8x cheaper
Wow thanks, thats really an in-depth comment I will try what you say!