this post was submitted on 24 Nov 2023
1 points (100.0% liked)
LocalLLaMA
1 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it's mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.