I just wrap it in tqdm
this post was submitted on 24 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
-
It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
-
No, it's mostly bound by memory bandwidth.
-
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.
I understanding is that tokens per second typically splits into two categories the preprocessing time and the actual token generation time.
At least from what I remember from oobabooga