LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

submitted 2 years ago by meetrais@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

Hey All,

I have few doubts about method to calculate tokens per second of LLM model.

The way I calculate tokens per second of my fine-tuned models is, I put timer in my python code and calculate tokens per second. So if length of my output tokens is 20 and model took 5 seconds then tokens per second is 4. Am I using correct method or is there any other better method for this?
If tokens per second of my model is 4 on 8 GB VRAM then will it be 8 tokens per second on 16 GB VRAM?

you are viewing a single comment's thread
view the rest of the comments

[–] andrewlapp@alien.top 1 points 2 years ago

It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it's mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.