LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

submitted 2 years ago by meetrais@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

Hey All,

I have few doubts about method to calculate tokens per second of LLM model.

The way I calculate tokens per second of my fine-tuned models is, I put timer in my python code and calculate tokens per second. So if length of my output tokens is 20 and model took 5 seconds then tokens per second is 4. Am I using correct method or is there any other better method for this?
If tokens per second of my model is 4 on 8 GB VRAM then will it be 8 tokens per second on 16 GB VRAM?

top 3 comments

sorted by: hot top controversial new old

[–] phree_radical@alien.top 1 points 2 years ago

I just wrap it in tqdm

[–] andrewlapp@alien.top 1 points 2 years ago

It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.
No, it's mostly bound by memory bandwidth.
Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.

[–] MINIMAN10001@alien.top 1 points 2 years ago

I understanding is that tokens per second typically splits into two categories the preprocessing time and the actual token generation time.

At least from what I remember from oobabooga