this post was submitted on 24 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hey All,

I have few doubts about method to calculate tokens per second of LLM model.

  1. The way I calculate tokens per second of my fine-tuned models is, I put timer in my python code and calculate tokens per second. So if length of my output tokens is 20 and model took 5 seconds then tokens per second is 4. Am I using correct method or is there any other better method for this?

  2. If tokens per second of my model is 4 on 8 GB VRAM then will it be 8 tokens per second on 16 GB VRAM?

top 3 comments
sorted by: hot top controversial new old
[–] phree_radical@alien.top 1 points 11 months ago

I just wrap it in tqdm

[–] andrewlapp@alien.top 1 points 11 months ago
  1. It depends on your inference engine. It will probably be much higher in TGI or vLLM than what you're presumably using, Transformers. You also need to measure input and output token rate separately. Additionally longer contexts will take more time than shorter contexts.

  2. No, it's mostly bound by memory bandwidth.

  3. Your finetuned model, assuming you have the same output format (fp16, gguf, AWQ, etc) as the base model will have the same inference speed as the base model.

[–] MINIMAN10001@alien.top 1 points 11 months ago

I understanding is that tokens per second typically splits into two categories the preprocessing time and the actual token generation time.

At least from what I remember from oobabooga