LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

submitted 11 months ago by Aaaaaaaaaeeeee@alien.top to c/localllama@poweruser.forum

6 comments fedilink hide all child comments

I get 20 t/s with a 70B 2.5bpw model, but this is only 47% of the theoretical maximum of 3090.

In comparison, the benchmarks on the exl2 github homepage show 35 t/s, which is 76% the theoretical maximum of 4090.

The bandwidth differences between the two GPUs aren't huge, 4090 is only 7-8% higher.

Why? Does anyone else have a similar 20 t/s ? I don't think my cpu performance is the issue.

The benchmarks also show ~85% utilization on 34B on 4bpw (normal models)

you are viewing a single comment's thread
view the rest of the comments

[–] Sat0r1r1@alien.top 1 points 11 months ago

My results are the same as yours.

I use TabbyAPI, 70B 2.4bpw I get 20/T.