overview for Radiant-Practice-270

1

Why is a single a100 so slow? (alien.top)

submitted 2 years ago by Radiant-Practice-270@alien.top to c/localllama@poweruser.forum

8 comments fedilink

I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

How can I improve inference performance to a normal range? in c/localllama@poweruser.forum

[–] Radiant-Practice-270@alien.top 1 points 2 years ago

sry for late reply. i already test about that , it is better than codellama 13b model but ,, 30token/s ..

1

How can I improve inference performance to a normal range? (alien.top)

submitted 2 years ago by Radiant-Practice-270@alien.top to c/localllama@poweruser.forum

2 comments fedilink

At work, we are using four A100 cards (0,1 nvlinked and 2,3 nvlinked) and I am curious about how to connect all four cards. Additionally, when using four A100 cards, the performance seems slower and the token usage is much lower compared to using a 4060 Ti at home. Why might this be? When I check with nvidia-smi, it shows that the VRAM is being fully utilized, but the volatile GPU utilization is not 100% for all four, usually something like 100, 70, 16, 16. (using KVM passthrough rhel8 server)