this post was submitted on 17 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

At work, we are using four A100 cards (0,1 nvlinked and 2,3 nvlinked) and I am curious about how to connect all four cards. Additionally, when using four A100 cards, the performance seems slower and the token usage is much lower compared to using a 4060 Ti at home. Why might this be? When I check with nvidia-smi, it shows that the VRAM is being fully utilized, but the volatile GPU utilization is not 100% for all four, usually something like 100, 70, 16, 16. (using KVM passthrough rhel8 server)

you are viewing a single comment's thread
view the rest of the comments
[โ€“] Semi_Tech@alien.top 1 points 10 months ago (1 children)

Try different version of the model.

What is the performance of a gguf q4-q6 on a single card?

sry for late reply. i already test about that , it is better than codellama 13b model but ,, 30token/s ..