this post was submitted on 17 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

At work, we are using four A100 cards (0,1 nvlinked and 2,3 nvlinked) and I am curious about how to connect all four cards. Additionally, when using four A100 cards, the performance seems slower and the token usage is much lower compared to using a 4060 Ti at home. Why might this be? When I check with nvidia-smi, it shows that the VRAM is being fully utilized, but the volatile GPU utilization is not 100% for all four, usually something like 100, 70, 16, 16. (using KVM passthrough rhel8 server)

top 2 comments
sorted by: hot top controversial new old
[โ€“] Semi_Tech@alien.top 1 points 10 months ago (1 children)

Try different version of the model.

What is the performance of a gguf q4-q6 on a single card?

sry for late reply. i already test about that , it is better than codellama 13b model but ,, 30token/s ..