this post was submitted on 17 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

At work, we are using four A100 cards (0,1 nvlinked and 2,3 nvlinked) and I am curious about how to connect all four cards. Additionally, when using four A100 cards, the performance seems slower and the token usage is much lower compared to using a 4060 Ti at home. Why might this be? When I check with nvidia-smi, it shows that the VRAM is being fully utilized, but the volatile GPU utilization is not 100% for all four, usually something like 100, 70, 16, 16. (using KVM passthrough rhel8 server)

top 2 comments
sorted by: hot top controversial new old
[โ€“] Semi_Tech@alien.top 1 points 11 months ago (1 children)

Try different version of the model.

What is the performance of a gguf q4-q6 on a single card?

[โ€“] Radiant-Practice-270@alien.top 1 points 11 months ago

sry for late reply. i already test about that , it is better than codellama 13b model but ,, 30token/s ..