this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments
[โ€“] Dry-Vermicelli-682@alien.top 1 points 9 months ago (1 children)

So you have 2 GPUs on single m/b.. and the llama.cpp thing knows to use both? Does this work with AMD GPUs too?

[โ€“] harrro@alien.top 1 points 9 months ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it's pretty easy to do.