this post was submitted on 14 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I imagine it's pretty solid.
I've tested around the with q4_K_M and the q8 on my Mac Studio, and the q4 is pretty darn good. There's some difference in that the q4 does seem to get confused when I talk to it sometimes, whereas the q8 seems unshakeable in its quality, but honestly the q4 still feels better than almost any other model I've ever used.
What's the tok/s for each of those models on that system?
Edit: also, if you don't mind my asking, how much context are you able to use before inference degrades?
for comparison sake EXL2 4.85bpw version runs around 6-8 t/s on 4x3090s at 8k context it's the lower end.