this post was submitted on 26 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it's possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?

Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context 'costs' more memory?

I'm considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] a_beautiful_rhind@alien.top 1 points 11 months ago (6 children)

I see it being ~2GB per every 4k from what llama.cpp spits out. Load a model and read what it puts in the log.

As to mac vs RTX. You can build a system with the same or similar amount of vram as the mac for a lower price but it depends on your skill level and electricity/space requirements.

If you live in a studio apartment, I don't recommend buying an 8 card inference server, regardless of the couple $1000 in either direction and the faster speed.

[โ€“] EvokerTCG@alien.top 1 points 11 months ago

Thanks. Yes, a 2kW heater pc would only be welcome in the winter, and could get pricy to run.

load more comments (5 replies)