What software are you using to run LLaMA and Stable Diffusion?
What version of the LLaMA model are you trying to run? How many parameters? What quantization?
What software are you using to run LLaMA and Stable Diffusion?
What version of the LLaMA model are you trying to run? How many parameters? What quantization?
This seems like this approach could also be useful in situations where the goal isn't speed, but rather "quality" (by a variety of metrics).
Please get specific. What's "quite slow," what's "extremely quickly." Use numbers and units that include a unit of time.
What hardware are you running on? Without changing hardware your best bet is a smaller model (in terms of parameters), or a smaller quantization of a 13b model, or both.
Game GPU-use probably hits the cache. LLM really won't since each token involves reading all the model data.
I think part of the answer is that RAM uses more power than you think when it's running near full-tilt, like it is during generation. Micron's advice is to figure 3w per 8GB for DDR4, and more than that for the highest performance parts. The fact that the RAM is on package probably offsets that somewhat, but that's still more than single digits.
Power consumption on my 24Core GPU M1 Max is similar to yours, though somewhat lower as you'd expect, according to both iStat Menus and Stats.app.
There is also the question of how accurate they are.
LLVM or LLM?
Apple is increasing differentiation amongst their chips. Previously the Pro and Max different primarily in GPU cores. Now they are also differentiated in CPU cores and memory bandwidth.
I was disappointed to see that the M3 Maxs memory bandwidth is the same, on paper, as the M2 Max, but I'm also mindful of the fact that no one functional unit was able to use all the available memory bandwidth in the first place, so I hope that the M3 will allow higher utilization.
We'll see once people get their hands on them.
Apple has further segmented the Apple silicon lineup with the M3.
With the M1 & M2 Max, all the GPU variants had the same memory bandwidth (400GB/s for the M2 Max). The top of the line M3 Max (16 CPU/ 40GPU cores) is still limited to 400GB/s max, but now the lower spec variants (14 CPU/30 GPU) are only 300GBs/max.
Inference is generally bound by memory bandwidth, so the M3 generation may not be much of an improvement. Apple claims improvements in the cache, but that may not mean much for inference. We'll know more once people have them in hand, which shouldn't take too long.
Someone asked something very similar in the past day, but I think it was from the angle of training as proof of work.
There is this https://petals.dev
When I load yarn-mistral-64k in Ollama (uses llama.cpp) on my 32GB MAC it allocates 16.359 GB for the GPU. I don't remember how much the 128k context version needs, but it was more than the 21.845GB MacOS allows for the GPUs use on a 32GB machine. You aren't going to get very far on a 16GB machine.
Maybe if you don't send any layers to the GPU and force it to use CPU you could eek out a little more. On Apple Silicon CPU inference only seems to be a 50% hit over GPU speeds, if I remember right.