overview for FlishFlashman

Context window testing with limited memory in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

When I load yarn-mistral-64k in Ollama (uses llama.cpp) on my 32GB MAC it allocates 16.359 GB for the GPU. I don't remember how much the 128k context version needs, but it was more than the 21.845GB MacOS allows for the GPUs use on a 32GB machine. You aren't going to get very far on a 16GB machine.

Maybe if you don't send any layers to the GPU and force it to use CPU you could eek out a little more. On Apple Silicon CPU inference only seems to be a 50% hit over GPU speeds, if I remember right.

Local generation not utilising GPU? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

What software are you using to run LLaMA and Stable Diffusion?

What version of the LLaMA model are you trying to run? How many parameters? What quantization?

Look ahead decoding offers massive (~1.5x) speedup for inference in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

This seems like this approach could also be useful in situations where the goal isn't speed, but rather "quality" (by a variety of metrics).

Inference Speed When Running Local Models in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

Please get specific. What's "quite slow," what's "extremely quickly." Use numbers and units that include a unit of time.

What hardware are you running on? Without changing hardware your best bet is a smaller model (in terms of parameters), or a smaller quantization of a 13b model, or both.

What exactly is pulling 100-110W when running local LLM? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 0 points 2 years ago (1 children)

Game GPU-use probably hits the cache. LLM really won't since each token involves reading all the model data.

What exactly is pulling 100-110W when running local LLM? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 0 points 2 years ago (1 children)

I think part of the answer is that RAM uses more power than you think when it's running near full-tilt, like it is during generation. Micron's advice is to figure 3w per 8GB for DDR4, and more than that for the highest performance parts. The fact that the RAM is on package probably offsets that somewhat, but that's still more than single digits.

Power consumption on my 24Core GPU M1 Max is similar to yours, though somewhat lower as you'd expect, according to both iStat Menus and Stats.app.

There is also the question of how accurate they are.

What are some of the things that would be cool to explore using LLVM that you run locally on GPU ? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 0 points 2 years ago (1 children)

LLVM or LLM?

Apple M3 Pro Chip Has 25% Less Memory Bandwidth Than M1/M2 Pro in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

Apple is increasing differentiation amongst their chips. Previously the Pro and Max different primarily in GPU cores. Now they are also differentiated in CPU cores and memory bandwidth.

I was disappointed to see that the M3 Maxs memory bandwidth is the same, on paper, as the M2 Max, but I'm also mindful of the fact that no one functional unit was able to use all the available memory bandwidth in the first place, so I hope that the M3 will allow higher utilization.

We'll see once people get their hands on them.

How does Apple’s new M3 128GB ram MacBook Pro compare with Nvidia A100? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

Apple has further segmented the Apple silicon lineup with the M3.

With the M1 & M2 Max, all the GPU variants had the same memory bandwidth (400GB/s for the M2 Max). The top of the line M3 Max (16 CPU/ 40GPU cores) is still limited to 400GB/s max, but now the lower spec variants (14 CPU/30 GPU) are only 300GBs/max.

Inference is generally bound by memory bandwidth, so the M3 generation may not be much of an improvement. Apple claims improvements in the cache, but that may not mean much for inference. We'll know more once people have them in hand, which shouldn't take too long.

Crowd-Sourced Computing for LLM Training in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

Someone asked something very similar in the past day, but I think it was from the angle of training as proof of work.

There is this https://petals.dev