overview for Aphid

Your settings are (probably) hurting your model - Why sampler settings matter in c/localllama@poweruser.forum

[–] Aphid_red@alien.top 1 points 11 months ago

If you want determinism, use a seed. The actual sampler settings shouldn't matter. This way you can have same output for same prompt always (and thus, for example, cache common prompts).

If you want to also 'measure' things about the model such as its perplexity, or the ability to see how well it can predict an existing text, use top k=1, temperature = 1.0, disable all other samplers, and correct it whenever it predicts the wrong next token. (Don't let the model generate more than one token at a time).

Regarding long context and quadratic attention in c/localllama@poweruser.forum

[–] Aphid_red@alien.top 1 points 1 year ago

I've been doing some off-hand calculations about this and found that really, most base models look like they've been made with a context that is shorter than really needed. Quadratic scaling of attention is a problem, but not something any of the currently trained models have actually encountered, as their sizes are too small for the quadratic factor to be relevant.

The neural network uses c=ctx_len and d=hidden_dimension when computing data. Its size in parameters doesn't depend much on ctx_len at all, but does depend a lot on d (quadratically)!

How much memory besides parameters do we need to run the computation? Well, there's this matrix QK^(T), which is of shape c*c which is the only really quadratic step in the attention. Also there's c*3d for making the matrices Q, K, and V.

Next, there is the 'linear' step. This involves a matrix of size c*4d

Take c = d. The first number ends up 4d^(2). The second is also 4d^(2)

So it doesn't make much sense to take c < d: the 'attention' part of the transformer (quadratic) will be smaller (in memory and compute) than the 'feedforward' part of it (which is linear in c and d). Also, the attention part itself also includes a c*3d scaling thing, which is creating Q,K, and V (inputs to attention) out of X (the (masked) input to self-attend).

For example, llama has d=8192. So it seems c>=8192 is optimal (in terms of making attention/feedforward take roughly equal time and memory), but c=4096 is actually chosen.

For Falcon, d=14848. So c>=14848 would make sense, but c=2048 was chosen.

Why? Probably to do cheaper training. But it might make more sense to train a smaller dimension model with bigger context size.

That won't win you any of the current benchmarks, as benchmarks don't check for long context. As long as the questions fit in it, there's no gain in them for making it longer. This is a bit of a problem with the current benchmarks, if wanting to use LLMs with longer form content (like having it write a book).

Looking for CPU Inference Hardware (8 Channel Ram Server Motherboards) in c/localllama@poweruser.forum

[–] Aphid_red@alien.top 1 points 1 year ago (3 children)

To get 3-5 tokens a second on 120GB models requires a minimum of 360-600 GB/s throughput (just multiply the numbers~), likely about 30% more due to various inefficiencies, as you usually never reach the maximum theoretical RAM throughput and there are other steps to evaluating the LLM than just the matmuls. So 468-780 GB/s.

This might be what you're looking for, as a platform base:

https://www.gigabyte.com/Enterprise/Server-Motherboard/MZ73-LM1-rev-10

24 channels of DDR-5 gets you up to 920 GB/s of total memory throughput, so that meets the criterion. About as much as a high-end GPU, actually. The numbers on genoa look surprisingly good (well, maybe not the power consumption; ~1100W for CPU and RAM is a lot more than the ~300W the A100 would use, you could probably power limit it to 150W and still be faster.).

Of course, during prompt processing, you'll be bottlenecked by the CPU speed. I'd estimate a 32-core genoa CPU does ~ 2 tflops or so of fp64 (based on 9654's number of 5.4 tflops, it'll be a bit more than a third due to higher clock speed), so perhaps 4 tflops of fp32 (fp16 I don't think is native instruction yet in genoa afaik, and fp32 should be 2x of fp64 using AVX). Compare 36 tflops for the 3090; so it's going to be 1/5th the speed at prompt processing, which is compute limited (two CPUs), or 1/10th if that's unoptimized for numa. Honestly, that's not too bad. But, if you want the best of both worlds, add in a 3090, 4090 or 7900XTX and offload the prompt processing with BLAS, and you get decent inference speed for a huge model (basically, roughly equal or better than anything except A100/H100), and also good prompt processing, as the kv cache should fit in the GPU memory.

As far as CPU prices.. . the 9334 seems to range from about $700 (used, quality samples) to $2700 (new), and would have the core count. A bit of a step up is the 9354 which has the full cache size. That might be relevant for inference.