tu9jn

joined 10 months ago
[–] tu9jn@alien.top 1 points 9 months ago (1 children)

Vcache only helps when you want to access lots of tiny chunks of data that fit inside the 96-128mb cache.

During inference you have to read the entire several Gb model for each token generation, so your botleneck is still the Ram bandwidth.

[–] tu9jn@alien.top 1 points 9 months ago

I have 64 cores with 8ch ram, if i use more than 24-32 cores the speed slows down somewhat.

This is for token generation, prompt processing benefits form all the threads.

But it is much better to spend your money on gpus than cpu cores, i have 3X Radeon MI25 in a i9 9900k box, and that is more than twice as fast as the 64 core epyc build

[–] tu9jn@alien.top 1 points 9 months ago (1 children)

Usually number of parameters matter more than bit per weight, but I had some problems with really low bpw models like 70b 2.55bpw exllamav2.

34b Yi could be a good compromise, I am impressed with it, and it has a long context length as well.

[–] tu9jn@alien.top 1 points 9 months ago

You wont get banned from local for asking the wrong questions, and GPT4 has hourly limit as well

If you already have the hardware why not try it? It's literally free.

[–] tu9jn@alien.top 1 points 10 months ago

I run 3X MI25, a 70b q4_k_m model starts from 7t/s and slows to ~3 t/s at full context. 7b_f16 is about 18t/s. As far as i know the Mi series only have linux drivers.