vikarti_anatra

joined 11 months ago
[–] vikarti_anatra@alien.top 1 points 9 months ago

That's why you have 'numa' option in llama.cpp.

From my experience, number of memory channels do matter a lot so this mean that all memory sockets better be filled.

[–] vikarti_anatra@alien.top 1 points 9 months ago

some of my results:

System:

2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

RTX 2060 6 Gb via PCIE x16 3.0

RTX 4060 Ti 16 Gb via PCIE x8 4.0

Windows 11 Pro

OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

euryale-1.3-l2-70b (llama.cpp in text generation UI)

Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

goliath-120 (llama.cpp in text generation UI)

Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

Observations:

- number of cores in cpu-only modes matters very little

- "numa" does matter (I have 2 CPU sockets)

I would say - try to get additional another card?

[–] vikarti_anatra@alien.top 1 points 9 months ago (1 children)

I would be interested to use such thing (especially if it's possible to pass custom options to llama.cpp and ask for custom models to be loaded).

Would it be possible to do something like this:

I put list of models: OpenHermes-2.5-Mistral-7B, Toppy-7B, OpenHermes-2.5-AshhLimaRP-Mistral-7B, Noromaid-v0.1.1-20B, Noromaid-v1.1-13B

Tool download every model from HF with every quantization, runs tests, and provide table with tests results (including failed ones)

[–] vikarti_anatra@alien.top 1 points 9 months ago

Unlikely. As far as I understoodd, first limit is not even matrix multiplication cores, it's memory bandwith and solution for this is faster RAM and multi-channel connections.

[–] vikarti_anatra@alien.top 1 points 9 months ago

Interesting idea.

[–] vikarti_anatra@alien.top 1 points 10 months ago

!remindme 7 days

[–] vikarti_anatra@alien.top 1 points 10 months ago

!remindme 7 days

[–] vikarti_anatra@alien.top 1 points 10 months ago

They think it's not a problem for them. Because they think that:

  • they have nothing to hide
  • they don't think CF (or TLAs who have access) will use it against them. (Possible examples: Ukrainian sites, Russian sites who disagree with goverment on at least some things)
  • they think alternatives are worse - it's...rather difficult to make CF censor you.
  • they only use CF's DNS services and not other things
  • It's just easier this way

This reminds me of current situation with "AI": There is OpenAI/Anthropic with their APIs (requests are sent via HTTPS but OpenAI/Anthropic are not only need to have access to do their work - they also censor it). There are paid-for alternatives who either host proxies for OpenAI/Anthropic/others (like OpenRouter.ai) or host local models for others (hosting require significant resources which will be unusused if you don't query often). There are means to host locally at home if you can. Some people prefer not to use local hosting even when they can do so.

[–] vikarti_anatra@alien.top 1 points 10 months ago

Just my thoughts on this:

Would be great.

Would be rather limited but possible (thanks to https://llm.mlc.ai/ and increasing memory).

A lot of CHEAP Chinese devices will say they can actually do it. They will. At 2 bit quatization and <1 t/s and it would be 7B Models or even less. They will be unusuable.

Google say it's not necessary because you can use their Firebase Services for AI and you can use NNAPI anyway. You must also censor your LLM-using apps in Play Store to adhere to their rules.

Apple says it's not necessary, later they will advertise it as very good thing and provide optmized libraries and some pretrained models but you need to buy latest iphone(last-year won't work because Apple). You must also censor your apps AND mark it as 18+

Areas of usage?

- Language translation (including voice-to-voice). Basically much more improved google translate.

- AI Assistant (basically MUCH more imroved Siri, used not only as command interface).

[–] vikarti_anatra@alien.top 1 points 10 months ago (1 children)

How you actually use it at home ? 3 or 4 old Test P40 from ebuy/local alternatives? Just CPU?