LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Inferencing with AND X3D Processors (alien.top)

submitted 2 years ago by ccbadd@alien.top to c/localllama@poweruser.forum

6 comments fedilink hide all child comments

With the proof of concept done and users able to get over 180gb/s on a PC with AMD's 3d vcache, it sure would be nice if we could figure a way to use that bandwidth for CPU based inferencing. I think it only worked on Windows but if that is the case we should be able to come up with a way to do it under Linux too.

top 6 comments

sorted by: hot top controversial new old

[–] tu9jn@alien.top 1 points 2 years ago (1 children)

Vcache only helps when you want to access lots of tiny chunks of data that fit inside the 96-128mb cache.

During inference you have to read the entire several Gb model for each token generation, so your botleneck is still the Ram bandwidth.

[–] ccbadd@alien.top 1 points 2 years ago

In the article they said that that is what was expected but the gains impacted the entire ramdrive and the concept has been proven now. The test used a 500mb+ block so bigger than the cache alone.

https://www.tomshardware.com/news/amd-3d-v-cache-ram-disk-182-gbs-12x-faster-pcie-5-ssd

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

180GB/s isn't really all that fast.

[–] ccbadd@alien.top 1 points 2 years ago

Maybe, but it's a lot faster than what we can do right now and its only the start.

[–] FaustBargain@alien.top 1 points 2 years ago

So there are CPU intrinsics for prefetching data. If we can get better at anticipating the next pieces of data that need to be calculated you can speckle in those preload instructions and achieve more speed.

[–] mcmoose1900@alien.top 1 points 2 years ago

There are actually TSVs for 3D Cache on the AMD 7900 series, but AMD doesn't use them. Presumably because it makes the chip run hotter, so they'd have to downclock it.

But I think it would be a great candidate for an ML card. Not for directly accelerating models, but for basically fitting any kind of intermediate calculations in cache to preserve all the RAM bandwidth for model weights.