this post was submitted on 28 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Vcache only helps when you want to access lots of tiny chunks of data that fit inside the 96-128mb cache.
During inference you have to read the entire several Gb model for each token generation, so your botleneck is still the Ram bandwidth.
In the article they said that that is what was expected but the gains impacted the entire ramdrive and the concept has been proven now. The test used a 500mb+ block so bigger than the cache alone.
https://www.tomshardware.com/news/amd-3d-v-cache-ram-disk-182-gbs-12x-faster-pcie-5-ssd