mrjackspade

joined 10 months ago
[–] mrjackspade@alien.top 1 points 9 months ago

So I don't know much about architecture but I'm assuming if we want to run something like this in Llama, we're going to need to submit a request? If its ground up, then pretty much everything is going to need to be implemented, right?

[–] mrjackspade@alien.top 1 points 9 months ago (1 children)

Switch to using YARN is the best I'm aware of at the moment.

YARN is basically dynamic alpha scaling with extra steps, functions better without fine tuning, and also benefits from fine tuning.

https://private-user-images.githubusercontent.com/567732/276779985-6b37697c-896e-4199-a541-a489b6fad213.png

[–] mrjackspade@alien.top 1 points 9 months ago

Gonna be honest, you can totally just skip LlamaSharp and call the Llama.dll methods using interop in C#

Its really not difficult to do and it cuts an entire layer of dependency out of your project.

[–] mrjackspade@alien.top 1 points 10 months ago

I actually don't know how much overhead that's going to be. I'd start by just kicking it off on the command line first as a proof of concept, its super easy,

5_K_M is just the quantization I use. There's almost no loss of perplexity with 5_K_M, but its also larger than 4 which is what most people use.

Name Quant method Bits Size Max RAM required Use case
goat-70b-storytelling.Q2_K.gguf Q2_K 2 29.28 GB 31.78 GB smallest, significant quality loss - not recommended for most purposes
goat-70b-storytelling.Q3_K_S.gguf Q3_K_S 3 29.92 GB 32.42 GB very small, high quality loss
goat-70b-storytelling.Q3_K_M.gguf Q3_K_M 3 33.19 GB 35.69 GB very small, high quality loss
goat-70b-storytelling.Q3_K_L.gguf Q3_K_L 3 36.15 GB 38.65 GB small, substantial quality loss
goat-70b-storytelling.Q4_0.gguf Q4_0 4 38.87 GB 41.37 GB legacy; small, very high quality loss - prefer using Q3_K_M
goat-70b-storytelling.Q4_K_S.gguf Q4_K_S 4 39.07 GB 41.57 GB small, greater quality loss
goat-70b-storytelling.Q4_K_M.gguf Q4_K_M 4 41.42 GB 43.92 GB medium, balanced quality - recommended
goat-70b-storytelling.Q5_0.gguf Q5_0 5 47.46 GB 49.96 GB legacy; medium, balanced quality - prefer using Q4_K_M
goat-70b-storytelling.Q5_K_S.gguf Q5_K_S 5 47.46 GB 49.96 GB large, low quality loss - recommended
goat-70b-storytelling.Q5_K_M.gguf Q5_K_M 5 48.75 GB 51.25 GB large, very low quality loss - recommended
goat-70b-storytelling.Q6_K.gguf Q6_K 6 56.59 GB 59.09 GB very large, extremely low quality loss
goat-70b-storytelling.Q8_0.gguf Q8_0 8 73.29 GB 75.79 GB very large, extremely low quality loss - not recommended
[–] mrjackspade@alien.top 1 points 10 months ago (4 children)

If you're only getting 0.1 then you've probably overshot your layer offloading.

I can get up to 1.5 t/s with a 3090, at 5_K_M

Try running Llama.cpp from the command line with 30 layers offloaded to the gpu, and make sure your thread count is set to match your (physical) CPU core count

The other problem you're likely running into is that 64gb of RAM is cutting it pretty close. Make sure your base OS usage is below 8GB if possible and try memory locking the model on load. The problem is that with that amount of system ram, its possible you have other applications running causing the OS to page the model data out to disk, which kills performance

[–] mrjackspade@alien.top 1 points 10 months ago (1 children)

/u/Noxusequal

I just went through this exact issue building Llama.cpp inside an Ubuntu docker container. If you don't have nvcc installed it will compile without error, but wont include CUDA support regardless of what you set for options. Check to make sure nvcc is installed in the machine

[–] mrjackspade@alien.top 1 points 10 months ago

If you're trying to do something novel as part of a learning experiment, just pick a good UI framework and then either wrap LlamaSharp, or interop directly with Llama.cpp using PInvoke.

Personally I just use PInvoke to cut out the middle man

As for a UI framework, I've been having a lot of fun with Avalonia lately.

You're not going to get 100% interop with all models using Llama.cpp as a core but if this is a learning exercise then I'm sure that's not an issue.

That being said, if you really want to you can fuck around with Python.net but you may find yourself spending way more time trying to manage model interop and execution than you want to.

[–] mrjackspade@alien.top 1 points 10 months ago

I tried coaxing an answer out of it, and as far as I got was

  1. One random redditors comment saying "next year" was announced at Meta Connect
  2. "It makes sense" due to the spacing between 1 and 2