overview for TheTerrasque

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! in c/localllama@poweruser.forum

[–] TheTerrasque@alien.top 1 points 2 years ago

70b? Q4, llama.cpp, some layers on gpu.

Might need to run Linux to get the system ram usage low enough

Alternatives to LLamaSharp? in c/localllama@poweruser.forum

[–] TheTerrasque@alien.top 1 points 2 years ago

I don't know an alternative, but I did some experimenting with it. I kinda rewrote large parts of it, and I also used a custom build of llama.cpp dll's. I'm pretty sure it'll still work with the newest llama.cpp build, you might need to update some native calls if they've been expanded or renamed.

My changes are at https://github.com/TheTerrasque/LLamaSharp/tree/feature/clblast - I haven't really documented it much, but maybe the git history will help

What is the best 13b right now? in c/localllama@poweruser.forum

[–] TheTerrasque@alien.top 1 points 2 years ago (1 children)

Well, it gets posted a few times a week, so it kinda is..

Is there a technical reason that distributed LLMs don't exist? in c/localllama@poweruser.forum

[–] TheTerrasque@alien.top 1 points 2 years ago (1 children)

Transferring the state over the internet so the next card can take over is sloooow. You'd want cards that can take a lot of layers to minimize that.

In other words, you want few and big gpu's in the network, not a bunch of small ones.

Clearing up confusion: GPT 3.5-Turbo may not be 20b after all in c/localllama@poweruser.forum

[–] TheTerrasque@alien.top 1 points 2 years ago

ITT: People explaining why it can still be 20b