Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.
Not sure on AMD support but for nvidia it's pretty easy to do.
Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.
Not sure on AMD support but for nvidia it's pretty easy to do.
Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).
Which coqui model did you use? The new xtts2 model is excellent IMO.
The repo doesn't contain a GGUF, did you forget to upload that?
I'm using langchain with qdrant as the vector store.
VRAM is full
How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.
After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).
Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).
Seems a miss from Microsoft's lawyers if they didn't check out how the board and company was organized before making such a large investment.
And at this point, there are plenty of companies that would jump at the chance to invest/get a controlling interest in OpenAI (and obviously they'd ask for a board seat at the very least) -- Google, Apple, even Meta.
Quantized GGUF here: https://huggingface.co/TheBloke/Tess-Medium-200K-v1.0-GGUF
And GPTQ https://huggingface.co/TheBloke/Tess-Medium-200K-v1.0-GPTQ
Any GGUF quantized download available?
It's nice to see this when every other ToS we click through says the reverse..
"By using this service, you grant Meta/Google/Microsoft a perpetual, royalty free right to reprint, reproduce and use your content".