harrro

joined 1 year ago
[–] harrro@alien.top 1 points 11 months ago

It's nice to see this when every other ToS we click through says the reverse..

"By using this service, you grant Meta/Google/Microsoft a perpetual, royalty free right to reprint, reproduce and use your content".

[–] harrro@alien.top 1 points 11 months ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it's pretty easy to do.

[–] harrro@alien.top 1 points 11 months ago (2 children)

Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

[–] harrro@alien.top 1 points 11 months ago (1 children)

Which coqui model did you use? The new xtts2 model is excellent IMO.

[–] harrro@alien.top 1 points 11 months ago

The repo doesn't contain a GGUF, did you forget to upload that?

[–] harrro@alien.top 1 points 11 months ago (1 children)

I'm using langchain with qdrant as the vector store.

VRAM is full

How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.

[–] harrro@alien.top 1 points 11 months ago (3 children)

After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).

Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).

[–] harrro@alien.top 1 points 11 months ago (1 children)

Seems a miss from Microsoft's lawyers if they didn't check out how the board and company was organized before making such a large investment.

And at this point, there are plenty of companies that would jump at the chance to invest/get a controlling interest in OpenAI (and obviously they'd ask for a board seat at the very least) -- Google, Apple, even Meta.

[–] harrro@alien.top 1 points 1 year ago (2 children)

Any GGUF quantized download available?