overview for harrro

Deepseek llm 67b Chat & Base in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago

It's nice to see this when every other ToS we click through says the reverse..

"By using this service, you grant Meta/Google/Microsoft a perpetual, royalty free right to reprint, reproduce and use your content".

Quantizing 70b models to 4-bit, how much does performance degrade? in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it's pretty easy to do.

Quantizing 70b models to 4-bit, how much does performance degrade? in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago (2 children)

Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

Any alternatives to couqi for TTS? in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago (1 children)

Which coqui model did you use? The new xtts2 model is excellent IMO.

TinyLlama Base Model Trained on 2T Tokens Complete in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago

The repo doesn't contain a GGUF, did you forget to upload that?

Need help estimating if my speed is expected. Llama_index in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago (1 children)

I'm using langchain with qdrant as the vector store.

VRAM is full

How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.

Need help estimating if my speed is expected. Llama_index in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago (3 children)

After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).

Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).

Details emerge of surprise board coup that ousted CEO Sam Altman at OpenAI (Microsoft CEO Nadella "furious"; OpenAI President and three senior researchers resign) in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago (1 children)

Seems a miss from Microsoft's lawyers if they didn't check out how the board and company was organized before making such a large investment.

And at this point, there are plenty of companies that would jump at the chance to invest/get a controlling interest in OpenAI (and obviously they'd ask for a board seat at the very least) -- Google, Apple, even Meta.

Introducing Tess: Tess-M with 200K Context Length in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 11 months ago

Quantized GGUF here: https://huggingface.co/TheBloke/Tess-Medium-200K-v1.0-GGUF

And GPTQ https://huggingface.co/TheBloke/Tess-Medium-200K-v1.0-GPTQ

DreamGen Opus — Uncensored model for story telling and chat / RP in c/localllama@poweruser.forum

[–] harrro@alien.top 1 points 1 year ago (2 children)

Any GGUF quantized download available?