__SlimeQ__

joined 1 year ago
[–] __SlimeQ__@alien.top 1 points 11 months ago

I'm not doing that but my guess is it's fun, easy, and cheap to do (only $8/mo!) and potentially lucrative if you can cheese a following somehow.

Using gpt is really lazy though when it's so easy to do a custom 13B Lora that will actually interact like a human

[–] __SlimeQ__@alien.top 1 points 11 months ago

I use C#. Initially I'd gone all out trying to wrap Llama.cpp myself, but I was getting outdated in a matter of weeks and it was going to take a ton of effort to keep up.

So instead I run a local ooba server and use the api. I get to do all my business logic in nice, structured C#, while all the python stuff says in ooba and I don't have to dig into it really at all.

[–] __SlimeQ__@alien.top 1 points 11 months ago (1 children)

Let me start off by saying I haven't gotten this right yet. But still...

When autogpt went viral, the thing everyone started talking about was vector db's and how they can magically extend the context window. This was not a very informed idea and implementations have been lacking.

It turns out that merely finding similar messages in the history and dumping them into the context is not enough. While this may sometimes give you a valuable nugget, most of the time it will just fill the context with repetitive garbage.

What you really need for this to work imo, is a structured narrative around finding the data, reading it, and reporting the data. LLMs respond extremely poorly to random, disconnected dialogue. They don't know what to do with it. So for one thing, you'll need a reasonable amount of pre-context for each data point so that the bot can even understand what's being talked about. But now this is prohibitively long, 4 or 5 matches on your search and your context is probably full. So you'll need to do some summarizing before squeezing it into the live conversation, which means your request takes 2x longer, at a minimum, and then you need to weave that into your chat context in as natural a way as possible.

Honestly RAG as a task is so weird that I no longer expect any general models to be capable of it. Especially not 7B/13B. Even gpt4 can just barely do it. I think with a very clever dataset somebody could make an effective RAG Lora, but I've yet to see it.

[–] __SlimeQ__@alien.top 1 points 11 months ago

I have the 16gb version of 4060ti, so the cards have nearly identical capabilities.

[–] __SlimeQ__@alien.top 1 points 11 months ago (4 children)

i can't speak for the desktop 3080ti, but i have that laptop card and it's roughly equivalent in performance to my 4060ti desktop card.

[–] __SlimeQ__@alien.top 1 points 11 months ago

i did this 1.1B TinyLlama quant last week. it's not smart by any means but it can hold a conversation somewhat. and, crucially, it's 600mb

 

I've been in need of a dedicated training rig lately and so I've been looking at gpus. For context I'm already running training on a 16gb 3080ti laptop and inference on a 16gb 4060ti. Both are really just fine for 13B models.

When I'm looking at cards though it appears I could buy nearly 4 more 16gb 4060ti cards for the price of a 24gb 4090.

I understand that the 4090 is potentially 2-3 times faster based on benchmarks, but does this actually translate to improved Llama speeds? Would it even be viable to go for double 4060ti's instead?

Currently I'm standardized on 16gb/13B/4bit but I'd love to push beyond that, have more vram for training, etc. What are my options?

[–] __SlimeQ__@alien.top 1 points 11 months ago (2 children)

my personal lora does this just because it was trained on actual human conversations. it's super unnatural for people to try answering just any off the wall question, most people will just go like "lmao" or "idk, wtf" and if you methodically strip that from the data (like most instruct datasets do) then it makes the bots act weird as hell

[–] __SlimeQ__@alien.top 1 points 11 months ago (1 children)

If you're not using a chat/instruct tuned model you should be using the notebook, the input that the chat tab creates will be chat/instruct formatted

[–] __SlimeQ__@alien.top 1 points 11 months ago

I can't speak for 1B models but you're going to have a really hard time training with no gpu. It's just going to take an insanely long time.

For $500 though you can get a 4060ti with 16gb of ram which is good enough to train a 13B lora

[–] __SlimeQ__@alien.top 1 points 11 months ago

without knowing what you've tried it's impossible to really know what to recommend. tiefighter or openhermes2.5 is probably your best bet

[–] __SlimeQ__@alien.top 1 points 11 months ago

Mistral and Llama2 (and Llama) are foundation models, meaning they actually trained all the weights given. Almost anything worth using is a derivative of these 3 foundation models. They are really expensive to train.

Just about everything else is a Lora fine tune on top of one of them. Fine tunes only change a small fraction of the weights, like 1%. Functionally speaking, the important part of these is the additional data they were trained on, and that training can be done on any underlying model.

So Open hermes is a Lora tuning on top of mistral, and is some opensource offshoot of nous hermes, which is an instruction dataset for giving good smart answers (or something) in a given instruction format.

view more: next ›