overview for ihexx

What do these words mean? Hermes, OpenHermes, OpenChat, Vicuna, Alpaca, Orca, OpenOrca, Airoboros, Synthia, Guanaco, Dolphin, Samantha, Synthia, ... in c/localllama@poweruser.forum

[–] ihexx@alien.top 1 points 2 years ago

they are just made up names. People choose to name their projects whatever. SOmetimes it's related to the prior work it's based on (like underlying model or dataset), but it's just arbitrary.

Why can't we just run local reinforcement learning? in c/localllama@poweruser.forum

[–] ihexx@alien.top 1 points 2 years ago

there's lots of different kinds of RL algos with different requirements

In general though, the tradeoff you're making is: data efficiency vs compute complexity

On one end, evolutionary methods & gradient-free optimization methods are simple, but data hungry.

On the other end, are things like model based RL (eg building reward models to train your generator model) are more data efficient, but are more complex since they have more moving parts and more live models to train.

So to answer:

Seriously though, what makes it require more VRAM than regular inference? You're still loading the same model, aren't you?

No, on the model-based end, you're training at least 2 models: the generator and the reward model.

On the evolutionary & gradient free end, you need far more data than supervised learning, since reinforcement learning doesn't tell the agent what to do at every time step, only after N time steps, so you're getting basically 1/Nth the training signal for each step compared to supervised learning.

Basically, we as GPU poors are in the wierd position where anything we can train under these limitations would probably have worse performance than just training a larger model off supervised datasets

[N] Why Gym/Gymnasium removed done from the step function in c/machinelearning@academy.garden

[–] ihexx@alien.top 1 points 2 years ago

yeah it makes sense to standardize this. Before you had to do wierd hacks which were unique to each environment to figure out what 'done' meant.

And if you want the old api back it's trivial to write a wrapper.