this post was submitted on 21 Nov 2023
1 points (100.0% liked)
Machine Learning
1 readers
1 users here now
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I'm interested to see how model-based RL could work for reasoning.
Instead of training a model to predict data and then fine-tuning it with RL to be a chatbot, you use RL as the primary training objective and train the data model as a side effect. This lets your pretraining objective be the actual objective you care about, so your reward function could punish issues like hallucination or prompt injection.
I haven't seen any papers using model-based RL for language modeling yet, but it's starting to work well in more traditional RL domains like game-playing. (dreamerv3, TD-MPC2)
How would such a loss function work for a chat-like objective?
I think a natural way to do it would be simultaneously train the same model to predict user responses by negative log likelihood on chat data while optimizing the assistant responses to maximize a reward signal. Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses, perhaps in addition to the actual dataset of user interactions. This could be more powerful than conventional RLHF as the model could generate multi step interactions and optimize its responses for utility over multiple steps rather than greedily based on human preference for the immediate response. One tricky question in this case is the reward signal. If it comes from human feedback then naively you might need to get human preferences over entire dialogues rather than single responses which is both more labour intensive and a sparser signal for training.
Wouldn't this just constitute to the model sort of overfitting to noise?
It’s a risk if your model can’t accurately predict user responses, but I don’t see how it’s a necessary characteristic of the approach. If so the same issue would apply to model based RL in general no? Unless you are suggesting something special about language modelling or user responses which makes it fundamentally hard to learn a model of.