this post was submitted on 21 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] koi691337@alien.top 1 points 10 months ago (1 children)

Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses

Wouldn't this just constitute to the model sort of overfitting to noise?

[–] til_life_do_us_part@alien.top 1 points 10 months ago

It’s a risk if your model can’t accurately predict user responses, but I don’t see how it’s a necessary characteristic of the approach. If so the same issue would apply to model based RL in general no? Unless you are suggesting something special about language modelling or user responses which makes it fundamentally hard to learn a model of.