koi691337

joined 11 months ago

[–] koi691337@alien.top 1 points 11 months ago (1 children)

Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses

Wouldn't this just constitute to the model sort of overfitting to noise?