Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses
Wouldn't this just constitute to the model sort of overfitting to noise?
Wouldn't this just constitute to the model sort of overfitting to noise?