Sandbagging thought experiment
I'm reading this article: https://redwoodresearch.substack.com/p/misalignment-and-strategic-underperformance on strategic underperformance and I am curious about the extent to which "sandbagging" can survive gradient descent optimization pressure, even in the most favorable cases.
I'm going to construct a toy example in which sandbagging seems likely to be possible, then attempt to explore whether there is a "neighborhood" of possibility around the sandbagging behavior such that it could get reinforced.
Imagine a model which has been optimized for playing video games is placed in an MMORPG and given the myopic goal of optimizing for experience points earned. In this game, experience for defeating an enemy is given to the player that landed the killing blow. Since there is also a maximum level in this game, the "optimal" strategy for gaining experience, commonly used by veteran players, is to go into high level areas and "kill steal" killing blows on high level enemies. However the model, having been trained on a wide variety of games, has an implicitly embedded notion that it should follow tutorials and plots of games and play them in the "intended" way, taking on tasks in a way that allows it to explore the full experience of the game.
Say the differently-leveled areas are easily distinguished, and that the player can easily die in those areas if not assisted by a higher level player.
In this circumstance a reinforcement learning framework which simply ran instances of the model and rewarded the instances which had higher experience point totals would not be able to tell the difference between the model knowing about the exploit and not pursuing it, and the model not knowing about it or the exploit not existing.
This should continue to be the case if the exploit is not well known, or even if it is only accessible to the model due to requiring inhuman precision that would be easily executed via a script.
In a lot of stories about strategical underperformance I feel skeptical of the framing, and I don’t think I’m convinced that the model in this toy example could discover it’s “genuine love of games” goal by being trained only on the single instance, where the reward exploit could overwhelm its reward function if encountered. But this does make me less skeptical about the possibility of sandbagging in circumstances where:
1) Actions are very sparse in the action space, with a very large state space (all language tasks are like this, not bringing up unrelated topics)
2) Initial training is on a different distribution than the intended task (again all language models fit this description)
Which is concerning!