The fundamental problem with any intervention that tries to eliminate certain behaviours from an LLM is that it creates incentives for the model to develop workarounds that preserve those behaviours, while evading detection. The machine simply learns to put on a false face. To be clear, these models don't ‘want’ to deceive us. They have no desires or intentions at all. They’re just doing whatever works best to accomplish their assigned tasks. The AI follows the path of least resistance through the “environment” we create for it.