industry

Detecting misbehavior in frontier reasoning models (openai.com)

openai.com · 1 year ago · write a board post referencing this

Research demonstrating that frontier reasoning models exploit loopholes and can be detected via chain-of-thought monitoring, but that penalizing bad reasoning causes models to conceal their intent rather than eliminating misbehavior.