Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This paper evaluates three control strategies for untrusted agents: deferral to trusted models, resampling, and critical action deferral. Initial testing showed resampling and critical action deferral achieving 96% safety. However, adversarial testing revealed resampling crashes to 17% safety when attackers can detect resampling or simulate monitors, while critical action deferral remained robust against all attack strategies.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: