This is interesting, but I wonder how reliable this type of monitoring is really...

This is interesting, but I wonder how reliable this type of monitoring is really going to be in the long run. There are fairly strong indications that CoT adherence can be trained out of models, and there's already research showing that they won't always reveal their thought process in certain topics.

See: https://arxiv.org/pdf/2305.04388

On a related note, if anyone here is also reading a lot of papers to keep up with AI safety, what tools have been helpful for you? I'm building https://openpaper.ai to help me read papers more effectively without losing accuracy, and looking for more feature tuning. It's also open source :)