I don't think that's accurate. The malware prompt has been around since Sonnet 3.7. We carefully evaled it for each new model release and found no regression to intelligence, alongside improved scores for cyber risk. That said, we have removed the prompt for Opus 4.6 since it no longer needed it.
I started seeing "not a malware, continuing" in almost every reply since around 2 weeks ago. Maybe you just reintroduced it with some regression? Opus 4.6
I'm happy to provide any other info that can be useful (as long as i'm not sharing any information about the code or tools we use into a public github issue).
1. I've never seen this. Is there a config option to unhide it if it's happening? Is this in Claude Code? Does it have to be set to verbose or something?
2. Can we pay more/do more rigorous KYC to disable it if it's active?