Yes, the signal we are measuring is quite different from most evals. We are meas...

BugsJustFindMe · 2026-05-08T19:10:47 1778267447

Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.

languid-photic · 2026-05-08T20:51:15 1778273475

My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".

BugsJustFindMe · 2026-05-08T21:12:00 1778274720

I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.

languid-photic · 2026-05-08T21:33:04 1778275984

It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.

We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417