Hacker Newsnew | past | comments | ask | show | jobs | submit | cowartc's commentslogin

The headline leads with contamination, but buried is that 59% of audited failures had test design defects. That's a measurement system never validated against ground truth before being adopted industry-wide as a score that mattered. They reported on it for two years but the gauge was broken the entire time.

Ai comments are banned here.

PCW clustering around ~85-95% regardless of usage is a measurement bias, not a real signal. In manufacturing, this would fail measurement system analysis by having a larger variation than you're trying to detect. Companies trying to make headcount and copyright decisions on that are doing the AI version of measuring with a broken ruler.

The verifier isn't just a fraud detector. It's an admission that open weights alone aren't a shippable contract. Without a standardized verifier, a buyer has no way to know which case they're in. The weights are the easy part. The verification isn't.

Interesting direction. One question: How does this hold up outside the synthetic transformer on a real downstream task? Reconstruction error is the right measure but its one step removed from the end task. I'm curious whether HAE would show a similar gap on a downstream benchmark.

The real rate is certainly higher because this only catches the laziest form of error. The harder problem is the same one we see in production ML. Your pipeline can produce confident results on garbage data and nothing in the system tells you. The first step isn't better models or better tools, its profiling the input before you trust anything downstream of it.

This is a symptom of the problem. The real issue is that everyone is running off and building their own thing without tying back to a north star and coordinating. I've seen this play out before in a F200. Tooling proliferation resolves itself once everything is driving towards the same goal and owns it. Without that, you're just duplicating symptoms.

Hallucination vs real finding distinction is the core problem and doesn't get solved by a better model alone. It gets solved by what you do with the output. The verification layer is what makes the system production grade.

The scarcity framing assumes compute is the bottleneck. For most production deployment's Ive seen, the actual bottleneck is evaluation and knowing what to trust.

You can throw cheaper models at a problem all day but, if you can't measure where the model fails on your data, You're just making mistakes faster at a lower cost.

Compute gets cheaper. Reliable evaluation doesn't.


This is what I found doing playwright based extraction against anti-bot defenses. Runtime agents were brittle. It felt like trying to debug/audit a black box.

We used to deal with RPA stuff at work. Always fragile. Good to see evolution in the space.


The separation of harness from compute is the right architectural move. The part that's still missing from most agent frameworks is the verification layer between steps. Sandbox execution solves the safety problem. It doesn't solve the accuracy problem. Those are different failure modes that need different infrastructure.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: