More

aluminum96 · 2025-07-19T17:12:34 1752945154

“they must be lying because I personally dislike them”

This is why HN threads about AI have become exhausting to read

nosianu · 2025-07-19T18:36:10 1752950170

In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?

Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?

otabdeveloper4 · 2025-07-19T18:30:17 1752949817

Yeah, that's how the concept of "reputation" works.

queenkjuul · 2025-07-19T23:48:35 1752968915

No, they are likely lying, because they have huge incentives to lie

aluminum96 · 2025-07-19T17:11:15 1752945075

OpenAI explicitly stated that it is natural language only, with no tools such as Lean.

https://x.com/alexwei_/status/1946477745627934979?s=46&t=Hov...

aluminum96 · 2025-07-19T17:09:46 1752944986

Why do people keep making up controversial claims like this? There is no evidence at all to this effect

blibble · 2025-07-19T17:59:59 1752947999

it was widely covered in the press earlier in the year

helloplanets · 2025-07-19T20:44:18 1752957858

Source?

aluminum96 · 2025-07-19T17:08:48 1752944928

Mark Chen posted that the system was locked before the contest. [1] It would obviously be crazy cheating to give verifiers a solution to the problem!

[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...

aluminum96 · 2025-07-19T17:00:23 1752944423

The proofs were published on GitHub for inspection, along with some details (generated within the time limit, by a system locked before the problems were released, with no external tools).

https://github.com/aw31/openai-imo-2025-proofs/tree/main

aluminum96 · 2025-07-19T16:58:46 1752944326

The solutions were publicly posted to GitHub: https://github.com/aw31/openai-imo-2025-proofs/tree/main

bwfan123 · 2025-07-19T17:09:26 1752944966

Did humans formalize the inputs ? or was the exact natural language input provided to the llm. A lot of detail is missing on the methodology used. Not to mention of any independent validation.

My skepticism stems from the past frontier math announcement which turned out to be a bluff.

aluminum96 · 2025-07-19T17:17:05 1752945425

People are reading a lot into the FrontierMath articles from a couple months ago, but tbh I don’t really understand what the controversy is supposed to be there. failing to clearly disclose sponsoring Epoch to make the benchmark clearly doesn’t affect performance of a model on it

aluminum96 · on April 29, 2024

What, you mean your fruit preferences don't form a total order?

7734128 · on April 30, 2024

Of course they do, but in this example there's no way to compare cherries to bananas.

Grapefruit is of course the best fruit.

aluminum96 · on April 26, 2024

Just vaporized a whole team so the roles can be moved overseas :(

lbruno · on April 28, 2024

full-timers or contacting?

aluminum96 · on April 10, 2024

SF Voters rejected Proposition A in 2022 [1], which would have included funding to upgrade Muni's control systems (among many other projects). We'll eventually have to find the money somewhere else when the system fails.

[1] https://www.sfchronicle.com/sf/article/S-F-voters-narrowly-r...

aluminum96 · on Feb 24, 2024

Google needs much stronger SVP-level product leadership. Directors and VPs should not be fighting for product turf, and major user-facing products, such as an entire default Android app in this case, need to outlast the tenure of any individual VP-level patron.

AtlasBarfed · on Feb 24, 2024

I get that the upper management is playboy billionaires at this point and have checked out, but at this point it would be B_A_S_I_C management to look at this churn and not thing "why aren't we conserving the codebases" because really this is just branding and reskinning fundamentally.