In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?
Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?
The proofs were published on GitHub for inspection, along with some details (generated within the time limit, by a system locked before the problems were released, with no external tools).
Did humans formalize the inputs ? or was the exact natural language input provided to the llm. A lot of detail is missing on the methodology used. Not to mention of any independent validation.
My skepticism stems from the past frontier math announcement which turned out to be a bluff.
People are reading a lot into the FrontierMath articles from a couple months ago, but tbh I don’t really understand what the controversy is supposed to be there. failing to clearly disclose sponsoring Epoch to make the benchmark clearly doesn’t affect performance of a model on it
SF Voters rejected Proposition A in 2022 [1], which would have included funding to upgrade Muni's control systems (among many other projects). We'll eventually have to find the money somewhere else when the system fails.
Google needs much stronger SVP-level product leadership. Directors and VPs should not be fighting for product turf, and major user-facing products, such as an entire default Android app in this case, need to outlast the tenure of any individual VP-level patron.
I get that the upper management is playboy billionaires at this point and have checked out, but at this point it would be B_A_S_I_C management to look at this churn and not thing "why aren't we conserving the codebases" because really this is just branding and reskinning fundamentally.
This is why HN threads about AI have become exhausting to read