The "cannon" one is maybe the funniest thing I've seen on the internet in months. It almost makes me want to add autoplaying music to my own website, just to add that too.
you mention voice ai in the announcement but I wonder how this works in practice. most voice AI systems are bound not by full response latency but just by time-to-first-non-reasoning-token (because once it heads to TTS, the output speed is capped at the speed of speech and even the slowest models are generating tokens faster than that once they start going).
what do ttft numbers look like for mercury 2? I can see how at least compared to other reasoning models it could improve things quite a bit but i'm wondering if it really makes reasoning viable in voice given it seems total latency is still in single digit seconds, not hundreds of milliseconds
Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.
At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.
Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.
This isn't really the author's point, but I think one effect of AI and the forthcoming robotics revolution will be the unrolling of a lot of consolidated supply chains for all sorts of products. It could usher in a renewed era of bespoke products.
For instance, when the cost of building a new (good) app goes to zero, it becomes economical to make a great app for a narrow niche, with a skeleton staff (maybe just one) and no VC money. And this can happen thousands of times over.
Robotics could open up bespoke local supply chains even beyond what's possible with a 3D printer today. For instance, if you had an actually dextrous humanoid robot "living" in your home, why wouldn't you have it just make all of your clothes? You could have any fabric, any style, exactly the right size. And only for the cost of materials (assuming you already own or lease the robot itself).
I do think the author is right in the big picture - the future will be more fun.
wow thanks for leaving this comment - i now realize two things:
1. the farmer's almanac i thought of when i saw the title and even read the article is not going anywhere
2. i have never before heard of the farmer's almanac referred to in this notice
yeah i think they shot themselves in the foot a bit here by creating the o series. the truth is that GPT-5 _is_ a huge step forward, for the "GPT-x" models. The current GPT-x model was basically still 4o, with 4.1 available in some capacity. GPT-5 vs GPT-4o looks like a massive upgrade.
But it's only an incremental improvement over the existing o line. So people feel like the improvement from the current OpenAI SoTA isn't there to justify a whole bump. They probably should have just called o1 GPT-5 last year.
"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material."
Chat is a great UX _around_ development tools. Imagine having a pair programmer and never being allowed to speak to them. You could only communicate by taking over the keyboard and editing the code. You'd never get anything done.
Chat is an awesome powerup for any serious tool you already have, so long as the entity on the other side of the chat has the agency to actually manipulate the tool alongside you as well.
a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.
They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.
disagree - good products meet their users where they are and bury complexity under the hood. i can't imagine trying to use a calendar app (or any app really) that refuses to operate in any mode other than UTC.
OK but most people would agree that "only UTC" is not an ergonomic default. There is a balance.
Also, are the users where they are because they want to be there, or because long ago some government or religious leader forced something through and they go along with it because of some kind of inertia?
reply