Hacker Newsnew | past | comments | ask | show | jobs | submit | mnicky's commentslogin

> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.

I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.


Since some time, Claude Codes's plan mode also writes file with a plan that you could probably edit etc. It's located in ~/.claude/plans/ for me. Actually, there's whole history of plans there.

I sometimes reference some of them to build context, e.g. after few unsuccessful tries to implement something, so that Claude doesn't try the same thing again.


Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

Added a thinking-disabled Opus 4.6 timing. It took 1m 4s – coincidentally the same as 5.3-codex-low.

> What am I missing?

Largest production capacity maybe?

Also, market demand will be so high that every player's chips will be sold out.


> Largest production capacity maybe?

Anyone can buy TSMC's output...


Which I'm sure is 100% reserved through at least 2030.

Aren't they building new fabs, though? Or even those are already booked?

Can anyone buy TSMC though?

No. TSMC will not take the risk on allocating capacity to just anyone given the opportunity cost.

Not without an army

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

> can a sufficiently large non thinking model perform the same as a smaller thinking?

Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).


its interesting that opus 4.6 added a paramter to make it think extra hard.

At least now we also have a tracker: https://marginlab.ai/trackers/claude-code/

Saw this the other day and loved it. Especially seeing Opus 4.5 degrading prior to the 4.6 release (IIRC) and Codex staying very stable and even improving over time.

But FYI the blog post is not about the actual model being dumbed down, but the command line interface.


What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q


Intelligence per token doesn't seem quite right to me.

Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.


It is possible to think of tokens as some proxy for thinking space. At least reasoning tokens work like this.

Dollar/watt are not public and time has confounders like hardware.


> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?


Evaluation than depends on your specific cost-benefit tradeoff of accuracy vs hallucinations.

For some tasks where detecting hallucinations is easy I can see it being beneficial.

In general case not so much...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: