More

mnicky · 2026-02-22T15:58:10 1771775890

> Very often, after a correction, it will focus a lot on the correction itself making for weird-sounding/confusing statements in commit messages and comments.

I've experienced that too. Usually when I request correction, I add something like "Include only production level comments, (not changes)". Recently I also added special instruction for this to CLAUDE.md.

mnicky · 2026-02-22T15:53:06 1771775586

Since some time, Claude Codes's plan mode also writes file with a plan that you could probably edit etc. It's located in ~/.claude/plans/ for me. Actually, there's whole history of plans there.

I sometimes reference some of them to build context, e.g. after few unsuccessful tries to implement something, so that Claude doesn't try the same thing again.

mnicky · 2026-02-12T20:29:41 1770928181

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

postalcoder · 2026-02-12T21:00:40 1770930040

Added a thinking-disabled Opus 4.6 timing. It took 1m 4s – coincidentally the same as 5.3-codex-low.

mnicky · 2026-02-12T20:26:10 1770927970

> What am I missing?

Largest production capacity maybe?

Also, market demand will be so high that every player's chips will be sold out.

onlyrealcuzzo · 2026-02-12T20:48:30 1770929310

> Largest production capacity maybe?

Anyone can buy TSMC's output...

CamperBob2 · 2026-02-12T23:22:03 1770938523

Which I'm sure is 100% reserved through at least 2030.

DeathArrow · 2026-02-13T07:06:13 1770966373

Aren't they building new fabs, though? Or even those are already booked?

Keyframe · 2026-02-12T21:21:34 1770931294

Can anyone buy TSMC though?

louiereederson · 2026-02-12T22:29:23 1770935363

No. TSMC will not take the risk on allocating capacity to just anyone given the opportunity cost.

roughly · 2026-02-13T00:55:04 1770944104

Not without an army

mnicky · 2026-02-12T17:53:51 1770918831

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

mnicky · 2026-02-12T17:50:27 1770918627

> can a sufficiently large non thinking model perform the same as a smaller thinking?

Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).

simianwords · 2026-02-12T17:54:36 1770918876

its interesting that opus 4.6 added a paramter to make it think extra hard.

mnicky · 2026-02-11T18:47:49 1770835669

At least now we also have a tracker: https://marginlab.ai/trackers/claude-code/

WXLCKNO · 2026-02-11T19:02:19 1770836539

Saw this the other day and loved it. Especially seeing Opus 4.5 degrading prior to the 4.6 release (IIRC) and Codex staying very stable and even improving over time.

But FYI the blog post is not about the actual model being dumbed down, but the command line interface.

mnicky · 2026-02-11T18:44:00 1770835440

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

phamilton · 2026-02-11T18:50:45 1770835845

Intelligence per token doesn't seem quite right to me.

Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.

mnicky · 2026-02-11T19:50:37 1770839437

It is possible to think of tokens as some proxy for thinking space. At least reasoning tokens work like this.

Dollar/watt are not public and time has confounders like hardware.

mnicky · 2026-02-11T17:33:33 1770831213

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

mnicky · 2026-02-11T09:03:44 1770800624

Evaluation than depends on your specific cost-benefit tradeoff of accuracy vs hallucinations.

For some tasks where detecting hallucinations is easy I can see it being beneficial.

In general case not so much...