Hacker Newsnew | past | comments | ask | show | jobs | submit | nbardy's commentslogin

You can estimate on tok/second

The Trillions of parameters claim is about the pretraining.

It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.

However those models end up very sparse and incredibly distillable.

And it’s way too expensive and slow to serve models that size so they are distilled down a lot.


How much of your RAM does that use including kv cache. Is there enough left to run real dev workloads AND the llm?

Also can you run batchwise effectively like vllm on cuda?

Enough to run multiple agents at the same time with throughput?


Why does apple want to make this hardware hard to access?

What actual benefits do they get?

I guess they can have their own models run faster than the competition on their hardware? But they don't even really have anything that consumers use on the ANE as far as I can tell and local LLMs are taking off on macs and could really benefit from this


I suspect main benefits are they have no need to maintain the hardware or software for any longer than it makes sense for their own needs, and don't have to handhold users through a constantly evolving minefield of performance and technical capabilities.

They are far behind. Go check re-swe bench to see the overfitting measured

Or just try to use them. They don’t generalize as well.

They are benchmaxxed.


They should probably fund their military first.

It’s petulant the way the EU is throwing a hissy fit after we’ve had lop-sided trade deals for years and funding the entire NATO alliance ourselves.

They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.


>They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.

The US is literally threatening to invade an EU overseas territory.


That's because not everyone thinks that the trade deals were lop-sided, and it's difficult to objectively determine if they are, given that trade deals are just another lever in the relationship between 2 countries, one lever among millions of levers, one that is constantly calibrated and moved depending on the other ones. In a system like this I think it's pretty difficult to say who's getting more and who's getting less. But Trump doesn't care what is true of false, so for him it's easy to just say what suits him best.

Regarding the war, I can assure you that Trump not excluding to take Greenland my force has been seen by the EU as threat of starting a war, giving that Greenland is part of the EU. Also applying tariffs when European NATO countries sent some troops in Greenland has been perceived as: "Trump wanted to invade Greenland, he felt like EU countries wanted to defend it, so he imposed tariffs because he wanted to invade".

I'm not saying everyone in EU is thinking this, but I think a lot of people did, and this is some context for you to try and understand europe's point of view.


> They act like we’re going to war with them when we’re asking for parity and for their self reliance to increase.

Threatening to take over Greenland by force isn't considered "going to war" for you?


Comrade, what is the weather in St. Petersburg?


> They should probably fund their military first.

They should do both. Resilience must be achieved in depth.

> It’s petulant the way the EU is throwing a hissy fit after we’ve had lop-sided trade deals for years and funding the entire NATO alliance ourselves.

Most of the outrage in the EU right now is about Trump's threats against another NATO country (Denmark / Greenland). The funding of the NATO has been slowly shifting for a few years already.


If you’re honestly OK with the maths Trump used to calculate the trade deficits then I’m not really sure you’re going to fit in here at HN.


No it’s not. I have written cuda kernels and 8bit optimizers with this.

They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch


> Claude Code officially added native support for the Language Server Protocol (LSP) in version 2.0.74, released in December 2025.

I think from training it's still biased towards simple tooling.

But also, there is real power to simple tools, a small set of general purpose tools beats a bunch of narrow specific use case tools. It's easier for humans to use high level tools, but for LLM's they can instantly compose the low level tools for their use case and learn to generalize, it's like writing insane perl one liners is second nature for them compared to us.

If you watch the tool calls you'll see they write a ton of one off small python programs to test, validate explore, etc...

If you think about it any time you use a tool there is probably a 20 line python program that is more fit to your use case, it's just that it would take you too long to write it, but for an LLM that's 0.5 seconds


> but for LLM's they can instantly compose the low level tools for their use case and learn to generalize

Hard disagree; this wastes enormous amounts of tokens, and massively pollutes the context window. In addition to being a waste of resources (compute, money, time), this also significantly decreases their output quality. Manually combining painfully rudimentary tools to achieve simple, obvious things -- over and over and over -- is *not* an effective use of a human mind or an expensive LLM.

Just like humans, LLMs benefit from automating the things they need to do repeatedly so that they can reserve their computational capacity for much more interesting problems.

I've written[1] custom MCP servers to provide narrowly focused API search and code indexing, build system wrappers that filter all spurious noise and present only the material warnings and errors, "edit file" hooks that speculatively trigger builds before the LLM even has to ask for it, and a litany of other similar tools.

Due to LLM's annoying tendency to fall back on inefficient shell scripting, I also had to write a full bash syntax parser and shell script rewriting ruleset engine to allow me to silently and trivially rewrite their shell invocations to more optimal forms that use the other tools I've written, so that they don't have to do expensive, wasteful things like pipe build output through `head`/`tail`/`grep`/etc, which results in them invariably missing important information, and either wandering off into the weeds, or -- if they notice -- consuming a huge number of turns (and time) re-running the commands to get what they need.

Instead, they call build systems directly with arbitrary options, | filters, etc, and magically the command gets rewritten to something that will produce the ideal output they actually need, without eating more context and unnecessary turns.

LLMs benefit from an IDE just like humans do -- even if an "IDE" for them looks very different. The difference is night and day. They produce vastly better code, faster.

[1] And by "I've written", I mean I had an LLM do it.


Note that the Claude code LSP integration was actually broken for a while after it was released, so make sure you have a very recent version if you want to try it out.

However as parent comment said, it seems to always grep instead, unless explicitly said to use the LSP tool.


Correct. If you try to create a coding agent using the raw Codex or Claude code API and you build your own “write tool”, and don’t give the model their “native patch tool”, 70%+ of the time it’s write/ patch fails because it tries to do the operation using the write/ patch tool it was trained on.


part of the value add of owning both the model and the tooling


We are back to RISC vs CISC!


history doesn't repeat but it definitely rhymes


Your way off, this reads more like anti capitalist political rhetoric than real reasoning.

Look at Nvidia nemotron series. They hav become a leading open source training lab themselves and they’re releasing the best training data, training tooling, and models at this point.


When are people going to drop the immigration is good at all costs assumption.

We need a well managed set of immigration polices or country WILL take advantage of US. These are our military rivals and we sell our most advanced math, physics and engineering seats to the highest bidder. It’s a self districting disaster and it’s not just on us to treat people better.

Look at the rate of Indian asylum seekers in Canada to see the most extreme case. It happens anywhere you extend naivety and boundless good will.


Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.


It's also possible that OpenAI use many human-generated similar-to-ARC data to train (semi-cheating). OpenAI has enough incentive to fake high score.

Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".


> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.

Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.


Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.


Hah, maybe 5th gen Claude will change to "you may be right".

The positive thing is that it seems to be more performative than anything. Claude models will say "you're [absolutely] right" and then immediately do something that contradicts it (because you weren't right).

Gemini 3 Pro seems to have struck a decent balance between stubbornness and you're-right-ness, though I still need to test it more.


5.2 seems worse on overfitting for esoteric logic puzzles in my testing. Tests using precise language where attention has to be paid to use the correct definition among many for a given word. It charges ahead with wrong definitions in a far lower accuracy and worse way now.


Same. Also got my attention re ARC-AGI-2. That's meaningful. And a HUGE leap.


Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!

[0] https://neoneye.github.io/arc/?dataset=ARC-AGI-2


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: