Hacker Newsnew | past | comments | ask | show | jobs | submit | fergal_reid's commentslogin

I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).

If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/


This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).

My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.

Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.

(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)

LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2

But you could instead batch the compute parts together.

LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2

Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)

Time spent using approach 1 (1 request at a time):

LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC

Time spend using approach 2 (batching):

LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC

The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.

TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.


Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:

- Big models like GPT-4 are split across many GPUs (sharding).

- Each GPU holds some layers in VRAM.

- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.

- Loading into cache is slow, the ops are fast though.

- Without batching: load layer > compute user1 > load again > compute user2.

- With batching: load layer once > compute for all users > send to gpu 2 etc

- This makes cost per user drop massively if you have enough simultaneous users.

- But bigger batches need more GPU memory for activations, so there's a max size.

This does makes sense to me but does this sound accurate to you?

Would love to know if I'm still missing something important.


This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.

The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.


Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.


Thanks, got it! Think I need a deeper article on this - as comment below says you'd then need to load the request specific state in instead.


Yeah chatgpt pretty much nailed it.


But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?


Yes we're familiar with the terminology and framing of gofai. Fwiw I read (most of) the 3rd edition of Russell and Norvig in my undergrad days.

However, the point we're trying to make here is at a higher level of abstraction.

Basically most demos of agents you see these days don't prioritize reliability. Even a Copilot use case is quite a bit less demanding than a really frustrated user trying to get a refund or locate a missing order.

I'm not sure putting that in the language of pomdps is going to improve things for the reader, rather than just make us look more well read.

But your feedback is noted!


At Intercom we've also a lot of experience here.

I disagree, basically. In our experience actual real world processes are not compactly defined, and don't have sharp edges.

When you actually go to pull them out of a customer they have messy probabilistic edges, where you can sometimes make progress a lot faster, and end up with a much more compact and manageable representation of the process, by leveraging an LLM.

We've a strong opinion this is the future of the space and that purely deterministic workflows will get left behind! I guess we'll see.


Strongly agree.

This seems to be very hard for people to accept, per the other comments here.

Until recently I was willing to accept an argument that perhaps LLMs had mostly learned the patterns; e.g. to maybe believe 'well there aren't that many really different leetcode questions'.

But with recent models (eg sonnet-3.7-thinking) they are operating well on such large and novel chunks of code that the idea they've seen everything in the training set, or even, like, a close structural match, is becoming ridiculous.


All due respect to Simon but I would love to see some of that groundbreaking code that the LLMs are coming up with.

I am sure that the functionalities implemented are novel but do you really think the training data cannot possibly have had the patterns being used to deliver these features, really? How is it that in the past few months or years people suddenly found the opportunity and motivation to write code that cannot possibly be in any way shape or form represented by patterns in the diffs that have been pushed in the past 30 years?


When I said "the thing I am doing has never been done by anyone else before" I didn't necessarily mean groundbreaking pushes-the-edge-of-computer-science stuff - I meant more pedestrian things like "nobody has ever published Python code to condense and uncondense JSON using this new format I just invented today": https://github.com/simonw/condense-json

I'm not claiming LLMs can invent new computer science. I'm saying it's not accurate to say "they can only produce code that's almost identical to what's in their training data".


> "they can only produce code that's almost identical to what's in their training data"

Again, you're misinterpreting in a way that seems like you are reacting to the perception that someone attacked some of your core beliefs rather than considering what I am saying and conversing about that.

I never even used the words "exact same thing" or "almost identical". Not even synonyms. I just said overfitting and quoted from an OpenAI/Anthropic paper that said "predict plausible changes to code from examples of changes"

Think about that. Don't react, think. Why do you equate overfitting and plausibility prediction with "exact" and "identical". It very obviously is not what I said.

What I am getting at is that a cannon will kill the mosquito. But drawing a fly swatter in the cannonball and saying the plastic ones are obsolete now would be in bad faith. No need to say to someone pointing that out that they are claiming that the cannon can only fire on mosquitoes that have been swatted before.


I don't think I understood your point then. I matched it with the common "LLMs can only produce code that's similar to what they've seen before" argument.

Reading back, you said:

> I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

I'll be honest: I don't understand what you mean by "overfitting to diffs it was trained on" there.

Maybe I don't understand what "overfitting" means in this context?

(I'm afraid I didn't understand your cannon / fly swatter analogy either.)


It's overkill. The models do not capture knowledge about coding. They overfit to the dataset. When one distills data into a useful model the model can be used to predict future behavior of the system.

That is the premise of LLM-as-AI. By training these models on enough data, knowledge of the world is purported as having been captured, creating something useful that can be leveraged to process new input and get a prediction of the trajectory of the system in some phase space.

But this, I argue, is not the case. The models merely overfit to the training data. Hence the variable results perceived by people. When their intentions and prompt fit to the data in the training, the model appears to give good output. But the situation and prompt do not, the models do no "reason" about it and "infer" anything. It fails. It gives you gibberish or go in circles, or worse if there is some "agentic" arrangement if fails to terminate and burns tokens until you intervene.

It's overkill. And I am pointing out it is overkill. It's not a clever system for creating code for any given situation. It overfits to training data set. And your response is to claim that my argument is something else, not that it's overkill but that it can only kill dead things. I never said that. I see it's more than capable of spitting out useful code even if that exact same code is not in the training dataset. But it is just automating the process of going through google, docs and stack overflow and assembling something for you. You might be good at searching and lucky and it is just what you need. You might not be so used to using the right keywords or just be using some uncommon language, or in a domain that happens to not be well represented and then it feels less useful. But instead of just coming up short as search, the model overkills and wastes your time and god knows how much subsidized energy and compute. Lucky you if you're not burning tokens on some agentic monstosity.


You are correct that variable results could be a symptom of a failure to generalise well beyond the training set.

Such failure could happen if the models were overfit, or for other reasons. I don't think 'overfit', which is pretty well defined, is exactly the word you mean to use here.

However, I respectfully disagree with your claim. I think they are generalising well beyond the training dataset (though not as far beyond as say a good programmer would - at least not yet). I further think they are learning semantically.

Can't prove it in a comment except to say that there's simply no way they'd be able to successfully manipulate such large pieces of code, using English language instructions, it they weren't great at generalisation and ok at understanding semantics.


I understand your position. But I think you're underestimating just how much training data is used and how much information can be encoded in hundreds of billions of parameters.

But this is the crux of the disagreement. I think the models overfit to the training data hence the fluctuating behavior. And you think they show generalization and semantic understanding. Which yeah they apparently do. But the failure modes in my opinion show that they don't and would be explained by overfitting.


If that's the case, it turns out that what I want is a system that's "overfitted to the dataset" on code, since I'm getting incredibly useful results for code out of it.

(I'm not personally interested in the whole AGI thing.)


Good man I never said anything about AGI. Why do you keep responding to things I never said?

This whole exchange was you having knee-jerk reactions to things you imagined I said. It has been incredibly frustrating. And at the end you shrug and say "eh it's useful to me"??

I am talking about this because of deceitfulness, resource efficiency, societal implications of technology.


"That is the premise of LLM-as-AI" - I assumed that was an AGI reference. My definition of AGI is pretty much "hyped AI". What did you mean by "LLM-as-AI"?

In my own writing I don't even use the term "AI" very often because its meaning is so vague.

You're right to call me out on this: I did, in this earlier comment - https://news.ycombinator.com/item?id=43644662#43647037 - commit the sin of responding to something you hadn't actually said.

(Worse than that, I said "... is uninformed in my opinion" which was rude because I was saying that about a strawman argument.)

I did that thing where I saw an excuse to bang on one of my pet peeves (people saying "LLMs can't create new code if it's not already in their training data") and jumped at the opportunity.

I've tried to continue the rest of the conversation in good faith though. I'm sorry if it didn't come across that way.


> My definition of AGI is pretty much

Simon, intelligence exists (and unintelligence exists). When you write «I'm not claiming LLMs can invent new computer science», you imply intelligence exists.

We can implement it. And it is somehow urgent, because intelligence is very desirable wealth - there is definite scarcity. It is even more urgent after the recent hype has made some people perversely confused about the idea of intelligence.

We can and must go well beyond the current state.


I’ve spent a fair amount of time trying to coax assistance out of LLMs when trying to design novel or custom neural network architectures. They are sometimes helpful with narrow aspects of this. But more often, they disregard key requirements in favor of the common patterns they were trained on.


As an Irish person when I saw the article title, I was immediately sceptical.

I personally believe most articles about the famine shy away from the horror of it, and also from a frank discussion.

Going to give some subjective opinion here: people generally downplay the role of the British government and ruling class in it.

Why? One personal theory - growing up in the 80s in Ireland there was a lot of violence in the north. (Most) Irish people who were educated or middle class were worried about basically their kids joining the IRA, and so kind of downplayed the historical beef with the British. That's come through in the culture.

There's also kind of a fight over the historical narrative with the British, maybe including the history establishment, who yes care a lot about historical accuracy, but, also, very subjectively, see the world through a different lens, and often come up through British institutions that view the British empire positively.

It's often easier to say the famine was the blight, rather than political. (They do teach the political angle in schools in Ireland; but I think it's fair to say it's contested or downplayed in the popular understanding, especially in Britain.)

However that article is written by a famous Irish journalist and doesn't shy away from going beyond that.

Perhaps a note of caution - even by Irish standards he'd be left leaning, so would be very politically left by American standards; he's maybe prone to emphasize the angle that the root cause was lassiez-faire economic and political policies. (I'm not saying it wasn't.)

I personally would emphasize more the fact that the government did not care much about the Irish people specifically. The Irish were looked down on as a people; and also viewed as troublesome in the empire.

Some government folks did sympathize, of course, and did try to help.

But I personally do not think the famine would have happened in England, no matter how lassiez-faire the economic policies of the government. A major dimension must be a lack of care for the Irish people, over whom they were governing; and there are instances of people in power being glad to see the Irish being brought low:

"Public works projects achieved little, while Sir Charles Trevelyan, who was in charge of the relief effort, limited government aid on the basis of laissez-faire principles and an evangelical belief that “the judgement of God sent the calamity to teach the Irish a lesson”." per the UK parliament website!

It's not an easy thing to come to terms with even today. I recently recorded a video talking about how fast the build out of rail infrastructure was, in the UK, as an analogy for how fast the AI infra build out could be; and I got a little quesy realizing that during the Irish potato famine the UK was spending double digit GDP percent on rail build out. Far sighted, yes, and powering the industrial revolution, but wow, doing that while mass exporting food from the starving country next door, yikes.


Crop failures are natural disasters. Famine's are political disasters.

The Indian economist Amartya Sen wrote a book in 1999, _Development as Freedom_ which argues, relatively convincingly, that famine's don't happen in functioning democracies among their own citizens. The book makes the observation that famines happened regularly in British colonial India, every few decades, but basically stopped in democratic, self-governing India. (1) And, as far back as the Romans, Egyptians, and Chinese many of the stories told about what good governance looked like involved beating famines- either because they were able to organize shipments of food from unaffected areas or because they stored up enough grain in the good times to survive the crop failures.

It is the general consensus among people who study this sort of thing that, as the United Nations OHCHR wrote in 2023, "Hunger and famine did not arise because there was not enough food to go around; they were caused by political failures, meaning that hunger and famine could only be addressed through political action." (2) Yes, a particular crop failure can be a natural disaster, but a famine happening requires a political failure on top of that (and the research does seem to indicate causation: the political failure is not caused by the crop failure but was pre-existing, and caused the crop failure to turn into a famine).

So, basically, yeah, the general consensus of people who study famines today and in the past is that the British government made choices that turned a crop failure into a famine. The same with the Great Famine of India, the Bengal Famine, the Soviets and the Holdomor, etc.

1: Generally, my understanding is that people who look at this think that Sen was basically correct. There might be a couple of occasions where a democracy failed to govern and suffered a famine, but, the way that democracies distribute power makes it far more unusual for them to fail so catastrophically that they can't deliver food to an area experiencing crop failure. This is one of the reasons that democracies are better than authoritarian governments!

2: https://www.ohchr.org/en/news/2023/03/conflict-and-violence-...


Also Irish person here. My primary school was 100m from one of the old workhouses, and I was taught from maybe age 7 what happened there. All the old stone walls in the nearby fields were built by forced famine labor. There's abandoned roads to nowhere (famine roads) all around, likewise built by forced labor.

I think it was taught quite well, and people around me while I was growing up didn't downplay it. It's still a significant event in the Irish psyche, especially in the parts of the country most deeply effected at the time.

The things it's, though, it's a fairly distant historical event at this stage, and I don't think it's healthy or helpful to the Irish collective psyche to hold on to it as strongly as we still do - not just the famine but all aspects our being "the oppressed". We're no longer oppressed, we're a privileged and filthy rich country (even if it doesn't feel like that right now, but we have no one to blame for the housing crisis except our own politicians and capitalists).

While we should be mindful of the English tendency to play down and rewrite history, I know many Irish people who are straight up racist towards the English - defended with the tired caveat the "oppressed people can't be racist towards the oppressors". Yes, they can. Maybe it's a less harmful form of racism, but it holds back the psychological development of the person with racist views nonetheless.

In secondary school "Up the Ra" was a common slogan shouted by my classmates. There's still pubs in Dublin and other places around the country where you wouldn't want to go with an English accent.

I'm not saying any of this to defend the English - they did terrible things in history, and those must not be forgotten or rewritten. There's also a fair few English people who are racist towards Irish too, not to mention a lot of "harking back to the glory days of Empire", mostly from older English men whose ancestors were probably peasants back then.

But for us Irish, holding onto this old identity of "the oppressed" is a part of our collective psyche which I struggled with a lot growing up, and it holds back out country. It's time we moved on.

Yes, I know that's hard when a quarter of the geographical landmass of Ireland still belongs to the old oppressors. But that's another thing we need to let go off. The people living in the North voted, several times, to remain in the UK. It's their choice, not ours. If they look like they're leaning to vote differently in the future we can restart the conversation.


> I know many Irish people who are straight up racist towards the English

I'm Irish. I've spent a lot of time in the countryside and the cities. This is not true. It's very rare to find an Irish person who is racist towards the British

> secondary school "Up the Ra" was a common slogan shouted by my classmates.

These days its justa catchy rebel chant. It does not necessarily mean the people chanting it support the IRA

> There's still pubs in Dublin and other places around the country where you wouldn't want to go with an English accent.

No there's not.

I can think of maybe 2 pubs in Dublin you might get an unfrindly welcome. On a bad day.

> But for us Irish, holding onto this old identity of "the oppressed" is a part of our collective psyche

You're really really over stating how prevalent this is

> a quarter of the geographical landmass of Ireland still belongs to the old oppressors. But that's another thing we need to let go off.

We did. Remember the referendum? The one where we collectively voted to remove the territorial claim from our constitution?

Your whole comment is vastly exaggerated.

There's Americans reading. Don't be giving them the wrong ideas, they've enough to be dealing with.


> It's very rare to find an Irish person who is racist towards the British

Oh come off it. No it's not. Unless you're in deep denial about what constitutes racism.

> Your whole comment is vastly exaggerated.

Maybe we have different lived experiences? We can both be Irish and have very different lives and experiences, small country though it is.

For me, nothing I said is exaggerated. Irish people do hate to state things directly though, and I'm used to be told to be quiet whenever I speak out about our issues.

> There's Americans reading. Don't be giving them the wrong ideas, they've enough to be dealing with.

Ok can't argue with that one.


Another Irish person here… Going to have to agree with biorach on this one, but not by a lot.

>> It's very rare to find an Irish person who is racist towards the British >Oh come off it. No it's not. Unless you're in deep denial about what constitutes racism.

The Irish that are racist against the British are, in my experience, the American who have things to say about other groups, ethnicities, religions.

Not uncommon, not prolific, but not the crowd you’d go hang out with either.


"There's Americans reading. Don't be giving them the wrong ideas, they've enough to be dealing with."

Thanks for making me laugh for a bit before I went back to staring at my screen in disbelief.


Sure, it's unhelpful to dwell too much on the past, but I don't think the Ireland of today is as consumed by victimhood or anti-Britishness as you are making out. I don't doubt there are pockets of society where anti-British sentiment is still strong but there is no society in the world without similar pockets of backwards, racist thinking. By and large, Irish people do not dislike or begrudge British people. While Brexit stoked some of the old tensions (again, we were far from the only country getting frustrated with Britain during those negotiations) we have, both before and since, largely regarded the British as our friends and allies.

The famine was a huge event in our history. Our population still hasn't recovered from it and the mass emigration it triggered still has an impact on our relations with other countries, particularly the US. We shouldn't be (and aren't) consumed by it but it would be madness to forget it. The same goes for our broader struggle for independence, which is literally the origin story of our country.

> Yes, I know that's hard when a quarter of the geographical landmass of Ireland still belongs to the old oppressors. But that's another thing we need to let go off. The people living in the North voted, several times, to remain in the UK. It's their choice, not ours. If they look like they're leaning to vote differently in the future we can restart the conversation.

The Irish position on the North is clear and has been since 1998. We don't lay claim to it so there is nothing to "let go". No one questions the right of the North to choose its own way, but equally we have a relationship and a history with that part of the island that we cannot just ignore.


It’s important to teach about bad times during the good times, because the horrors of what humans are capable of seem unfathomable with time and distance.


I'm American of Irish descent and have spent a lot of time in Ireland. The walls mentioned were sort of an academic trick. They had to do "work" to get "paid" and so they were made to just build walls so that they could then be paid in food and not starve.

If you hike around and see them, it's stunning. They were handmade. The rocks weren't insitu, they were carried in. It's not the pyramids, but in a relatively contemporary time they were made rather than just providing assistance.


"Up the RA" is a great slogan. The IRA made an important and undeniable contribution to Irish statehood. I don't think we'd be "a privileged and filthy rich country" were it not for their activities in the 20th century. There is an unfortunate tendency among some people to be unwilling to recognise that for fear of offending our neighbours to the east. As you say, it's in the distant past and not worth getting too offended about.


They also did that to Bengal in the famine there much later. It's a pattern with the Brits.


Similar arguments to LeCun.

People are going to keep saying this about autoregressive models, how small errors accumulate and can't be corrected, while we literally watch reasoning models say things like "oh that's not right, let me try a different approach".

To me, this is like people saying "well NAND gates clearly can't sort things so I don't see how a computer could".

Large transformers can clearly learn very complex behavior, and the limits of that are not obvious from their low level building blocks or training paradigms.


> while we literally watch reasoning models say things like "oh that's not right, let me try a different approach".

Not saying I disagree with your premise that errors can’t be corrected by using more and more tokens, but this argument is weird to me.

The model isn’t intentionally generating text. The kinds of “oh let me try a different approach” lines I see are often followed by the same approach just taken. I wouldn’t say most of the time, but often enough that I notice.

Just because a model generates text doesn’t mean that the text actually represents anything at all, let alone a reflection of an internal process.


> Just because a model generates text doesn’t mean that the text actually represents anything at all, let alone a reflection of an internal process.

What does it represent then? What are all these billion weights for? It's not a bag full of NULLs that just pulls next words from a look-up table. Obviously there is some kind of internal process.

Also I don't get why people ignore the temporal aspect. Humans too generate thoughts in sequence, and can't arbitrarily mutate what came before. Time and memory is what forces sequential order - we too just keep piling on more thoughts to correct previous thoughts while they are still in working memory (context).


The text represents a prediction of how a human may respond, one word(ish) at a time, that's it.

With "reasoning" models, the reasoning layer is basically another LLM instructed to specifically predict how a human may respond to the underlying LLM's answer, fake prompt engineering if you will.

There of course is some kind of internal process, but we can't prove any kind of reasoning. We ask a question, the main LLM responds, and we see how the reasoning layer LLM itself responds to that.


Please don't confuse people with wrong information, the reasoning part in reasoning models is the exact same LLM that produces the final answer. For example o1 uses special "thinking" tokens to demarcate between reasoning and answer sections of it's output.


Sure, that's a great clarification though maybr a bit of an implementation detail in this context.

Functionally my argument stands in this context - just because we can see one stream of LLM responses responding to the primary response stream says nothing of reasoning or what is going on internally in the reasoning layer.


> what is going on internally in the reasoning layer.

We literally know exactly what is going on with every layer.

It’s well defined. There are mathematical proofs for everything.

Moreover it’s all machine instructions which can be observed.

The emergent properties we see in LLMs are surprising and impressive, but not magic. Internally what is happening is a bunch of matrix multiplications.

There’s no internal thought or process or anything like that.

It’s all “just” math.

To assume anything else is personification bias.

To look at LLMs outputting text and a human writing text and think “oh these two things must be working in the same way” is just… not a very critical line of thought.


> We literally know exactly what is going on with every layer.

Unless I missed a huge break in the observability problem, this isn't correct.

We know exactly how every layer is designed and we know how we functionally expect that to work. We don't know what actually happens in the model at time of inference.

I.e. we know what pieces were used to build the thing but when we actually use it its a black box - we only know inputs and outputs.


> We don't know what actually happens in the model at time of inference.

How could we not know? Every processor instruction is observable.

What we specifically don’t have a good view is the causal relationship between input tokens, a model’s weights, and the output.

We don’t know specifically what weights matter or why.

That’s very different than not understanding what processes are taking place.


This paper [1] may be an interesting place to start.

We only know how the structures are designed to work, and we have hypothesise of how they likely work. We can't interpret what actually happens when the LLM is actually going through the process of generating a response.

That seems pedantic or unimportant on the surface, but there are some really important implications. At the more benign level, we don't know why a model gave a bad response when a person wasn't happy with the output. On the more important end, any concerns related to the risk of these models becoming self-directed or malicious simply can't be recognized or guarded against. We won't know if a model becomes self-directed until after it acts on it in ways that don't match how we already expect them to work.

Both alignment and interoperability were important research topics for decades of AI research. We effectively abandoned those topics once we made real technological advancement - once an AI-like tool was no longer entirely theoretical we couldn't be bothered focusing resources on figuring out how to do it safely. The horse was already out of the barn.

Does this mean they will turn evil or end up going poorly for us? Absolutely not. It just means that we have to cross our fingers and hope because we can't detect issues early.

[1] https://arxiv.org/abs/2309.01029


> We can't interpret what actually happens when the LLM is actually going through the process of generating a response.

There are 2 things we’re talking about here.

There’s the physical, mechanical operations going on during inference and there’s potentially a higher order process happening as an emergent property of those mechanical operations.

We know precisely the mechanical operations that take place during inference as they are machine instructions which are both man-made and very well understood. I hope we can agree here.

Then there’s potentially a higher order process. The existence of that process and what that process is still a mystery.

We do not know how the human brain works, physically. We can’t inspect discrete units of brain operations as we can with machine instructions.

For that reason, it is uncritical to assume that there is any kind of “thought” process occurring at inference which is similar to our thought processes.

Comparing the two is like apples and oranges anyway and is pedantic in a non-useful way, especially with our limited understanding of the human brain.


> There are 2 things we’re talking about here.

I was never actually talking about the physical mechanisms. Sure we can agree that GPUs, logical gates, etc physically work in a certain way. That just isn't important here at all.

> For that reason, it is uncritical to assume that there is any kind of “thought” process occurring at inference which is similar to our thought processes.

I wasn't intending to raise concerns over emergent consciousness or similar. Whether thought goes on is a bit less clear depending on how you define thought, but that still wasn't the point I was making.

We have effectively abandoned the alignment problem and the interoperability problem. Sure we know how GPUs work, and we don't need to assume that consciousness emerged, but we don't know why the model gives a certain answer. We're empowering these models with more and more authority, not only are they given access to the public internet but now we're making agents that are starting to interact with the world on our behalf. Models are given plenty of resources and access to do very dangerous things if they tried to, and my point is we don't have any idea what goes on other than input/output pairs. There's a lot of risk there.

> Comparing the two is like apples and oranges anyway and is pedantic in a non-useful way, especially with our limited understanding of the human brain.

Comparing the two is precisely what we're meant to do. If the comparison wasn't intended they wouldn't be called "artificial intelligence". That isn't pedantic, if the term isn't meant to imply the comparison then they were either accidentally or intentionally named horribly.


> I wasn't intending to raise concerns over emergent consciousness or similar

Oh jeez, then we may have just been talking past each other. I thought that’s what you were arguing for.

> That just isn't important here at all.

It is, though. The fact that the underlying processes are well understood means that, if we so wished, we could work backwards and understand what the model is doing.

I recall some papers on this, but can’t seem to find them right now. One suggested that groups of weights relate to specific kinds of high level info (like people) which I thought was neat.

> the comparison wasn't intended they wouldn't be called "artificial intelligence"

Remember “smart” appliances? Were we meant to compare an internet connected washing machine to smart people? Names are all made up.

I do actually think AI is a horrible name as it invites these kinds of comparisons and obfuscates more useful questions.

Machine Learning is a better name, imo, but I’m not a fan of personifying machines in science.

Too many people get sci-fi brain.


Haha, well its funny sometimes when you realize too late there were two different conversations happening.

I definitely agree on the term machine learning - it seems a much better fit but still doesn't feel quite right. Naming things is hard, but AI seems particularly egregious here.

> The fact that the underlying processes are well understood means that, if we so wished, we could work backwards and understand what the model is doing.

I'm not sure we can take that leap. We understand pretty well how a neuron functions but we understand very little about how the brain works or how it relates to what we experience. We understand how light is initially recognized in the eye with cones and rods, but we don't really know exactly how it goes from there to what we experience as vision.

In complex systems its often easy to understand the function of a small, more fundamental but of the system. Its much harder to understand the full system, and if you do you should be able to predict it. For LLMs, that would mean they could predict a model's output for a given input (even if that prediction has to account to randomness added into the inference algorithm).


Subbarao Kambhampati, who seems to only use X is a good resource. He points out how the CoT text is not of semantic importantce.

This work from his team shows how few 'reasoning' traces are valid.

https://atharva.gundawar.com/searchformer_response_analysis....

This paper shows how the scratch space gets transformers to PTIME from TC0 without it.

https://arxiv.org/abs/2502.02393

OpenAI may be able to do more in the long term because they don't show the <think> and can spend more of that scratch space on improving answers vs appeasing users, but time will show.

Remember that probabilistic checkable proofs show how random data can improve computation.

The AI field has always had a problem with wishful mnomics.

But it is probably not a binary choice, if we could get the scratch space to reliably simulate Dykstra' shunting and convert to postfix as an example, that would be great.


> Humans too generate thoughts in sequence,

You don’t know this. I don’t feel like I generate thoughts in sequence, for me it feels hierarchical.

> can't arbitrarily mutate what came before

Uhh… what?

Do you remember your memories as a child? Or what you ate for breakfast 3 weeks ago?

Have you ever misremembered an event or half remembered a solution to a problem?

The information in human minds are entirely mutable. They are not like computers…

> It's not a bag full of NULLs that just pulls next words from a look-up table.

Funny enough, the attention mechanism that’s popular right now is effectively lots and lots of stacked look up tables. That’s how it’s taught as well (what with the Q K and V)

Tho I don’t think that’s a requirement for LLMs in general.

I find a lot of people who half understand cognition and understand computing look at LLMs and work backwards to convince themselves that it’s “thinking” or doing more cognitive functions like we humans do. It’s personification bias.


Not OP.

> Do you remember your memories as a child? Or what you ate for breakfast 3 weeks ago?

For me, this seems like conjuring up and thinking about a childhood event is like putting what came out of my nebulous 'memory' fresh into context at the point in time you are thinking about it, along with whatever thoughts I had about it (how embarrassed I was, how I felt proud because of X, etc). As that context fades into the past, some of those thoughts may get mixed back into that region of my 'memory' associated with that event.


> The model isn’t intentionally generating text.

What's the mechanistic model of "intention" that you're using to claim that there is no intention in the model's operation?

> Just because a model generates text doesn’t mean that the text actually represents anything at all, let alone a reflection of an internal process.

Generating text is the trace of an internal process in an LLM.


> What's the mechanistic model of "intention" that you're using to claim that there is no intention in the model's operation?

You can’t prove intention, but I can show examples of LLMs lacking intent (as when repeating the same solution even after being told it was incorrect)

> Generating text is the trace of an internal process in an LLM.

Not really sure precisely what you mean by trace, but the output from an LLM (as with any statistical model) is the result of the calculations, not a representation of some emergent internal state.


> You can’t prove intention, but I can show examples of LLMs lacking intent (as when repeating the same solution even after being told it was incorrect)

I don't think that shows lack of intent, any more than someone who has dementia forgetting why they entered a room shows they lack intent.


I'd argue that humans are by definition autoregressive "models", and we can change our minds mid thought as we process logical arguments. The issue around small errors accumulating makes sense if there is no sense of evaluation and recovery, but clearly, both evaluation and recovery is done.

Of course, this usually requires the human to have some sense of humility and admit their mistakes.

I wonder, what if we trained more models with data that self-heals or recovers mid sentence?


As the number of self-corrections increases, it also increases the likelihood that it will say "oh that's not right, let me try a different approach" after finding the correct solution. Then you can get into a second-guessing loop that never arrives at the correct answer.

If the self-check is more reliable than the solution-generating process, that's still an improvement, but as long as the model makes small errors when correcting itself, those errors will still accumulate. On the other hand, if you can have a reliable external system do the checking, you can actually guarantee correctness.


Error correction is possible even if the error correction is itself noisy. The error does not need to accumulate, it can be made as small as you like at the cost of some efficiency. This is not a new problem, the relevant theorems are incredibly robust and have been known for decades.


Can you link me to a proof demonstrating that the error can be made arbitrarily small? (Or at least a precise statement of the theorem you have in mind.) I would think that if the last step of error correction turns a correct intermediate result into an incorrect final result with probability p, that puts a lower bound of p on the overall error rate.


Yann LeCun's prediction was empirically refuted. He says that the longer LLMs run, the less accurate they get. OpenAI showed the opposite is true.


They didn't show this, they just increased the length where accuracy breaks down.


Explain? OpenAI showed the new scaling law in December 2024 that performance keeps increasing proportional to ln(N reasoning tokens)


link?


LeCun is for sure a source of inspiration, and I think he has a fair critique that still holds true despite what people think when they see reasoning models in action. But I don't think like him that autoregressive models are a doomed path or whatever. I just like to question things (and don't have absolute answers).

I-JEPA and V-JEPA have recently shown promising results as well.


I think recurrent training approaches like those discussed in COCONUT and similar papers show promising potential. As these techniques mature, models could eventually leverage their recurrent architecture to perform tasks requiring precise sequential reasoning, like odd/even bit counting that current architectures struggle with.


I think the authors misunderstand what's actually going on.

I think this is the crux:

>They are vastly more powerful than what you get on an iPhone, but the principle is similar.

This analogy is bad.

It is true that the _training objective_ of LLMs during pretraining might be next token prediction, but that doesn't mean that 'your phone's autocomplete' is a good analogy, because systems can develop far beyond what their training objective might suggest.

Literally humans, optimized to spread their genes, have developed much higher level faculties than you might naively guess from the simplicity of the optimisation objective.

If the behavior of top LLMs didn't convince you of this, they clearly develop much more powerful internal representations than an autocomplete does, are much more capable etc.

I would point to papers like Othello-gpt, or lines of work on mechanistic interpretability, by Anthropic, and others, as very compelling evidence.

I think that, contrary to the authors, using words like 'understand' and 'think' for these systems is much more helpful than to conceptualise them as autocomplete.

The irony is that many people are autocompleting from the training objective to the limits of the system; or from generally being right by calling BS on AI, to concluding it's right to call BS here.


I think most of the replies, here and on stack exchange, are answering slightly the wrong question.

It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.

I think the answer is:

Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).


> It is fair to ask why the likelihoods are useful if they are so small

The way the question demonstrates "smallness" is wrong, however. They quote the product of the likelihoods of 50 randomly sampled values - 9.183016e-65 - as if the smallness of this value is significant or meant anything at all. Forget the issue of continuous sampling from a normal distribution, and just consider the simple discrete case of flipping a coin. The combined probability of any permutation of 50 flips is 0.5 ^ 50, a really small number. That's because the probability is, in fact, really small!


Right - and so the more appropriate thing to do is not look at the raw likelihood of any one particular value but instead look at relative likelihoods to understand what values are more likely than other values.


Therefore, likelihood ratios! (Or log likelihood ratios)


For the discrete case, it seems that a better thing to do is consider the likelihood of getting that number of heads, rather than the likelihood of getting that exact sequence.

I am not sure how to handle the continuous case, however.


Of course you ignore irrelevant ordering of data points. That's not the issue.

The issue, for discrete or continuous (which are mathematically approximations of each other), is that the value at a point is less important than the integral over a range. That's why standard deviation is useful. The argmax is a convenient average over a weightable range of values. The larger your range, the greater the likelihood that the "truth" is in that range.

If you only need to be correct up to 1% tolerance, the likelihood of a range of values that have $SAMPLING_PRECISION tolerance is not importance. Only the argmax is, to give you a center of the range.


Yes - the most enlightening concept for me was "Highest Probability Density Interval" which basically always is clustered around the mean. But you can choose any interval which contains as much probability mass!

https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...

It's a fairly common "mistake" to assume that the MLE is useful as a point estimate and without considering covariance/spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD or some other measure of precision given the measurement set.


> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This may be true for low dimensions but doesn’t generalise to high dimensions. Consider a 100-dimensional standard normal distribution for example. The MLE will still be at the origin but most of the mass will live in a thin shell of distance roughly 7 units from the origin.


I think the "mass" they are referring to might the mass of the Bayesian posterior in parameter space, not the mass of the data in event space.


Yes, in parameter space.

However, TobyTheCamel's point is valid in that there are some parameter spaces where the MLE is going to be much less useful than others.

Even without having to go to high dimensions, if you've got a posterior that looks like a normal distribution, the MLE is going to the you a lot, whereas if it's a multimodal distribution with a lot of mass scattered around, knowing the MLE much less informative.

But this is a complex topic to address in general, so I'm trying to stick to what I see as the intuition behind the original question!


Concentration of mass is density. A shell is not dense.

If I am looking for a needle in a hyperhaystack, it's not important to know that it's more likely to be "somewhere on the huge hyperboundary" than "in the center hypercubic inch".


Disagree:

A lot of why large corporations fail to make products that people enjoy is tied up in this behavior and that mass is not independently distributed along each distribution — you end up with “continents of taste” your centroid product sucks for equally.


This is similar to how they originally tried to build fighter jet seats for the average pilot, but it failed because it turned out there were no average pilots, so they had to make them adjustable.


And yet your parent comment was right in saying that it won't be true that "a lot of the probability mass - an amount that is not small - will be concentrated" in the center hypercubic inch.


> Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

Can you elaborate? An MLE is never going to come up with the exact parameters that produced the samples, but in the original example, as long as you know it's a normal distribution, MLE is probably going to come up with a mean between 4 and 6 and a SD within a similar range as well (I haven't calculated it, just eyeballing it) -- when the original parameters were 5 and 5.

I guess I don't know what you mean by "correct", but that's as correct as you can get, based on just 50 samples.


Right - I think this is what's at the heart of the original question.

I know they asked with a continuous example, but I don't interpret their question as limited to continuous cases, and I think it's easier to address using a discrete example, as we avoid the issue of each exact parameter having infinitesimal mass which occurs in a continuous setting.

Let's imagine the parameter we're trying to estimate is discrete and has, say, 500 different possible values.

Let's say the parameter can have the value of the integers between 1 and 500 and most of the mass is clustered in the middle between 230 and 270.

Given some data, it would actually be possible that MLE would come up with the exact value, say 250.

But maybe given the data, a range of values between 240 and 260 are also very plausible, so the likelihood of exactly 250 has a fairly low probability.

The original poster is confused, because they are basically saying, well, if the actual probability is so low, why is this MLE stuff useful?

You are pointing out they should really frame things in terms of a range and not a point estimate. You are right; but I think their question is still legitimate, because often in practice we do not give a range, and just give the maximum likelihood estimate of the parameter. (And also, separately, in a discrete parameter setting, specific parameter value could have substantial mass.)

So why is the MLE useful?

My answer would be, well, that's because for many posterior distributions, a lot of the probability mass will be near the MLE, if not exactly at it - so knowing the MLE is often useful, even if the probability of that exact value of the parameter is low.


I agree with your points and thats why it's useful to compare a MLE to an alternative model via a likelihood ratio test, in which case one sees how much better the generative model performs as compared to the wrong model.

Similarly, AIC values do not make a lot of sense on an absolute scale but only relative to each other, as written in [1].

[1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304.


> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This is a Bayesian point of view. The other answers are more frequentist, pointing out that likelihood at a parameter theta is NOT the probability of theta being the true parameter (given data). So we can't and don't interpret it like a probability.


Given enough data, Bayesian and frequentist models tend to converge to the same answer anyway.

Bayesian priors have similar effect to regularization (e.g. ridge regression / penalizing large parameter values).


That's not a Bayesian point of view. You can re-word it in terms of a confidence interval / coverage probability. It is true that in frequentist statistics parameters don't have probability distributions, but their estimators very much do. And one of the main properties of a good estimator is formulated in terms of convergence in probability to the true parameter value (consistency).


The Austin novels made a lot more sense to me when I started to think of them as closer to tales of corporate Mergers & Acquisitions, rather than love stories. The rich familes then were like large corporations are now, and a marriage was a very financial merger.

I think I realised this when I first read Pride and Prejudice and the main character started talking about basically falling in love with the Pemberley estate.

Thereafter, any time I visit an English country house with extensive gardens, the massive wealth expenditure to create them makes a lot more sense when you view them as M&A marketing budget.

This is hopefully too cynical, and the truth is somewhere in between - but it's equally naive to read Austen as straight love stories with a modern perspective - there's a lot of clear focus on the incomes and social situations in the text.


For most of human civilization, marriages _were_ mergers and acquisitions. Over the course of your married life you developed love through mutual companionship. The idea of them being primarily driven by romance and love is a very recent artifact. In many ways it's also an incomplete and somewhat inimical development, because I've observed modern couples ignore aspects of duty, responsibility, service etc. that are central to building a life together, and obsess just singularly over love or attraction.


Marriage among the landed gentry were both economic and emotional arrangement and the novels explore the tension between these aspects. If you focus on just one aspect you miss the conflict of the story. It is literally in the title of “Sense and Sensibility”, where sense is the financial aspect and sensibility the emotional.


Lizzie was joking about falling in love with the Pemberley estates. She fell in love through seeing him through the eyes of the people who knew him best.

However, marriage was primarily a financial arrangement back then. That is true.


> Lizzie was joking

Well, obviously we shouldn't get too hung up on what a fictional character thought - but I stand by my recollection.

Just googling it, and finding this page: https://www.sparknotes.com/lit/pride/quotes/symbol/pemberley...

I think you can say the last quote on that page is the character joking (although I'm not sure I read it that way); but the second last quote was the one I was referring to, and is in the narrator's voice.

But, look, while reading that did change my perspective on the story, I also don't want to interpret things too cynically; I'm not saying the character of Elizabeth should be read as purely seeking advantage; just that they were clearly evaluating marriage on a combination of advantage, and 'love', with a lot of weight on the former; and all of Austen made a lot more sense when I realised that.


> This is hopefully too cynical...

Nah, it's just how it is.

A casual web search will show that women care about money in relationships and marriage... a lot.


I'd be willing to bet that the majority of results you find in that search will be men talking about how much women care about money... of course some women do, but far far from all of them


Well its not about money per se, rather power, competence and whatever it translates to. This is almost a daily sight in any society if you know what to look for, and basis of who we were and who we are also today. I don't see any problems with it, there is good logic and history behind it and it explains a lot of behavior of various folks.

Certain age usually brings this 'seeing' effortlessly and you actually become annoyed by additional layer of personalities that you see (since you can't escape it anymore facts that a lot of people of both sexes are just badly broken or simply not good people to the core), you just need to grind through enough people/stories around you like any other skill.


Whatever your position on this, reducing things back to 21st century individualistic dating preferences and gender norms is ahistorical and shallow. The parent comment was spot on in the analogy — this isn’t about what individuals choose, it’s dynastic. A family is “the firm” and marriage in one of the primary strategic tools to advance its interests. At the very top of the pile, the Austrian Habsburg family built a 500 year imperial dynasty in Central Europe primarily from marriages. But it’s similar further down: land, property, family reputation, strategic alliances, and sometimes (but not always) individual preferences as well.


Meh, when we got married she had assets that were easily more than 10x what I owned. I wasn't even able to make any real money until two years later.


Don't forget the political alliance aspect.


There's a big gap between training an algorithm on a toy problem, vs building a useful product.

Software engineers often are missing key skills. They can learn them, but won't automatically get them in their traditional training.

First, measuring success. Actually telling how well a production system is doing is tricky. There's an art to developing metrics that tell you if an ML system is delivering value, and a lot for engineers don't have the metric design skills. Often to productionize an ML system, you need a bunch of proxy metrics and a pretty good backtesting setup. This will often depend on the specific problem, and the skill of it is something you won't get in a standard software setting.

Engineers - and especially designers - also struggle with edge cases when things go off the happy path. It's often easy to make an ML prototype that works in 90% of cases, and get a project started - but a nightmare to solve enough the edge cases for a production grade system. Finding and papering over and designing around all those edge cases effectively can require a deep bag of tricks a pure software engineer won't have.

Finally there's a struggle with tactics and culture. A lot of the bread and butter tactics of high performing software delivery are the opposite of what you need for ML projects. E.g. In high velocity frontend work you want to lock a design early, and your designer can probably do a lot of iteration before engineering starts. In ML projects you want to keep the design floating and low fidelity, as you prototype, and lock it late in the project.

So many development tactics, and cultural patterns, that lead to high performing software teams, in a SaaS setting, say, are anathema to ML projects.


> Engineers - and especially designers - also struggle with edge cases when things go off the happy path. It's often easy to make an ML prototype that works in 90% of cases, and get a project started - but a nightmare to solve enough the edge cases for a production grade system. Finding and papering over and designing around all those edge cases effectively can require a deep bag of tricks a pure software engineer won't have.

YMMV. Finding and papering over the things that prevent a model from being deployable can also require a deep bag of engineering tricks that an average ML research scientist does not have. In my personal experience I've seen teams burned by this more often than the other way around.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: