Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It's not really that, it's that recalling a set of rules and following a set of rules are fundamentally different tasks for an LLM.

My point is that this appears to be the case for people too. It is often necessary to explicitly remind people to recall a set of rules to get them to follow the specific rules rather than act in a way that may or may not match the rules.

Having observed this many times, I simply don't believe that most humans will see e.g. an addition and go "oh, right, these are the set of rules I should follow for addition, let me apply them step by step". If we've had the rules reinforced through repetitive training many enough times, we will end up doing them. But a lot of the time people will know the steps but still not necessarily apply them unless prompted, just like LLMs. Quite often people will still give an answer. Sometimes even the correct one.

But without applying the methods we've been taught. To the point where when dealing e.g. with new learners - children in particular - who haven't had enough reinforcement in just applying a method, it's not at all unusual to find yourself having conversations like this: "Ok, so to do X, what are the steps you've been taught? Ok, so you remember that they are A, B and C. Great. Do A. You've done A? Now do B..." and so on.

To me, getting a child to apply a method they know to solve a problem is remarkably close to getting an LLM to actually recall and follow these methods.

But even for professionals, checklists exist for a reason: We often forget steps, or do them wrong, and forget to even try to explicitly recall a list of steps and do them one by one when we don't have a list of steps in front of us.



I don't believe this works the way you think. Within the same chat session with GPT3 you can ask it to explain addition, then ask it to do addition, and the explanation will be perfectly accurate but the sums it does will be complete rubbish. It's not enough to remind it.

The article og_kalu posted above goes into detail as to what they had to do to teach an LLM how to reason algorithmically in a specific problem domain and it was incredibly hard; much, much more convoluted and involved than just reminding it of the rules. Only an LLM that has gone through this intensive multi-step highly domain specific training regime has a hope of getting good results and then only in that specific problem domain. with a human you teach a reasoning ability and get them to apply it in different domains, with LLMs that doesn't work.

Take this comment in the article "However, despite significant progress, these models still struggle with out-of distribution (OOD) generalization on reasoning tasks". Where humans naturally generalise reasoning techniques from one problem area to another, LLMs flat out don't. If you teach it some reasoning techniques when teaching doing sums, you have to start again from scratch when teaching it how to apply even the same reasoning techniques to any other problem domain, every single time. You can't remind them they learned this or that when learning to do sums and to use it again in this context, as you would with a human, at the moment that flat out doesn't work.

The reason it doesn't work is precisely due to the limitations imposed by token stream prediction. The different tasks involving reasoning are different token stream domains, and techniques the LLM uses to optimise for one token stream domain currently only seem to apply to that token stream domain. If you don't take that into account you will make fundamental errors in reasoning about the capabilities of the system.

So what we need to do is come up with architectures and training techniques to somehow enable them to generalise these reasoning capabilities.


> I don't believe this works the way you think. Within the same chat session with GPT3 you can ask it to explain addition, then ask it to do addition, and the explanation will be perfectly accurate but the sums it does will be complete rubbish. It's not enough to remind it.

Again, I've had this exact experience with people many times as well, so again I don't think this in itself is any kind of indication of whether or not LLMs are all that different from humans in this regard. The point is not that there aren't things missing from LLMs, but that I don't find the claim that this behaviour shows how different they are to be at all convincing.

My experience is that people do not appear naturally generalise reasoning techniques very well unless - possibly - if they are trained at doing that (possibly, because I'm not convinced that even most of those of us with significantly above average intelligence generalise reasoning nearly as well as we'd like to think).

Most people seem to learn not by being taught a new technique and then "automatically applying it", but being taught a new technique and then being made to repetitively practice that technique by being prompted step by step until they've learnt to apply it separate from the process of following the steps, and tend to perform really poorly and make lots of mistakes when doing it by instruction.

> You can't remind them they learned this or that when learning to do sums and to use it again in this context, as you would with a human, at the moment that flat out doesn't work.

I don't know what you're trying to say here. Mentioning a technique to ChatGPT and telling it to go through it step by step is not flawless but it often does work. E.g. I just tested by asking GPT4 for a multiplication method and then asked it to use it on two numbers I provided and show its working, and it did just fine. At the same time, doing this with humans often requires a disturbingly high level of step by step prompting (having a child, I've been through a torturous amount of this). I won't suggest ChatGPT is as good as following instructions as people, yet, but most people are also really awfully horrible at following instructions.


>Again, I've had this exact experience with people many times as well

There's a difference between some people sometimes needing to be reminded to do something, and them flat out not being able to do it due to fundamental cognitive limitations.

>"E.g. I just tested by asking GPT4 for a multiplication method and then asked it to use it on two numbers I provided and show its working, and it did just fine."

That's because GPT4 has been custom tuned and trained on that specific task as well, along with many others. It's that training, why it was necessary and how it works that the paper referred to previously was about.

This is literally the subject under discussion. You're using the fact the system was custom trained to do that specific task as is therefore good at it, along with other basic mathematical tasks, as evidence that it has a general ability that doesn't need to be custom trained.

This is what I'm talking about and as these models get better specific problem domain training two things will happen. One is that they will become dramatically more powerful and capable tools. That's great.

The other is that more and more people will come to fundamentally misunderstand how they function and what they do, because they will see that they produce similar results to humans in many ways. They will infer from that they work cognitively in the same way as humans, and will reason about their abilities on that basis and will make errors as a result, because they're not aware of the highly specialist training the systems had to go through precisely because there are very important and impactful ways in which they don't cognitively function like humans.

This matters, particularly when non specialists like politicians are making decisions about how this technology should be regulated and governed.


> There's a difference between some people sometimes needing to be reminded to do something, and them flat out not being able to do it due to fundamental cognitive limitations.

GPT4 isn't "flat out not able to do it" when reminded. My point was that I have had the same experience of having to prompt step by step and go "why did you do that? Follow the steps" with both fully functional, normally intelligent people and with GPT4 for similarly complex tasks, and given the improvement between 3.5 and 4 there's little reason to assume this won't keep improving for at least some time more.

> That's because GPT4 has been custom tuned and trained on that specific task as well, along with many others. It's that training, why it was necessary and how it works that the paper referred to previously was about.

So it can do it when trained, just like people, in other words.

> They will infer from that they work cognitively in the same way as humans

And that would be bad. But so is automatically assuming that there's any fundamental difference between how they work and how human reasoning work given that we simply do not know how human reasoning work, and given that LLMs in increasing number of areas show similar behaviour (failure to e.g. fall back on learned rules) when their reasoning breaks down as untrained people.

Again, I'm not saying they're reasoning like people, but I'm saying that we know very little about what the qualitative differences are outside of the few glaringly obvious aspects (e.g. lack of lasting memory and lack of ongoing reinforcement during operation), and we don't know how necessary those will be (we do know that humans can "function" for some values of function without the ability to form new lasting memories, but obviously it provides significant functional impairment).


> Again, I'm not saying they're reasoning like people

Cool, that’s really the only point I’m making. On the one hand it’s certainly true we can overcome a lot of the limitations imposed by that basic token sequence prediction paradigm, but they are just workarounds rather than general solutions and therefore are limited in interesting ways.

Obviously I don't know for sure how things will pan out, but I suspect we will soon encounter scaling limitations in the current approach. Not necessarily scaling limitations fundamental to the architecture as such, but limitations in our ability to develop sufficiently well developed training texts and strategies across so many problem domains. That may be several model generations away though.


> Cool, that’s really the only point I’m making.

To be clear, I'm saying that I don't know if they are, not that we know that it's not the same.

It's not at all clear that humans do much more than "that basic token sequence prediction" for our reasoning itself. There are glaringly obvious auxiliary differences, such as memory, but we just don't know how human reasoning works, so writing off a predictive mechanism like this is just as unjustified as assuming it's the same. It's highly likely there are differences, but whether they are significant remains to be seen.

> Not necessarily scaling limitations fundamental to the architecture as such, but limitations in our ability to develop sufficiently well developed training texts and strategies across so many problem domains.

I think there are several big issues with that thinking. One is that this constraint is an issue now in large part because GPT doesn't have "memory" or an ability to continue learning. Those two need to be overcome to let it truly scale, but once they are, the game fundamentally changes.

The second is that we're already at a stage where using LLMs to generate and validate training data works well for a whole lot of domains, and that will accelerate, especially when coupled with "plugins" and the ability to capture interactions with real-life users [1]

E.g. a large part of human ability to do maths with any kind of efficiency comes down to rote repetition and generating large sets of simple quizzes for such areas is near trivial if you combine an LLM at tools for it to validate its answers. And unlike with humans where we have to do this effort for billions of humans, once you have an ability to let these models continue learning you make this investment in training once (or once per major LLM effort).

A third is that GPT hasn't even scratched the surface in what is available in digital collections alone. E.g. GPT3 was trained on "only" about 200 million Norwegian words (I don't have data for GPT4). Norwegian is a tiny language - this was 0.1% of GPT3's total corpus. But the Norwegian National Library has 8.5m items, which includes something like 10-20 billion words in books alone, and many tens of billions more in newspapers, magazines and other data. That's one tiny language. We're many generations of LLM's away from even approaching exhausting the already available digital collections alone, and that's before we look at having the models trained on that data generate and judge training data.

[1] https://sharegpt.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: