I feel like your observation that this "isn't a complicated question" is leaning on an implicit assumption that ChatGPT is a general AI and not a LLM. It is just generating text based on probabilities -- it isn't "reasoning". I might go as far to say that inferences computed by a LLM are of all the same complexity but I don't really know enough about ChatGPT to be confident in that statement.
People keep repeating that LLMs are "just generating text based on probabilities". That statement doesn't mean anything.
I think people who say this are imagining LLMs work something like a statistical model. Maybe it's doing a linear regression or works like a Markov chain. It's not.
A single artificial neuron sort of works like that. But that's sort of like saying a single transistor is just an electronically controlled switch, so the only thing computers can do is switching. It's true in some sense that computers are just doing a lot of switching, but it turns out all this switching is Turing-complete. That means computers can theoretically compute anything that's possible to compute given enough time and memory, which includes anything a human could figure out.
Similar principle applies to LLMs. Using probabilities is part of what they do, but that doesn't preclude them from using logic and rules of inference.
It's worth noting that "just generating text based on probabilities" describe Markov algorithms [1], which are Turing-complete. People overestimate how much it takes to end up with something Turing-complete.[Markov algorithms only generate text with probability 100% or 0% based on whether a certain rule matches or not, so it's even simpler]
(A Markov algorithm is distinct from a Markov chain, but as far as I can tell you could emulate a Markov algorithm with a Markov chain with sufficient number of input states, transitions clamped to 0% or 100%, and allowing it to iterate over its own output; with a large enough state machine, iteration, and a mechanism to provide memory it's almost hard not to end up with a Turing machine)
That's a valid point in that we don't fully understand how LLMs solve some problems and using logic and rules of inference isn't excluded by the architecture, but on the other hand understanding that they are generating probabilistic token sequences is a very powerful and effective way to understanding how to engineer prompts and understand some of their failure modes. If we discard that insight, reasoning about their many failure modes and limitations becomes near impossible.
For example we often see people thinking that because an LLM can explain how to do something that therefore it knows how to do it, like arithmetic. That's because if a human can explain how to do something, we know that they can. Yet for an LLM outputting a token sequence for an explanation of something, and outputting a token sequence for solving a problem statement for that problem domain are fundamentally different tasks.
We can get round this with very clever prompt engineering to 'force' chain of reasoning behaviour, as this discussion shows, but the reason we have to do that is precisely because the cognitive architecture of these LLMs is fundamentally different from humans.
Yet these systems are clearly highly capable, and it is possible to dramatically improve their abilities with clever engineering. I think what this means is that LLMs may be incredibly powerful components or elements of systems that may become far more advanced and sophisticated AIs. However to do that engineering and build dramatically more capable systems, we need to have a clear understanding of how and why LLMs work, what their advantages and limitations are, and how to reason about and work with those features.
>For example we often see people thinking that because an LLM can explain how to do something that therefore it knows how to do it, like arithmetic. That's because if a human can explain how to do something, we know that they can. Yet for an LLM outputting a token sequence for an explanation of something, and outputting a token sequence for solving a problem statement for that problem domain are fundamentally different tasks.
Please explain to me the process currently happening in your visual cortex as you read this text.
The fact that neuroscience exists as a field (with so many remaining questions) shows that humans also do not understand how we can do all the things we do.
I agree generally with what you're saying. I was arguing against people who seem to have concluded LLMs can't do any reasoning because they're "just generating text based on probabilities". I've seen people express that point of view quite a few times and it seems to be based on a superficial and incorrect understanding of how LLMs work.
And although I think one could demonstrate fairly easily that ChatGPT is capable of some level of deductive reasoning, my last post wasn't even arguing about any actual capabilities of current LLMs. I was just saying you can't conclude that LLMs can't reason (even in theory) because they're "just generating text based on probabilities".
That said, it's not clear to me what the limits are on LLMs as they scale up. GPT-4 can usually add very large numbers together (I've tested it with 20 digit numbers) without any chain-of-thought, something older models struggled with. I think addition works well because you almost don't need internal working memory to do it. You can _usually_ compute a digit of the answer just by looking at 2-3 digits of each of the summands. Occasionally this isn't true: if you have a long sequence of columns that each sum to 9, then a carry from many digits away can affect the current digit. But that's rare.
Multiplication of large number, by contrast, does require working memory and an iterative algorithm. It makes a lot of sense that chain-of-thought helps with this. The text the LLM writes functions as working memory, and it iteratively generates the response, token by token.
Still, just scaling up the models has also helped a lot with multiplication (even without using chain-of-thought). Presumably larger models can have a larger part of the network devoted to arithmetic. It still doesn't compare to a calculator, and integrating LLMs with other tools or AI models sounds promising. But so far, the results of just scaling LLMs and training data has been surprisingly impressive.
> For example we often see people thinking that because an LLM can explain how to do something that therefore it knows how to do it, like arithmetic. That's because if a human can explain how to do something, we know that they can.
I think example shows LLMs to be more like people not less. It's not at all unusual to see humans struggle to do something until you remind them that they know an algorithm for doing so, and nudge them to apply it step by step. Sometimes you even have to prod them through each step.
LLMs definitely have missing pieces, such as e.g. a working memory, an ability to continue to learn, and an inner monologue, but I don't think their sometimes poor ability to recall and follow a set of rules is what sets them apart.
>I don't think their sometimes poor ability to recall and follow a set of rules is what sets them apart
It's not really that, it's that recalling a set of rules and following a set of rules are fundamentally different tasks for an LLM. This is why we need, and have implemented different training and reinforcement strategies to close that gap. The chain of reasoning ability has had to be specifically trained into the LLMs, it didn't arise spontaneously. However clearly this limitation can be, and is being worked around. The issue is that it's a real and very significant problem that we can't ignore, and which must be worked around in order to make these systems more capable.
The fact is LLMs as they are today have a radically different form of knowledge compared to us and their reasoning ability is very different. This can lead people to look at an LLMs performance on one task and infer things about it's other abilities we think of as being closely related which simply don't apply.
I see a lot of naive statements to the effect that these systems already reason like humans do and know things in the same way that humans do, when investigation into the actual characteristics of these systems shows that we can characterise very important ways in which they are completely unlike us. Yet they do know things and can reason. That's really important because if we're going to close that gap, we need to really understand that gap very well.
> It's not really that, it's that recalling a set of rules and following a set of rules are fundamentally different tasks for an LLM.
My point is that this appears to be the case for people too. It is often necessary to explicitly remind people to recall a set of rules to get them to follow the specific rules rather than act in a way that may or may not match the rules.
Having observed this many times, I simply don't believe that most humans will see e.g. an addition and go "oh, right, these are the set of rules I should follow for addition, let me apply them step by step". If we've had the rules reinforced through repetitive training many enough times, we will end up doing them. But a lot of the time people will know the steps but still not necessarily apply them unless prompted, just like LLMs. Quite often people will still give an answer. Sometimes even the correct one.
But without applying the methods we've been taught. To the point where when dealing e.g. with new learners - children in particular - who haven't had enough reinforcement in just applying a method, it's not at all unusual to find yourself having conversations like this: "Ok, so to do X, what are the steps you've been taught? Ok, so you remember that they are A, B and C. Great. Do A. You've done A? Now do B..." and so on.
To me, getting a child to apply a method they know to solve a problem is remarkably close to getting an LLM to actually recall and follow these methods.
But even for professionals, checklists exist for a reason: We often forget steps, or do them wrong, and forget to even try to explicitly recall a list of steps and do them one by one when we don't have a list of steps in front of us.
I don't believe this works the way you think. Within the same chat session with GPT3 you can ask it to explain addition, then ask it to do addition, and the explanation will be perfectly accurate but the sums it does will be complete rubbish. It's not enough to remind it.
The article og_kalu posted above goes into detail as to what they had to do to teach an LLM how to reason algorithmically in a specific problem domain and it was incredibly hard; much, much more convoluted and involved than just reminding it of the rules. Only an LLM that has gone through this intensive multi-step highly domain specific training regime has a hope of getting good results and then only in that specific problem domain. with a human you teach a reasoning ability and get them to apply it in different domains, with LLMs that doesn't work.
Take this comment in the article "However, despite significant progress, these models still struggle with out-of distribution (OOD) generalization on reasoning tasks". Where humans naturally generalise reasoning techniques from one problem area to another, LLMs flat out don't. If you teach it some reasoning techniques when teaching doing sums, you have to start again from scratch when teaching it how to apply even the same reasoning techniques to any other problem domain, every single time. You can't remind them they learned this or that when learning to do sums and to use it again in this context, as you would with a human, at the moment that flat out doesn't work.
The reason it doesn't work is precisely due to the limitations imposed by token stream prediction. The different tasks involving reasoning are different token stream domains, and techniques the LLM uses to optimise for one token stream domain currently only seem to apply to that token stream domain. If you don't take that into account you will make fundamental errors in reasoning about the capabilities of the system.
So what we need to do is come up with architectures and training techniques to somehow enable them to generalise these reasoning capabilities.
> I don't believe this works the way you think. Within the same chat session with GPT3 you can ask it to explain addition, then ask it to do addition, and the explanation will be perfectly accurate but the sums it does will be complete rubbish. It's not enough to remind it.
Again, I've had this exact experience with people many times as well, so again I don't think this in itself is any kind of indication of whether or not LLMs are all that different from humans in this regard. The point is not that there aren't things missing from LLMs, but that I don't find the claim that this behaviour shows how different they are to be at all convincing.
My experience is that people do not appear naturally generalise reasoning techniques very well unless - possibly - if they are trained at doing that (possibly, because I'm not convinced that even most of those of us with significantly above average intelligence generalise reasoning nearly as well as we'd like to think).
Most people seem to learn not by being taught a new technique and then "automatically applying it", but being taught a new technique and then being made to repetitively practice that technique by being prompted step by step until they've learnt to apply it separate from the process of following the steps, and tend to perform really poorly and make lots of mistakes when doing it by instruction.
> You can't remind them they learned this or that when learning to do sums and to use it again in this context, as you would with a human, at the moment that flat out doesn't work.
I don't know what you're trying to say here. Mentioning a technique to ChatGPT and telling it to go through it step by step is not flawless but it often does work. E.g. I just tested by asking GPT4 for a multiplication method and then asked it to use it on two numbers I provided and show its working, and it did just fine. At the same time, doing this with humans often requires a disturbingly high level of step by step prompting (having a child, I've been through a torturous amount of this). I won't suggest ChatGPT is as good as following instructions as people, yet, but most people are also really awfully horrible at following instructions.
>Again, I've had this exact experience with people many times as well
There's a difference between some people sometimes needing to be reminded to do something, and them flat out not being able to do it due to fundamental cognitive limitations.
>"E.g. I just tested by asking GPT4 for a multiplication method and then asked it to use it on two numbers I provided and show its working, and it did just fine."
That's because GPT4 has been custom tuned and trained on that specific task as well, along with many others. It's that training, why it was necessary and how it works that the paper referred to previously was about.
This is literally the subject under discussion. You're using the fact the system was custom trained to do that specific task as is therefore good at it, along with other basic mathematical tasks, as evidence that it has a general ability that doesn't need to be custom trained.
This is what I'm talking about and as these models get better specific problem domain training two things will happen. One is that they will become dramatically more powerful and capable tools. That's great.
The other is that more and more people will come to fundamentally misunderstand how they function and what they do, because they will see that they produce similar results to humans in many ways. They will infer from that they work cognitively in the same way as humans, and will reason about their abilities on that basis and will make errors as a result, because they're not aware of the highly specialist training the systems had to go through precisely because there are very important and impactful ways in which they don't cognitively function like humans.
This matters, particularly when non specialists like politicians are making decisions about how this technology should be regulated and governed.
> There's a difference between some people sometimes needing to be reminded to do something, and them flat out not being able to do it due to fundamental cognitive limitations.
GPT4 isn't "flat out not able to do it" when reminded. My point was that I have had the same experience of having to prompt step by step and go "why did you do that? Follow the steps" with both fully functional, normally intelligent people and with GPT4 for similarly complex tasks, and given the improvement between 3.5 and 4 there's little reason to assume this won't keep improving for at least some time more.
> That's because GPT4 has been custom tuned and trained on that specific task as well, along with many others. It's that training, why it was necessary and how it works that the paper referred to previously was about.
So it can do it when trained, just like people, in other words.
> They will infer from that they work cognitively in the same way as humans
And that would be bad. But so is automatically assuming that there's any fundamental difference between how they work and how human reasoning work given that we simply do not know how human reasoning work, and given that LLMs in increasing number of areas show similar behaviour (failure to e.g. fall back on learned rules) when their reasoning breaks down as untrained people.
Again, I'm not saying they're reasoning like people, but I'm saying that we know very little about what the qualitative differences are outside of the few glaringly obvious aspects (e.g. lack of lasting memory and lack of ongoing reinforcement during operation), and we don't know how necessary those will be (we do know that humans can "function" for some values of function without the ability to form new lasting memories, but obviously it provides significant functional impairment).
> Again, I'm not saying they're reasoning like people
Cool, that’s really the only point I’m making. On the one hand it’s certainly true we can overcome a lot of the limitations imposed by that basic token sequence prediction paradigm, but they are just workarounds rather than general solutions and therefore are limited in interesting ways.
Obviously I don't know for sure how things will pan out, but I suspect we will soon encounter scaling limitations in the current approach. Not necessarily scaling limitations fundamental to the architecture as such, but limitations in our ability to develop sufficiently well developed training texts and strategies across so many problem domains. That may be several model generations away though.
To be clear, I'm saying that I don't know if they are, not that we know that it's not the same.
It's not at all clear that humans do much more than "that basic token sequence prediction" for our reasoning itself. There are glaringly obvious auxiliary differences, such as memory, but we just don't know how human reasoning works, so writing off a predictive mechanism like this is just as unjustified as assuming it's the same. It's highly likely there are differences, but whether they are significant remains to be seen.
> Not necessarily scaling limitations fundamental to the architecture as such, but limitations in our ability to develop sufficiently well developed training texts and strategies across so many problem domains.
I think there are several big issues with that thinking. One is that this constraint is an issue now in large part because GPT doesn't have "memory" or an ability to continue learning. Those two need to be overcome to let it truly scale, but once they are, the game fundamentally changes.
The second is that we're already at a stage where using LLMs to generate and validate training data works well for a whole lot of domains, and that will accelerate, especially when coupled with "plugins" and the ability to capture interactions with real-life users [1]
E.g. a large part of human ability to do maths with any kind of efficiency comes down to rote repetition and generating large sets of simple quizzes for such areas is near trivial if you combine an LLM at tools for it to validate its answers. And unlike with humans where we have to do this effort for billions of humans, once you have an ability to let these models continue learning you make this investment in training once (or once per major LLM effort).
A third is that GPT hasn't even scratched the surface in what is available in digital collections alone. E.g. GPT3 was trained on "only" about 200 million Norwegian words (I don't have data for GPT4). Norwegian is a tiny language - this was 0.1% of GPT3's total corpus. But the Norwegian National Library has 8.5m items, which includes something like 10-20 billion words in books alone, and many tens of billions more in newspapers, magazines and other data. That's one tiny language. We're many generations of LLM's away from even approaching exhausting the already available digital collections alone, and that's before we look at having the models trained on that data generate and judge training data.
> That means computers can theoretically compute anything that's possible to compute given enough time and memory, which includes anything a human could figure out.
Whoa, that's quite a leap there. Not sure where we (as society) are with our understanding of intuition, but I doubt a million monkeys would recognize that the falling of an apple is caused by the same agent as the orbit of planets.