I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.
With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.
When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.
The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.
Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".
But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.
Generated code only works because "test" part (compile/validate/analyze etc.) is completely external and written before any mass-market LLMs. There is no such external validator for new theorems, books, pictures, text guides etc. You can't just run hard_reality.exe on a generated poem or a scientific paper to deem it "correct". It is only possible with programming languages, and even then not always.
Your proposed approach to science would result in the extremely tiny subset of math, probably theorems being proven by automation. And it is questionable if those theorems would be even useful. A good mathematician with CS experience can probably write a generator of new useless theorems, something along "are every sequential cube plus square of a number divisible by a root of seventh smallest prime multiplied by logn of than number plus blabla...". One can generate such theorrems and formally prove or disprove them, yes.
On the other hand any novel science usually requires deep and wide exploratory research, often involving hard or flawed experimentation or observation. One can train LLM on a PhD curriculum in astrophysics, then provide that LLM with API to some new observatory and instruct it to "go prove cosmological constant". And it will do so, but the result will be generated garbage because there is no formal way to prove such results. There is no formal way to prove why pharaohs decided to stop building pyramids, despite there being some decent theories. This is science too, you know. You can't formally prove that some gene sequence is responsible for trait X etc.
I would say a majority of science is not formally provable.
And lastly, you dismiss books/texts, but that is a huge chunk of intellectual and creative work of humans. Say you are an engineer and you have a CAD model with a list of parts and parameters for rocket for example. Now you need to write a guide for it. LLM can do that, it can generate guide-looking output. The issue is that there is no way to automatically proof it or find issues in it. And there are lots of items like that.
> You can't formally prove that some gene sequence is responsible for trait X etc.
Maybe not formally in some kind of mathematical sense. But you certainly could have simulation models of protein synthesis, and maybe even higher order simulation of tissues and organs. You could also let the ai scientist verify the experimental hypothesis by giving access to robotic lab processes. In fact it seems we are going down both fronts right now.
Nobody argues that LLMs aren't useful for some bulk processing of billion datapoints or looking for obscure correlations in the unedited data. But the premise of the Gwern's article is that to be considered thinking, LLM must initiate such search on it's own and arrive to a novel conclusion on it's own.
Basically if:
A) Scientist has an idea > triggers LLM program to sift through a ton of data > LLM print out correlation results > scientist read them and proves/disproves an idea. In this case, while LLM did a bulk of work here, it did not arrive at a breakthrough on its own.
B) LLM is idling > then LLM triggers some API to get some specific set of data > LLM correlates results > LLM prints out a complete hypothesis with proof (or disproves it). In this case we can say that LLM did a breakthrough.
I think the problem here is that you assume the LLM has to operate isolated from the world, i.e. without interaction. If you put a human scientist in isolation, then you cannot have high expectations either.
I assume not that LLM would be isolated, I assume that LLM would be incapable of interacting in any meaningful way on its own (i.e. not triggered by direct input from a programmer).
IME, on a daily basis, Claude Code (supposed SoTA agent) constantly disables and bypasses tests and checks on my codebase - despite following clear prompting guidelines and all the /woo/ like ultrathink etc.
I think if we can have a good enough simulation of reality, and a fast one. Something like an accelerable minecraft with real world physics. Then this idea might actually work.
But the hard reality we currenly could generate efficiently and feed into LLMs usually has a narrow scope. It feels liking teaching only textbook math to a kid for several years but nothing else. The LLM mostly overoptimize in these very specific fields, but the overall performance might even be worse.
True, and the successful ones usually require an external source of information.
For AlphaGo, it is the simple algorithm which decide who is the winner of a game of Go. For GAN, it is the images labled by human.
In these scenarios, the critic is the medium which transforms external information into gradient which optimized the actor, but not the direct source of that information.
The LLM doesn't have to know about the critic though. It can just output things and the critic is a second process that filters the output for the end user.
It is a counter argument against that we can build a perfect simulator of our universe, but not a good one against that our universe is simulated.
In fact if I were to build a simulator, I most likely have to design a mechanism to prevent its residents from observing beyond a certain micro scale due to limited cpu/mem resources and laziness to implement all the details. Tiny black hole is a good mechanism to reduce resource consumption when simulating a fixed volume of this universe. Imagine living in the world of Minecraft, the minimal unit is a block. Trying to look inside of it yields nothing. All physically meaningful characteristics are described by its surface. In our universe this looks very much like a blackhole.
The market has changed drastically since some Chinese factories start mass producing synthetic diamonds around 2019.
Currently in the end market the price is $200~300 for a 1CT synthesized one, and the quality are simply better all around. The only way to distinguish natural ones is to track from the start where each is mined and processed, which is a bit absurd.
Many of these factories were associated with drill head industry and the like until they found the jewelry market is much more profitable and got the trick to mass-produce them.
If they are willing to compete a bit the price might even go lower.
QFT is too successful that experimentalists stuck for half a century with no meaningful surprise. But theoreticians got to maintain a publication streak if trying to stay in academy. Thus string theory becomes the perfect field. Most ones doing it seem to know it is a goose chase but no one dare to admit it publicly as your colleagues stripped of funding would tear you to pieces.
I feel we really should sort these papers into a category "physics in imaginary universe", so other people do not get confused.
Some additional context here. The author has a further statement, in which he apologizes that the previous video was misleading due to two points:
1. The sample was not LK99.
2. Both the larger piece held by the tweezers and the smaller levitating one were from the same sample.
But he also states that no tricks of any kind were used nor the video was edited.
So, some other material levitating itself in seemingly room temperature?
Currently this guy is practically using his real name, with his university and professor exposed. It takes some courage to lie at this point.
Edit: People sometimes speak in convoluted ways. I feel "The sample was not LK99" could have two interpretations here:
1. It is a completely different compound.
2. It is a derivative from LK99 with a different synthesis/doping method.
>Currently this guy is practically using his real name, with his university and professor exposed. It takes some courage to lie at this point.
No, it doesn't. It is a nine second video. Not a research paper. Right now the topic is being hyped up and you might care but next year nobody is going to give a damn about some random video uploaded to bilibili.
Why does everyone on HN pretend that even the smallest of missteps is going to end a reserachers career and therefore even videos with almost no effort put into them, that to feed the rumor mills, are somehow the paragon of truth?
I guess it really depends on what you are working on.
If you are writing some non-trivial algorithms or working on some projects which requires delicate handling of things, then Copilot is most likely going to mess up.
But if you are working on many of those frontend code or backend CRUDs which are usually quite repetitive. Then Copilot could be helpful.
It is really interesting to read the comments in this thread.
From an outsider's perspective, US has been the most prosperous country in this planet. People generally have good chances to move up the class ladder. And there are a lot of active enterprises. But I had the strong feeling that this stop to be the case in recent decades.
The reason, as I see it, is the natural tendency of capitalism, people seeking to maximize their individual capital. In the past century, with the pressure of the cold war, this tendency is balanced by the government, as US needed its working class to actually make innovations and make products to gain the advantage in the competition with the Soviet Union. In fact, the working class got their best treatment during the cold war, and then it worsens in recent decades.
Now with the Soviet Union dead, and China is still too young to be an actual competitor. There is no obstacle for the capitalists to play their games. It should be evident for most Americans that the price of properties are inflating while the increase of their salary slowed. The government also got eroded, most of the people there are either capitalists or have good relationship with them. What will be their incentives to change the rules?
This is not just the problem with US, it happens also in China. And I don't see a solution to it. The lower classes are controlled tightly by the media, which is controlled by the capitalists. And the middle class is thin and divided. In fact a lot of the middle class try very hard to climb the social ladder so they can be one of the upper class. They don't want the game changed, just "modified" to benefit themselves a bit.
> and a critic model filters the results for genuinely valuable ideas.
In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.
So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.