Sooo... it can *play* Pokemon. Feels like they had to throw that in after Google...

minimaxir · 2025-05-22T16:49:52 1747932592

That Google IO slide was somewhat misleading as the maintainer of Gemini Plays Pokemon had a much better agentic harness that was constantly iterated upon throughout the runtime (e.g. the maintainer had to give specific instructions on how to use Strength to get past Victory Road), unlike Claude Plays Pokemon.

The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.

fourier456 · 2025-05-23T01:55:55 1747965355

Okay, wait though like I want to know the full transcript because that actually is a better / softer benchmark if you measure in terms of the necessary human input.

archon1410 · 2025-05-22T17:26:11 1747934771

Claude Plays Pokemon was the original concept and inspiration behind "Gemini Plays Pokemon". Gemini arguably only did better because it had access to a much better agent harness and was being actively developed during the run.

See: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-...

montebicyclelo · 2025-05-22T17:47:18 1747936038

Not sure "original concept" is quite right, given it had been tried earlier, e.g. here's a 2023 attempt to get gpt-4-vision to play pokemon, (it didn't really work, but it's clearly "the concept")

https://x.com/sidradcliffe/status/1722355983643525427

archon1410 · 2025-05-22T17:54:23 1747936463

I see, I wasn't aware of that. The earliest attempt I knew of was from May 2024,[1] while this gpt-4-vision attempt is from November 2023. I guess Claude Plays Pokemon was the first attempt that had any real success (won a badge), and got a lot of attention over its entertaining "chain-of-thought".

[1] https://community.aws/content/2gbBSofaMK7IDUev2wcUbqQXTK6/ca...

silvr · 2025-05-23T15:04:42 1748012682

I disagree - this is all an homage to Twitch Plays Pokemon, which was a noteworthy moment in internet culture/history.

https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon

throwaway314155 · 2025-05-22T16:48:47 1747932527

Gemini can beat the game?

mxwsn · 2025-05-22T16:51:50 1747932710

Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.

throwaway314155 · 2025-05-22T17:20:14 1747934414

Way offtopic to TFA now, but isn't using an improved harness a bit like saying "I'm going to hardcore as many priors as possible into this thing so it succeeds regardless of its ability to strategize, plan and execute?

silvr · 2025-05-22T22:46:57 1747954017

While true to a degree, I think this is largely wrong. Wouldn't it still count as a "harness" if we provided these LLMs with full robotic control of two humanoid arms, so that it could hold a Gameboy and play the game that way? I don't think the lack of that level of human-ness takes away from the demonstration of long-context reasoning that the GPP stream showed.

Claude got stuck reasoning its way through one of the more complex puzzle areas. Gemini took a while on it also, but made it through. I don't that difference can be fully attributed up to the harnesses.

Obviously, the best thing to do would be to run a SxS in the same harness of the two models. Maybe that will happen?

throwaway314155 · 2025-05-23T01:16:53 1747963013

I can appreciate that the model is likely still highly capable with a good harness. Still, I think this is more in line with ideas from say, speed running (or hell even reinforcement learning) where you want to prove something profound is possible and to do so before others do, you need to accumulate a series of "tricks" (refining exploits/hacking rewards) in order to achieve the goal. but if you use too many tricks you're no longer proving something as profound as originally claimed. In speed running this tends to splinter into multiple categories.

Basically, the gane being conpleted by gemini was in an inferior category (however minuscule) of experiment.

I get it though. People demanded these types of changes in the CPP twitch chat, because the pain of watching the model fail in slow motion is simply too much.

samrus · 2025-05-22T17:39:32 1747935572

it is. the benchmark was somewhat cheated, from the perspective of finding out how the model adjusts and plans within a dynamic reactive environment

11101010001100 · 2025-05-22T19:20:03 1747941603

They asked gemini to come up with another word for cheating and it came up with 'harness'.

klohto · 2025-05-22T16:54:06 1747932846

2 weeks ago

hansmayer · 2025-05-22T16:59:08 1747933148

Right, but on the other hand... how is it even useful? Let's say it can beat the game, so what? So it can (kind of) summarise or write my emails - which is something I neither want nor need, they produce mountains of sloppy code, which I would have to end up fixing, and finally they can play a game? Where is the killer app? The gaming approach was exactly the premise of the original AI efforts in the 1960s, that teaching computers to play chess and other 'brainy' games will somehow lead to development of real AI. It ended as we know in the AI nuclear winter.

samrus · 2025-05-22T17:37:45 1747935465

from a foundational research perspective, the pokemon benchmark is one of the most important ones.

these models are trained on a static task, text generation, which is to say the state they are operating in does not change as they operate. but now that they are out we are implicitly demanding they do dynamic tasks like coding, navigation, operating in a market, or playing games. this are tasks where your state changes as you operate

an example would be that as these models predict the next word, the ground truth of any further words doesnt change. if it misinterprets the word bank in the sentence "i went to the bank" as a river bank rather than a financial bank, the later ground truth wont change, if it was talking about the visit to the financial bank before, it will still be talking about that regardless of the model's misinterpretation. But if a model takes a wrong turn on the road, or makes a weird buy in the stock market, the environment will react and change and suddenly, what it should have done as the n+1th move before isnt the right move anymore, it needs to figure out a route of the freeway first, or deal with the FOMO bullrush it caused by mistakenly buying alot of stock

we need to push against these limits to set the stage for the next evolution of AI, RL based models that are trained in dynamic reactive environments in the first place

hansmayer · 2025-05-22T18:23:11 1747938191

Honestly I have no idea what is this supposed to mean, and the high verbosity of whatever it is trying to prove is not helping it. To repeat: We already tried making computers play games. Ever heard of Deep Blue, and ever heard of it again since the early 2000s?

lechatonnoir · 2025-05-22T19:46:46 1747943206

Here's a summary for you:

llm trained to do few step thing. pokemon test whether llm can do many step thing. many step thing very important.

hansmayer · 2025-05-23T03:59:42 1747972782

Are you showing off how the the extensive LLM usage impaired your writing and speaking capabilities?

lechatonnoir · 2025-06-02T20:56:33 1748897793

I am mocking you, but you didn't get it.

drdeca · 2025-05-23T22:34:54 1748039694

You complained about the high verbosity.

Rudybega · 2025-05-22T21:00:42 1747947642

The state space for actions in Pokemon is hilariously, unbelievably larger than the state space for chess. Older chess algorithms mostly used Brute Force (things like minimax) and the number of actions needed to determine a reward (winning or losing) was way lower (chess ends in many, many, many fewer moves than Pokemon).

Successfully navigating through Pokemon to accomplish a goal (beating the game) requires a completely different approach, one that much more accurately mirrors the way you navigate and goal set in real world environments. That's why it's an important and interesting test of AI performance.

hansmayer · 2025-05-23T04:12:54 1747973574

Thats all wishful thinking, with no direct relation to the actual use cases. Are you going to use it to play games for you? Here is a much more reliable test: Would you blindly copy and paste the code the GenAI spits out at you? Or blindly trust the recommendations it makes about your terraform code ? Unless you are a complete beginner, you would not, because it sometimes generates downright the opposite of what you asked it to do. It is because the tool is guessing the outputs and not really knowing what it all means. It just "knows" what character sequences are most likely (probability-wise) to follow the previous sequence. Thats all there is to it. There is no big magic, no oracle having knowledge you dont etc. So unless you tell me you are ready to blindly use whatever the GenAI playing pokemon tells you to do, I am sorry, but you are just fooling yourself. And in the case you are ready to blindly follow it - I sure hope you are ready for a life of an Eloi?

Rudybega · 2025-05-23T18:34:13 1748025253

All of that is totally unrelated to the point I'm trying to make.

Pokemon is interesting because it's a test of whether these models can solve long time horizon tasks.

That's it.

hansmayer · 2025-05-23T18:57:26 1748026646

Ok, well now that you phrase it clearly like that, it makes much more sense, so it's a test of being able to keep a relatively long context-length. Another incremental improvement I suppose.

Rudybega · 2025-05-27T18:13:43 1748369623

It's not really a function of maintaining coherency across context length. It's more about whether the model can accomplish a long time horizon task when the context length of a single message isn't even close to sufficient for keeping track of the all the things that have occurred in pursuit of the task's completion.

Basically, the model has to keep some notes about its overall goals and current progress. Then the context window has to be seeded with the relevant sections from these notes to accomplish sub goals that help with the completion of the overall goal (beat the game).

The interesting part here is whether the models can even do this. A single context window isn't even close to sufficient to store all the things the model has done to drive the next action, so you have to figure out alternate methods and see if the model itself is smart enough to maintain coherency using those methods.

j_maffe · 2025-05-22T17:58:42 1747936722

> Where is the killer app?

My man, ChatGPT is the sixth most visited website in the world right now.

hansmayer · 2025-05-22T18:13:44 1747937624

But I did not ask "what was the sixth most visited website in the world right now?", did I? I asked what was the killer app here. I am afraid vague and un-related KPIs will not help here, otherwise we may as well compare ChatGPT and PornHub based on the number of visits, as you seem to suggest.

signatoremo · 2025-05-22T18:50:18 1747939818

Are you saying PornHub isn’t a killer app?

hansmayer · 2025-05-23T06:07:03 1747980423

Well in the AI space definitely not...

j_maffe · 2025-05-22T20:31:22 1747945882

If the now default go-to source for quick questions and formulating text isn't a killer app in your eyes, I don't know what is.

Jensson · 2025-05-23T03:26:47 1747970807

Not killer enough to warrant trillions of dollars valuation that the VC money are looking for here.

hansmayer · 2025-05-23T04:14:34 1747973674

The VCs have already burnt on the order of 200B USD, to generate about 10B operating income, within the total "industry" (source: The Information). It will be interesting when some of them start asking about the returns.

hansmayer · 2025-05-23T04:02:14 1747972934

I already know how to read, write and think by myself, so no - that is not a killer app. Especially when it produces wrong answers with an authoritative attitude.

j_maffe · 2025-05-23T06:59:34 1747983574

Then it's not a killer app for you. Doesn't stop it from being so for many others.

minimaxir · 2025-05-22T17:14:35 1747934075

It's a fun benchmark, like simonw's pelican riding a bike. Sometimes fun is the best metric.

lechatonnoir · 2025-05-22T19:49:36 1747943376

This is a weirdly cherry-picked example. The gaming approach was also the premise of DeepMind's AI efforts in 2016, which was nine years ago. Regardless of what you think about the utility of text (code), video, audio, and image generation, surely you think that their progress on the protein-folding problem and weather prediction have been useful to society?

What counts as a killer app to you? Can you name one?

hansmayer · 2025-05-23T06:22:54 1747981374

Well the example came from their own press-release, so who cherry-picked it? Why should I name the next killer app ? Isnt that something that we just recognise the moment it shows up, like we did with www and e-commerce? Its not something a comittee staffed by a bunch of MBAs defines ahead of the time, as is currently the case with the use-cases that are being pushed into our faces every day. I would applaud and cheer if their efforts were focused on scientific problems that you mentioned. Unfortunately for us, this is not what the bean-counters heading all major tech corps see as useful. Do you honestly think any one of them has the benefit of society at heart? No, they want to make money by selling you bullshit products like e-mail summarising and such. Perhaps in the process also to get rid of software developers altogether as well. Then once we as the society lose the ability to do anything on our own, relying on these bullshit machines they gain not only in terms of being able to entshittify their products and squeeze that extra buck, but also opens a "world of possibilities" (for the rich) in terms of societal control. But sure, at least you will still have your, what is it now, two-day delivery from Amazon and a handholding tool to help you speak, write and do anything meaningful as a human being.

lechatonnoir · 2025-06-02T20:55:18 1748897718

you assert that people know a killer app when they see one

a bunch of people think that something like chatgpt is a killer app, and they know it when they see it. you assert that it obviously is not, so clearly the above intuition isn't working for the purposes of discussion.

instead, someone should define the term so that we know what we're talking about, and i offer you the ability to do it so that the frame of the discussion can be favorable to your point of view. but you are also not willing to do that, so how do you expect to convince anyone of your viewpoint?

rxtexit · 2025-05-23T11:23:55 1747999435

The whole idea of a "killer app" is stupid.

It is a dismissive rhetorical device to prove a wrong point on an internet forum such as this that has nothing to do with reality.

hansmayer · 2025-05-23T11:49:54 1748000994

Are you sure you are not describing your own argument here?