Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.
Nice idea. I added it now and it's on the latest VSCode extension and also the Github repository (you can add any API key even OpenAI, Anthropic etc, not just OpenRouter)
I also added tons of other things like live test updates, copy buttons on all test cards and issue cards (so you can just copy and paste to claude code to fix the given issue) and also a QA report card which is emitted as the final summary of a test which shows you everything that was tested at one place (no need to scroll up) and allows you to even export it as a PDF (all tests performed / their outcome and all issues found)
Let me know if there is anything else
I've just installed it in VSCode and started playing and it looks really good. The only thing it's missing is a persistent icon in the sidebar of VS Code - so it's always available with a click. Apart from that, I haven't really started using it yet in earnest, but from first looks setting up the API, etc., it seems very solid.
Yes currently it is modelled a bit after Claude Code which has an icon at the top right (an orange one) and is only visible if you have a file open (like an actual file in VSCode)
There is a same icon for QA Panda too at the same place.
By having the button there, you can open multiple QA Panda instances for the same repository, not just a single one (like with Claude Code) if you just keep clicking that icon (it opens a new instance every time).
The extension is fully designed to have an isolated Chrome instance for every extension tab that you open so you can do multiple parallel QA tests for the same repository and have multiple browser instances built in.
But yeah I will look into also adding a persistent icon maybe in the sidebar too.
Good idea. Thank you for the feedback.
Also let me know if there is anything else missing.
I think the problem is there are so many different aspects of this thing we call AI that it's hard to pin down any particular use case. For some users it's brilliant because if you're doing something like marketing imagery etc it can dramatically reduce costs, especially if you're using on premise models on your own hardware without touching the cloud.
But for other uses, i.e. companies who've just thrown AI money at the wall, probably using chatGPT, they wonder why they're not getting the return on investment they were promised. It's all a bit confused at the moment. Rather like the beginning of the internet days were.
OpenClaw config needs cleanup
Picnic stopped before restart because C:\Users\User\.picnic\openclaw.json contains keys OpenClaw does not accept.
Unsupported keys
error: too many arguments for 'config'. Expected 0 arguments but got 1.
What is safe to edit
Only documented OpenClaw schema fields should live in openclaw.json. Picnic metadata or experiments should go in separate files.
Recommended fix
Remove unsupported keys, then retry. If you need Picnic-specific metadata, store it outside openclaw.json.
I'm not having to go at you specifically, but take a guess at how many openclaw / agentic system installations I've done recently which have worked perfectly out of the box. Yep. 0%. Interesting, isn't it?
reply