Hacker Newsnew | past | comments | ask | show | jobs | submit | jaggs's commentslogin

How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

The blog post has a benchmark comparison table with these two in it

Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]

Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.


It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...


This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???


Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.

Long live LibreOffice.


Doesn't exist? Wow you'd better tell VTT to stop their testing then.

Nice. It would be great if it could be used with something like Openrouter BYOK, to give more modern flexibility and costs.


Nice idea. I added it now and it's on the latest VSCode extension and also the Github repository (you can add any API key even OpenAI, Anthropic etc, not just OpenRouter) I also added tons of other things like live test updates, copy buttons on all test cards and issue cards (so you can just copy and paste to claude code to fix the given issue) and also a QA report card which is emitted as the final summary of a test which shows you everything that was tested at one place (no need to scroll up) and allows you to even export it as a PDF (all tests performed / their outcome and all issues found) Let me know if there is anything else

Awesome, I'll definitely take a look asap.

I've just installed it in VSCode and started playing and it looks really good. The only thing it's missing is a persistent icon in the sidebar of VS Code - so it's always available with a click. Apart from that, I haven't really started using it yet in earnest, but from first looks setting up the API, etc., it seems very solid.

Nice, it's good to hear that it works.

Yes currently it is modelled a bit after Claude Code which has an icon at the top right (an orange one) and is only visible if you have a file open (like an actual file in VSCode)

There is a same icon for QA Panda too at the same place. By having the button there, you can open multiple QA Panda instances for the same repository, not just a single one (like with Claude Code) if you just keep clicking that icon (it opens a new instance every time).

The extension is fully designed to have an isolated Chrome instance for every extension tab that you open so you can do multiple parallel QA tests for the same repository and have multiple browser instances built in.

But yeah I will look into also adding a persistent icon maybe in the sidebar too. Good idea. Thank you for the feedback. Also let me know if there is anything else missing.


Totally fair and balanced though. /s


I think the problem is there are so many different aspects of this thing we call AI that it's hard to pin down any particular use case. For some users it's brilliant because if you're doing something like marketing imagery etc it can dramatically reduce costs, especially if you're using on premise models on your own hardware without touching the cloud.

But for other uses, i.e. companies who've just thrown AI money at the wall, probably using chatGPT, they wonder why they're not getting the return on investment they were promised. It's all a bit confused at the moment. Rather like the beginning of the internet days were.


Will this work with an openrouter key?


yes Open Router is an option in the onboarding process and in the settings. try it out, and let me know how it goes :D


OpenClaw config needs cleanup Picnic stopped before restart because C:\Users\User\.picnic\openclaw.json contains keys OpenClaw does not accept.

Unsupported keys error: too many arguments for 'config'. Expected 0 arguments but got 1. What is safe to edit Only documented OpenClaw schema fields should live in openclaw.json. Picnic metadata or experiments should go in separate files. Recommended fix Remove unsupported keys, then retry. If you need Picnic-specific metadata, store it outside openclaw.json.


Thanks, dude. Yeah, I'm aware of the bug. We're fixing it right now. The next update should be in about six to eight hours


I'm not having to go at you specifically, but take a guess at how many openclaw / agentic system installations I've done recently which have worked perfectly out of the box. Yep. 0%. Interesting, isn't it?


xD no i get it im with you. We started Picnic out of frustration with the others haha


Looks good. Strange unmemorable name though?


Another classic from the master.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: