More

jaggs · 2026-04-07T18:11:22 1775585482

How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

eis · 2026-04-07T18:17:45 1775585865

The blog post has a benchmark comparison table with these two in it

jaggs · 2026-04-07T19:34:02 1775590442

Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]

DeathArrow · 2026-04-07T18:15:43 1775585743

Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.

XCSme · 2026-04-08T00:12:51 1775607171

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

BoorishBears · 2026-04-08T04:19:53 1775621993

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

XCSme · 2026-04-08T08:05:47 1775635547

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

jaggs · 2026-04-08T10:10:16 1775643016

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

BoorishBears · 2026-04-08T18:16:58 1775672218

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

jaggs · 2026-04-08T19:03:23 1775675003

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

BoorishBears · 2026-04-09T08:10:19 1775722219

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.

jaggs · 2026-04-05T19:34:27 1775417667

Long live LibreOffice.

jaggs · 2026-04-01T05:03:50 1775019830

Hmm...https://youtu.be/QsgLS8mSlVs?si=vrwkAXlKer2ColVf

jaggs · 2026-03-30T06:09:30 1774850970

Doesn't exist? Wow you'd better tell VTT to stop their testing then.

jaggs · 2026-03-29T22:45:50 1774824350

Nice. It would be great if it could be used with something like Openrouter BYOK, to give more modern flexibility and costs.

qapandaapp · 2026-04-02T07:23:19 1775114599

Nice idea. I added it now and it's on the latest VSCode extension and also the Github repository (you can add any API key even OpenAI, Anthropic etc, not just OpenRouter) I also added tons of other things like live test updates, copy buttons on all test cards and issue cards (so you can just copy and paste to claude code to fix the given issue) and also a QA report card which is emitted as the final summary of a test which shows you everything that was tested at one place (no need to scroll up) and allows you to even export it as a PDF (all tests performed / their outcome and all issues found) Let me know if there is anything else

jaggs · 2026-04-02T12:19:15 1775132355

Awesome, I'll definitely take a look asap.

jaggs · 2026-04-04T20:26:44 1775334404

I've just installed it in VSCode and started playing and it looks really good. The only thing it's missing is a persistent icon in the sidebar of VS Code - so it's always available with a click. Apart from that, I haven't really started using it yet in earnest, but from first looks setting up the API, etc., it seems very solid.

qapandaapp · 2026-04-06T14:50:50 1775487050

Nice, it's good to hear that it works.

Yes currently it is modelled a bit after Claude Code which has an icon at the top right (an orange one) and is only visible if you have a file open (like an actual file in VSCode)

There is a same icon for QA Panda too at the same place. By having the button there, you can open multiple QA Panda instances for the same repository, not just a single one (like with Claude Code) if you just keep clicking that icon (it opens a new instance every time).

The extension is fully designed to have an isolated Chrome instance for every extension tab that you open so you can do multiple parallel QA tests for the same repository and have multiple browser instances built in.

But yeah I will look into also adding a persistent icon maybe in the sidebar too. Good idea. Thank you for the feedback. Also let me know if there is anything else missing.

jaggs · 2026-03-25T15:47:55 1774453675

Totally fair and balanced though. /s

jaggs · 2026-03-22T21:57:58 1774216678

I think the problem is there are so many different aspects of this thing we call AI that it's hard to pin down any particular use case. For some users it's brilliant because if you're doing something like marketing imagery etc it can dramatically reduce costs, especially if you're using on premise models on your own hardware without touching the cloud.

But for other uses, i.e. companies who've just thrown AI money at the wall, probably using chatGPT, they wonder why they're not getting the return on investment they were promised. It's all a bit confused at the moment. Rather like the beginning of the internet days were.

jaggs · 2026-03-15T07:33:56 1773560036

Will this work with an openrouter key?

basiclaser · 2026-03-15T09:04:47 1773565487

yes Open Router is an option in the onboarding process and in the settings. try it out, and let me know how it goes :D

jaggs · 2026-03-15T10:42:20 1773571340

OpenClaw config needs cleanup Picnic stopped before restart because C:\Users\User\.picnic\openclaw.json contains keys OpenClaw does not accept.

Unsupported keys error: too many arguments for 'config'. Expected 0 arguments but got 1. What is safe to edit Only documented OpenClaw schema fields should live in openclaw.json. Picnic metadata or experiments should go in separate files. Recommended fix Remove unsupported keys, then retry. If you need Picnic-specific metadata, store it outside openclaw.json.

basiclaser · 2026-03-15T12:14:08 1773576848

Thanks, dude. Yeah, I'm aware of the bug. We're fixing it right now. The next update should be in about six to eight hours

jaggs · 2026-03-15T13:51:00 1773582660

I'm not having to go at you specifically, but take a guess at how many openclaw / agentic system installations I've done recently which have worked perfectly out of the box. Yep. 0%. Interesting, isn't it?

basiclaser · 2026-03-15T16:13:10 1773591190

xD no i get it im with you. We started Picnic out of frustration with the others haha

jaggs · 2026-03-13T08:15:29 1773389729

Looks good. Strange unmemorable name though?

jaggs · 2026-03-13T08:11:57 1773389517

Another classic from the master.