More

bisonbear · 2026-03-18T03:22:24 1773804144

I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.

I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass

also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing

bisonbear · 2026-03-18T03:17:29 1773803849

I agree with your analysis but not the conclusion.

Evals are broken - OpenAI showed that SWE Bench Verified was in the training data - models were able to reconstruct the changes from memory (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.

What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.

Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further

bisonbear · 2026-03-17T03:01:03 1773716463

Really interesting study. One thing I keep coming back to is that tests have no way of catching this sort of tech debt. The agent can introduce something that will make you rip your hair out in 6 months, but tests are green...

My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?

bisonbear · 2026-03-15T15:16:39 1773587799

curious how you measure/track how this actually impacts the coding agent?

sukit · 2026-03-15T15:59:59 1773590399

Fair question. I haven’t done a systematic benchmark yet, so I don’t have hard numbers to point to. Honestly I’ve mostly been iterating from actual use. The main test has been whether it helps me keep the good parts of brainstorming with the agent, recover context across longer multi PR or multi session work, and reduce friction overall. So right now the evidence is mostly qualitative and based on my own workflow, not a formal evaluation.

bisonbear · 2026-03-15T13:21:52 1773580912

For agentic development teams, I see there being two ways to measure performance:

How good is the human at using the agent, and how good is the agent itself?

I agree with the thesis here that the traditional DORA metrics don't have as much signal in an agentic world. I like the metrics mentioned in the article - another one I would propose is "number of turns" - the idea being that, if the agent goes off course, the human has to spend more turns course correcting the agent, whereas if the agent is aligned, there are just a few turns in the conversation.

For the "measuring the agent itself" part, I'm convinced that traditional benchmarks are broken, and that we need a way to measure our coding agents on our tasks, and anything else is irrelevant/noise.

bisonbear · 2026-03-12T00:40:53 1773276053

I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard

dirtbag__dad · 2026-03-12T03:55:57 1773287757

This sounds amazing. In particular, I like comps to existing PRs. But I’m also not sure that I want existing PRs to be a template for most things reasonable or best practice.

I’ve been building out internal linters that enforce design patterns I want and raise common code smells (also note tools like eslint allow custom rules which are easy write with something like opus 4.6). The use case is a total refactor of react and fastapi apps. We are suffering from everything’s a snowflake syndrome and just want the same pattern employed across features.

This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.

But to get the agent (Claude code opus 4.6 currently) to nail the directory structure and design primitives, and limit some doofus behavior, I still haven’t cracked how to make literally each line of code simple and sensible. And I haven’t figured out how to prevent agents from going out of bounds and doing weird things unless I catch it in review and add another rule.

This is a relatively new endeavor, but my gut is that it’s not much more time (linter rules and perhaps “evals” or a beefy agent review cycle) before I have bespoke linters in place that force what I want from our architecture.

Note that a huge bottleneck to all of this is that the codebase our current team inherited has no tests. It’s too easy to accidentally nuke a screen’s subtle details. It’s also really hard to write good tests without knowing what all of the functionality is. It feels like a blocker to a lot of large-swath agentic changes is a test strategy or solution first then a rigid push for rearchitecture or new design.

bisonbear · 2026-03-12T15:28:02 1773329282

yikes, using AI without tests is not fun. with testing at least you have some confidence that the AI isn't going completely off track, without them you're pretty much flying blind

having linters is super important IMO - I never try to make the AI do a linter's job. let the AI focus on the hard stuff - architecture, maintainability, cleanliness, and the linter can handle the boring pieces.

I also definitely see the AI making changes that are way larger than necessary. I try to capture that in the eval by comparing a "footprint risk" which is essentially how many unnecessary changes did the AI make vs the original PR.

I would certainly like to move beyond using PRs as a sole source of truth, since humans don't always write great code either. Maybe having LLM-as-a-judge looking for scope creep/bloat would be a decent band-aid?

ebhn · 2026-03-12T00:53:05 1773276785

Nice, I really like your idea. First I've heard of something like that

floodfx · 2026-03-12T01:45:34 1773279934

Working on that too. Lmk if you’re up for a chat?

bisonbear · 2026-03-12T02:50:49 1773283849

yea I'm down - feel free to send me an email [email protected]

bisonbear · 2026-03-11T21:58:28 1773266308

sounds like it's another openclaw-as-a-service provider?

bisonbear · 2026-03-10T20:39:04 1773175144

assume you're referencing coding agents - I don't think people are. If they are, it's likely using

- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)

I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?

yelmahallawy · 2026-03-12T07:37:36 1773301056

I'd love to hear more about what you're working on (if you're open to sharing!).

I like to play with knowledge base powered chatbots but what's most useful to me (and probably my primary use case) is coding agents since I use CC every day. Recently I just heard about Minimax m2.5 which apparently is a pretty good coding agent (they say it's comparable to opus 4.6) but I haven't tried it yet — plus it'd take a lot of time to figure out whether it's better or not.

bisonbear · 2026-02-07T21:02:20 1770498140

Intuitively makes sense, but in my experience, a more realistic workflow is using the main agent to sub-agent delegation pattern instead of straight 7x-ing token costs.

By delegating to sub agents (eg for brainstorming or review), you can break out of local maxima while not using quite as many more tokens.

Additionally, when doing any sort of complex task, I do research -> plan -> implement -> review, clearing context after each stage. In that case, would I want to make 7x research docs, 7x plans, etc.? probably not. Instead, a more prudent use of tokens might be to have Claude do research+planning, and have Codex do a review of that plan prior to implementation.

languid-photic · 2026-02-08T05:19:15 1770527955

Yes, understandable.

The question is which multi-agent architecture, hierarchical or competitive, yields the best results under some task/time/cost constraints.

In general, our sense is that competitive is better when you want breadth and uncorrelated solutions. Or when the failure modes across agents are unknown (which is always, right now, but may not be true forever).

girvo · 2026-02-07T22:06:17 1770501977

> straight 7x-ing token costs

You are probably right, but my work pays for as many tokens as I want, which opens up a bunch of tactics that otherwise would be untenable.

I stick with sub-agent approaches outside of work for this reason though, which is more than fair a point

darkerside · 2026-02-08T02:40:25 1770518425

Maybe an evolution based approach does make sense. 3x instead, and over time drop the least effective agents, replacing them with even random choices.

Edit: And this is why you should read the article before you post!

languid-photic · 2026-02-08T04:44:46 1770525886

Yes indeed, you get a big lift out of running just the few top agents.

We run big ensembles because we are doing a lot of analysis over the system etc

bisonbear · 2026-01-15T19:03:07 1768503787

curious how this is different from claude-mem?

https://github.com/thedotmack/claude-mem

AttentionBlock · 2026-01-15T19:13:53 1768504433

great question

claude-mem uses a compaction approach. It records session activity, compresses it, and injects summaries into future sessions. Great for replaying what happened.

A-MEM builds a self-evolving knowledge graph. Memories aren’t compressed logs. They’re atomic insights that automatically link to related memories and update each other over time. Newer memories impact past memories.

For example: if Claude learns “auth uses JWT” in session 1, then learns “JWT tokens expire after 1 hour” in session 5, A-MEM links these memories and updates the context on both. The older memory now knows about expiration. With compaction, these stay as separate compressed logs that don’t talk to each other.