Hacker Newsnew | past | comments | ask | show | jobs | submit | vessenes's commentslogin

Good summary.

Upshot: Steve thinks he’s built a quality task tracker/work system (beads), and is iterating on architectures, and has gotten convinced an architecture-builder is going to make sense.

Meanwhile, work output is going to improve independently. The bet is that leverage on the top side is going to be the key factor.

To co-believe this with Steve, you have to believe that workers can self-stabilize (e.g. with something like the Wiggum loop you can get some actual quality out of them, unsupervised by a human), and that their coordinators can self stabilize.

If you believe those to be true, then you’re going to be eyeing 100-1000x productivity just because you get to multiply 10 coordinators by 10 workers.

I’ll say that I’m generally bought in to this math. Anecdotally I currently (last 2 months) spend about half my coding agent time asking for easy in-roads to what’s been done; a year ago, I spent 10% specifying and 90% complaining about bugs.

Example, I just pulled up an old project, and asked for a status report — I got a status report based on existing beads. I asked it to verify, and the computer ran the program and reported a fairly high quality status report. I then asked it to read the output (a PDF), and it read the PDF, noticed my main complaints, and issued 20 or so beads to get things in the right shape. I had no real complaints about the response or workplan.

I haven’t said “go” yet, but I presume when I do, I’m going to be basically checking work, and encouraging that work checking I’m doing to get automated as well.

There’s a sort of not-obvious thing that happens as we move from 0.5 9s to say 3 9s in terms of effectiveness — we’re going to go from constant intervention needed at one order of magnitude of work to constant intervention needed at 2.5x that order of magnitude of work — it’s a little hard to believe unless you’ve really poked around — but I think it’s coming pretty soon, as does Steve.

Who, nota bene, to be clear, is working at a pace that he is turning down 20 VCs a week, selling memecoin earnings in the hundreds of thousands of dollars and randomly ‘napping’ in the middle of the day. Stay rested Steve, keep on this side of the manic curve please, we need you.. I’d say it’s a good sign he didn’t buy any GAS token himself.


> Stay rested Steve, keep on this side of the manic curve please, we need you

This is my biggest takeaway. He may or may not be on to something really big, but regardless, it's advancing the conversation and we're all learning from it. He is clearly kicking ass at something.

I would definitely prefer to see this be a well paced marathon rather than a series of trips and falls. It needs time to play out.


> He is clearly kicking ass at something.

Publishing unmaintainable garbage code to Github?

Have you looked at the beads codebase? It's a bad joke at our expense.


That something being psychosis

And crypto

Yep, it works. Like anything getting the most out of these tools is its own (human) skill.

With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.

Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.

I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.

Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.

Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.

And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.

Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.


I still don't get what beads needs a daemon for, or a db. After a while of using 'bd --no-daemon --no-db' I was sick of it and switched to beans and my agents seem to be able to make use of it much better, on the one hand its directly editable by them as its just markdown, on the other hand the CLI still gives them structure and makes the thing queryable

Steve runs beads across like 100 coding environments simultaneously. So, you need some sort of coordination, whether that's your db or a daemon. Realistically with 100 simultaneous connections, I would probably reach for both myself. I haven't tried beans, thanks for the reference.

yeah that does make sense that these choices are related to it being a big part of gastown, still I feel it would be much more sensible to make a different abstraction separating beads core features from the coordination layer

It's 100% both sides. We haven't had a president work to roll back his own power, since ... Hmm. Maybe Gerald Ford? I guess Carter was fairly principled on some of this.

This part of the system - executive power grabs - is supposed to be curtailed by the courts first and congress second in the US system.


> We haven't had a president work to roll back his own power,

this is just not true. For example, all under the Obama administration

* the closure of Guantanamo Bay and other black sites, the prohibition of torture as an interrogation method including updates to Army Field Manual and mandatory access of Red Cross to any POW, all represented a significant reduction in executive power in how we treat detainees.

* following the Snowden leaks there were several actions taken to curtail executive power in applying surveillance programs to both US citizens and non-US persons. these also rolled back several components of the PATRIOT act (passed under his predecessor we all know and love, Dubya)

* the signing statements reform meant the executive no longer had an effective line-item veto

* the AG under Obama implemented a new DoJ policy limiting the use of "state secret" privilege during litigations.


The Guantanamo Bay Detention Facility is still open, and hosting detainees: https://en.wikipedia.org/wiki/Guantanamo_Bay_detention_camp

Obama rejected signing statements on the campaign trail, but his actions in office were more nuanced: https://en.wikipedia.org/wiki/Signing_statement#Obama_admini...

Eric Holder, notable AG under the Obama administration had a very mixed record, and did not support limitations of his power, or oversight of his actions: https://en.wikipedia.org/wiki/Eric_Holder#Tenure_as_Attorney...


> The Guantanamo Bay Detention Facility is still open, and hosting detainees

remind me, who reopened it?


It never closed. Congress with support of many Democrats prevented its closure and Obama just kind of gave up.

Trump obviously wants it open and Biden kind of just ignored it so it remains.


I agree with those things but they were not rollbacks of executive power. That was Obama using executive power to reel in bad policy, not ceding the power entirely.

Of course perhaps he couldn’t. Congress needs to do that, and the courts, and neither seem interested in doing their job. Lower courts sometimes step up but the Supreme Court seems to be on the side of a dictatorial executive for some time now.

What does Congress even do these days? Seems like half crackpot debate club and half hospice care facility.


There are light years of space between the behavior we're seeing now and "a president working to roll back his own power," and even that has arguably happened in many presidencies, depending on what you mean. You would need much more than that to demonstrate anything approaching behavioral parity on this dimension. Otherwise - yes, politicians from every party, forever, everywhere, exhibit some similar faults.

A lot of this depends on your workflow. A language with great typing, type checking and good compiler errors will work better in a loop than one with a large surface overhead and syntax complexity, even if it's well represented. This is the instinct behind, e.g. https://github.com/toon-format/toon, a json alternative format. They test LLM accuracy with the format against JSON, (and are generally slightly ahead of JSON).

Additionally just the ability to put an entire language into context for an LLM - a single document explaining everything - is also likely to close the gap.

I was skimming some nano files and while I can't say I loved how it looked, it did look extremely clear. Likely a benefit.


Thanks for sharing this! A question I've grappled with is "how do you make the DOM of a rendered webpage optimal for complex retrieval in both accuracy and tokens?" This could be a really useful transformation to throw in the mix!

Those are sort of like 2D splats - if only they'd thought to make it all differentiable!

Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.

I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.


I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.

The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:

1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.


Try a nano-gpt subscription. Not going to be as fast as cerebras obviously but it's $8/mo for 60,000 requests

It’s wild that cached tokens count full - what’s in it for you to care about caching at all then? Is the processing speed gain significant?

Not really worth it, in general. It does reduce latency a little. In practice, you do have a continuing context, though, so you end up using it whether you care or not.

I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.

In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.

I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.

It's even cheaper to just use it through z.ai themselves I think.


I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser

Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.

With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.

I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.

Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)


I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).


Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.


FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)


You know, this is also the case of Proxmox vs. VMWare.

Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.


Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

dont forget the plethora of middleman chat services with liberal logging policies. i've no doubt there is a whole subindustry lurking in here

i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?

I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.

That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)

They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.


Yeah but are we all just speculating or is it accepted knowledge that this is actually happening?

Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?

Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.


Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?

Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.

The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.

What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.


OpenAI implemented ID verification for their API at some point and I think they stated that this was the reason.

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.


Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.


Sonnet was already very good a year ago, do open weights model right are as good ?

Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago

From my experience, Kimi K2, GLM 4.7 (not flash, full), Mistral Large 3, and DeepSeek are all about Sonnet 4 level. I prefer GLM of the bunch.

If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.

But they're nowhere near Opus 4.5


Probably the wrong attitude here - beads is infra for your coding agents, not you. The most I directly interact with it is by invoking `bd prime` at the start of some sessions if the LLM hasn’t gotten the message; maybe very occasionally running `bd ready` — but really it’s a planning tool and work scheduler for the agents, not the human.

What agent do you use it with, out of curiosity?

At any rate, to directly answer your question, I used it this weekend like this:

“Make a tool that lets me ink on a remarkable tablet and capture the inking output on a remote server; I want that to send off the inking to a VLM of some sort, and parse the writing into a request; send that request and any information we get to nanobanana pro, and then inject the image back onto the remarkable. Use beads to plan this.”

We had a few more conversations, but got a workable v1 out of this five hours later.


It not only sort of works, the 10% of the time it works surprisingly well at scale! Tantalizing.

Counterpoint - you can go much faster if you get lots of people engaging with something and testing it. This is exploratory work, not some sort of ivory tower rationalism exercise, (if those even ever truly exist), there’s no compulsion involved, so everyone engaged does so for self-motivated reasons..

Don’t be mad!

Also, beads is genuinely useful. In my estimation, gas town, or a successor built on a similar architecture, will not only be useful, but likely be considered ‘state of the art’ for at least a month sometime in the future. We should be glad this stuff is developed in the open, in my opinion.


It is worth an install; it works very differently than an agent in a single loop.

Beads formalizes building a DAG for a given workload. This has a bunch of implications, but one is that you can specify larger workloads and the agents won’t get stuck or confused. At some level gas town is a bunch of scaffolding around the benefits of beads; an orchestrator that is native to dealing with beads opens up many more benefits than one that isn’t custom coded for it.

Think of a human needing to be interacted with as a ‘fault’ in an agentic coding system — a copilot agent might be at 0.5 9s or so - 50% of tasks can complete without intervention, given a certain set of tasks. All the gas town scaffolding is trying to increase the number of 9s, and the size of the task that can be given.

My take - Gas town (as an architecture) certainly has more nines in it than a single agent; the rest is just a lot of fun experimentation.


> Beads formalizes building a DAG for a given workload

> gas town is [...] an orchestrator that is native to dealing with beads

Thanks - this is very helpful in deciding when and where to use them. Steve's descriptions sounded to me like more RAM and Copilot Agents:

> [Beads:] A memory upgrade for your coding agent

> [Gas Town:] a new take on the IDE for 2026. Gas Town helps you with the tedium of running lots of Claude Code instances


Yes he is on an extended manic episode right now - we can only sit back and enjoy the fruits of his extreme labor. I expect the dust will settle at some point, and I think he’s right that he’s on to some quality architecture.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: