Come on now, you didn't expect someone linking to that trash website to actually read any of it did you? Grokipedia tries to downplay the progressive part but does still mention it.
Maybe I'm just a fuddy-duddy but my eyes about rolled out of my head reading this. The same article could probably be written about multiple companies and it'd be just as uninteresting. It's my understanding that there isn't anything special about WD-40, as in alternatives exist that can work just as well. Now, I think WD-40 is a brand name that can be trusted to work well more often than most alternatives but that is more about process than recipe (I would think).
I've long thought that every restaurant/bakery/etc could publish their full internal cookbooks and not see a drop in sales. People don't buy it because they are incapable (or think they are) of making something, they do it because it's faster, they don't have all the ingredients, they don't have the time, they don't have the skill, the list goes on. I bet I could give the instructions, the equipment, and the ingredients to people and they'd still choose to buy it. Sure, you might lose a tiny bit of sales to "home bakers" [0] but I think it'd be eclipsed by people that saw/read/heard about the cookbook (maybe never even saw it) and that was enough "marketing" to get them in the door.
I've always found "secret knowledge" to be a little silly. A sort of, security through obscurity. Knowing a recipe doesn't make you special, being able to build/run a company around it and make it consistently good does.
[0] I love to cook, I sometimes like making copy-cat recipes. I cannot think of a copy-cat recipe that I made more than 2-3 times. While it's fun to do, it's never exactly the same, and I also believe that "food tastes better when someone else makes it". Also it can sometimes be just-as or more expensive to make some food items due to needing a bunch of ingredients that they don't sell in exactly the quantity the recipe calls for.
> I've long thought that every restaurant/bakery/etc could publish their full internal cookbooks and not see a drop in sales.
Makes me think of all those stories[0] employing a "secret recipe" plot. Some baking/cooking recipe (or a whole cookbook), written down by grandma and passed down in the family, or such, is critical to the fate of a bakery/restaurant/Thanksgiving dinner/etc.; predictably, it gets stolen, and suddenly the meal everyone loves cannot be made anymore.
It's a dumb idea if you think about it for more than a second - even the worst home cook will naturally memorize all the ingredients and steps after using the recipe more than couple times. If the process involves more than one person, there's bound to be copies and derivative documents (e.g. shopping lists) around, too. Recipes are good checklists and are particularly helpful when onboarding new cooks, but losing an actively used one isn't a big deal - it can be recreated on the spot by those who already know it by heart.
--
[0] - One I've watched recently was Hoodwinked! - https://en.wikipedia.org/wiki/Hoodwinked!. Great movie, but out of all the absurdities in it, by far the biggest one was the whole "stealing recipes to put bakeries out of business" plot driver.
Recently Fallow posted video on how they made demi glace. Bit specialized ingredients probably if you ordered far enough from butcher doable. Bigger issue was the larger scale and time effort.
I really feel main difference is the scale and then getting right ingredients and then actually using all of them. Later making thing somewhat cost effective.
I have no doubt any serious company couldn't make something like WD-40. Not exactly same stuff, but in general close enough. Probably close enough that if you labeled over nearly all users would not notice.
Nothing gets gearhead nerds going more than arguing about lubricants and gas. Ask the wrong group of dudes about when to change your oil at breakfast, and they will still be going at dinner.
I don't know if that's really fair. It's much more rare for HN link posts to have bodies and this one is a single line of the gift link. Yes, that gift link works today but it's also completely reasonable to post the archive link.
We are all lucky that Alan Dye is off screwing up Meta now and that Tim Cook is not long for Apple. Even with just Dye gone I expect things to improve but I doubt a big change will happen until Tim Cook leaves (which cannot come soon enough).
I'm not trying to be rude here at all but are you manually verifying any of that? When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value. And that's when they aren't bypassing commit checks or just commenting out tests or saying "I fixed it all" while multiple tests are broken.
Maybe I need a stricter harness but I feel like I did try that and still didn't get good results.
I feel like it was doing what you're saying about 4-6 months ago. Especially the commenting out tests. Not always but I'd have to do more things step by step and keep the llm on track. Now though, the last 3-4 months, it's writing decent unit tests without much hand holding or refactors.
Hmm, my last experience was within the last 2 months but I'm trying not to write it off as "this sucked and will always suck", that's the #1 reason I keep testing and playing with these things, the capabilities are increasing quickly and what did/didn't work last week (especially "last model") might work this week.
I'll keep testing it but that just hasn't been my experience, I sincerely hope that changes because an agent that runs unit test [0] and can write them would be very powerful.
[0] This is a pain point for me. The number of times I've watching Claude run "git commit --no-verify"... I've told it in CLAUDE.md to never bypass commit checks, I've told it in the prompt, I've added it 10 more times in different places in CLAUDE.md but still, the agent will always reach for that if it can't fix something in 1-3 iterations. And yes, I've told it "If you can't get the checks to pass then ask me before bypassing the checks".
It doesn't matter how many guardrails I put up and how good they are if the agent will lazily bypass them at the drop of a hat. I'm not sure how other people are dealing with this (maybe with agents managing agents and checking their work? A la Gas Town?).
I haven't seen your issue, but git is actually one of the things I don't have the llm do.
When I work on issues I create a new branch off of master, let the llm go to town on it, then I manually commit and push to remote for an MR/PR. If there are any errors on the commit hooks I just feed the errors back into the agent.
Interesting, ok, I might try that on my next attempt. I was trying to have it commit so that I could use pre-commit hooks to enforce things I want (test, lint, prettier, etc) but maybe instead I should handle that myself and make it more explicit in my prompts/CLAUDE.md to test/lint/etc. In reality I should just create a `/prep` command or similar that asks it to do all of that so that once it thinks it's done, I can quickly type that and have it get everything passing/fixed and then give a final report on what it did.
You’ll likely have the same issue relying on CLAUDE.md instructions to test/lint/etc, mine get ignored constantly to the point of uselessness.
I’m trying to redesign my setup to use hooks now instead because poor adherence to rules files across all the agentic CLIs is exhausting to workaround.
(and no, Opus 4.5 didn’t magically solve this problem to preemptively respond to that reply)
I wonder if some people are putting in too much into their markdown files of what NOT to do.
I hate people saying the llms are just better auto-correct, but in some ways they're right. I think putting in too much "don't do this" is leading the llm down the path to do "this" because you mentioned it at all. The LLM is probabilistically generating it's response based on what you've said and what's in the markdown files, the fact you put some of that stuff in there at all probably increases the probability those things will show up.
In my projects there's generally a "developer" way to do things and an "llm agent" way to do things.
For the llm a lot of linting and build/test tools go into simple scripts that the llm can run and get shorthand info out of. Some tools, if you have the llm run them, they're going to ingest a lot from the output (like a big stacktrace or something). I want to keep context clean so I have the llm create the tool to use for build/test/linting and I tell it to create it so the outputs will keep its context clean, then I have it document it in the .md file.
When working with the LLM I have to start out pretty explicit about using the tooling. As we work through things it will start to automatically run the tooling. Sometimes it will want to do something else, I just nudge it back to use the tooling (or I'll ask it why or if there are benefits to the other way and if there are we'll rebuild the tooling to use the other way).
Finally, if the LLM is really having trouble, I kill the session and start a new one. It used to feel bad to do that. I'd feel like I'm losing a lot of info that's in context. But now, I feel like it's not so bad... but I'm not sure if that's because the llms are better or if my workflow has adapted.
Now, let me backup a little bit. I mentioned that I don't have the llm use git. That's the control I maintain. And with that my workflow is: llm builds feature->llm runs linters/tests->I e2e test whatever I'm building by deploying to a dev/staging/local env->once verified I commit. Now I will continue that context window/session until I feel like the llm starts fucking up. Then I kill the session and start a new one. I rarely compact, but it does happen and I generally don't fret about it too much.
I try to keep my units of work small and I feel like it does the best when I do. But then I often find myself surprised at how much it can do from a single prompt, so idk. I do understand some of the skepticism because a lot of this stuff sounds "hand-wavy". I'm hoping we all start to hone in on some general more concrete patterns but with it being so non-deterministic I'm not sure if we will. It feels like everyone is using it differently and people are having successes and failures across different things. People where I work LOVE MCPs but I can't stand them. When I use them it always feels like I have to remind the llm that it has an MCP, then it feels like the MCP takes too much context window and sometimes the llm still trips over how to use it.
Ok, that's a good tip about separate tools/scripts for the LLM, I did something similar less than a year ago so that I kept lint/test output to a minimum but it was still invoked via git hooks. I'll try again with scripts next time I'm doing this. My hope was to let the agent commit to a branch (with code that passed lint/test/prettier/etc), push it, auto-deploys to preview branches, and then that's where I'd do my e2e/QA and once I was happy I could merge it and it get deployed to the main site.
I discussed approaches in my earlier reply. But what you are saying now makes me think you are having problems with too much context. Pare down your CLAUDE.md massively and never let you context usage get over 60-65%. And tell CLAUDE not to commit anything without explicit instructions from you (unless you are working in a branch/worktree and are willing to throw it all away).
Literally yesterday I was using Claude for writing a SymPy symbolic verification of a mathematical assertion it was making with respect to some rigorous algebra/calculus I was having it do for me. This is the best possible hygiene I could adopt for checking its output, and it still failed to report on results correctly.
After manual line-by-line inspection and hand-tweaks, it still saved me time. But it's going to be a long, long time before I no longer manually tweak things or trust that there are no silent mistakes.
Those kinds of errors were super common 4-6 months ago, but LLM quality moves fast. Nowadays I don't see these very often at all. Two things that make a huge difference: work on writing a spec first. github.speckit, GSD, BMAD, whatever tool you like can help with this. Do several passes on the spec to refine it and focus on the key ideas.
Now that you have a spec, task it out, but tell the LLM to write the tests first (like Test-Driven Development, but without all the formalisms). This forces the LLM to focus on the desired behavior instead of the algorithms. Be sure to focus on tests that focus on real behavior: client apis doing the right error handling when you get bad input, handling tricky cases, etc. Tell the system not to write 'struct' tests - checking that getters/setters work isn't interesting or useful.
Then you implement 1-3 tasks at a time, getting the tests to pass. The rules prevent disabling tests, commenting out tests, and, most importantly, changing the behavior of the tests. Doesn't use a lot of context, little to no hallucinating, and easily measurable progress.
>> When I've had LLMs write unit tests they are quick to write pointless unit tests that seem impressive "2123/2123 tests passed!" but in reality it's testing mostly nothing of value.
This has not happened to me since Sonnet 4.5. Opus 4.5 is especially robust when it comes to writing tests. I use it daily in multiple projects and verify the test code.
I thought I did use Opus 4.5 when I tested this last time but I might have still been on the $20 plan and I cannot remember if you get any Opus 4.5 on that in Claude Code (I thought you did with really low limits?), so maybe I wasn't using Opus 4.5, I will need to try again.
Where is the "super upvote button" when you need it?
YES! I have been playing with vibe coding tools since they came out. "Playing" because only on rare occasions have I created something that is good enough to commit/keep/use. I keep playing with them because, well I have a subscription, but also so I don't fall into the fuddy-duddy camp of "all AI is bad" and can legitimately speak on the value, or lack thereof, of these tools.
Claude Code is super cool, no doubt, and with _highly targeted_ and _well planned_ tasks it can produce valuable output. Period. But, every attempt at full-vibe-coding I've done has gotten hung up at some point and I have to step in an manually fix this. My experience is often:
1. First Prompt: Oh wow, this is amazing, this is the future
2. Second Prompt: Ok, let me just add/tweak a few things
10. 10th prompt: Ugh, everytime I fix one thing, something else breaks
I'm not sure at all what I'm doing "wrong". Flogging the agents along doesn't not work well for me or maybe I am just having trouble letting go of the control and I'm not flogging enough?
But the bottom line is I am generally shocked that something like Gas Town was able to be vibe-coded. Maybe it's a case of the LLM overstating what it's accomplished (typical) and if you look under the hood it's doing 1% of what it says it is but I really don't know. Clearly it's doing something, but then I sit over here trying to build a simple agent with some MCPs hooked up to it using a LLM agent framework and it's falling over after a few iterations.
So I’m probably in a similar spot - I mostly prompt-and-check, unless it’s a throwaway script or something, and even then I give it a quick glance.
One thing that stands out in your steps and that I’ve noticed myself- yeah, by prompt 10, it starts to suck. If it ever hits “compaction” then that’s beyond the point of return.
I still find myself slipping into this trap sometimes because I’m just in the flow of getting good results (until it nosedives), but the better strategy is to do a small unit of work per session. It keeps the context small and that keeps the model smarter.
I will check out Ralph (thank you for that link!).
> Another way is “Write out what we did to PROGRESS.md” - then start new session - then “Read @PROGRESS.md and do X”
I agree on small context and if I hit "compacting" I've normally gone too far. I'm a huge fan of `/clear`-ing regularly or `/compact <Here is what you should remember for the next task we will work on>` and I've also tried "TODO.md"-style tracking.
I'm conflicted on TODO.md-style tracking because in practice I've had an agent work through everyone on the list, confidently telling me steps are done, only to find that's not the case when I check its work. Either a TODO.md that I created or one I had the agent create both suffer from this. Also, getting it update the TODO.md has been frustrating, even when I add it to CLAUDE.md "Make sure to mark tasks as complete in the TODO.md as you finish them" or adding the same message to the end of all my prompts, it won't always update it.
I've been interested in trying out beads to see if works better than a markdown TODO file but I haven't played with that yet.
But overall I agree with you, smaller chunks are key to success.
I hate TODO.mds too. If I ever have to use one, I'll keep track of it manually, and split the work myself into chunks of the size I believe CC/codex can handle. TODO.md is a recipe for failure because you'll quickly have more code than you can review and nothing to trust that it was executed well.
> 10. 10th prompt: Ugh, everytime I fix one thing, something else breaks
Maybe that is the time to start making changes by hand. I think this dream of humans never ever writing any more code might be too far and unnecessary.
I’ve definitely hit that same pattern in the early iterations, but for me it hasn’t really been a blocker. I’ve found the iteration loop itself isn’t that bad as long as you treat it like normal software work. I still test, review, and check what it actually did each time, but that’s expected anyway. What’s surprised me is how quickly things can scale once the overall architecture is thought through. I’ve built out working pieces in a couple of weeks using Claude Code, and a lot of that time was just deciding on the architecture up front and then letting it help fill in the details. It’s not hands-off, but used deliberately, it’s been quite effective https://robos.rnsu.net
I agree that it can be very useful when used like that but I'm referring to fully vibe-coding, the "I've never looked at the code"-people. CC is a great tool when you use plan carefully, review its work, etc but people are building things they say they've never read the code for and that just hasn't been my experience, it always falls over on it's own if I'm not in the code reviewing/tweaking.
> Keep in mind that Steve has LLMs write his posts on that blog.
Ok, I can accept that, it's a choice.
> Things said there may not reflect his actual thoughts on the subject(s) at hand.
Nope, you don't get to have it both ways. LLMs are just tools, there is always a human behind them and that human is responsible for what they let the LLM do/say/post/etc.
We have seen the hell that comes from playing the "They said that but they don't mean it" or "It's just a joke" (re: Trump), I'm not a fan of whitewashing with LLMs.
This is not an anti or pro Gas Town comment, just a comment on giving people a pass because they used an LLM.
Do you read that as giving him a pass? I read it as more of a condemnation. If you have an LLM write "your" blog posts then of course their content doesn't represent your thoughts. Discussing the contents of the post then is pointless, and we can disregard it entirely. Separately we can talk about what the person's actual views might be, using the fact that he has a machine generate his blog posts as a clue. I'm not sure I buy that the post was meaningfully LLM-generated though.
The same approach actually applies to Trump and other liars. You can't take anything they say as truth or serious intent on its own; they're not engaging in good faith. You can remove yourself one step and attempt to analyze why they say what they do, and from there get at what to take seriously and what to disregard.
In Steve's case, my interpretation is that he's extremely bullish on AI and sees his setup or something similar as the inevitable future, but he sprinkles in silly warnings to lampshade criticism. That's how the two messages of "this isn't serious" and "this is the future or software development" co-exist. The first is largely just a cover and an admission this his particular project is a mess. Note that this interpretation assumes that the contents of the blog post in question were largely written by him, even if LLM assistance was used.
Hmm, maybe I read the original comment wrong then? I did read it as "You can't blame him, that might not even be what he thinks" and my stance is "He posted it on his blog, directly or indirectly, what else am I supposed to think?".
I agree with you on Steve's case, and I have no ill will towards him. Mostly it was just me trying to "stomp" on giving him a pass, but, as you point out, that may not have been what the original commenter meant.
I'm building a house currently and I really wish there were more options to have the things I want without needing all the extra space in places I don't care about. The problem is, even if I was able to build such a house (I'm using a large builder, this is not a fully custom house) the resale prospects would be poor.
I missed that thread originally, the post and the comments where a good read, thank you for sharing.
I got a kick out of this comment [0]. "BenjiWiebe" made a comment about the SSH packets you stumbled across in that thread. Obviously making the connection between what you were seeing in your game and this random off-hand comment would be insane (if you had seen the comment at all), but I got a smile out of it.
I wanted to look into their pricing for Devin+ and I have to say, ACU are entirely too opaque/confusing/complicated. The entire description of them is shrouded in mystery. And this part confuses me even more:
> Aside from the few ACUs required to keep the Devin VM running, Devin will not consume ACUs when:
> Waiting for your response
> Waiting for a test suite to run
> Setting up and cloning repositories
Ok, that kind of makes sense, but what does "the few ACUs required to keep the Devin VM running" mean? These cost $2.50/ea so "a few" means $5+ and on what time scale? Daily? Monthly?
The lowest plan comes with $20 ACUs but they don't list anywhere how far that gets you or even rough examples. I guess if you want to kick the tires $20 isn't a crazy amount to test it out yourself and maybe I'm just not the target market (I kind of feel like I am though?) but I wish their pricing made sense.
reply