Hacker Newsnew | past | comments | ask | show | jobs | submit | munro's commentslogin

ChatGPT needs to catch up hard, for me their model is unusable and I cancelled my subscription 5-6 months ago, so to me this post is hot air. I need to see results, going back to try Codex 5.2, then 5.3, downloading their new desktop client, their VS code extension, their weird browser, is time I wish I had back.

>=99% accuracy wtf?!?

I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.


I was looking at using this on an LTO tape library, it seems the only resiliency is through replication, but this was my main concern with this project, what happens with HW goes bad


If you have replication, you can lose one of the replica, that's the point. This is what Garage was designed for, and it works.

Erasure coding is another debate, for now we have chosen not to implement it, but I would personally be open to have it supported by Garage if someone codes it up.


Erasure coding is an interesting topic for me. I've run some calculations on the theoretical longevity of digital storage. If you assume that today's technology is close to what we'll be using for a long time, then cross-device erasure coding wins, statistically. However, if you factor in the current exponential rate of technological development, simply making lots of copies and hoping for price reductions over the next few years turns out to be a winning strategy, as long as you don't have vendor lock-in. In other words, I think you're making great choices.


I question that math. Erasure coding needs less than half as much space as replication, and imposes pretty small costs itself. Maybe we can say the difference is irrelevant if storage prices will drop 4x over the next five years? But looking at pricing trends right now... that's not likely. Hard drives and SSDs are about the same price they were 5 years ago. The 5 years before that SSDs were seeing good advancements, but hard drive prices only advanced 2x.


seriously cool, i just did something similar with quantiles for even bucketizing on arb key data types (still needs ord tho)


just use a calendar event, it's more robust, and gives you the same feeling of 'oh yea...'


Exactly. I give services like this - generally coded as someone's first "wow I know PHP now!" or the modern equivalent - approximately 5 years shelf life, at best.

Whereas I have notes-to-future-me on my calendar that I put there 30 years ago.


What calendar system have you been using for 30 years, that's survived that long?

I think I sent one of those "mails to the future" in the 90's, asking 2002 me how I am. I don't think it ever arrived, or the free email domain I was using ceased operating.

Sheesh, anyone old enough to remember the services offering a free email address with a choice of maybe 50 domains in a dropdown?


> Sheesh, anyone old enough to remember the services offering a free email address with a choice of maybe 50 domains in a dropdown?

Mail.com is still around and offers a lot of domains, though I think the amount of domains has reduced over the years.


Not the same one all that time, but I've always exported/imported when I have changed.

For the same reason, I have every email I've ever sent or received, going back to my 1988 FIDOnet account.


I don't get emails for my calendar events though (which is kinda important for my workflow, as my inbox is my task backlog)


I had built an agent with LangGraph a 9 months ago--now seems React agents are in LangChain. Over all pretty happy with that, I just don't use any of the dumb stuff like embedding/search layer: just tools&state

But I was actually playing with a few frameworks yesterday and struggling--I want what I want without having to write it. ;). Ended up using pydantic_ai package, literally just want tools w/ pydantic validation--but out of the box it doesn't have good observability, you would have to use their proprietary SaaS; and it comes bundled with Temporal.io (yo odio eso proyecto). I had to write my own observability which was annoying, and it sucks.

If anyone has any things they've built, I would love to know, and TypeScript is an option. I want: - ReAct agent with tools that have schema validation - built in REALTIME observability w/ WebUI - customizable playground ChatUI (This is where TypeScript would shine) - no corporate takeover tentacles

p.s.s: I know... I purposely try to avoid hard recommendations on HN, to avoid enshittification. "reddit best X" has been gamed. And generally skip these subtle promotional posts..


Hey munro, Douwe from Pydantic AI here. Our docs have a page dedicated to observability: https://ai.pydantic.dev/logfire/, which is all based on OpenTelemetry and its gen_ai conventions.

The Logfire SDK that Pydantic AI uses is a generic OTel client that defaults to sending data to Pydantic Logfire, our SaaS observability platform: https://pydantic.dev/logfire, but can easily be pointed at any OTel sink: https://ai.pydantic.dev/logfire/#logfire-with-an-alternative...

Temporal is one of multiple durable execution solutions (https://ai.pydantic.dev/durable_execution/overview/) we support and its SDK is indeed included by default in the "fat" `pydantic-ai` package, as are the SDKs for all model providers. There's also a `pydantic-ai-slim` package that doesn't include any optional dependencies: https://ai.pydantic.dev/install/#slim-install


I have also worked on react agents using langgraph and I've not faced much issues! This thread has been confusing! Am I doing anything incorrectly? Or am I misunderstanding what people call as agents?


Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"


Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.

What about when we don't know what it's supposed to look like?

Lately I've been wrestling with the fact that unlike, say, a generalized linear model fit to data with some inferential theory, we don't have a theory or model for the uncertainty about LLM products. We recognize when it's off about things we know are off, but don't have a way to estimate when it's off other than to check it against reality, which is probably the exception to how it's used rather than the rule.


I need to be delicate with wording here, but this is why it's a worry that all the least intelligent people you know could be using AI.

It's why non-coders think it's doing an amazing job at software.

But it's worryingly why using it for research, where you necessarily don't know what you don't know, is going to trip up even smarter people.


You are describing exactly the Dunning-Kruger Effect[0] in action. I’ve worked with some very bright yet less technical people who think the output is some sort of magic lamp and vastly overindex on it. It’s very hard as an engineer to explain this to them.

[0] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect


I built an ML classifier for product categories way back, as I added more classes/product types, individual class PR metrics improved--I kept adding more and more until I ended up with ~2,000 classes.

My intuition is at the start when I was like "choose one of these 10 or unknown", that unknown left a big gray area, so as I added more classes the model could say "I know it's not X, because it's more similar to Y"

I feel like in this case though, the broken clocks are broken because they don't serve the purpose of visually transmitting information, they do look like clocks tho. I'm sure if you fed the output back into the LLM and ask what time it is it would say IDK, or more likely make something up and be wrong. (at least the egregious ones where the hands are flying everywhere)


Yeah it seems crazy to use LLM on any task where the output can't be easily verified.


> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.


  > "Hey this test is failing", LLM deletes test, "FIXED!"
A nice continuation of the tradition of folk stories about supernatural entities like teapots or lamps that grant wishes and take them literally. "And that's why, kids, you should always review your AI-assisted commits."


To be fair I'd probably also delete the test.


I love the irony of the site showing a MASSIVE banner with a huge green "Download Extension for Mac (Free)" button.

This thing is 280px tall! I clicked it for shits and giggles and upon returning it showed a popup XD

https://files.catbox.moe/sv7hb7.png

> Only 2 Steps (thx)

> Click "Download"

> Add Privacy Guard for Chrome™

Don't worry why I'm not using ad block



Why does a static blog need to store user information?


it's in the picture: personal advertising

[I know you were sarcastic, just wanted to make it clear for ...future generations]


I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog
> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?


That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.


It's more like someone whispered dog into your ears while you were unconscious, and you were unable to recall any conversation but for some reason you were thinking about dogs. The thought didn't enter your head through a mechanism where you could register it happening so knowing it's there depends on your ability to examine your own internal states, i.e., introspect.


I'm more looking at the problem more like code

https://bbycroft.net/llm

My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights


But what if `ty` was also eventually merged into `uv` as well? 8-)

That's probably the vision, given all from astral.sh, but `ty` isn't ready yet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: