We're building something similar and found that no matter how good the agent loop is, you still need "canonical metrics" that are human-curated. Otherwise non-technical users (marketing, product managers) are playing a guessing game with high-stakes decisions, and they can't verify the SQL themselves.
Our approach:
1. We control the data pipeline and work with a discrete set of data sources where schemas are consistent across customers
2. We benchmark extensively so the agent uses a verified metric when one exists, falls back to raw SQL when it doesn't, and captures those gaps as "opportunities" for human review
Over time, most queries hit canonical metrics. The agent becomes less of a SQL generator and more of a smart router from user intent -> verified metric.
The "Moving fast without breaking trust" section resonates, their eval system with golden SQL is essentially the same insight: you need ground truth to catch drift.
Yes, I’ve been working on this and you need a clear semantic layer.
If there are multiple paths or perceived paths to an answer, you’ll get two answers. Plus, LLMs like to create pointless “xyz_index” metrics that are not standard, clear, or useful. Yet i see users just go “that sounds right” and run with it.
In my experience, archived objects are almost never accessed, and if they are, it's within a few hours or days of deletion, which leaves a fairly small chance that schema changes will have a significant impact on restoring any archived object. If you pair that with "best-effort" tooling that restores objects by calling standard "create" APIs, perhaps it's fairly safe to _not_ deal with schema changes.
Of course, as always, it depends on the system and how the archive is used. That's just my experience. I can imagine that if there are more tools or features built around the archive, the situation might be different.
I think maintaining schema changes and migrations on archived objects can be tricky in its own ways, even kept in the live tables with an 'archived_at' column, especially when objects span multiple tables with relationships. I've worked on migrations where really old archived objects just didn't make sense anymore in the new data model, and figuring out a safe migration became a difficult, error-prone project.
My partner and I have been playing this almost every morning. We're really enjoying it!
Some feedback:
1) it would be great if the incomplete clues could move to the top. this would avoid having to scroll down towards the end of the puzzle.
2) better collission behavior; it would be nice if we could drag a chunk of words and it would just "move the other words" out of the way. Sometimes we have to spend time to make a path to move chunks of words around.
1) This is an interesting idea! I’ll play with that when I have time.
2) I am experimenting with this but have gotten mixed feedback from players. Some people don’t like it. I’m curious what you think! If I don’t do this I’ll explore other options: https://sunny.garden/@paulhebert/115698266272946749
Nah, that's too smart of a behavior. What exists now may have some edge cases, but it is otherwise staright-forward and intuitive. The only real "hassle" is swapping two large assembled pieces closer to the end of the game round, but it's not really a hassle. Not a big deal, really.
I’m thinking of adding a “shuffle” button to rearrange the tiles if you get really stuck. It’s theoretically possible to get in an unwinnable state where you can’t swap two tiles
I was coming to the comments to ask about this as I noticed a other (finance) companies [1] were providing this for free and I wanted to know what the game was about.
I've been checking about twice a week for the last 6 months, and they are very rare, but it does happen. I caught one on video 2 weeks ago! https://youtu.be/NkNx6tx3nu0?t=744
Despite some pushback, I worked on something similar (internally) for a previous company.
Having React/JSX for email templating (even if it was on top of MJML) is a great win for productivity.
All our front-end devs knew React, a couple knew Jinja, Pug or mustache. And every-time a team needed to add a new email template, their frontend devs needed to learn those again.
Instead they could just write email templates as they would write their regular components the way they do every day.
Uhh I am saving this one! Thanks! I remember the last time I had to set up a few mail templates it was so incredibly painful. Especially the testing of it is so painstakingly slow.
We're building something similar and found that no matter how good the agent loop is, you still need "canonical metrics" that are human-curated. Otherwise non-technical users (marketing, product managers) are playing a guessing game with high-stakes decisions, and they can't verify the SQL themselves.
Our approach: 1. We control the data pipeline and work with a discrete set of data sources where schemas are consistent across customers 2. We benchmark extensively so the agent uses a verified metric when one exists, falls back to raw SQL when it doesn't, and captures those gaps as "opportunities" for human review
Over time, most queries hit canonical metrics. The agent becomes less of a SQL generator and more of a smart router from user intent -> verified metric.
The "Moving fast without breaking trust" section resonates, their eval system with golden SQL is essentially the same insight: you need ground truth to catch drift.
Wrote about the tradeoffs here: https://www.graphed.com/blog/update-2