We did not want to make the post engineering-focused, but we have 18 companies in production today (we wrote about PostHog in the blog). At some point we should post some case studies. The metric we track for usefulness is our monthly revenue :)
Mendral is replacing a human Platform Engineer. It debugs the CI logs, look at the commit associated, look at the implementation of the tests, etc... It then proposes fixes and takes care of opening a PR.
There is a cost associated with each investigation (that the Mendral agent is doing). And we spend time tuning the orchestration between agents. Yes expensive but we're making money on top of what it costs us. So far we were able to take the cost down while increasing the relevance of each root cause analysis.
We're writing another post about that specifically, we'll publish it sometimes next week
I agree, we automated in the Mendral agent what is time consuming for human (like debugging a flaky test), but it will need permission to confirm the remediation and open a PR.
But it's night and day to fix your CI when someone (in this case an agent) already dug into the logs, the code of the test and propose options to fix. We have several customers asking us to automate the rest (all the way to merge code), but we haven't done it for the reasons you mention. Although I am sure we'll get there sometimes this year.
Shameless plug here for Lexega—a deterministic policy enforcement layer for SQL in CI/CD :) https://lexega.com
There are bridges here that the industry has yet to figure out. There is absolutely a place for LLMs in these workflows, and what you've done here with the Mendral agent is very disciplined, which is, I'd venture to say, uncommon. Leadership wants results, which presses teams to ship things that maybe shouldn't be shipped quite yet. IMO the industry is moving faster than they can keep up with the implications.
LLMs are better now at pulling the context (as opposed to feeding everything you can inside the prompt). So you can expose enough query primitives to the LLM so it's able to filter out the noise.
I don't think implementing filtering on log ingestion is the right approach, because you don't know what is noise at this stage. We spent more time on thinking about the schema and indexes to make sure complex queries perform at scale.
From own experience it's true, and I think it's due to the amount of SQL content (docs, best practices, code) that you can find online, which is now in all LLM's corpus data.
Same applies when picking a programming language nowadays.
Yeah it sounds very familiar with what we went through while building this agent.
We're focused on CI logs for now because we wanted something that works really well for things like flaky tests, but planning to expand the context to infrastructure logs very soon.
We started writing very recently: https://www.mendral.com/blog - there is a another post we made yesterday about the overall architecture. And we have a long list of things we're planning to write about in more details.
What we learned while building this is every token matters in the context, we spend lot of time watching logs of agent sessions, changing the tool params, errors returned by tools, agent prompts, etc...
We noticed for example the importance of letting the model pull from the context, instead of pushing lots of data in the prompt. We have a "complex" error reporting because we have to differentiate between real non-retryable errors and errors that teach the model to retry differently. It changes the model behavior completely.
Also I agree with "significant weight of human input and judgement", we spent lots of time optimizing the index and thinking about how to organize data so queries perform at scale. Claude wasn't very helpful there.
Very interesting work here, no doubt. It's a measured approach to using an LLM with SQL rather than trying to make it responsible for everything end-to-end.
reply