This is one of our biggest challenges, you're spot on! What we're working on taking this includes a memory layer that agents have access to - thus state changes become part of their knowledge and accounted for while conducting a test.
They're also smart enough to not be frazzled by things having changed, they still have their objectives and will work to understand whether the functionality is there or not. Beauty of non-determinism!
You could definitely do that and get some good results! But if you want a repeatable process with detailed bug reports (including logs, reasoning, etc.) and a large enough search area with agents that can continuously build an understanding of your app - that's us :)
The agents can definitely detect when something is off, given they're using VLMs. They don't necessarily compare it to previous versions, rather they have opinionated takes on whether something looks broken / off. So - yes!
hi! Looking at your swarm results, you might not have given the swarm login credentials to use, which is why most of the runs are failing out. Please feel free to try it again and give them access.
Great question! The swarm takes a first pass to generate tests and can continuously add as it runs again over time.
In the off chance it misses specific tests - we have tools to let you build them directly with ai support, either by giving them objectives or dropping in a video of the actions you're taking!
They're also smart enough to not be frazzled by things having changed, they still have their objectives and will work to understand whether the functionality is there or not. Beauty of non-determinism!