Hacker Newsnew | past | comments | ask | show | jobs | submit | tybug's commentslogin

To put it on the record: my position is current models can't get us there, and neither can the next iteration of models, but in two model iterations this will be worth doing. There's a lot of fiddly details in Hypothesis that are critical to get right. You can get a plausible 80% port with agents today but find they've structured it in a way to make it impossible to get to 100%.

Yep, `#[derive(DefaultGenerator)]` and `generators::default<T>()` are the right tools here.

This is one of the areas we've dogfooded the least, so we'd definitely be happy to get feedback on any sharp corners here!

I think `from_type` is one of Hypothesis's most powerful and ergonomic strategies, and that while we probably can't get quite to that level in rust, we can still get something that's pretty great.


What do you think we're currently missing that Python's `from_type` has? I actually think the auto-deriving stuff we currently have in Rust is as good or better than from_type (e.g. it gets you the builder methods, has support for enums), but I've never been a heavy from_type user.

`from_type` just supports a bunch more things than rust ever can due to the flexibility of python's type system. `from_type(object)` is amazing, for example, and not something we can write in rust.

Yeah, that's true. I was going to say that it's maybe not fair to count things that just don't even make sense in Rust, but I guess the logical analogue is something like `Box<dyn MyTrait>` which it would make sense to have a default generator for but also we're totally not going to support that.

Thank you! I have some particularly annoying proptest-based tests that I'll try porting over to Hegel soon. (Thanks for writing the Claude skill to do this.)

Please let us know how it goes!

As Liam says, the derive generator is not very well dogfooded at present. The claude skill is a bit better, but we've only been through a few iterations of using it and getting Claude to improve it, and porting from proptest is one of the less well tested areas (because we don't use proptest much ourselves).

I expect all of this works, but I'd like to know ways that it works less well than it could. Or, you know, to bask in the glow of praise of it working perfectly if that turns out to be an option.


I actually think there's another angle here where PBT helps, which wasn't explored in the blog post.

That angle is legibility. How do you know your AI-written slop software is doing the right thing? One would normally read all the code. Bad news: that's not much less labor intensive as not using AI at all.

But, if one has comprehensive property-based tests, they can instead read only the property-based tests to convince themselves the software is doing the right thing.

By analogy: one doesn't need to see the machine-checked proof to know the claim is correct. One only needs to check the theorem statement is saying the right thing.


Right, I said that property based tests are easier to read, and that's good. But people still have to actually read them. Also, because they still work best at the "unit" level, to understand them, the people reading them need to know how all the units are connected (e.g. a single person cannot review even PBTs required for 10KLOC per day [1]).

My point isn't so much about PBT, but about how we don't yet know just how much agents help write real software (and how to get the most help from them).

[1]: I'm only using that number because Garry Tan, CEO of YC, claimed to generate 10K lines of text per day that he believes to be working code and developers working with AI agents know they can't be.


As possibly the one community on earth where it's actually better to post the code than the blog post: TL;DR this is a universal property-based testing protocol (https://github.com/hegeldev/hegel-core) and family of libraries (https://github.com/hegeldev/hegel-rust, more to come later).

I've talked with lots of people in the PBT world who have always seen something like this as the end goal of the PBT ecosystem. It seemed like a thing that would happen eventually, someone just had to do it. I'm super excited to actually be doing it and bringing great PBT to every and any language.

It doesn't hurt that this is coming right as great PBT in every language is suddenly a lot more important thanks to AI code!


(Hypothesis maintainer here) If you have recommendations for a better example on the front page, I'd love to hear them! (I mean this entirely genuinely and non-sarcastically; I agree sorting can give misleading ideas, but it is also concise and well understood by every reader).


The more I think about it the more I think calling it a bad example may be unfair. It can be extremely misleading for someone unfamiliar with the concept coming at it with a particular viewpoint, but I’m less sure, with more time to think, that an example that is better for that wouldn’t be worse in other ways.

I like sorting as an example, and I like that using the built-in is concise, and reimplementing the behavior of an existing function where using the existing function as an oracle is a reasonable thing to do for a test isn't all that uncommon.

I feel like something with a couple of properties described in comments with assertions testing those properties (but where the functionality and properties are familiar enough that it would make a clear connection) would be a bit better, in theory, but I don't have a great particular example to use, and anything done that way will be, at best, somewhat less concise.


Appreciate the thoughts <3. I do think there might be stronger examples we could choose. Possibly json encode/decode..


(Hypothesis maintainer here)

Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default.

That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/).


Thank you, Hypothesis is brilliant!


Thanks for the good work!


Nice! "testing your test code" is particularly important when dealing with PBT distributions, especially when your generator gets more complicated.

Tyche [0] is another cool tool for addressing the same problem, visualizing the PBT distribution but not making assertions about it.

[0] https://github.com/tyche-pbt/tyche-extension


The Hypothesis explain phase [1][2] does this!

  fails_on_empty_third_arg(
      a = "",  # or any other generated value
      b = "",  # or any other generated value
      c = "",  
      d = "",  # or any other generated value
  )
[1] https://hypothesis.readthedocs.io/en/latest/reference/api.ht...

[2] https://github.com/HypothesisWorks/hypothesis/pull/3555


That kind of behavior can happen at the threshold of Hypothesis' internal limit on entropy - though if you're not hitting HealthCheck.data_too_large then this seems unlikely.

Let me know if you have a reproducer, I'd be curious to take a look.


(not OP but I would be surprised if the answer wasn't) yes, because you're changing the order in which the random draws are interpreted. But this isn't a problem in practice because you generally aren't changing the generator in the middle of debugging a failure.


So, this means refactoring becomes potentially difficult. While the gem is still a great accomplishment and very useful, I'd have to engineer my way around this issue before using it with things like a Rails Model which could have changing shape.

@OP:

I wonder if the README (and possibly runner) should suggest writing a test-case that doesn't rely on PBT when the user wants to preserve a case for future testing.

The issue here is that if you're saving a singular example and it represents a weird corner case, it's totally conceivable that a small change will result in an invisible change to that test case.

Another idea: it'd be great if the test could simply take examples that are failing and add them to a `failing_examples.rb` or some such. I know I'd use a feature like this quite a bit.


I think there are usually three actions a programmer can take when PBT fails.

- Create a test case that doesn't depend on PBT as you suggest.

- Fix the production code being tested since its failure is an unexpected bug.

- Fix the PBT itself. This means that the programmer has had wrong assumption for the test target.

I think it's difficult for the tool to know which choice is the best on a failure. But if there's any good idea, I'd like to incorporate it. :)


>yes, because you're changing the order in which the random draws are interpreted. But this isn't a problem in practice because you generally aren't changing the generator in the middle of debugging a failure.

Correct. The test inputs are determined by a seed and generators (including the order of generators).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: