Hacker Newsnew | past | comments | ask | show | jobs | submit | sfink's commentslogin

Perf is not normally distributed. Or rather, it is very very common for it to not be. Where I work, we often see multimodal (usually mostly bimodal) distributions, and we'll get performance alerts that when you look at them are purely a result of more samples happening to shift from one mode to another.

It's easy to construct ways that this could happen. Maybe you're running a benchmark that does a garbage collection or three. It's easy for a GC to be a little earlier or a little later, and sneak in or out of the timed portion of the test.

Warm starts vs cold starts can also do this. If you don't tear everything down and flush before beginning a test, you might have some amount of stuff cached.

The law of large numbers says you can still make it normal by running enough times and adding them up (or running each iteration long enough), but (1) that takes much longer and (2) smushed together data is often less actionable. You kind of want to know about fast paths and slow paths and that you're falling off the fast path more often than intended.

As usual you can probably cover your eyes, stick your fingers in your ears, and proceed as if everything were Gaussian. It'll probably work well enough!


> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.


Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.

Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests


You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.

Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:

128 commits, 500 trials:

Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.

Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...


This doesn't sound quite right, but I'm not sure why.

Perhaps: a reasonable objective would be to say that for N bits of information, I would like to pick the test schedule that requires the least total elapsed time. If you have two candidate commits and a slow recompile time, it seems like your algorithm would do many repeats of commit A until the gain in information per run drops below the expected gain from B divided by the recompile time, then it would do many repeats of B, then go back to A, etc. So there are long runs, but you're still switching back and forth. You would get the same number of bits by doing the same number of test runs for each commit, but batching all of the A runs before all of the B runs.

Then again: you wouldn't know how many times to run each in advance, and "run A an infinite number of times, then run B an infinite number of times" is clearly not a winning strategy. Even with a fixed N, I don't think you could figure it out without knowing the results of the runs in advance. So perhaps your algorithm is optimal?

It still feels off. You're normalizing everything to bits/sec and choosing the maximum. But comparing an initial test run divided by the rebuild time vs a subsequent test run divided by a much faster time seems like you're pretending a discrete thing is continuous.

I wish I could math good.


The general requirement for this approach to be optimal, is called "dynamical consistency". A good description is in [1]. It is the situation where, suppose you have a budget B , and you search until your budget is exhausted. Then you are informed that there is an additional budget, B2, and you can continue searching until that is exhausted. A situation is dynamically consistent if, for any B,B2, the optimal strategy is such that you would make the same choices whether you know that you will get B2 or not.

So you are correct that discreteness is a problem, because if you are nearing the end of the budget you may optimally prefer to get more dice rolls than take bigger bets. But the optimal solution is then often analytically intractable (or at least it was - I last read about this a while back), and the entropy approach is often reasonable anyway. (For cases where search effort is significant, a good search plan can be found by simulation).

[1] https://bayes.wustl.edu/etj/articles/search.pdf


Note that "pick the commit with best expected information gain" in git_bayesect isn't optimal even in the no overhead regime. I provide a counterexample in the writeup, which implies ajb's heuristic is also not optimal. I don't see a tractable way to compute the optimal policy.

One idea: if you always spend time testing equal to your constant overhead, I think you're guaranteed to be not more than 2x off optimal.

(and agreed with ajb on "just use ccache" in practice!)


> My idea was to make X-to-safe-lang translators, X initially being Python and Javascript.

Both of those languages are already safe. Then you talk about translating to C, so you're actually doing a safe-to-unsafe translation. I'm not sure what properties you're checking with the static analysis at that point. I think what would be more important is that your translator maintains safety.


No. It means we get an extra dose of stearates and inaccurate science.

The gloves are not plastic.


Stone tires.


I've read much of the HN discussion on the previous post, a skimmed the rest, but I didn't see a couple of things addressed:

First, how could you make this deal with copies and renames? It seems to me like the pure version of this would require a weave of your whole repository.

Second, how different is this from something like jujutsu? As in, of course it's different, your primary data structure is a weave. But jj keeps all of the old commits around for any change (and prevents git from garbage collecting them by maintaining refs from the op log). So in theory, you could replay the entire history of a file at a particular commit by tracing back through the evolog. That, plus the exact diff algorithm, seems like enough to recreate the weave at any point in time (well, at any commit), and so you could think of this whole thing as a caching layer on top of what jj already provides. I'm not saying you would want to implement it like that, but conceptually I don't see the difference and so there might be useful "in between" implementations to consider.

In fact, you could even specify different diff algorithms for different commits if you really wanted to. Which would be a bit of a mess, because you'd have to store that and a weave would be a function of those diff algorithms and when they were used, but it would at least be possible. (Cohen's system could do this too, at the cost of tracking lots of stuff that currently it doesn't need or want to track.) I'm skeptical that this would be useful except in a very limited sense (eg you could switch diff algorithms and have all new commits use the new one, without needing to rebuild your entire repository). It breaks distributed scenarios -- everyone has to agree on which diff to use for each commit. It's just something that falls out of having the complete history available.

I'm cheating with jj a bit here, since normally you won't be pushing the evolog to remotes so in practice you probably don't have the complete history. In practice, when pushing to a remote you might want to materialize a weave or a weave-like "compiled history" and push that too/instead, just like in Cohen's model, if you really wanted to do this. And that would come with limitations on the diff used for history-less usage, since the weave has to assume a specific deterministic diff.


You can generically represent copies and renamed (aka "moves") in a CRDT, OT or CTM using a Portal: https://braid.org/meeting-62/portals

...which is why I recently went to about:keyboard and removed that hotkey. I love that page.

That, and Ctrl-N. No more forest of blank browser windows when using a terminal emulator in a web page!

(Firefox only)


Ctrl+W is undoable.

Ctrl+Shift+T will undo your recent tab closures in reverse order. The tabs maintain their history as well.

I am very surprised at how many people in here don’t seem to know that. I learned about Ctrl+Shift+T before I learned about Ctrl+W. I was using the middle mouse button on a tab to close tabs before then.


I know. I used to use it fairly often when Ctrl-W still did something. It helps, but (1) it doesn't work if you closed the last tab and thus the whole window, you'd need to restore recently closed windows instead; and (2) it is still more disruptive and potentially state-losing than preventing an unwanted close in the first place. Tab history retention isn't perfect.

We still try to limit any additional exposure as much as possible, and WR/FG are specced to keep the visibility as coarse as possible. (Collections won't be visible until the current script execution finishes, though async adds a lot more places where that can happen.)

A proposal to add new ways of observing garbage collection will still be shot down immediately without a damn good justification.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: