That caught my attention. Retrying failed tasks isn’t easy. There are all kinds of corner cases that pop up one by one. If you have some nice way to handle the common failure modes ("text me" or "retry every ten minutes" or "retry 5 times, then give up" or "keep retrying, but with exponential backoff") then that’s something I’d love to use.
(Wiring together 40+ preemptible TPUs was a nice crucible for learning about all of these. And much like a crucible, it was as painful as it sounds. Hatchet would’ve been nice.)
(Wiring together 40+ preemptible TPUs was a nice crucible for learning about all of these. And much like a crucible, it was as painful as it sounds. Hatchet would’ve been nice.)
Thanks for making this!