A guy I once worked with had a really good approach to exceptions: don't.
Basically if your code should throw, instead of throwing. Do something about it. Rollback the filesystem or transaction so things can retry or the process can exit.
Ideally, do not rely on someone reading a log message.
Instead, do what that person reading the log would need to do.
If that's an escalation to support staff or the CEO. Send the email. If the developer reading the log would just press a retry button, press the button for them.
Basically, giving up was just not an option, it worked pretty well tbh, sure we had logs and they were crucial for debugging failures but weekend callouts were not a thing. We never had to babysit processes. Never had to worry about filesystem corruption if you want to kill -9 a process mid run.
Sure that works for application code; for library code, you are raising an exception so that the application code can decide the domain-specific thing to do that the person reading the log message would do. Exceptions are raised by code that can be used in multiple contexts, where the person reading the log message would be doing entirely different things.
I agree with part of this. When possible, code should either succeed or revert back to the state before the operation, before doing a throw. Avoid half-completed work if at all possible. This is in general quite easy by first doing work with higher chance on failure, and only then connecting that work to the rest of the system state. E.g. don't add an item to a list and then do something with it, do it the other way round and only add items to the list when something has already been done on them.
If you are in a known failure scenario, you can try known resolutions. If some necessary resource like a DB or file system disappeared, try to self-heal as soon as the resource reappears. But if you're in unknown territory, better stop working and complain. Writing a message in a log is good enough, IF some monitoring system will pick up that log and alert the right channel.
I've seen systems trying to self heal, but working on wrong assumptions. Operators now have 2 problems: Fix the system AND stop it from damaging itself. Self-healing can turn to self-damaging quickly.
I’ve written a few systems that were aggressively self-healing and operated them in production. The benefit is as you say. When done well, the systems kind of run themselves and require much less attention than systems that are not designed this way. From an operations perspective it was great. From a software development perspective, not so much, and this largely explains why it is uncommon.
In all typical software architectures, many places in the code do not have enough context to handle exceptional conditions. Single errors may have multiple possible root causes that have to be determined by inference or deduction in the code so that the handling is appropriate. Evaluating some causes requires complex code far outside the purview of the software’s main purpose and possibly skill set of the developers. Appropriate resolution of an exception at a single call site can be context dependent — not only do you have to determine the root cause at runtime, you also have to determine the correct resolution at runtime. A single resolution may need to implement multiple strategies to take into account real-time environmental context that change how that resolution is handled.
Making this logic maintainable requires an architecture that pretty heavily revolves around the software infrastructure required to make this type of exception handling scalable. You’re replacing all of the error handling idioms every software engineer knows with something alien that colors the entire code base. Also, there is little in the way of robust frameworks that do a lot of this grunt work for you so you are usually left writing your own.
The tl;dr is that implementation is quite expensive and difficult in practice, even though it usually has no performance overhead and is great from an operations perspective. While people like the idea, the software development overhead is usually considered too high to justify making operations’ life easier.
I've certainly had success with an application with a very aggressive transactional approach; every button press would commit-or-crash, and in the event that it did crash you could simply resume from another terminal by logging in with a quick key fob tap.
Don't like choice? Go back to Mac, spare us your hot takes.