I work in this field. Derivatives (and in fact second derivatives or a Hessian matrix), when they are reliable and when you have them, makes your search faster.
Unfortunately for many problems, they are either unavailable or expensive to compute. (Imagine a case when your “function” is a full CFD simulation or a 2 hr job — every time an optimizer iterates it has to run an entire simulation. You can’t see the function and you can’t really calculate the derivatives — except crudely through finite differences).
Derivative free optimizers choose to forgo derivative information and instead try to efficiently search the space to optimize the objective. Their convergence is often much slower than derivative based optimizers but they are also much simpler to implement and tend to not get stuck in valleys. All they really are are a way to cleverly evaluate f(X) to find the optimum, without needing f’(X) or f’’(X).
Caveat: derivative free optimizers are terrible for constrained optimization however (especially if your constraints are anything but simple bounds). You can use barrier methods but still they’re not great.
Also there’s an element of luck involved. It’s hard to benchmark derivative free methods because the methods themselves don’t matter too much (Nelder Mead Simplex method outperforms Powell’s method in some cases but not others. Powell’s method is well known but isn’t categorically better than any other algorithm) because when you don’t use derivatives, other higher order effects like initial point and shape of terrain dominate over the the choice of algorithm. I wouldn’t be too hung up on picking the right algorithm — instead I would focus my efforts on heuristics to get to a good initial point cheaply.
According to the papers metnioned by others, the algos under discussion do not seem to support this point. The author of the project mentiones that the FORTRAN 77 implementation is `nontrivial to understand', which motivates this work.
`Caveat: derivative free optimizers are terrible for constrained optimization however (especially if your constraints are anything but simple bounds). You can use barrier methods but still they’re not great.'
Lincoa and cobyla in the package can handle problems with nontrivial constriants according to the documentation of the git repository, though I do not know how well they work.
Many common functions have non-derivative friendly discontinuities. Logic functions, rounding, conditional clauses, sawtooth functions, etc.)
Other non-derivative functions have flat constant zero-derivative areas. Like logic functions & rounding (again), etc.
Some functions have derivatives but have so many minima (or maxima) that non-optimal or non-acceptable minima are likely to trap learning. Sine and cosine waves, for instance.
Some functions even have repeating discontinuities going through +/-infinity. Tangent and secant.
Or, in some cases a model may exist of the system (or part of the system) to be optimized, which can be used to calculate the system response, but it contains relationships that are not easily accessible, so accurate derivatives are difficult or time costly to estimate. Since there is no symbolic expression to allow for efficient calculation of the derivatives.
Yes sometimes it’s hard to measure a derivative. Eg when doing hyperparameter tuning in ML, you can read out a metric at a given choice of parameters, but it’s generally not easy to get a gradient.
Shameless plug: I happen to have recently written a package for the opposite limit! It finds roots when you can only measure the derivative.
(It was written in another time, don't expect it to be nice. My 20-year-old Octave/Matlab wrappers on that page have almost certainly bit-rotted, don't expect them to work.)
I assume backprop is still most efficient for the neural net itself, rather than one of these algorithms? I am doing a course and I am aware that there are other algorithms than gradient descent but haven’t seen the details yet on that.
I know things like cross entropy loss are designed to be easy to differentiate and probably more efficient to do that than something that approximates it.
Yeah, backprop gives efficient gradient calcs to train at a given net architecture. To find the best architecture though people try various other methods, eg random search or Gaussian processes, which don’t evaluate derivatives wrt architecture Params.
All of the above. Some functions might not have derivatives in some places (where the function is discontinuous), the derivatives can be very difficult or impossible to calculate even if they exist.
I recently wrote some straightforward typescript code, and I want to find the inputs that make one of the outputs exactly zero. My options are to a) rewrite it in an autograd tool b) manually calculate the derivatives and write a new function for them, or c) just use a root-finder that doesn't need derivatives. I have the cpu cycles to spare, so it will save a huge amount of time to just do c.
And as a tangential gripe, I wish js had operator overloading. In python, I could just call my existing function with a symbolic argument and get a symbolic output to differentiate, because my `x+y*z` code would work on both kinds of arguments.
I'm trying to come up with a simple example. I think that for bunch of flying hard spheres in a box, any measurement that requires sufficiently time spaced samples of their positions will have trouble being differentiated.
Sometimes you really don't know how to calculate the derivatives.
But perhaps more often, even if in principle you could do the additional work, formulate how to calculate derivatives, and add the required extra code, you just don't want to invest the additional resources, when you can just call a derivative-free optimization library. The optimization will run slower, but perhaps you have some extra seconds so spare.
If you are using a derivative based optimization, smooth derivatives of your function need to exist.
Trying to use a derivative based optimization where a smooth derivative doesn't exist will likely result in getting stuck at a cliff where there's no change to either the function value or to the gradient.
I work professionally in computational optimization, so here are some of my observations coming from the world of mechanics.
In my experience, it's because the client is trying to reuse an existing simulation code for something that requires optimization. This could be something like imaging (parameter estimation), calibration, or control. For example, say someone has something like a PDE solver that can simulate sound waves going through a media. If you can match the output of this simulation to some collected data, the parameters for the simulation will tell you what material is where.
Anyway, the issue is that these existing codes are rarely instrumented or derived in a way to make it easy to get derivatives. Management doesn't want pay for a rewrite, so they demand that existing equity (the simulator) be used. Then, a derivative free optimizer (DFO) gets slapped on as a first attempt. To be clear, this doesn't work particularly well, but it tends to be the place where I most often see these algorithms used.
As to why it doesn't work very well, derivative based optimizers can scale to the order of hundreds of millions of variables without too many issues (modulo the kind of problem, but it works very well for a lot of problems of value). DFOs tend to have issues past a few dozen variables.
Now, for anyone not wanting to end up in this conundrum, when you write your original simulation code parameterize on the floating point type. Basically, don't write the code with double, use some kind of generic, template parameter, or whatever on something we'll call MyFloat. The purpose here is to allow the code to be more easily compatible with automatic differentiation (AD) tools, which will give a real and machine precision exact derivative. Don't do finite differences. Don't think reinstrumenting later is easy. It's not. It's a pain. It's expensive. And, to be clear, AD can be done poorly and it can be slow. However, I've never seen a code instrumented with AD and a good gradient based optimizer do more poorly than a derivative free optimization code. They've always done dramatically better. Beyond just giving the derivative, I'll also mention that even if the client wants to recode the simulator to give an analytic derivative for speed, it is vastly easier to do so if the results can be checked with a simulator instrumented with AD. There's also other tricks of value that can be used in a properly instrumented code such as interval arithmetic for some kinds of sensitivity analysis.
Anyway, there are other reasons to use derivative free optimization. Some functions just flat out don't have derivatives. However, in the world of mechanics, most do and DFOs gets abused due to poor prior planning. In either case, please, for the love of math, just parameterize on your floating point type. It would make my job much easier.
Agreed. If (approximate) 1st-order information can be obtained in any way (even by *carefully* deployed finite difference), then gradient-based methods should be used. It is wrong to use DFO in such cases.
> DFOs tend to have issues past a few dozens variables.
It seems that the PRIMA git repo https://github.com/libprima/prima shows results on problems of hundreds of variables. Not sure how to interpret the results.
Imagine if you were trying to optimize a real physical system you didn't have a model of. These methods will still work; you could implement them "by hand" if necessary.