Am I understanding this right? Surely, I must be missing the entire point because...
This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?
That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?
If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?
I think you do misunderstand. They do not add “correlated variables” to a model. The idea is that if you have an overparameterised model for a specific problem, this model contains a smaller model, that has similar performance to the trained large model, without training! That means gradient descent is in fact equivalent to pruning weights in a random network. There is no algorithm for how to do this efficiently (as they show) but that does not mean that there are no (so far unknown) heuristics out there that would get you close. This is exciting as it means a potential alternative for backprop is out there. This would be cool because it might mean more efficient algorithms and something I haven’t seen mentioned in the paper, an alternative to backprop that might be easier to understand in a biologically plausible way.
What you are stating is the original Lottery Ticket Hypothesis. What they prove in this paper is the stronger version, empirically noticed here https://arxiv.org/abs/1905.01067 and referred to as "supermasks". To quote from the paper posted here: "within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnetwork that achieves competitive accuracy".
Seems like a "Library of Babel" type of thing. I'd have to read the full paper for how they find the subnets, but their mere existence is not so surprising. There's a huge sea of possible subnetworks. Basically SGD is replaced by whatever procedure you use to traverse the space of parameter subsets. Definitely interesting direction.
This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?
That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?
If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?