> it does so using a network with 480 million parameters. The training to ascertain the values of such a large number of parameters is even more remarkable because it was done with only 1.2 million labeled images—which may understandably confuse those of us who remember from high school algebra that we are supposed to have more equations than unknowns.
This is one of the key misunderstandings are still deeply rooted in people's minds. For modern DL, a large part of the learning comes from "internal" data points, in this case the pixels of the image, as opposed to the labels. If you count the number of pixels, you will likely get something like 1.2 trillion, more than enough to justify the 4.8e8 parameters. It's the usage of internal data that prevents overfitting, NOT the random initialization and SGD as claimed in the article.
Another way to see this is: if you need more labels than parameters, how can GPT3 have ANY parameters at all? It is trained purely on raw text data.
GPT3 has millions of labels. Every vocabulary term is a label. It’s equivalent to supervised learning in architecture. The “self-supervised” business is mostly spin to make it sound a bit more novel. People have been predicting the next word for ages (Turing did this).
Input: <previous words of article>
Label: <next word>
Your point is well taken that the number of input data points is also important when considering the complexity of the problem. In this case however the number of data points more or less exactly equals the number of labels.
(About Me: the first year+ of my PhD was focused on large scale language modelling, during which transformers came out.)
You are incorrect about the input dimensionality mattering. Let's say you have 100 high-res images with yes/no labels. If you hash the images and put their labels in a hashmap, you can say this is a "learned" function of 100 parameters which achieves zero training error on the dataset. This parameter count is independent of input dimension. Why do you think this would change when this mapping is replaced by a smooth neural network mapping?
GPT is trained to predict the input (estimating p(x)), versus predicting a label given an input (p(y|x)). So in the case of GPT you can use the input dimensionality as a "label", as another responder has mentioned. ImageNet classification
is different (excepting recent semi-supervised or unsupervised approaches to image recognition).
The ability to generalize in the typical imagenet setting is, as the article says, a byproduct of SGD with early stopping, which in practice limits the number of functions a deep neural network can express (something not considered in an analysis which only considers parameter count).
The point is your simple mapping with zero error on the training dataset also has zero prediction power in both the test dataset and in real life. It's learned nothing; it's at the extreme scale of overfitted.
Input dimensionality is absolutely important when determining net size.
Seems like cross talking to me. They were responding to the erroneous claim of "input dimensonlity" being equivalent to data. What the first poster referred to as "internal data points" may be better described as the presumption of differentiability, that is, a small disturbance of the pixels should result in a "small" change of the labels. But it was ridiculous to claim somehow the total number of pixels is a meaningful measure of sample size. The pixels are not independent, as dramatized by the hash map example given above.
That's the point. 100 parameters is sufficient to overfit, and it's a number that's independent of the input size. Do you have a reference for your statement?
Reference for what exactly? That input dimensionality is important when determining net size? That seems quite self-explanatory; try training a image classifier with only 100 parameters.
Maybe I understood that question wrong, but regardless, even if early stopping wasn't implemented, a NN would have more predictive power than the hash mapping. Both would be completely overfit on the training data set, yet the NN would most likely be able to make some okay guesses with OOD data.
For modern DL, a large part also comes from regularization. And then also data augmentation. And self-supervision in whatever way, either prediction, masked prediction, contrastive losses, etc.
Which all adds to the number of constraints / equations.
That doesn't quite work out that way I think, if you compare it to solving a system of equations then the size of the input is irrelevant. Indeed a very large input is often the main reason for a problem to be under-specified.
What you should look at is the number of outputs times the number of data points for each output. If this number is lower than the number of parameters then it should be possible to find multiple solutions.
Of course in this case you're not looking for a solution, but an optimum, and not even a global one, so it's not too troubling per se that you don't get a unique answer. Though it does somewhat suggest you should be able to get an equivalent fit with far fewer parameters, but finding it could be quite tricky.
The whole point is weird - having more equations than unknowns is only needed for an exact closed form solution, it's entirely possible to solve underconstrained problems that have multiple solutions via optimization (which is what DL is). To be fair, the article says just that afterward, just weird that the high school thing was mentioned in the first place.
Here's the original study[1] that seems to be the primary source for this article. It's an important study from a respectable journal. To be frank, it's pretty disconcerting that the top comments on this thread are those writing off the topic on the premise alone while those comments actually engaging with the topic seem to be at the bottom
It is a short paper from a leading conference in the field of Natural Language Processing. I have not engaged with the work, so I can not speak about its quality. But at least get your facts straight and avoid appealing to authority.
All journals have good and bad papers, the same goes for conferences. So to understand the quality of a piece of work you need to at the very least talk to a few people in the field that have engaged with the work to form a reasonable opinion about its quality.
> ... A similar sentiment was expressed by Marvin Minsky: "Unfortunately, the strategies most popular among AI researchers in the 1980s have come to a dead end," said Minsky. So-called “expert systems,” which emulated human expertise within tightly defined subject areas like law and medicine, could match users’ queries to relevant diagnoses, papers and abstracts, yet they could not learn concepts that most children know by the time they are 3 years old. “For each different kind of problem,” said Minsky, “the construction of expert systems had to start all over again, because they didn’t accumulate common-sense knowledge.” Only one researcher has committed himself to the colossal task of building a comprehensive common-sense reasoning system, according to Minsky. Douglas Lenat, through his Cyc project, has directed the line-by-line entry of more than 1 million rules into a commonsense knowledge base."
> And then, GPT-3 came along and rendered Cyc a wasted effort.
I'm yet to see GPT-3 do anything commercially important? Cyc on the other hand seems to have been used in a number of sectors. Not to downplay GPT-3 - it's cool tech that produces cool demos - Cyc just seems more like a tool rather than a toy.
Thanks for the response, I'm not familiar with Markcopy.ai, jenni.ai, but the others are "toys" if we're being honest (although as I said - very cool toys). You wouldn't want to use GPT-3 to recommend drugs for a condition you feed it...would you? As far as I understand, this is the kind of problem Cyc is trying to solve with it's domain-specific rules.
edit: Cyc can also tell you why it gave the response it did - something that deep nets cannot do. This is important in many fields, otherwise you cannot trust any response it produces.
Amazing how our own brain can delude us by creating a feeling that we understand something because (apparently) we have produced a clear definition of it.
Just a hint - each of these "facts" or "rules" mean slightly something else depending on the context they are used in. This context is not hard-encodable because it slightly mutates with every new information learned or forgotten.
can you automate your fate with gpt-3? that is where the money lives.
will sublime conclusions made by GPT-3 EVER be trusted if the reasoning to its conclusion is not understood by a human? perhaps the gestalt of gpt3 implies a meta gpt3 that could derive human grokkable explains of its "dumber" self. or maybe not.
Current set of tools in explainable AI do allow deep neural nets to tell why they gave the response that they did to a certain extent.
Contact me if you want to see it applied to text based DL models, time series based DL models or Image based DL models.
I doubt that gains from symbolic AI or any algorithmic improvement will outpace Moore’s law in the long run. I certainly wouldn’t want to spend decades working to improve AI, only to have my work undercut by cheaper silicon.
And here I thought one of the big benefits of DL was that it could handle the complexities which would be too hard to specify symbolically in order to give symbolic AI “common sense”.
The following argument comes to mind, but I don’t really buy it (it just came to mind as something that one might say next):
‘
Perhaps there is an analogy between the solutions of “we just need to get better (more varied, better fitting the desired behavior, etc.) training data, and maybe better training procedures” and “we just need to add more/better inference rules and symbolic ways to encode statements, and add more facts about the world”. Similar in that both will produce the specific improvements they target, but where solving “the real/whole/big problem” that way is infeasible. If so, then maybe this indicates that a practical full-solution to artificial “common sense” would require something fundamentally different than both of them, if it is even possible at all.
‘
Again, I don’t really buy that line of reasoning, just expressing my inner GPT2 I guess, haha.
Ok, but I presented an argument (or something like an argument) which I made up, and said that I don’t buy it. So, I should say why I don’t buy it, right? Like many of the things I write, it is chock-full of qualifiers like “perhaps” and “maybe”, to the point that one might say that it hardly makes any claims at all. But ignoring that part of it, one major difference is that the DL style architectures, seem to be working? And it isn’t clear what kinds of (practically speaking) hard limits it could run into. Now, on the other hand, perhaps at the time that symbolic AI was all the rage, it appeared the same way. (Is this what people mean when they talk about inside view vs outside view?).
Why should these two things not be especially analogous? Well, saying “proposed solution X to the problem says to just [do more of what X is/do X better], and that is just like how proposed solution Y says to just [do more of what Y is/do Y better]” is kind of a fully generalize argument for dismissing any proposed type of solution where partial solutions of that type have been tried, but the whole problem hasn’t been solved that way yet, and another proposed kind of solution has already lost favor. This doesn’t seem like a generally valid line of reasoning. Sometimes you really do just need more dakka (spelling? I mean “more of the thing you already tried some of”).
Of course, if one is convinced that it really was right for the older proposed kind of solution to be discarded, that probably should say something about the currently popular kind of solution. Especially if there have been many proposed kinds of solutions which have been discarded. But, it seems like much of what it says is just that the problem is hard. And, sure, that may mean an increased probability that the currently popular proposed kind of solution also doesn’t end up being satisfactory, that doesn’t mean one should be too quick to discard it. Tautologically: if no known alternative is currently at least as promising as the type of solution currently being considered, then, the current one is the most promising of the currently known options. Whether it is promising enough to actively pursue may be a different question, but it shouldn’t be marked as discarded until something else (perhaps something previously discarded, or something novel) becomes more promising.
I still think the next breakthrough will be when we figure out how to simplify/optimize inner portions of feedforward networks. I think it is extremely likely from working with deep nets in the past that a lot of the inner structure ends up being superfluous. The best way to test this is take a well trained network and then remove a neuron, then re-train until the prior accuracy is achieved, and repeat this process until the previous accuracy cannot be achieved. At this point you have a network that is theoretically as simple as it can be without an accuracy loss. This won't work for the large nets they are talking about in the article since training takes days and uses an inordinate amount of compute resources. So the real breakthrough will be when we come up with some mathematical technique (in my mind something almost analogous to AVL rotations) that yields a bunch of structural simplifications you can apply to inner structures within these nets, turning a network with thousands of weights into a network with hundreds of weights.
You should look at Frankle's work on The Lottery Ticket Hypothesis [1]; it turns out that in most cases you can remove 80+% of the network's weights and still get very similar output quality. The hypothesized reason is that regions of the network which are randomly initialized end up getting a "winning lottery ticket" which already has structure reasonably well optimized for your end task to be further finetuned, and everything else mostly ends up just being set dressing, which is why you can delete those neurons without a major impact to performance.
That said, I don't think we are going to be able to actually step away from large networks any time soon. It seems to be the case that when you have more parameters to optimize, you have more degrees of freedom and you are less likely to end up getting stuck in a local minima which is why it's actually easier to train a larger network to solve a task versus a smaller network, despite the fact that both are wildly overparameterized.
I think recursive networks will beat feed forward: figure out a way to feed some outputs of the upper layers back into the lower layers ("fine-tuning/focus")
I’ve had similar thoughts in the past but I started playing around with some newer models recently and I have about 25-30 projects that I can think of right off that could be considered commercially viable. And certainly VC fundable in this investment environment.
It's not inherently true. Technically, deep learning is essentially any neural network model with hidden layers (i.e., one layer in between the input layer and the output layer). You could have a "deep learning" model with a couple dozen parameters, perhaps. But at that end of the scale, most people would probably reach for other approaches that are more easily interpretable (e.g., logistic regression, random forest). So in practice, yes, virtually any deep learning model you see out there in the wild, even most "toy examples" used to teach machine learning, are going to be overparameterized.
Not even close, most of my work has been in naturally occurring data and there is way waay more data available than can possibly be used (petabytes). Where they get this idea as being the rule and not the exception is beyond me.
> At OpenAI, an important machine-learning think tank, researchers recently designed and trained a much-lauded deep-learning language system called GPT-3 at the cost of more than $4 million. Even though they made a mistake when they implemented the system, they didn't fix it, explaining simply in a supplement to their scholarly publication that "due to the cost of training, it wasn't feasible to retrain the model."
Putting aside energy costs, Object detection is still crappy and has stalled. YOLO/SSDMN were impressive as all get-out, but they stink for general purpose use. It's been 3 years (?) and general object detection, even with 100 classes, is still unusable off the shelf. Yes, I understand incremental training of pre-trained nets is a thing, but that's not where we all hoped it would go.
You're on the right track and don't even know it! ;-P
That's why the current Tesla has like Five LIDARs, two radars, 16 cameras, and lots of traditional algorithms that aren't NNets. (I don't know the counts, but my point stands.)
I've spent 5 years on this with tier 1 automotives, I don't think ADAS5 will happen in my lifetime, and it definitely won't be due to neural nets: It is FAR more likely autonomy will be achieved by billions of IoT sensors in a mesh, guiding vehicles: e.g., lane sensors, weather sensors, c2c (car-to-car telemetry), c2i (car-to-infrastructure telemetry, like traffic lights), rather than some fancy in-car brain-emulation.
This reminds me a bit of what I said once when someone asked me whether I believed P = NP would have any practical impact. What I said was that even if P = NP, so that essentially every problem has a polynomial time algorithm, I suspected there were certain practical problems that would prove intractable to solve exactly, because the exponent would be too high, like n^10 or something.
There's a good reason that consumer software either has small n, or basically doesn't use algorithms with exponents worse than n^2 (and, we only go to n^2 if it's really necessary). Throwing compute resources at the problem only goes so far when that exponent works against you.
At that time, I wasn't thinking about it in terms of energy consumption, or carbon emissions, but, given the scale of things now, that does look like an appropriate set of units. Nor did I ever conceive of algorithms that would only be run once, because it was just too expensive to run it more than that.
Longish article about the cost of training increasingly big neural nets. Worried about carbon. "Training such a model would cost US $100 billion and would produce as much carbon emissions as New York City does in a month. And if we estimate the computational burden of a 1 percent error rate, the results are considerably worse."
This number is not credible. Are we done with ImageNet?[1] points out that the best ImageNet models that currently exist surpass the accuracy of the original labels they are trained on. Expecting to reduce error by a factor 10 on unseen data points is incoherent.
Anybody who argues against deep learning based on energy consumption immediately fails to impress me. This article is particularly bad- claiming you need k*2 more data points to improve a model and using that to extrapolate unrealistic energy consumption targets for DL training.
The sum of all DL training in teh world is noise compared to the other big consumers of energy in computing. That's because the main players all invested in energy-efficient architectures. DL training energy is not something to optimize if your goal is to have a measurable impact on total power consumption.
> That's because the main players all invested in energy-efficient architectures.
If the cost was gigantic enough to make the investment worth it they must have found some really great improvements for it to end up being just noise. Improvements that somehow didn't have a noteworthy impact on general computing.
And yet people here have no trouble crying about electricity wastage of crypto. Also from my limited knowledge I think DNN models are not very transferable in real world setting requiring constant retraining even for a small drift in signal or change in noise modes.
> And yet people here have no trouble crying about electricity wastage of crypto
Which is many orders of magnitude more energy-intensive, on the scale of a small nation-state, and in most cases fundamentally wasteful by design. A very large pre-trained model can be reused very cheaply once it's finished.
> Also from my limited knowledge I think DNN models are not very transferable in real world setting requiring constant retraining even for a small drift in signal or change in noise modes.
This is FUD, promulgated by people who expected deep learning to solve all their problems overnight. All models will suffer from "drift" whenever the underlying data changes.
Part of what made deep learning so good was that it was able to generalize exceptionally well from exceptionally complicated input data.
It is unreasonable to expect that a model pre-trained on a huge generic corpus will be a perfect match for your very specific business problem. However it is _not_ unreasonable to expect that said model will be a useful baseline and starting point for your very specific business problem.
We are not yet (and might never be) at the point where you can dump a pile of garbage data into an API and get great predictions out the other end on the first try. But nobody ever thought you could do that, except the people selling expensive subscriptions to those kinds of APIs. The fact that they work at all should be taken as evidence of how amazing deep learning is; the fact that they don't work perfectly should not be taken as evidence that deep learning is bad/useless/wasteful/hype/whatever.
Don't let the clueless tech media set your expectations.
Professional data scientists and machine learning practitioners for the most part take their work very seriously and take pride in delivering good outcomes, just like professional software engineers. If deep learning wasn't useful to that end, nobody would be using it.
Open ended crying about electricity doesn’t make sense in the absence of specifics.
A big company like Microsoft probably wasted more money on pentium 4s 15 years ago. Electricity is just another resource - if the numbers work, burn away.
I know we’re nowhere near the following scenario, this is just to illustrate how things can go wrong even if the numbers tell you to “burn away”:
Image we have computronium with negligible manufacture cost, the only important thing is the power cost to use it.
Imagine you’re using it to run an uploaded mind, spending $35,805/year on energy.
The 50% of Americans earning more than this [0] are no longer economically viable, because their productivity can now be done at the same cost by a computer program.
Doing this with the current power mixture would be disastrous, doing it with PV needs about 1400m^2 per simultaneous real time mind upload instance (depending on your assumption about energy costs and cell efficiency, naturally).
In a more near-term sense, there are plenty of examples where the Nash equilibrium tells each of us to benefit ourselves at the expense of all of us. Not saying that is the case for Deep Learning right now, but can (and frequently does) happen.
I hate to be the one to tell you, but, it turns out we are living in the middle of an ecological catastrophe, and it also turns out that means that electricity is a resource we are going to have to conserve.
It’s a resource whose cost is flattening due to the rise of PV and wind power, unless you live in some backward place where they are still mostly coal.
> > The first part is true of all statistical models: To improve performance by a factor of k, at least k^2 more data points must be used to train the model. The second part of the computational cost comes explicitly from overparameterization. Once accounted for, this yields a total computational cost for improvement of at least k4.
Those claims are entirely new to me, and I've been a researcher in the field for almost 10 years. Where do they come from/what theorems are they based on? It's unfortunate this article doesn't have any citations.
But it only applies to estimation (like how well a population parameter can be estimated) in certain regimes, and not e.g. expected risk (like how well one can do at prediction), so I’m not sure how it would apply here.
This assumes that our processes and algorithms don't get more targeted or improve. The rate of new approach discovery is staggering. For every problem, some combination of approaches will more efficiently pre-process and understand the training data.
The article also ignores training vs running tradeoffs. Training a model once may be extremely resource intensive, but running the resulting model on millions of devices can be negligible while having huge value add.
a good example is discovery of attention mechanisms/transformers replacing more cumbersome and computationally expensive RNNs and LSTMs in NLP and more recently outperforming more expensive models in computer vision.
I'm wondering if there is a way to combine optimization of model weights in a neural net with a set of heuristics limiting the search space, as a sort of rules engine/decision tree integrated within ANN backprop training. Basically pruning irrelevant and redundant features early and focusing on more informative ones.
Yes, there are many approaches like that. In one approach they train a network and prune it, then mask the pruned weights and retrain from scratch a sparse network from the original untrained weights.
This is one of the key misunderstandings are still deeply rooted in people's minds. For modern DL, a large part of the learning comes from "internal" data points, in this case the pixels of the image, as opposed to the labels. If you count the number of pixels, you will likely get something like 1.2 trillion, more than enough to justify the 4.8e8 parameters. It's the usage of internal data that prevents overfitting, NOT the random initialization and SGD as claimed in the article.
Another way to see this is: if you need more labels than parameters, how can GPT3 have ANY parameters at all? It is trained purely on raw text data.