Is anyone else at least a bit worried about a bunch of developers running around doing "machine learning" without much understanding of mathematics and probability? E.g. consider the creation of fragile models that overfit data being used in finance, infrastructure, medicine, etc.
I'm a little bit worried. At least at the same level as when I see a bunch of developers compiling programs without much understanding of what an LL(k) parser does, or how a pushdown automaton works, or what a Turing machine is. I usually feel the same every time I see an elevator without a liftman, don't compute a square root by hand, or hear about Google self-driving cars.
The difference is that the software or the elevator will work but the statistical model is wrong and doesn't work. It is like the elevator only lift people above 120 and below 90 and for the others it just don't work or take you to the wrong floor.
> The difference is that the software... will work
Lots of software doesn't work. Is there a substantial difference between putting an overfitting model in production, and putting a poorly tested program in production?
I think there might be. When ML fails the only individual capable of noticing is someone who understands the math. When code breaks often the "lay" user notices. The result is obvious to a novice. When ML fails it looks like a duck, quakes like a duck but after multiple years of study its immediately recognizable as an antelope. Though to disagree with my own point, security vulnerabilities have a similar profile. In essence, to all but the highly trained the difference is imperceptible.
>"When code breaks often the "lay" user notices. The result is obvious to a novice."
That depends "how" it breaks. As a novice coder myself, I've had things go wrong that I don't notice or can't identify, and it looks like my program is running fine.
I think that's the parent's point: it might be stupid to implement crappy macho learning models into production, but it isn't worrisome. It's expected.
I hear ya. I knew that assertion was going to draw some criticism as its a judgement call about where we draw the line. Who's a novice and what's obvious? However I can't get away from my nagging impression that statistical validity is not inherently clear to the absolute best practitioners. Causality is the goal, and its notoriously difficult, even for world class minds. In my experience the only similar effervescent specter for software development is in security. Such circumstance,seem to me, to require great humility and introspection about ones abilities, but I suppose a little of that would go a long way in general too!
That the program is likely to fail loudly and obviously, but the overfitted model will just sit there being subtly yet perniciously much wronger than you think is, forever.
How many elevators were failing today? how many were fixed today? and how many have been monitored in real-time to get fixed as they fail? Likewise, statistical models can be monitored and fixed automatically, and, of course, even so they will fail from time to time like elevators do.
Your comment is a bit obtuse, developers are not creating parsers and compilers. Elevators and calculators are very robust technologies that already work.
What I worry about is a new wave of engineers and developers thinking they understand statistical models and then proceeding to work at the big banks and have their models blow things up. If PhDs can make such disastrous non-robust models, how on earth is a random developer who took a summer course not going to do the same?
Now if the banks actually failed on their own, then by natural selection the less skilled would be out of jobs and people would stop trying to "short-cut" gaining this type of knowledge. But that's not what happens. Academics keep writing papers and hyping up specific techniques for which they can give conferences on, and the taxpayer bails out the idiots at the top.
I'm a developer with very little statistics knowledge who at one time had extensive math knowledge, but haven't applied it in so long that I don't recall most of it these days.
I worked in the finance sector as a (lead) developer on a production use trading system for quite some years. None of the developers had formal math or statistics knowledge to the extent required to develop this system.
This didn't really bother anyone in the least despite the fact a mistake could cause a loss of millions of dollars in practically no time at all...
The reason for this wasn't because anyone was ignorant enough to think the programmers knew what was going on. It was because it was a finance company that also employed plenty of mathematicians, statisticians, physicists, and other's with the proper math/stats background. The programmers wrote the code, but the math/stats people wrote the business rules and the formulas and extensively tested that the system worked correctly against a large enough variety of models with expected outcomes that they were able to have sufficient confidence that the system reward/risk measurements were appropriate.
So my answer would be no; I don't find this all that troublesome. We can't be experts at everything and smart companies realize this so they should be creating teams with the correct skillset to be successful.
This is overly pedantic. The code is doing machine learning. In order to write and understand the code you have to understand the machine learning algorithms. Before you even choose a model and tune the parameters you have to know how the parameters interact and how different models work.
Is that any different from a bunch of developers plugging in magic numbers into a formula that they made up, which (to a first approximation) is roughly what happens now?
Realistically, the outcome will be the same as it is now: those firms whose models don't reflect reality will blow up, those whose do will get bigger, a few will get too big to fail off some very confidently-expressed models and make a lot of people mad at them, and eventually the market will straighten out who's lying and who's not. Won't be painless, but then, capitalism never is.
> those firms whose models don't reflect reality will blow up,
I'm picturing one of those dystopic films/novels where the main character is deleted/fired/jailed as a result of an algorithm error. Yes, in real life the trends will overcome the bad models. But just think of the potential consequences for harm on an individual basis!
That's sorta the way society has functioned for millenia. In the 20th century alone, hundreds of millions of people have died miserable deaths because some guy in power had an incorrect model of reality.
It sucks, it's not fair or just, and everything would run more smoothly if we were omniscient beings living in a completely egalitarian society. Unfortunately, that's not the reality we live in. In the meantime, we accept it as simply fate or vulnerability, and muddle through as best as we can.
And creating "fragile" models because they don't have the tools to reproduce their own experiments. How many authors of academic papers in ML could reproduce the exact same results a year later? I would guess around 10%.
This pisses me off so much. I'm not a mathematician, but I like to think I'm a pretty good programmer. I feel like I could pick up a mathematical concept described in a computer science paper more easily if I could actually see the damn code and run it myself. But most of the papers I've read haven't mentioned where to find the referenced source code or, if they do, it's either horribly written and only runs on the author's machine or it requires specialized software that only a university could afford.
From my interactions with researchers in ML, most of them are actually pretty good programmers. There just isn't an incentive to make your code clean:
1. There isn't much correlation between quantity or even quality of papers you publish and the quality of your code. Meaning, writing cleaner code is not going to help you get that postdoc or faculty position.
2. Doing research is full of stops and starts and branches that fail and approaches that get thrown out. It's a waste of time to write clean code since you know it'll most likely be thrown out. When you do get an approach that works, you publish your paper and move on.
> most of them are actually pretty good programmers.
What is 'good'? In 'software development' 'good' is usually connected to writing clear, maintainable, test covered code. In most scientific research it means something completely different. I think on HN most adhere to the former definition of good and in that sense most researchers (especially in physics, but also CS / ML) are not 'good' according to that definition (because you need quite a lot of years of experience in a corporate setting usually) and actually even bad. But the code works and implements the concepts in their papers so they are 'good' in that respect. That is more rapid prototyping to make a POC to show it works, after which you properly rewrite it.
A decent CS undergrad degree decade ago included abstract math concepts. I took Engineering math, Information Theory, Numerical analysis, Probability, Simulation in my sophomore and Junior years. NLP and AI were electives in Senior year.
As a Junior, we were building toy programs that do Operations research type of work - solving linear equations via various matrix operations, design optimal queue processes based on poisson process.
Assuming a software engineer is a CS undergrad, he/she most likely has good footing to learn more by themselves.
In all fairness, that's pretty atypical of a standard CS degree. In my anecdotal experience (knowing people that went to Stanford/Berkeley/MIT/CMU), most people take at most 1 probability class, 1 linear algebra class, and maybe 1 AI/ML class. Info theory, NLP, numerical analysis, optimization, etc. are not at all common.
Or just got a CS degree too long ago. I got lots of discrete math - formal methods, automata theory, and number theory. All that stuff that's in Knuth. But no number-crunching beyond matrix inversion and Fourier transforms.
Bad CS school student here, we don't take any Math aside from a very basic "discrete structures" class, which is simplified discrete structures :/ Wish we did more math...
That all seems like quasi-maths, however. What hopefully was being referred to earlier is algebra and analysis, at least up to the 2nd iteration, so one has real understanding of methods of proof.
E.g - no probability class that doesn't require analysis 1 and 2 is truly a probability class.
>> doing "machine learning" without much understanding of mathematics and probability
In my understanding "machine learning" is just a buzz word for the good old fashioned data-mining, which is still a part of applied mathematics/statistics. Only because it involves computers it doesn't belongs to CS.
So what you have written sounds for me like "doing applied statistics without much understanding of mathematics and probability". And yes, I am worried about it.
Nah. The only topic I would be worried about is cryptography, when used in a non-learning context. That has a high potential to cause harm. Otherwise with machine learning, I don't see how it is necessarily more dangerous than any other software -- databases, network protocols and so on...
Databases, networking protocols and so forth are hardened, relatively speaking (less the occasional heart bleed or PoW-blockchain fork). If you have autonomous systems built on top of hardened infrastructure but behaving according to ML models, the impact of their wrong doings is exponentially higher. It's about top-level autonomy through ML models really: from flash crashes to (future) autopilots.
The same effect of severity vs. position in the control hierarchy goes for human organizations. A cashier can defraud for a couple of hundred $, the C-suite at Goldman Sachs / Enron for a multitude. The invention of the corporation, as much risk as it entails, was a milestone in human progress though. So yeah, let the predicting but in its entirety not quite predictable models run the world. It's worth it.
On top of this, I would add that general trends in information workflow / technological advancement, which ML models like this running the world would certainly fall under, are as close to unstoppable forces as we've ever seen due to the complexity and power of the smaller trends that cause them.
Basically, if this is going to become a thing, then there is no stopping it.
Not really. What will have to happen is a readjustment for realizing that many models, especially made by people with little experience and training, will be wrong.
Right now it's the glory days of ML when nobody much has the ability to judge success. Unlike software engineering broadly, where these glory days just keep going, ML is all about measuring success. People will detect failures.
The real risk is when people systematically underestimate the risk like the copula thing occurring with the subprime market. That was anything bug untrained people using models—they would not have been as dangerous as they were if they weren't so damn good to begin with. This is a robustness failure, not a poorly trained workforce failure.
ML is the next commodity on the development stack. It is good to worry over the next few years, but after that, there should be a bunch of pretty solid tools out there for developers to work with. I am among the people that I believe are working on these tools.
The main goal of this article is to make Developers understand some of the basics of machine learning, not to make everyone think they can be an expert data scientist after reading just this one article online. This is also stated clearly throughout the article, as there are no golden rules for finding good features, getting the data right, etcetera. Given this I don't think you should be worried about this. Additionally lots of testing and validation mechanisms and people are involved in complex systems that you state to be concerned about.
No because their businesses won't work if they don't understand that stuff. It took us a while to learn it in the anti spam world but not that long in the big scheme of things.
No. You won't get hired to do machine learning just because you did a Coursera course and read a few books. If you do, you know you're working at a dead end.
9 times out of 10 it will be a clueless manager that read in Gartner that ML is the next big thing so they'll put some programmers on it, they'll click their heels together at learning something new, nothing will come out of it, except they funded the platform for the much smaller group of people who'll actually use it for something useful. Win win.
I always assume that if a person want to do much better in ML, he will try to learn stuffs he didn't know before, like statistics, math, etc. Everyone's knowledge is limited, but that doesn't limit what people can do, just need to learn more, I guess.