Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The opposite of what you’d think when studying machine learning…

95% of the job is data cleaning, joining datasets together and feature engineering. 5% is fitting and testing models.



As it was in the beginning and now and ever shall be amen

At the staff/principal level it’s all about maintaining “data impedance” between the product features that rely on inference models and the data capture

This is to ensure that as the product or features change it doesn’t break the instrumentation and data granularity that feed your data stores and training corpus

For RL problems however it’s about making sure you have the right variables captured for state and action space tuple and then finding how to adjust the interfaces or environment models for reward feedback


As somebody whose machine learning expertise consists of the first cohort of Andrew Ng's MOOC back in 2011, I'm not too surprised. One of the big takeaways I took from that experience was the importance of getting the features right.


I remember that class. Someone from Blackrock taught it at Hacker Dojo. The good old days of support vector machines and Matlab.


This was very important with classical machine learning, now with deep learning, feature engineering became useless as the model can learn the relevant features by itself.

However, having a quality and diverse dataset is more important now than ever.


That depends on the type of data, and regardless, your goal is to minimizing the input data since it has a direct impact on performance overhead and duration of inference.


no we just replaced feature engineering with architectural engineering


>was the importance of getting the features right.

Yeah, but also knowing which features to get right. Right?


In a sense, the data _is_ the model (inductive bias) so splitting « data work » and « model work » like you do is arbitrary.


Same here, it's tons of work to collect, clean, validate data, followed by a tiny fun portion where you train models, then you do the whole loop over again.


> it's tons of work to collect, clean, validate data

That's my fun part. The discovery process is a joy especially if it means ingesting a whole new domain and meeting people.


Sounds like a Data Scientist job?


This is a large problem in industry: defining away some of the most important parts of a job or role as (should be) someone else's.

There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.


It's not about "yucky" so much as specialization and only having a limited time in life to learn everything.

Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.


My answer is yes to both of those

If other peoples work is reliant on yours then you should know how their part of the system transforms your inputs

Similarly you should fully understand how all the inputs to your part of the system are generated

No matter your coupling pattern, if you have more than 1 person product, knowing at least one level above and below your stack is a baseline expectation

This is true with personnel leadership too, I should be able to troubleshoot one level above and below me to some level of capacity.


The parent comment had three examples...


2/3 is close enough in ML world


> I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

I've seen these too, and you aren't wrong. Division into specializations can work "way better" (i.e. the overall potential is higher), but in practice the differentiating factors that matter will come down to organizational and ultimately human-factors. The anecdotal cases I draw my observations from organizations operating at the scale of 1-10 people, as well as 1,000s working in this field.

> Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

To realize the higher potential mentioned above, what they need to be doing is appreciating the value of what those things are and those who do those things beyond: these are the people that do the things I don't want to do or don't want to understand. That appreciation usually comes from having done and understanding that work.

When specializations are used, they tend to also manifest into organizational structures and dynamics which are ultimately comprised of humans. Conway's Law is worth mentioning here because the interfaces between these specializations become the bottleneck of your system in realizing that "higher potential."

As another commenter mentions, the effectiveness of these interfaces, corresponding bottlenecking effects, and ultimately the entire people-driven system is very much driven by how the parties on each side understand each other's work/methods/priorities/needs/constraints/etc, and having an appreciation for how they affect (i.e. complement) each other and the larger system.


> There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

See: the use of "devops" to encapsulate "everything besides feature development"


Used to do this job once upon a time - can't overstate the importance of just being knee-deep in the data all day long.

If you outsource that to somebody else, you'll miss out on all the pattern-matching eureka moments, and will never know the answers to questions you never think to ask.


My partner is a data engineer, from what I’ve gathered the departments are often very small or one person so the roles end up blending together a lot.


A good DS can double as an MLE.


And sometimes, a good MLE can double as a DS.

Personally I think we calcified the roles around data a little too soon but that's probably because there was such demand and the space is wide.


“Scientist”? Is this like Software Engineer?


I guess it means "someone who has or is about to have a PhD".


Sounds like a data engineer job to me




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: