The opposite of what you’d think when studying machine learning… 95% of the job ...

AndrewKemendo · on June 7, 2024

As it was in the beginning and now and ever shall be amen

At the staff/principal level it’s all about maintaining “data impedance” between the product features that rely on inference models and the data capture

This is to ensure that as the product or features change it doesn’t break the instrumentation and data granularity that feed your data stores and training corpus

For RL problems however it’s about making sure you have the right variables captured for state and action space tuple and then finding how to adjust the interfaces or environment models for reward feedback

dblohm7 · on June 7, 2024

As somebody whose machine learning expertise consists of the first cohort of Andrew Ng's MOOC back in 2011, I'm not too surprised. One of the big takeaways I took from that experience was the importance of getting the features right.

Animats · on June 8, 2024

I remember that class. Someone from Blackrock taught it at Hacker Dojo. The good old days of support vector machines and Matlab.

ismailmaj · on June 8, 2024

This was very important with classical machine learning, now with deep learning, feature engineering became useless as the model can learn the relevant features by itself.

However, having a quality and diverse dataset is more important now than ever.

Salgat · on June 8, 2024

That depends on the type of data, and regardless, your goal is to minimizing the input data since it has a direct impact on performance overhead and duration of inference.

beckhamc · on June 8, 2024

no we just replaced feature engineering with architectural engineering

geoduck14 · on June 7, 2024

>was the importance of getting the features right.

Yeah, but also knowing which features to get right. Right?

whiplash451 · on June 8, 2024

In a sense, the data _is_ the model (inductive bias) so splitting « data work » and « model work » like you do is arbitrary.

llama_person · on June 8, 2024

Same here, it's tons of work to collect, clean, validate data, followed by a tiny fun portion where you train models, then you do the whole loop over again.

gopher_space · on June 8, 2024

> it's tons of work to collect, clean, validate data

That's my fun part. The discovery process is a joy especially if it means ingesting a whole new domain and meeting people.

toephu2 · on June 7, 2024

Sounds like a Data Scientist job?

moandcompany · on June 7, 2024

This is a large problem in industry: defining away some of the most important parts of a job or role as (should be) someone else's.

There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

tedivm · on June 7, 2024

It's not about "yucky" so much as specialization and only having a limited time in life to learn everything.

Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

AndrewKemendo · on June 7, 2024

My answer is yes to both of those

If other peoples work is reliant on yours then you should know how their part of the system transforms your inputs

Similarly you should fully understand how all the inputs to your part of the system are generated

No matter your coupling pattern, if you have more than 1 person product, knowing at least one level above and below your stack is a baseline expectation

This is true with personnel leadership too, I should be able to troubleshoot one level above and below me to some level of capacity.

otteromkram · on June 8, 2024

The parent comment had three examples...

mrbombastic · on June 8, 2024

2/3 is close enough in ML world

moandcompany · on June 7, 2024

> I've seen what it looks like when a company hires mostly researchers and ignores other expertise, versus what happens when a company hires diverse talent sets to build a cross domain team. The second option works way better.

I've seen these too, and you aren't wrong. Division into specializations can work "way better" (i.e. the overall potential is higher), but in practice the differentiating factors that matter will come down to organizational and ultimately human-factors. The anecdotal cases I draw my observations from organizations operating at the scale of 1-10 people, as well as 1,000s working in this field.

> Should your reseacher have to manage nvidia drivers and infiniband networking? Should your operations engineer need to understand the math behind transformers? Does your researcher really gain any value from understanding the intricacies of docker layer caching?

To realize the higher potential mentioned above, what they need to be doing is appreciating the value of what those things are and those who do those things beyond: these are the people that do the things I don't want to do or don't want to understand. That appreciation usually comes from having done and understanding that work.

When specializations are used, they tend to also manifest into organizational structures and dynamics which are ultimately comprised of humans. Conway's Law is worth mentioning here because the interfaces between these specializations become the bottleneck of your system in realizing that "higher potential."

As another commenter mentions, the effectiveness of these interfaces, corresponding bottlenecking effects, and ultimately the entire people-driven system is very much driven by how the parties on each side understand each other's work/methods/priorities/needs/constraints/etc, and having an appreciation for how they affect (i.e. complement) each other and the larger system.

hiatus · on June 7, 2024

> There is a lot of toil and unnecessary toil in the whole data field, but if you define away all of the "yucky" parts, you might find that all of those "someone elses" will end up eating your lunch.

See: the use of "devops" to encapsulate "everything besides feature development"

RSZC · on June 8, 2024

Used to do this job once upon a time - can't overstate the importance of just being knee-deep in the data all day long.

If you outsource that to somebody else, you'll miss out on all the pattern-matching eureka moments, and will never know the answers to questions you never think to ask.

jamil7 · on June 8, 2024

My partner is a data engineer, from what I’ve gathered the departments are often very small or one person so the roles end up blending together a lot.

auntienomen · on June 7, 2024

A good DS can double as an MLE.

disgruntledphd2 · on June 8, 2024

And sometimes, a good MLE can double as a DS.

Personally I think we calcified the roles around data a little too soon but that's probably because there was such demand and the space is wide.

huygens6363 · on June 8, 2024

“Scientist”? Is this like Software Engineer?

staunton · on June 8, 2024

I guess it means "someone who has or is about to have a PhD".

maxlamb · on June 8, 2024

Sounds like a data engineer job to me