Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Databricks Is Winning (cloudnativeenterprise.substack.com)
85 points by _ttnp on Feb 14, 2021 | hide | past | favorite | 75 comments


The one thing I see in my current company, and a growing trend with SaaS apps is that companies are forgetting how to actually engineer. Like Boeing- the more you outsource the less able you're able to react to changing market forces and fix issues.

We run Hadoop & Spark internally, but the team is underfunded and stuck in a constant cycle of fighting fires. And the result (and part of a larger push of the company due to the same cycle of under-funding and culture issues) is that we're moving our petabytes of data to cloud providers into their systems. Not only is the cost of doing this dwarfing that it would take to actually fix our issues, but we're going to lose the people who know how to design and manage petabyte scale hadoop clusters.

We wind up in a situation where we locked up data fundamental to our company and our position in the market with a 3rd party, and losing the talent that would allow us to maintain full control over the data. If the service increases prices, changes it's offering, or we get to a point where the offering doesn't meet our needs- we're fucked.

It's nice that Databricks has a nice "offramp" that you can take to go somewhere else, but the general idea is the same.


It's incredibly difficult to fund internal platform teams appropriately. Usually one of three failure patterns emerges

1) The team is competent but picks up migration work to arbitrary technologies, and approaches with no clear ROI. These migrations block feature development and never seem to end e.g. teams ceaselessly migrating from GCP to AWS, to kubernetes, to podman from mysql to postgresql etc.

2) The team is operationally heavy and generates arbitrary requirements for everyone else to follow. The toolchain seems to get worse over time, and the number of hoops to jump through to get anything done endlessly grows e.g. Wait 2 weeks and get three business approvals for a server which you aren't allowed to have root access to.

3) The team has big ideas, but the business constantly under invests. The team is called to fight every fire but unable to stop the fires through any meaningful project. The company ends up on a platform thats constantly on fire.

When weighing these execution risks building an internal platform for just about anything looks incredibly expensive. I've only been at 1 company out of 6 which nailed the internal platform tooling requirements. The only reason I can attribute to their success was through quarterly NPS surveys on the developer experience for every major piece of the companies toolchain + hard to meet SLAs for uptime.


I think you need to also include the reality that you graduate to 2 to stop the bleeding from 1 and 3. I’ve been at companies where we have had literally no choice but to implement 2 because the ops team was drowning due to every part of the business demanding to be priority 1 and not being able to actually reason about prod because devs would manually make changes.

The underlying symptom is bad project management and more work than the team can handle but convincing the business that the ops team actually needs to be twice the size and have dedicated managers falls on deaf ears.


As you say, this de-skilling problem is broader than SaaS, it is tied to oursourcing of critical competencies or, in younger companies, never having them in the first place.

If you want to see de-skilling in action, hard core, go look into the service providers, wireless and fixed. They are running on fumes and attrition-victorious teams that last had a new technology in the late 70s/early 80s because everyone good at networking went to the FANG predecessors and then FANG proper.


It’s also a problem with compensation. Many companies take an incredibly stupid and paternalistic attitude that employees should be happy with 3% - 5% raises, no bonuses, and refresher equity way less than their original grant.

Anyone actually talented just leaves after 2-4 years, so you get zero cohesion or snowballing effect of talented tech leaders who other good people actually want to work with.

These places become total career death. Eventually it becomes a place where only people with kids or other significant life obligation go, specifically because they value no-ambition bureaucracy that trades off compensation and risk taking in exchange for work life balance.

Those places are so soul crushing. Avoid at all costs.


> we're going to lose the people who know how to design and manage petabyte scale hadoop clusters.

Why is that different from "lose the people who know how to design and write accounting systems from scratch?" That was what happened when packaged accounting systems showed up. I'm not sure why you would want to preserve knowledge of Hadoop if other technologies are more efficient.


This is a good point. By the same token, most of us have no idea how to milk our own cows or forge our own knives. Having businesses dedicated to one craft and providing their services to those who'd rather be doing other things seems like a natural model humans - and nature itself - have been following for millenia.

It's not like stem cells worry about losing the ability to excrete bile when they become neurons.


This happens with every technology stack from steam engines to software. There was a time when programmers could solder together an ALU using transistor gates.


It might be the same with 'trains' (steam engines); I don't know much about that. But losing skills in 'the cloud' makes you prisoner to PaaS/SaaS providers; if these skills almost completely disappear (let's say, akin to your transistor gates, that almost no-one can install or maintain server setups anymore), the cloud providers really can do whatever they want price wise. For instance, if a cloud provider misconfigures something at scale in this great future, there is no way for anyone to verify that; it might be that you are paying 1000x what you should be paying for a particular database query that triggers a perfect storm in your set-up. As no-one in your company knows and no-one can check what or even if this is going on, it'll be considered 'normal' and you just pay. This is already happening, at scale, but not due to the cloud provider's fault, but because people are already not learning skills. If I would have $100 for every missing index I find per month in client's setups, I would be working a few hours/month for a very nice wage. Heck, most startup databases I see these days don't have any indexes at all; and then they wonder why they are paying $10k/month at aws while I can get them the same or better performance for a few bucks/month on a vps. Not because I'm so smart, but just doing everything 'low-level' wrong because no-one simply knows anymore how a database works (it's the cloud! we shouldn't have to!). It's not restricted to relational dbs either; sometimes I cry when I see dynamodb setups made by people who think this is just the cloud way of working; 'throw in whatever data, it'll always optimise itself'. So this is already happening at many companies and it's getting worse.


Similar thing happened in the past when application developers (most apps were written assembly in the past) built a dependency on instruction sets. If the computer manufacturer underperforms in future years, you are stuck. This dynamic can never be erased even as we keep climbing higher on the stack.


Well yeah I could probably have done that with some 7400 series TTL ... but the MIPS wouldn't nearly cut it in today's world :)


Another thing I noticed that adds to this point; the investors (well their tech DD people) seem to really push this. I have seen more and more teams not advancing to the next round because they wanted (with good reason imho) to engineer their own tech than 'just go with aws' blindly. AWS costing literally over 10x (that's lowballing, more often 20-30x and sometimes 100x) as much. As a startup, that's not always the best choice imho, but I see people getting forced into it as investors (again, their tech dd peeps) reason that it scales better than people (it does, but that's usually not needed at all at seed and even far beyond for many types of businesses) but in that, they lose sight of that these skilled people might not join at all because the money you could've paid them in the early days to get/keep them on board, now disappears into the cloud. It seems most investor's (here in the EU at least) dream to run companies with only MBA's and this is a solid step in that direction.


Using third party tools doesn’t lock you out of them. Nothing stops you from the collecting that data before sending it off.

Every company outsources something fundamental. Does Google mine its own metal? Generate its own electricity?

Even if you did it on prem, that doesn’t save you from license renewal costs or upgrades. You can write the software yourself, but that’s not cheap either.


Is metal and electricity Google's core competency though?

I don't think anyone is arguing one should maintain their own silica, atoms, independent universe, etc .


Using Hadoop doesn’t mean your core competency is maintaining Hadoop. Analysts don’t want to be engineers.


> independent universe,

Solipsystem, Inc. begs to differ. Having it your way is a booming business nowadays.

(j/k, borrowed an SF title https://en.wikipedia.org/wiki/Solip:System )


Some additional background:

* AWS has a managed Spark offering called EMR

* EMR pricing (https://aws.amazon.com/emr/pricing/) is lower than Databricks pricing (https://databricks.com/product/aws-pricing)

* Databricks notebook development experience is better than EMR (but still really basic compared to IntelliJ / PyCharm text editing)

* Both Databricks & EMR have proprietary Spark runtimes

* Databricks is building a Spark runtime in C++ that might be faster (Delta Engine)

* Spark lets you process massive datasets easily, with small teams. 2-3 person teams can build data ingestion pipelines to clean & process terabytes of data a day. It's an incredible technology.

* The difference between the PySpark & Scala APIs confuse the hell out of people

* Whether or not ppl can run Python machine learning models on Spark clusters confuses people

* Overreliance on notebooks causes big issues (no version control, tests, deployment process, dependency management)

The big data ecosystem is constantly evolving and you need to study constantly to keep up.


We recently ran a large clustering job over billions of records (and a few TBs of data on spinning disks) on a single machine with minimal command line tooling in a few hours. Not really optimized yet. People forget how fast modern hardware is and overestimate how much useful data they have (or need).

I think I should start a company around minimalistic data tooling or the like - the amount of waste seems large across the industry.

I saw the de-skilling a few year ago, where a guy stiched together a compete application from a couple of SaaS APIs. Cool, but it somehow does not impress me.


> People forget how fast modern hardware is and overestimate how much useful data they have (or need).

Where your model breaks down is when you have folks throughout your engineering org who have data needs but don't have folks to spin up bespoke pipelines like this and spend time optimizing them. You have data, you want to write something SQL-ish or have some nice APIs, and you want to get results dumped into a predictable place. When you have mixed workloads, mixed data sources, and those jobs are being tweaked and changed frequently, the actual underlying compute cost is hardly the issue. Getting the data, chewing on it [fast enough] without having to spend much time optimizing, getting it to its destination, and making that happen regularly and reliably is where the value is.


Great point, r5.metal instances have 96 CPUs and 768 GB of RAM. Lots of "big data problems" can actually be solved with a single big EC2 instance. Cluster computing should always be avoided when a single node will do.


We had something like 24G of RAM, but we have 500GB RAM machines as well (we own the hardware) and the job would have been even more of a breeze there.


Bare metal can be incredibly fast, if you can get access for a reasonable price. Virtualization is becoming a continuum but the overhead is always there.


> * AWS has a managed Spark offering called EMR

There is also my rinky-dink open source project, Flintrock [0], that will launch open source Spark clusters on AWS for you.

It's probably not the right tool for production use (and you would be right to wonder why Flintrock exists when we have EMR [1]), but I know of several companies that have used Flintrock at one point or other in production at large scale (like, 400+ node clusters).

[0]: https://github.com/nchammas/flintrock

[1]: https://github.com/nchammas/flintrock#why-build-flintrock-wh...


The last point isn’t so dramatic. There is version control for notebooks, and better version is coming (right now it’s in preview - code name - Projects, it’s in official docs). Tests are possible - either Nutter from Microsoft, or home grown (about 20-30 lines on top of built in unittest). Deployment is also not so complicated - there are tools (provider for terraform, cicd-templates project, Databricks-cli, etc), you can refer MS Learn for short course about CI/CD for Databricks and Azure DevOps, etc...


The last point was for teams that only rely on notebooks, sorry if I didn't make that clear.

You're right that all those issues can be sidestepped if you build projects in version controlled Git repos, test the code, and deploy JAR / Wheel files.

Speaking of testing, can you let me know if this PySpark testing fix worked for you ;) https://github.com/MrPowers/chispa/issues/6


As alexott says, you can link notebooks to a git repository and commit and rollback them from the Databricks UI (https://docs.databricks.com/notebooks/github-version-control...)

Problem is that it’s limited. You can’t, for example, commit multiple files in one go (that improves a little bit in https://docs.databricks.com/projects.html), or merge changes a colleague made with your changes. You also have to use the UI Databricks provides. You can’t use a git CLI or whatever GUI you prefer. (all AFAIK, but I’m fairly certain about it)


I’m sorry for delay, will fix ASAP...

My point is that you can do that even without jars/wheels - you can do VC and tests of notebooks. For example, https://github.com/alexott/databricks-nutter-projects-demo


> Overreliance on notebooks causes big issues (no version control, tests, deployment process, dependency management)

The last three of those things might be valid issues, but since when are notebooks not just as subject as any other source code format to version control (I get that the difference between the UI and the on-site structure may make typical diff tools less-than-ideal, but VC itself is unaffected.)


In my experience the use of notebooks (exclusively) goes hand in hand with not knowing things such as version control, tests, deployment process or dependency management exist.

I don't mean to sound harsh about other Data Scientists from a non software engineering background but the standard workflow is to fiddle around with a notebook until you can get a result. That's as far as it goes, no real robustness to it.

That's a pretty big generalisation but in organisations where they "home grow" their Data Science capability many of the online courses don't cover production level Data Science.


Your experience aligns with what I've seen.

All the notebooks are in one place. Some are for important production jobs, other are for data exploration.

It's easy to make a little edit in a notebook and accidentally break production jobs.

Comparatively harder to make an edit in a git repo and do a deploy that'll break production jobs (e.g. if the JAR doesn't compile or the CI errors out cause the tests don't pass).

Notebook based production jobs get even more dangerous when NotebookA depends on NotebookB and so on.


Treat notebooks like other code - separate onto staging and production, with defined promotions between them - it’s possible. You can run tests in CI/CD pipelines, etc. You can set permissions so nobody can update production notebooks manually, ...


> The difference between the PySpark & Scala APIs confuse the hell out of people

I personally had this experience when first wanting to learn spark, and it really turned me off to the whole spark ecosystem. Curious if you have any suggested resources that do a good job on this?


Spark offers Scala, Python, Java, and R APIs. Scala & Python are the most viable options (R lacks a lot of features and Java is only good for ppl that love Java).

Scala & PySpark are both great options. Lots of devs are terrified of Scala, so PySpark is more popular now. I'd say Scala has a slight technical advantage, see here for more details: https://mungingdata.com/apache-spark/python-pyspark-scala-wh.... Both are great overall.

I wrote a book that's a practical introduction to Spark: https://leanpub.com/beautiful-spark/

Most of the training materials are theoretical, which makes Spark seem really intimidating. You can learn some basic Spark principles and get up-and-running with production workflows quickly.


"Fully managed Spark" sounded awesome a few years ago. But now, Spark can run on Kubernetes clusters [1]. If your infrastructure is already running on Kubernetes then you already have a cluster capable of running Spark. And because of the magic of Kubernetes you don't even have to dedicate nodes to Spark.

[1] https://spark.apache.org/docs/latest/running-on-kubernetes.h...


100%.

The DataBricks notebooks have a lot of value (for now) but ever since running Spark on Kube... I have had literally 0 cluster issues. It's absolutely shocking, coming from a YARN-based environment, where I was constantly plagued with ResourceManager issues, autoscaling issues, preemption issues, network disconnect issues...


For reasons I don't understand our company ended up choosing Google Cloud and is a complete mess. You have BigQuery, Dataproc, AI Platform training, Colab, AI Platform notebooks, and many tools that do the same thing but not well integrated. My personal.favourite is AI Notebooks, but i need additional.plugins to interact with BigQuery, S3 and GCS. Requires a lot of customization, we used before Azure and Databricks where we had one stop shop. I heard Google is integrating some of their products but that will take a while. In the meantime we lose hours of productivity figuring out how to use their products (which stability is horrible. Example AI.platform Jobs and permission model)


Hey, I replied[0] to your submission[1] a couple of weeks ago and elaborated on our internal platform for collaborative notebooks, scheduled notebooks, and model management. How is your effort going so far?

- [0]: https://news.ycombinator.com/item?id=25940923

- [1]: https://news.ycombinator.com/item?id=25938654


I’ve had a lot of success with Dask lately. It’s comparable to spark in some ways [0]. Being written in python and built on top of pandas/numpy it allows much more flexibility. It’s also easier to get adoption from data scientists who are usually more comfortable with python than scala. It also has great tools built on top of kubernetes making deployment quick and easy [1].

[0]https://docs.dask.org/en/latest/spark.html

[1]https://github.com/dask/dask-gateway


Just for clarification, dask lets you run any python function on any python datatype. Numpy/pandas is faster, but anything is doable with dask. This makes it eminently flexible, unlike many «big data» tools that only work on tables, or other arbitrary limitations.


Yes, you can use dask’s lower level APIs, like futures [0] or dask delayed, and it’ll just pickle the python objects. Dask provides a group of collections APIs, like dataframes [1] and arrays that use more efficient serialization methods.

[0]https://docs.dask.org/en/latest/futures.html

[1]https://docs.dask.org/en/latest/dataframe.html


My feeling is that Spark is better for huge ETL jobs and Dask is better for certain types of model building (because of easy access to Python libraries), but don't have any benchmarks to back this up. Would love some benchmark results comparing processing tens of terabytes on equal sized i3.xlarge EC2 clusters to get a better idea of benchmarks.

Have you seen any good Spark vs. Dask benchmarks?


The area that we've seen Dask shine the most is ML. This is because the Python ecosystem has more advantages when it comes to ML and there is better GPU support.

https://towardsdatascience.com/supercharging-hyperparameter-...

https://towardsdatascience.com/random-forest-on-gpus-2000x-f...

Disclaimer: We produced those benchmark. I'm a founder of Saturn (https://www.saturncloud.io/) and we focus on providing Databricks-like capabilities with Jupyter + Dask + Prefect, so I definitely have strong feelings in this area.


I would be interested in seeing some spark v dask benchmarks too. Haven’t seen any yet though.

In my experience, Dask really shines when you implement custom numpy computations that could only be done in spark UDFs. We saw a decent performance difference there, but for common built-in computations I’d imagine that spark has better performance.

Edit: after some googling I found this paper with benchmarks.

https://arxiv.org/pdf/1907.13030.pdf

> Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python’s GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer.


It's also worth noting that with Spark, you can perform arbitrary computation using the Dataset API and operating on case classes.


Databricks seems on a convergent evolution towards Snowflake, between the two of them I’d rather be starting at the position Snowflake is versus Databricks.


They are both going towards the "Data Lakehouse" end point which is driving some of the convergence. Silly term, but basically providing analytics and a database type experience over a data lake.

That said, Databricks is a much broader platform, with all of the collaboration environments and is generally much more programmable than Snowflake.


> providing analytics and a database type experience over a data lake.

It's interesting how the modern data lake is developing in this way, recreating many patterns from the traditional database for distributed systems and massive scale: SQL and query optimization, transactions and time travel, schema evolution and data constraints...

Having started out as a database developer / DBA many years ago, working with data lakes today reminds me in many ways of that early part of my career.

I wrote a post tracing a common interface from the typical relational database to the modern data lake.

https://nchammas.com/writing/modern-data-lake-database


We've had mixed experience with Databricks, to be honest. The quality and responsiveness of support we were getting was not worth the amount of issues we were seeing + the considerable cost, so we decided to use "exit strategy to DIY Spark" (using the original post's terms) instead and we're pretty happy so far.


Cost is one thing we didn't really get to grips with. They use a fairly abstract Databricks Processing Unit - https://databricks.com/product/aws-pricing - and the costs felt disjointed compared to the workloads we were running.

This said, cost didn't really spiral or become an issue for us, especially when you take into account the cost avoidance of administering the Spark cluster and all of the tools. However, the pricing model did feel a little opaque.


Their sales team is somewhere between Oracle and a pro wrestling villain in terms of sleaziness.


Interestingly, my team is actually moving off Databricks. I have anecdotally heard the same from other teams in the industry (buy side finance).

We found that notebook-based development is actually an antipattern for software engineering. It was ostensibly helpful for the narrower "data science" use case, but we have a much more robust ETL and research platform we built on our own using Pandas, Dask, Prefect and AWS.

And personally I hated writing code in notebooks. If you're attached to that, you can basically get the same thing by using PyCharm in scientific mode with cell execution.


Maybe I'm missing something, but revenue per employee doesn't seem to be rising, which is what I would expect to see out of a successful startup that has found product market fit and is beginning to gain economies of scale. All I see is a capital raise 2x the previous raise each year, with no sign of profit and only a proportional increase in revenue.

I can build a business like that: sell Hadoop consulting services at a loss, below the market price. Win lots of business by being really cheap; grow as fast as your cashflow allows, which will require annual investment rounds of ever increasing size.

When I look at Google Trends, Hadoop's long-term trajectory looks like "dead by 2024". Do we expect that Databricks will be able to justify a $8B capital raising if Hadoop doesn't really exist? If not, what's the path to profitability?

(Not trying to say anything bad about Databricks or their management, I'm just genuinely wondering what's going on.)


For what it’s worth, we work with them and thus have been following closely. Databricks’ value add is that are end to end for interfaces to use your data. From ingest to exploration to DS/ML.

1. Your metrics are wrong about revenue/employee. We’ve heard they’re growing well by all standard SaaS company metrics 2. Hadoop consulting business is just a completely incorrect description of the platform. Hadoop is dying because of companies like Databricks, and all that needs to exist to justify their valuation is people bringing them large data workloads.

Their biggest problem is release maturity. Between Azure being down and them being down, stuff is down too often


DBFS is a complete joke though. A filesystem that has no timestamps? realy?



It's an abstraction over other file systems, like S3 or Azure BLOB storage. More a convenience than anything else, and helps ease porting code between cloud providers.

Not a traditional file system, or even a HDFS clone.


logging into the same system and interacting with the same datasets through the same Notebook based UI

I don't think there's a good one size fits all UI that can be applied the different types of work that take place with data and consumption by users. This is evidenced in Databrick's own feature set which includes an integration with R Studio and Tableau to Databricks as a data source rather than a work environment.


A (big?) part of Databricks' success is the complicated mess that Spark is to run.


Considering the makers of Databricks are also the maintainers of Spark it makes you wonder whether this is deliberate


I'd say the difference in valuation between Snowflake and Databricks clearly shows the latter is not "winning".


Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example: https://github.com/snowflakedb/spark-snowflake

Snowflake predicate pushdown filtering seems quite promising: https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...

Think both these companies can win.


There's definitely place in the industry for a no 1 and a no 2, but no. I don't think they're complementary, I think they're competitive. Both are trying to own the whole storage + compute layer eventually. I guess you can see Snowflake as a storage layer but they have attached scalable compute to it. Databricks started out as scalable compute and is now becoming more focused on storage.

The idea of storing your data in Snowflake and querying it in Databricks is pretty silly. Why would you want to do that? Why not just use Snowflake's compute? Sure you could argue Spark has some transformations that are hard to express in SQL, but that is why Snowflake introduced Snowpark.


Snowflake spends an insane amount on marketing.

And they’re Apples and Oranges. Snowflake is more comparable to Athena and BigQuery.


Why does nobody discuss Apache Pulsar as a viable alternative?


Apache Pulsar is a streaming platform more comparable to Kafka. It doesn’t have built in parallel computation APIs like spark. You can hook spark streaming up with pulsar as a data source though.

https://pulsar.apache.org/docs/en/adaptors-spark/


I actually posted on a somewhat related topic here:

- https://managingml.substack.com/p/ml-model-training-is-an-et...

As an ML manager one of the things that I dislike about the Databricks model is the reliance on notebooks and bespoke interactive cluster workflows.

ML workloads are stuck in the stone age because there’s no common pattern to plug them into existing ETL frameworks, but really that is what they are.

Model training is an ETL that just happens to need unique domain tools to visualize how it’s succeeding / track observability metrics. But apart from that, you really need to automate model training and experimentation.

It is a total antipattern to ever use notebooks, or even non-notebook code developed interactively. Rather you should be submitting tasks to a task queue that maps your ML exploratory workload or training workload to a scheduled compute resource, runs it with observability baked in, and treats outputs like schema’d outputs of more traditional ETLs.

Instead of that, Databricks is much more of an early 2000s MATLAB model. Make it addictive and easy for unpragmatic researchers detached from real production use cases, then figure employers will have to adopt it since all the expensive-to-hire researchers can only be productive with it.

Long term I think it’s a very bad gamble. Just consider how much open source Python tools have eaten MATLAB’s lunch in the past 10 years.


Can't disagree with you more here. Databricks is a tool. It's on you to create your frameworks or whatever to optimize your workflow. I've been using Databricks and it's such a great platform, at least compared to what I've used in the past (EMR/Sagemaker). Simply build a framework where you offload the notebooks and exploratory stuff to databricks and build out productionable pipelines locally using sample data from dbricks, then test on databricks. That's what we've been doing and it's working very nicely. Furthermore you can remotely kick off jobs on databricks, so once you have your pipeline worked out you really don't need to interact with notebooks if you don't want...just use it as a scalable backend, dashboard, model tracker, whatever fits in your workflow.

I feel like instead of using the tool a lot of folks have specific expectations and if the tool doesn't fit exactly into how they work then it's just written off.


The trouble is that those expectations exist for a reason, and there’s a whole world of best practices and efficient patterns that have developed specifically to solve the challenges of orchestrating and monitoring ETLs. What feels convenient for researchers using a notebook is just fool’s gold (for known, legit reasons), and it damages credibility with other domains of engineering when their widely vetted best practices are ignored just for the sake of convenience tools like notebooks or manual job kickoff interfaces through a vendor tool like Databricks.


They (Databricks) are not advertising themselves as a best practices way of doing anything, they're just a platform for doing data science and analytics. It's up to the data scientist to use the tool properly and to have proper methodologies. A platform is simply a set of tools that helps a target audience, and that's what databricks is. Now the issue here is that a lot of datascience either cannot code/do engineering or just don't want to. But in my view people like that are just glorified analysts. Data Science is built on the foundation of software engineering (+ stats, viz, math, etc...); this is why it's so complicated. If you cannot code or don't understand the SDLC or best practices you're basically a carpenter who can't hammer or saw.


The question is not whether Databricks claims to be anything. The question is just what is a best practice, from an ETL / DevOps point of view, and how to enforce ML tooling to adhere to that from first principles.

I’ll give a concrete example. In my org we use Dataproc on GCP as a model training task execution paradigm. You define your base environment via some Docker container, put it in GCR, and then define Dataproc jobs in terms of the base environment, the backing compute resources, any GCS bucket connections, and any ML-specific config like hyperparameters.

A human being never under any circumstances triggers these jobs. Instead a human user deploys the config as a cronjob or regular job in Kubernetes, and then a scheduler picks them up and runs them. For experimental workloads only, developers can manually trigger a Kubernetes job.

Each job consults config, spins up the appropriate Dataproc cluster, runs the job (with visualization tools exposed on ports at the cluster node IPs), and saves artifacts to GCS when done.

All of this is controlled via clean and easy internal CLI tools and wrappers to make it simple for any developer.

The number one thing this ensures is that no work ever exists in notebook format, beyond tiny scratch work a developer might do strictly to debug code or try a small data proof of concept.

The number two thing this ensures is complete reproducibility. Since every possible training task must go through code review, commit all config to version control, get impounded into a container, and execute via a deployed Kubernetes job, it is by definition impossible for someone to execute an ad hoc task that other engineers can’t rerun or have to follow weird setup steps to recreate (it’s all impounded in the container).

The third thing it ensures is that all accuracy, monitoring and results artifacts are explicitly tied to the Kubernetes job that controlled the process. It is not possible for some accuracy result to float around untethered from a job ID that uniquely and conclusively ties it to all relevant code for the job. This can be facilitated through MLflow or whatever else.

Getting data scientists to “wear corrective shoes” and learn to reorient their way of working to align it with this process has universally paid dividends, both for letting the data scientists experiment faster and more reliably, and for ensuring model training adheres to SRE-related compliance and best practices, so it is pluggable into various tools and constraints those non-ML support teams need in order to do their jobs and offer support to ML teams without getting hit with unstructured notebook spaghetti and bespoke execution paradigms.


This is the direction we're going towards with our internal platform[0] that we built to make our life easier.

We do have real-time collaborative notebooks. However, we also have long-running notebook scheduling where notebooks get picked by a runners and executed against certain environments. One cool feature is that you can watch the notebook's output as it runs, even if you had closed the browser or logged in from another computer. Anothr cool feature is that we do automatic experiment tracking, so we detect parameters, models, metrics, and the notebook automatically, removing the cognitive load from the use to "remember" doing that. People rarely remember clerical things, but even when they do, we didn't want to pollute the notebooks with tracking code of say MLflow. This happens automatically.

These long-running notebooks are simply jobs. There are other types of jobs: building a Docker image for the model so that you can push it to a registry is also a job, so is deploying a model to get a "REST" endpoint and a live dashboard to monitor model performance.

One of our guidelines is: "Anything that is possible with point and click must be possible with an API call", so we make sure to make everything programmable. For example, scheduling a long-running notebook can be done sending in a request. We also are working to add in webhooks and use cloudevents spec so that people can take information from our system, such as a job having finished, and trigger things in other systems.. Another is workload portability.

Again, it's still early and we're prioritizing the issues that have frustrated us the most for the past years in our projects. There's a long way to go, especially that there are many such as yourself who may have worked on problems we've never seen, or a scale/data we haven't handled before.

- [0]: https://iko.ai


Also there's a level of automation for model tuning (experiments) especially if this is an update to an existing model (you still need to do model validataion). For the initial model there's gonna be automation but a lot of time will be spent validating the model with business KPI, not just some model metric like accuracy. Databricks makes this automation and tracking pretty simple. You don't have to do this in notebooks. Once you're done with exploration, build out the pipeline in your "framework" and kick off jobs to execute it on databricks. Yes you can use ETL tools here to, I dunno, train every x hours...whatever


Each round trip of model training (such as when optimizing hyperparameters) should be a distinct run through the ETL stages, with dedicated artifact tracking. Observability and monitoring for ML model training ETLs should always include whatever end criteria the business uses to judge the model, such as an ETL step for real acceptance testing, not just domain specific accuracy like F1 score or ROC curves.


I agree, but you're looking at everything through an automated ETL workflow which sorta invoke the idea of using a tool like airflow. But generally you don't need that. What you need is a specific set of functionality (training, tuning, testing) so you can either kick off manually when you wanna do things interactively or hook up into a pipeline when you wanna automate things. And yes you can track everything using MLFlow on databricks. Matter of fact Delta Lake is one of the most powerful features as you track not just migrations but data changes, so you can track end-to-end lineage and so far it's the ONLY platform (that I've tested) that allows this with ease.


> “ But generally you don't need that.”

I think this is wrong. This is what ML researchers tend to think when they are ignorant of why DevOps best practices exist and what other teams need in order to provide underlying infrastructure support. You’re only thinking of “what you need” from the point of view of the developer experience of the ML engineer, which frankly is usually the least important part by a wide margin.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: