Hacker Newsnew | past | comments | ask | show | jobs | submit | stkbailey's commentslogin

Interesting, Astronomer was actually my last choice for orchestrator. We went with Dagster, but I didn't want to make the takeaway "Dagster solves these problems", because it doesn't directly. Astronomer was just the best foil for the "meta-orchestrator" space that seems to be evolving, and which _can_ address these problems.


Author here - appreciate the comments and reads. To add a bit of color -- I spent about a month looking into orchestrators to migrate Whatnot's data platform onto earlier this year, and it was a miserable experience. We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.

In fact, I did end up doing all those things, but we opted for Dagster Cloud, because of their focus on improving developer efficiency. Their team provided pre-built Github actions for CI/CD and recently introduced PR-specific branch deployments, which has been amazing. They're moving towards serverless execution, built-in ECR repositories, managed secrets. Prefect and Astronomer I expect are moving in this direction, too, but I liked the Dagster project's energy quite a bit.

As I've waded into the MLOps world as well, it just keeps looking like every platform basically devolves into : an orchestrator that provisions compute resources and logs metadata into an opinionated data model. Catalog tools like Atlan are metadata sinks that are trying to build out orchestration/workflow capabilities. dbt Cloud of course is just an orchestrator for a specific type of data product that is aiming to operationalize metadata with its metrics layer.

Orchestration + a metadata data model is a common denominator here, and I think the fact that Airflow is so inevitable has made it really hard for people to imagine the category as anything other than a scheduler, but perhaps some of these new companies can break new ground.


Thanks for the experience report - I have Dagster and Prefect on my shortlist to evaluate next time I need to build this, and Dagster seems the most promising, so it’s good to get another datapoint.

One Q - it seems to me that another possible solve (and probably how the big guys tend to do it) is to use a dataflow engine like Spark/Flink. Did you compare a managed platform like Google Dataproc? They also have serverless if you don’t want a heavy managed cluster, which might make this approach more viable for non-huge companies that wouldn’t utilize a min-spec cluster. (When I last evaluated this they didn’t have serverless which was a dealbreaker for my small scale).


We didn't look into a dataflow engine specifically, in part because we have a heterogeneous set of workfloads. Our core use case is loading mission critical data in chunks, but it is also coordinating SaaS tools and managed services like Sagemaker. So the sort of "just run this arbitrary code" reliably and scalably is an important role in our case, not just the dataflow part of things.


>We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.

This sounds like an issue not with Airflow but with integration.


Yep, that's what I tried to point out in the article.


“We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.”

DAGs can be published to S3 for cutting down on like half of these dependencies. And the nice thing about MWAA is log & stats publishing over cloudwatch, which should flow into any existing amazon integrated tooling.

For our team setting up terraform for iam & mwaa, some deploy pipelines to s3, and connecting some config bits to wire up splunk logs / monitoring pieces was not that much work. Initiating a separated vendor relationship & pricing out data ingress/egress costs would blow that work out of the water but maybe it’s a difference in company size/placement.


In your investigation did you try out Flyte at all?


Nope -- just MWAA, Astronomer, Dagster Cloud, and Prefect Cloud. In the past I used Argo Workflows pretty extensively and have talked about its pros and cons here: https://www.youtube.com/watch?v=-cyr_kL-9fc


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: