If you've ever used a tool similar to HP Operations Orchestration. It's pretty much a stripped down version of that.
Cons:
- It's not very stable. (It requires a lot of configuration to get it to do more than one process at a time)
- It's very easy to get the UI to fall over.
- It's very difficult to get tasks+jobs to stop running once they started. (You can delete/stop/cancel a job.. but under the covers it keeps running and your next iteration is going to wait.. if it ever does complete before you can go through the develop, test cycle again)
- It's written in Python: Expect to have issues with your environment. The latest version of Airflow doesn't work with 3.7.xish because the async word was made a keyword. There goes that method.
- There is no sharing (xcoms is frowned upon) of data from one process to another. This means that if you're trying to pull data from S3, you're going to have to hard code it to a predictable place. The next operation acts completely indepenently and runs that.
- The connections between tasks are superficial. They're just there to order it based on how you specified it. Also, it can be a bit difficult to debug when you have multiple layers and multiple depedency declarations where something is both a upstream and downstream of the same depedency.
- No optimization. It will not split up the work per task. You have to define that work manually. (See the next complaint)
- No Dynamic tasks or Dags. You cannot generate a new dag or task after the dag is initialized. That means that if you have to perform 1 000 000 000 000 API calls, you can't just break that up into 200 api calls per task and then max out your compacity in your workers.
- That example that they had of a dag of thousands of tasks. That's a bad practice. Timeouts on dags are going to be reached by the time that completes, and it'll try to restart on a schedule.
Any opinion on NiFi or other open source alternatives? I'm evaluating products in this space, but it seems pretty hard to tell without just giving them a try how well they'd integrate into my work.
Another reason: Your tasks/dags aren't portable from one system to another. Your tasks+dags are your code base. You can't create something and share it with others very easily.
This is why I wish attempts at standardizing ways to express DAGs in YAML/JSON/XML like CWL [0] and WDL [1] had more steam to them. If these standards took off, you'd be able to take your workflow and execute it on another batch system scheduler if you got tired of your current workflow orchestration tool.
[full disclosure, Airflow committer here]
I've never heard of "HP Operation Orchestration", but that looks like a drag and drop enterprise tool from a different Windows-GUI era. Airflow is very different, it's workflows defined as code which is a totally different paradigm. The Airflow community believes that when workflows are defined as code, it's easier to collaborate, test, evolve and maintain them. Though maybe the HP tool exposes an API?
To address some of your comments:
- About stability, I'm not sure what version you've used, or which executor you were using, but if stability was a concern I don't think we'd have such a large and thriving community. Nothing is perfect, but clearly it's working well for hundreds of companies.
- About stopping jobs, if the task instance state is altered or cleared (through the web ui or CLI), the task will exit and the failure will be handled. Earlier versions (maybe 1-2 years back?) did not always do that properly.
- About Python: Python 3.7 was released late June 2018, and I think there are PRs addressing the `async` issue already. We fully support 2.7 to 3.6, and 3.7 very soon. You need to give software a chance and a bit of time to adapt to new standards. I wonder which % of pypi packages support 3.7 at this point, or how many have 3.7 in their build matrix, but my guess is that it's very low.
- XCOMs are fine in many cases, though if you're not passing data or metadata from a task to another that doesn't mean that there's no context that exist for the execution of the task. We recommend having determinisc context and data locations (meaning the same task instance would always target the same partition or data location).
- Dynamic: the talk linked to above clarifies what kind of dynamism Airflow supports. It's very common to build DAGs dynamically, though the shape of the DAG cannot shape at runtime. Conceptually an Airflow DAG is a proper directed acyclic graph, not a DAG factory or many DAGs at once. Note that you can still write dynamic DAG factories if you want to create DAGs that change based on input.
- No optimization: the contract is simple, Airflow executes the tasks you define. While you can do data-processing work in Python inside an Airflow task (data flow), we recommend to use Airflow to orchestrate more specialized systems like Spark/Hadoop/Flink/BigQuery/Presto to do the actual data processing. Airflow is first and foremost an orchestrator.
- DAG parsing timeouts are configurable. DAG creation times for large DAGs have been improved in the past versions. But clearly it's easier for humans to reason about smaller DAGs. DAGs with thousands of tasks aren't ideal but they are manageable.
One thing to keep in mind is that thriving open source software evolves quickly, and Airflow gets 100+ commits monthly, and has dozens of engineers from many organizations working on it full time. From what I can see it's clearly the most active project in this space.
HP OO is an entire system that has an XML structure command structure to customize the job that you're working on. It has a GUI that is used to build out the flows, run and test. It has a backend system to audit, admin, and visualize the current process, and it has workers to scale out the work. It's a bit of a more mature setup.
The claim that the setup of your workflow has to be code isn't necessarily a good thing. Your recipe should be descriptive, not imperative.
----
To answer your response:
1. Stability: I was working with apache-airflow 1.9 (last release: Jan2018) 1.10 was just released 2 days ago. I frequently had issues where deleting more than 3 tasks would cause that mushroom cloud error message. Also, I've had cases where the task could max out on memory and take the whole system down.
2. 1.9.0: Stopping jobs: I saw this issue where that a task would be running, I would stop+delete the task and start a new one. I frequently saw the case where I had to wait a while for the triggered dag to continue running.
3. Python3.7: Yes, it was addressed on the PR. However, for things like that we (the users) need a quick turn arround/hotfix for stuff like that. It got released late (lets say 27 June, and the latest version 1.10 was 28 August [with a 7month gap]) It's just painful to have this upgrade just break something internally in Airflow.
4. From what I've seen in situations where the work for the task is huge is that there is an expectation of the task to handle the workload and splitting up the workload it's self. (Since you can't define a span out of the tasks based on the workload) That's no beuno.
Timeouts:
From what I have seen there are issues where the next dag run scheduled can interfere with the last one. This is an issue given the timeouts, retries, and reoccurring schedules. (yes you can say.. that's user's choice.. however, workloads and performance can change without notice)
----
Another issue I had: There is no way to trigger a task and it's depending tasks without triggering the whole dag. This makes long-running dags with lots of tasks difficult to debug and test.
Also there is a slight difference between airflow run (task) and test. Sometimes you use one vs the other.
Cons:
- It's not very stable. (It requires a lot of configuration to get it to do more than one process at a time)
- It's very easy to get the UI to fall over.
- It's very difficult to get tasks+jobs to stop running once they started. (You can delete/stop/cancel a job.. but under the covers it keeps running and your next iteration is going to wait.. if it ever does complete before you can go through the develop, test cycle again)
- It's written in Python: Expect to have issues with your environment. The latest version of Airflow doesn't work with 3.7.xish because the async word was made a keyword. There goes that method.
- There is no sharing (xcoms is frowned upon) of data from one process to another. This means that if you're trying to pull data from S3, you're going to have to hard code it to a predictable place. The next operation acts completely indepenently and runs that.
- The connections between tasks are superficial. They're just there to order it based on how you specified it. Also, it can be a bit difficult to debug when you have multiple layers and multiple depedency declarations where something is both a upstream and downstream of the same depedency.
- No optimization. It will not split up the work per task. You have to define that work manually. (See the next complaint)
- No Dynamic tasks or Dags. You cannot generate a new dag or task after the dag is initialized. That means that if you have to perform 1 000 000 000 000 API calls, you can't just break that up into 200 api calls per task and then max out your compacity in your workers.
- That example that they had of a dag of thousands of tasks. That's a bad practice. Timeouts on dags are going to be reached by the time that completes, and it'll try to restart on a schedule.