You will have to solve this eventually
As data projects evolve beyond their proof of concept (PoC) phase, orchestration becomes a critical component. This necessity arises not only from the inherent dependencies in processing tasks but also from the need to manage ancillary yet crucial elements such as alerts, SLAs (Service Level Agreements), retries, historical reports, re-runs, backfills, etc.
Selecting the right orchestration tool is a pivotal decision. Despite appearing easily interchangeable, these tools often become deeply integrated into a project's framework. This integration is primarily due to organizational factors. Teams develop proficiency and confidence in a particular tool, naturally setting expectations for any subsequent replacements. Thus, it’s vital to choose wisely, balancing the decision with the flexibility for potential future changes. Let's delve deeper into this trade-off later on.
Presently, there are three primary categories of orchestration tools to consider:
Cloud-Native Tools: Examples include AWS Glue, Azure Data Factory, and Google Cloud Composer. Notably, AWS Glue and Cloud Composer are essentially managed versions of Apache Airflow.
Open Source Specialty Tools: Apache Airflow is a prominent example in this category, Luigi is also a good alternative with a different dependency philosophy.
Custom Solutions: Utilizing classic or cloud-native functionalities, such as AWS Step Functions, to build your own orchestration framework.
Let’s look at them in reverse order to build some suspense!
3. Custom Orchestration Solutions
Building your own orchestration solution can be both engaging and cost-effective, particularly if executed correctly. However, it's often not the optimal choice, except for projects at very small or extremely large scales.
For instance, consider AWS Step Functions, although other cloud providers offer similar capabilities. These solutions are cost-effective, charging only for active runtime, and versatile enough to handle various workflows through state machine models. Nevertheless, if your workflow complexities extend beyond what state machines can model, you might need a bigger boat.
The primary challenge with custom solutions is the unexpected amount of engineering effort required. While initially appealing to engineers, the reality of integrating features like retries, SLAs, alerts, backfills, and user interfaces for operational teams can become daunting. Additionally, consider the long-term aspects: the sustainability of maintenance and the ability to attract talent for ongoing development and growth.
2. Open Source Tools (e.g., Airflow, Luigi)
Open-source tools like Apache Airflow or Luigi were, and remain, solid choices, especially for smaller organizations. Startups operating on limited resources might find customizing an Airflow instance a pragmatic short-term solution, allowing them to focus on core business goals and existing customers.
For organizations where data orchestration is a strategic priority, adopting an open-source tool involves significant infrastructure work that might still be worth it. You're essentially managing a complex system comprising a web server, scheduler, and associated computing resources. The expertise required is substantial, though the market has become better at supplying these skills over recent years.
While Airflow is often the go-to recommendation, conducting a thorough technical review based on current needs and trends is invaluable. This process not only ensures a well-informed choice but also provides excellent material for a thoughtful discussion or a subsequent blog post.
1. Cloud-Native Tools
Opting for cloud-native solutions like AWS Managed Airflow or GCP Cloud Composer essentially means outsourcing infrastructure and administration to a major tech provider. This approach is generally a good idea unless your infrastructure team feels strongly otherwise. I do recommend listening to them in that case, they probably know a lot more about your situation than us.
Such tools allow teams to concentrate on developing their data workflows, transformations, and custom operators. However, it's wise to document instances where cloud-native features are used within these operators, maintaining awareness of how these dependencies might impact future migrations to different tools.
Outsourcing aspects like authentication to specialized providers is often a prudent idea also, using their expertise for enhanced security and efficiency.
Conclusion: A Thoughtful Approach to Tool Selection
In sum, we recommend thinking about this early on and having an open discussion about it. It's crucial to recognize the significance of choosing a tool that aligns with your project's needs. Once chosen, it's important for the team to fully embrace and utilize the tool as designed. Yet, maintain an awareness of potential future changes, being mindful of any strong dependencies that might arise. Conscious decision-making in this regard is key, avoiding lock-ins that result merely from choosing the path of least resistance.
Our experience spans a wide range of tools, including those mentioned and others. We're adept at quickly adapting to new tools as required. For instance, our experience with Meta/FB's custom tool, supported by their robust documentation, meant it was indistinguishable from Airflow in a couple of days. Maybe not as pretty.
Feel free too reach out if you want to talk about this, or if you have ideas on further blog posts.