Many companies have been forced to adapt their work organization and structure for their employees working from home due to COVID-19. As a result of this, parallel computing or cloud computing has become much more prevalent due to cost optimization and the fact that certain data centers may not be as easily accessible anymore. In preparation for an upcoming interview, I decided to further my knowledge on the technologies and processes used in data engineering that make this new work structure possible. One such technology that I’ve heard of but have not had a prior chance to work with was Apache Airflow. Since I’ve not had much prior experience with it, I decided to take DataCamp’s “Introduction to Airflow in Python” course to learn more.
The primary aspect that Airflow is used for is programming data engineer workflows. A workflow refers to the set of steps required in order to accomplish a certain task. These workflow tools exist in order to ensure that dependency requirements are fulfilled to execute the tasks, such as ensuring that a dataset is filtered prior to another computational process. Task dependencies are designated as either upstream or downstream tasks, where an upstream task must be done before a downstream task. The way that Airflow implements its workflow is through directed acyclic graphs or DAG.
Within Airflow, a DAG represents the tasks and dependencies between tasks and contains identifying information such as the name of the workflow and the owner. Being directed means there is a particular flow with respect to the order of operations between each component. Acyclic means that the workflow does not repeat itself. However, it does not mean that the entire operation cannot be rerun, simply that the individual components are executed only once. Each component or task makes up a vertex of the DAG.
In order to work with DAGs, you use the airflow command line tool and the numerous subcommands available. These subcommands are used to control many operations and aspects of our DAGs, such as running a DAG or setting up a scheduler for all tasks and DAGs. DAGs are written in Python but can use components in other languages if needed. Airflow also has a web UI that is useful for monitoring and developing workflows.
Operators represent a single task within a workflow. They run independently, meaning that all the resources needed to complete the task are contained within that operator. Various operators are used in order to perform certain tasks. BashOperator, for instance, executes Bash commands and scripts. DAGs can contain numerous operators that perform different tasks depending on what is needed.
A DAG run refers to an instance of a workflow at a given time. A DAG could be currently running, have failed, or was successfully completed. We can view all DAG runs and their state through the Airflow UI. Scheduling a DAG to run automatically allows for multiple attributes depending on your needs. A start_date is specified for a date/time to initially schedule the DAG. There are other optional attributes, such as max_tries for a number of attempts to retry before failing a DAG run, or schedule_interval to specify how often to schedule a DAG run.
There are certain peculiarities when scheduling DAGs that need to be mentioned. You can state a start date and end date, but this is not a guarantee of a DAG running within this range, rather a minimum/maximum value of when it could be scheduled. In addition, when using a start date and a schedule interval, Airflow will schedule the DAG once at least one schedule interval has passed beyond the start date. For example, if I scheduled a DAG to run daily and stated the start date as February 14th, Airflow would use February 15th as the earliest possible date for the first run of the DAG. This can cause some issues when dealing with workflows with longer intervals.
The schedule interview can be defined by certain presets or with a cron style syntax. Airflow presets include @once, @daily, @monthly, among others. Using a cron expression, a string comprising of five or six fields representing a set of time, enables for more specificity when scheduling your workflows.
In the next segment of this blog, I hope to cover monitoring Airflow workflows and learning how to build production pipelines. I look forward to covering more of Airflow and other cloud computing technologies to help broaden my knowledge and skills further.