Open In App

Scheduling in Airflow

Pre-requisite: Airflow

When we are working on big projects in a team, we need workflow management tools that can keep track of activities and not get haywire in the sea of multiple tasks. Airflow is one of them(workflow management tool).



Airflow is an open-source platform and a workflow engine that will easily schedule monitor batch-oriented workflow, and run our complex data pipelines. In other words, we can call it a workflow orchestration tool, that is used in data transformation pipelines. For easy scheduling, Airflow uses Python to create workflows.

To understand Airflow Scheduling, we should be aware of Airflow DAG. Let’s discuss that in detail.



Airflow DAG:

DAG stands for Directed Acyclic Graph.  It is a data pipeline in Airflow which is defined in Python. Here is a graph, each node represents a task (which is responsible for completing a unit of work) and each edge represents a dependency between tasks. Overall a DAG represents a collection of tasks. These tasks are organized in such a way that their relationships(between tasks in the Airflow UI) and dependencies are reflected. This DAG model provides a simple technique for executing the pipeline by dividing the pipelines into discrete incremental tasks(rather than depending on a single unit to perform all the work). Now let’s understand how DAG build pipelines(their mathematical properties).

Airflow Scheduler: Scheduling Concepts and Terminology

In simple words, scheduling is the process of assigning resources to perform tasks. A scheduler has control over all tasks as to when and where various tasks will take place. An Airflow scheduler keeps track of all the tasks and DAGs and also set off tasks whose dependencies have been met.

In order to start the Airflow Scheduler service, all we need is one simple command:

airflow scheduler

Let’s understand what this command does. It starts the Airflow scheduler using the Airflow Scheduler configuration specified in airflow.cfg. After the start of the scheduler, our DAGs will automatically start executing based on start_date,schedule_interval, and end_date. These all parameters and scheduling concepts will be discussed in detail.

Airflow Scheduler Parameters:

 

Airflow Scheduler Terms:

It is important to be aware of all the Scheduling concepts before going deep with working on the Airflow Scheduler. So, there are some keys described below:

  data_interval_start = Start date of the data interval = execution date.
  data_interval_end = End date of the data interval.

Working On Airflow Scheduler:


Article Tags :