Data Pipelines with Airflow
Orchestrating Data with Apache Airflow
Apache Airflow is an open-source platform created by the community to programmatically author, schedule and monitor workflows.
It is heavily used in conjunction with Kafka for modern event-driven architectures.
What is a DAG?
In Airflow, a workflow is defined as a Directed Acyclic Graph (DAG). A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
- Directed: Dependencies have a specific direction (Task A must run before Task B).
- Acyclic: The graph cannot have loops (Task A cannot depend on Task B if Task B depends on Task A).
Writing Your First DAG
Airflow DAGs are written in standard Python.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'data_engineering_team',
'depends_on_past': False,
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Define the DAG
with DAG(
'daily_etl_pipeline',
default_args=default_args,
description='A simple daily ETL DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2023, 1, 1),
catchup=False,
) as dag:
# Define tasks
extract_task = BashOperator(
task_id='extract_data',
bash_command='python /scripts/extract.py',
)
transform_task = BashOperator(
task_id='transform_data',
bash_command='python /scripts/transform.py',
)
load_task = BashOperator(
task_id='load_data',
bash_command='python /scripts/load.py',
)
# Define dependencies (Bitshift operators)
extract_task >> transform_task >> load_task
Advanced Concepts: XComs
XComs (short for "cross-communication") are a mechanism that let tasks talk to each other. By default, tasks are completely isolated and may run on entirely different machines. XComs allow a task to push a small amount of metadata (like a processed file path) to the Airflow metadata database, which downstream tasks can then pull.