Airflow Setup

Well, here it is: getting Apache Airflow up and running! 🎉

To begin, you’ll want to familiarize yourself with the example docker-compose.yaml that Apache provides in the Airflow repository:

🔗 Official Docker Compose Example

🔍 Overview of the Services, Environment Variables, and Volumes

The Apache-provided docker-compose.yaml file is described as:

“Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.”

This setup provides a fully functional Airflow instance using a CeleryExecutor, supported by Redis for task queuing and PostgreSQL for the metadata database. Below is a breakdown of each service in the stack and its role:

1. PostgreSQL Database

The PostgreSQL service acts as the metadata database for Airflow. It stores crucial information such as:

DAG definitions
Task states and histories
Airflow configuration settings
User credentials and roles (if applicable)

Environment Variables to Configure:

POSTGRES_USER: Username for the database (e.g., airflow).
POSTGRES_PASSWORD: Password for the database.
POSTGRES_DB: Name of the database (e.g., airflow).

2. Redis Database

The Redis service is a high-performance in-memory key-value store that Airflow uses as a message broker for the CeleryExecutor. It handles task queuing and ensures that:

Tasks are efficiently distributed to workers.
Communication between components remains fast and reliable.

Exposed Ports:

Redis does not expose a UI, but it runs on port 6379 internally for communication.

3. Airflow Webserver

The Airflow Webserver provides the main user interface (UI) for interacting with Airflow. From this web application, you can:

View and manage DAGs.
Monitor task progress and logs.
Pause, trigger, or edit workflows.

Environment Variables to Configure:

_AIRFLOW_WWW_USER_USERNAME: Username for the admin account (default: airflow).
_AIRFLOW_WWW_USER_PASSWORD: Password for the admin account (default: airflow).

Exposed Ports:

The webserver UI is available on port 8080 by default.

4. Airflow Scheduler

The Airflow Scheduler continuously monitors all active DAGs and determines which tasks are ready to run. It is responsible for:

Triggering tasks at the appropriate time based on dependencies and schedules.
Updating task states in the metadata database.

This is the brain of the workflow orchestration process.

5. Airflow Worker

The Airflow Worker is a Celery worker responsible for executing tasks assigned by the scheduler. Workers pull tasks from the Redis queue and execute them as isolated jobs. Multiple workers can run in parallel to distribute the workload and improve scalability.

6. Airflow Triggerer

The Airflow Triggerer handles deferred tasks in Airflow, leveraging asynchronous programming. It ensures that tasks like sensors (e.g., monitoring for an external file) don’t consume unnecessary system resources by checking conditions asynchronously.

7. Airflow CLI

The Airflow CLI is a command-line interface for interacting with the Airflow instance. It can be used to:

Trigger DAGs manually.
Test or debug tasks.
Manage Airflow users and configurations.

This is especially useful for debugging or automation scripts.

8. Flower

Flower is a real-time monitoring tool for Celery tasks. It provides a web-based UI to:

Track task progress and statuses across workers.
View performance metrics and worker health.
Debug task-related issues.

Exposed Ports:

The Flower UI is available on port 5555 by default.

⚙️ Configurations to Consider Before Spinning It Up

Before starting the Airflow stack, ensure you’ve configured the following environment variables and settings:

1. PostgreSQL Database Credentials

POSTGRES_USER: Set the database username.
POSTGRES_PASSWORD: Set the database password.
POSTGRES_DB: Set the name of the database.

These credentials must match the values in the Airflow environment variables:

AIRFLOW__DATABASE__SQL_ALCHEMY_CONN
AIRFLOW__CELERY__RESULT_BACKEND

Example:

1
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:<password>@postgres/airflow

2. Webserver and Flower Ports

Webserver Port (8080): Ensure the 8080 port is not already in use on your system, or map it to an alternative port (e.g., 8081:8080).
Flower Port (5555): Similarly, ensure 5555 is available or remap it.

3. Admin Account for Webserver

Set credentials for the default admin user:

_AIRFLOW_WWW_USER_USERNAME: Admin username (e.g., airflow).
_AIRFLOW_WWW_USER_PASSWORD: Admin password (e.g., airflow123).

4. Fernet Key for Encryption

If you need to encrypt sensitive data like connection passwords, generate a Fernet key and set it in:

1
AIRFLOW__CORE__FERNET_KEY=<your_fernet_key>

5. DAGs, Logs, Plugins, and Configuration Directory

Map these directories to ensure Airflow persists files locally:

./dags:/opt/airflow/dags
./logs:/opt/airflow/logs
./plugins:/opt/airflow/plugins
./config:/opt/airflow/config

These mappings ensure your DAGs and logs remain accessible outside the container.

📦 Extending the Docker Image with More PIP Libraries

If you need additional Python libraries beyond what’s included by default, you can extend the Airflow image by modifying the Docker Compose file and adding a custom Dockerfile.

Steps to Extend the Image

Modify the Docker Compose File
In the example docker-compose.yml, comment out the image line and uncomment the build: . line. This tells Docker Compose to look for a Dockerfile in the same directory and execute its instructions.

Create a Dockerfile
Here’s an example Dockerfile that extends the Airflow image with custom commands:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
FROM apache/airflow:2.7.3

# Update PIP to the latest version
RUN pip install --upgrade pip

# Copy requirements file and install additional Python libraries
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install system dependencies (optional)
USER root
RUN apt-get update && apt-get install -y wget

📝 About the Dockerfile

COPY requirements.txt .: Copies the requirements.txt file into the container.
pip install -r requirements.txt: Installs all the Python libraries listed in the file.
RUN apt-get install wget: Installs additional system utilities if needed (like wget).
Feel free to add more commands as required.

This setup is flexible, allowing you to customize the Airflow environment with the libraries and tools you need.

📋 Sample `requirements.txt`

Here’s a list of libraries you might find useful for an Airflow environment:

1
2
3
4
5
6
requests
pandas
numpy
sqlalchemy
beautifulsoup4
pyyaml

💡 Try and specify library versions in requirements.txt to avoid compatibility issues and to maintain control over your environment. Example:

1
2
pandas==1.5.3
numpy==1.23.0

⚠️ A Note on Maintenance

Keeping your Python libraries up to date is critical. Security teams often require regular updates or the removal of outdated libraries to reduce vulnerabilities. This process can be a pain point, but it’s essential for a stable and secure setup.

🛠️ Putting It All Together and Spinning It Up

Now that everything is in place, it’s time to fire up the containers and get Airflow running! 🌟

Remember the directive in the docker-compose.yml file:

1
build: .

This requires a slightly different Docker Compose command than usual. Instead of the standard:

1
docker compose up -d

We’ll use:

1
docker compose up --build -d

Why Use `--build`?

The build: . directive in the docker-compose.yml file ensures Docker builds the image based on your Dockerfile the first time.
If the image already exists, Docker will not rebuild it unless you explicitly use the --build flag.

💡 Pro-Tip You can specify the number of celery workers here as well using the --scale flag

1
docker compose up --build -d --scale airflow-worker=3

Explanation:

--scale: This flag lets you specify the number of instances to run for a particular service defined in the docker-compose.yml file.
airflow-worker: The name of the service from the docker-compose.yml file for which you want to scale the number of instances.
3: Replace this with the desired number of workers.

Spin it up and start orchestrating workflows like a pro. 🚀

🔍 Overview of the Services, Environment Variables, and Volumes#

1. PostgreSQL Database#

Environment Variables to Configure:#

2. Redis Database#

Exposed Ports:#

3. Airflow Webserver#

Environment Variables to Configure:#

Exposed Ports:#

4. Airflow Scheduler#

5. Airflow Worker#

6. Airflow Triggerer#

7. Airflow CLI#

8. Flower#

Exposed Ports:#

⚙️ Configurations to Consider Before Spinning It Up#

1. PostgreSQL Database Credentials#

2. Webserver and Flower Ports#

3. Admin Account for Webserver#

4. Fernet Key for Encryption#

5. DAGs, Logs, Plugins, and Configuration Directory#

📦 Extending the Docker Image with More PIP Libraries#

Steps to Extend the Image#

📝 About the Dockerfile#

📋 Sample requirements.txt#

⚠️ A Note on Maintenance#

🛠️ Putting It All Together and Spinning It Up#

Why Use --build?#

🔍 Overview of the Services, Environment Variables, and Volumes

1. PostgreSQL Database

Environment Variables to Configure:

2. Redis Database

Exposed Ports:

3. Airflow Webserver

Environment Variables to Configure:

Exposed Ports:

4. Airflow Scheduler

5. Airflow Worker

6. Airflow Triggerer

7. Airflow CLI

8. Flower

Exposed Ports:

⚙️ Configurations to Consider Before Spinning It Up

1. PostgreSQL Database Credentials

2. Webserver and Flower Ports

3. Admin Account for Webserver

4. Fernet Key for Encryption

5. DAGs, Logs, Plugins, and Configuration Directory

📦 Extending the Docker Image with More PIP Libraries

Steps to Extend the Image

📝 About the Dockerfile

📋 Sample `requirements.txt`

⚠️ A Note on Maintenance

🛠️ Putting It All Together and Spinning It Up

Why Use `--build`?