Well, here it is: getting Apache Airflow up and running! 🎉
To begin, you’ll want to familiarize yourself with the example docker-compose.yaml
that Apache provides in the Airflow repository:
🔗 Official Docker Compose Example
🔍 Overview of the Services, Environment Variables, and Volumes
The Apache-provided docker-compose.yaml
file is described as:
“Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.”
This setup provides a fully functional Airflow instance using a CeleryExecutor, supported by Redis for task queuing and PostgreSQL for the metadata database. Below is a breakdown of each service in the stack and its role:
1. PostgreSQL Database
The PostgreSQL service acts as the metadata database for Airflow. It stores crucial information such as:
- DAG definitions
- Task states and histories
- Airflow configuration settings
- User credentials and roles (if applicable)
Environment Variables to Configure:
POSTGRES_USER
: Username for the database (e.g.,airflow
).POSTGRES_PASSWORD
: Password for the database.POSTGRES_DB
: Name of the database (e.g.,airflow
).
2. Redis Database
The Redis service is a high-performance in-memory key-value store that Airflow uses as a message broker for the CeleryExecutor. It handles task queuing and ensures that:
- Tasks are efficiently distributed to workers.
- Communication between components remains fast and reliable.
Exposed Ports:
- Redis does not expose a UI, but it runs on port
6379
internally for communication.
3. Airflow Webserver
The Airflow Webserver provides the main user interface (UI) for interacting with Airflow. From this web application, you can:
- View and manage DAGs.
- Monitor task progress and logs.
- Pause, trigger, or edit workflows.
Environment Variables to Configure:
_AIRFLOW_WWW_USER_USERNAME
: Username for the admin account (default:airflow
)._AIRFLOW_WWW_USER_PASSWORD
: Password for the admin account (default:airflow
).
Exposed Ports:
- The webserver UI is available on port
8080
by default.
4. Airflow Scheduler
The Airflow Scheduler continuously monitors all active DAGs and determines which tasks are ready to run. It is responsible for:
- Triggering tasks at the appropriate time based on dependencies and schedules.
- Updating task states in the metadata database.
This is the brain of the workflow orchestration process.
5. Airflow Worker
The Airflow Worker is a Celery worker responsible for executing tasks assigned by the scheduler. Workers pull tasks from the Redis queue and execute them as isolated jobs. Multiple workers can run in parallel to distribute the workload and improve scalability.
6. Airflow Triggerer
The Airflow Triggerer handles deferred tasks in Airflow, leveraging asynchronous programming. It ensures that tasks like sensors (e.g., monitoring for an external file) don’t consume unnecessary system resources by checking conditions asynchronously.
7. Airflow CLI
The Airflow CLI is a command-line interface for interacting with the Airflow instance. It can be used to:
- Trigger DAGs manually.
- Test or debug tasks.
- Manage Airflow users and configurations.
This is especially useful for debugging or automation scripts.
8. Flower
Flower is a real-time monitoring tool for Celery tasks. It provides a web-based UI to:
- Track task progress and statuses across workers.
- View performance metrics and worker health.
- Debug task-related issues.
Exposed Ports:
- The Flower UI is available on port
5555
by default.
⚙️ Configurations to Consider Before Spinning It Up
Before starting the Airflow stack, ensure you’ve configured the following environment variables and settings:
1. PostgreSQL Database Credentials
POSTGRES_USER
: Set the database username.POSTGRES_PASSWORD
: Set the database password.POSTGRES_DB
: Set the name of the database.
These credentials must match the values in the Airflow environment variables:
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN
AIRFLOW__CELERY__RESULT_BACKEND
Example:
|
|
2. Webserver and Flower Ports
- Webserver Port (8080): Ensure the
8080
port is not already in use on your system, or map it to an alternative port (e.g.,8081:8080
). - Flower Port (5555): Similarly, ensure
5555
is available or remap it.
3. Admin Account for Webserver
Set credentials for the default admin user:
_AIRFLOW_WWW_USER_USERNAME
: Admin username (e.g.,airflow
)._AIRFLOW_WWW_USER_PASSWORD
: Admin password (e.g.,airflow123
).
4. Fernet Key for Encryption
If you need to encrypt sensitive data like connection passwords, generate a Fernet key and set it in:
|
|
5. DAGs, Logs, Plugins, and Configuration Directory
Map these directories to ensure Airflow persists files locally:
./dags:/opt/airflow/dags
./logs:/opt/airflow/logs
./plugins:/opt/airflow/plugins
./config:/opt/airflow/config
These mappings ensure your DAGs and logs remain accessible outside the container.
📦 Extending the Docker Image with More PIP Libraries
If you need additional Python libraries beyond what’s included by default, you can extend the Airflow image by modifying the Docker Compose file and adding a custom Dockerfile
.
Steps to Extend the Image
Modify the Docker Compose File
In the example docker-compose.yml
, comment out the image
line and uncomment the build: .
line. This tells Docker Compose to look for a Dockerfile
in the same directory and execute its instructions.
Create a Dockerfile
Here’s an example Dockerfile
that extends the Airflow image with custom commands:
|
|
📝 About the Dockerfile
COPY requirements.txt .
: Copies therequirements.txt
file into the container.pip install -r requirements.txt
: Installs all the Python libraries listed in the file.RUN apt-get install wget
: Installs additional system utilities if needed (likewget
).
Feel free to add more commands as required.
This setup is flexible, allowing you to customize the Airflow environment with the libraries and tools you need.
📋 Sample requirements.txt
Here’s a list of libraries you might find useful for an Airflow environment:
|
|
💡 Try and specify library versions in requirements.txt
to avoid compatibility issues and to maintain control over your environment. Example:
|
|
⚠️ A Note on Maintenance
Keeping your Python libraries up to date is critical. Security teams often require regular updates or the removal of outdated libraries to reduce vulnerabilities. This process can be a pain point, but it’s essential for a stable and secure setup.
🛠️ Putting It All Together and Spinning It Up
Now that everything is in place, it’s time to fire up the containers and get Airflow running! 🌟
Remember the directive in the docker-compose.yml
file:
|
|
This requires a slightly different Docker Compose command than usual. Instead of the standard:
|
|
We’ll use:
|
|
Why Use --build
?
- The
build: .
directive in thedocker-compose.yml
file ensures Docker builds the image based on yourDockerfile
the first time. - If the image already exists, Docker will not rebuild it unless you explicitly use the
--build
flag.
💡 Pro-Tip You can specify the number of celery workers here as well using the --scale
flag
|
|
Explanation:
--scale
: This flag lets you specify the number of instances to run for a particular service defined in the docker-compose.yml file.airflow-worker
: The name of the service from the docker-compose.yml file for which you want to scale the number of instances.3
: Replace this with the desired number of workers.
Spin it up and start orchestrating workflows like a pro. 🚀