Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.
Generate production-ready Apache Airflow ETL pipelines that automatically discover, download, and transform documents from government websites and other data sources into structured markdown files.
Initial Analysis:
License Verification:
Document Inventory:
User Confirmation:
Create a complete, production-ready Airflow project structure:
airflow_pipelines/
├── dags/
│ └── [source_name]_etl_dag.py
├── operators/
│ ├── __init__.py
│ ├── document_scraper.py
│ └── document_converter.py
├── utils/
│ ├── __init__.py
│ ├── license_checker.py
│ └── file_manager.py
├── config/
│ └── [source_name]_config.yaml
├── requirements.txt
└── README.md
1. DAG File (dags/[source_name]_etl_dag.py):
2. Document Scraper (operators/document_scraper.py):
3. Document Converter (operators/document_converter.py):
4. License Checker (utils/license_checker.py):
5. File Manager (utils/file_manager.py):
6. Configuration (config/[source_name]_config.yaml):