Interactive walkthrough for setting up a DataSurface Yellow environment on Azure AKS with Azure Database for PostgreSQL, Azure SQL Database, Helm Airflow 3.x, Azure Key Vault, and Workload Identity. Use this skill to guide users through the complete Azure installation process step-by-step.
This skill guides you through deploying a DataSurface Yellow environment on Azure AKS (Azure Kubernetes Service). It uses az CLI commands for infrastructure, Azure Database for PostgreSQL Flexible Server for the Airflow metadata database, Azure SQL Database for the merge engine, Azure Files NFS for shared storage, Azure Key Vault for credentials, and Workload Identity for pod-level Azure access. Follow each step in order and verify completion before proceeding.
Before starting, verify the user has:
az account show succeeds)kubectl CLI installedhelm CLI installedrepo scope for pushing to model/DAG repos; if you only need to push to existing repos, fine-grained tokens scoped to those repos work too — but repo scope is required if the agent needs to create new repositories)bw sync after adding new items from another session before attempting to retrieve themAsk the user for these environment variables if not already set:
AZURE_SUBSCRIPTION_ID # Azure subscription ID
AZURE_REGION # Azure region (recommend westus2 - see Region Selection note below)
RESOURCE_GROUP # Resource group name (e.g., ds-demo-rg)
CLUSTER_NAME # AKS cluster name (e.g., ds-demo-aks)
VNET_NAME # VNet name (e.g., ds-vnet)
PG_ADMIN_USER # PostgreSQL admin username (NOT 'admin' - reserved by Azure)
PG_ADMIN_PASSWORD # PostgreSQL admin password (see password rules below)
SQL_ADMIN_USER # Azure SQL admin username
SQL_ADMIN_PASSWORD # Azure SQL admin password (see password rules below)
GITHUB_USERNAME # GitHub username
GITHUB_TOKEN # GitHub Personal Access Token (repo access)
GITLAB_CUSTOMER_USER # GitLab deploy token username
GITLAB_CUSTOMER_TOKEN # GitLab deploy token
DATASURFACE_VERSION # DataSurface version (default: 1.1.0)
MODEL_REPO # Target model repo (e.g., yourorg/demo1)
AIRFLOW_REPO # Target DAG repo (e.g., yourorg/demo1_airflow)
NAMESPACE # K8s namespace (default: demo1-azure)
KEY_VAULT_NAME # Globally unique Key Vault name (3-24 chars, alphanumeric + hyphens, no consecutive hyphens, e.g., dsdemokv<random>)
MANAGED_IDENTITY_NAME # Managed identity name (e.g., ds-airflow-identity)
CRITICAL: Password Character Restrictions. Database passwords must NOT contain !, \, $, backticks, or other shell metacharacters. The macOS default shell (zsh) escapes ! even inside single quotes when used with kubectl create secret --from-literal, silently corrupting passwords (e.g., DsDemo2024pg! becomes DsDemo2024pg\! in the stored secret). This causes persistent "password authentication failed" errors that are extremely difficult to diagnose because the password looks correct in the az CLI but doesn't match what's stored in the K8s secret. Use only alphanumeric characters and simple symbols like - or _ in all passwords. Minimum 8 characters with uppercase + lowercase + number to satisfy Azure complexity requirements.
Verify Azure CLI is configured:
az account show --query "{subscriptionId:id, name:name, state:state}" -o table
Set the active subscription:
az account set --subscription $AZURE_SUBSCRIPTION_ID
IMPORTANT: Region Selection. Azure periodically restricts new database provisioning in popular regions (especially eastus). Both PostgreSQL Flexible Server and Azure SQL Database can be blocked simultaneously. Recommend westus2 as the default region -- it has broad service availability and fewer provisioning restrictions. If your chosen region fails with RegionDoesNotAllowProvisioning or "location is restricted for provisioning", you must tear down and recreate everything in a different region. All resources (VNet, AKS, PostgreSQL, SQL) must be in the same region for VNet integration to work. Cross-region VNet rules and VNet-integrated PostgreSQL are not supported.
IMPORTANT: Pre-register Azure resource providers. New subscriptions often lack provider registrations, which causes MissingSubscriptionRegistration errors during resource creation. Register all needed providers upfront (takes 1-2 minutes):
az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.DBforPostgreSQL
az provider register --namespace Microsoft.Sql
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.ManagedIdentity
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Network
# Wait for the critical one (AKS)
while [ "$(az provider show -n Microsoft.ContainerService --query registrationState -o tsv)" != "Registered" ]; do sleep 10; done
echo "ContainerService registered"
IMPORTANT: Check vCPU quota. New Azure subscriptions typically have a 10 vCPU regional quota. Our default config (3 x Standard_D2s_v3 = 6 cores) fits within this. If you need larger VMs, check your quota first:
az vm list-usage --location $AZURE_REGION -o table | grep "Total Regional"
If your quota is less than the cores you need, request an increase via the Azure Portal before proceeding.
Always run this step, even for "fresh" installations. Previous resource groups, namespaces, or Key Vault soft-deleted secrets can cause conflicts.
kubectl get namespace $NAMESPACE 2>/dev/null && \
kubectl delete namespace $NAMESPACE
# If namespace is stuck in Terminating state (wait 30 seconds, then check):
kubectl get namespace $NAMESPACE -o json 2>/dev/null | jq '.spec.finalizers = []' | \
kubectl replace --raw "/api/v1/namespaces/$NAMESPACE/finalize" -f -
WARNING: This deletes ALL resources in the group (AKS, databases, Key Vault, VNet). Only do this if you want a completely fresh start.
az group exists --name $RESOURCE_GROUP && \
az group delete --name $RESOURCE_GROUP --yes --no-wait
Wait for the resource group to be fully deleted before proceeding (can take 5-10 minutes):
while az group exists --name $RESOURCE_GROUP 2>/dev/null | grep -q true; do
echo "Waiting for resource group deletion..."
sleep 30
done
echo "Resource group deleted"
Azure Key Vault has soft-delete enabled by default. If you previously deleted a Key Vault with the same name, you must purge it before recreating:
az keyvault purge --name $KEY_VAULT_NAME 2>/dev/null || true
If you are reusing an existing Key Vault rather than deleting the resource group:
az keyvault secret delete --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--sqlserver-demo-merge--credentials" 2>/dev/null || true
az keyvault secret delete --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--git--credentials" 2>/dev/null || true
# Purge deleted secrets (required before re-creating with same name)
az keyvault secret purge --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--sqlserver-demo-merge--credentials" 2>/dev/null || true
az keyvault secret purge --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--git--credentials" 2>/dev/null || true
CRITICAL: If the model or DAG repositories were used in a previous deployment, stale tags and releases will cause the VersionPatternReleaseSelector to find old commits with wrong hostnames/config. The infrastructure DAG and all job pods resolve the model via the GitHub Releases API — a stale release pointing to an old commit means every job runs against the wrong configuration, and the error ("No tags found matching pattern") is misleading because the tag exists but the release doesn't match.
# Delete all tags and releases from the model repo
for tag in $(gh release list --repo $MODEL_REPO --json tagName -q '.[].tagName'); do
gh release delete "$tag" --repo $MODEL_REPO --yes 2>/dev/null || true
git push origin ":refs/tags/$tag" 2>/dev/null || true
done
# Delete all tags and releases from the DAG repo
for tag in $(gh release list --repo $AIRFLOW_REPO --json tagName -q '.[].tagName'); do
gh release delete "$tag" --repo $AIRFLOW_REPO --yes 2>/dev/null || true
git push origin ":refs/tags/$tag" 2>/dev/null || true
done
Checkpoint:
az group exists --name $RESOURCE_GROUP returns falsekubectl get namespace $NAMESPACE returns "not found"az keyvault show --name $KEY_VAULT_NAME returns "not found" or has been purgedgh release list --repo $MODEL_REPO returns emptygh release list --repo $AIRFLOW_REPO returns emptyaz group create --name $RESOURCE_GROUP --location $AZURE_REGION
Checkpoint:
az group show --name $RESOURCE_GROUP --query "{name:name, location:location, state:properties.provisioningState}" -o table
Provisioning state must be Succeeded.
Create a VNet with three subnets: one for AKS, one delegated to PostgreSQL Flexible Server, and one for Azure SQL private endpoints.
# Create VNet
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET_NAME \
--address-prefix 10.0.0.0/16 \
--location $AZURE_REGION
# AKS subnet (large - AKS needs IPs for nodes + pods)
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--address-prefix 10.0.0.0/20
# PostgreSQL delegated subnet (required for VNet-integrated Flex Server)
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name pg-subnet \
--address-prefix 10.0.16.0/24 \
--delegations Microsoft.DBforPostgreSQL/flexibleServers
# Azure SQL private endpoint subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name sql-subnet \
--address-prefix 10.0.17.0/24
Checkpoint:
az network vnet subnet list --resource-group $RESOURCE_GROUP --vnet-name $VNET_NAME -o table
Should show three subnets: aks-subnet, pg-subnet, sql-subnet.
AKS_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--query id -o tsv)
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 3 \
--node-vm-size Standard_D2s_v3 \
--vnet-subnet-id $AKS_SUBNET_ID \
--enable-oidc-issuer \
--enable-workload-identity \
--generate-ssh-keys \
--location $AZURE_REGION \
--service-cidr 172.16.0.0/16 \
--dns-service-ip 172.16.0.10
Note: We use Standard_D2s_v3 (2 vCPU, 8 GB RAM) by default because most new Azure subscriptions have a 10 vCPU regional quota. 3 x D2s_v3 = 6 cores fits comfortably. Use Standard_D4s_v3 (4 vCPU) if you have quota for 12+ cores.
Note: The --service-cidr 172.16.0.0/16 and --dns-service-ip 172.16.0.10 flags are required because the default AKS service CIDR (10.0.0.0/16) overlaps with our VNet address space (10.0.0.0/16). Using 172.16.0.0/16 avoids the conflict.
This takes approximately 5-10 minutes.
Checkpoint:
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
--query "{name:name, state:provisioningState, k8sVersion:kubernetesVersion, nodeCount:agentPoolProfiles[0].count}" -o table
Provisioning state must be Succeeded. Verify OIDC issuer is enabled:
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
--query "oidcIssuerProfile.enabled" -o tsv
Must return true.
This PostgreSQL instance is used only for the Airflow metadata database. The merge engine uses Azure SQL Database (Step 5).
Note: If this fails with "location is restricted for provisioning of flexible servers", your chosen region is blocking new PostgreSQL Flex Server creation. You must tear down ALL resources and start over in a different region (e.g., westus2). PostgreSQL Flex Server with VNet integration requires the server and VNet to be in the same region -- there is no cross-region workaround.
az postgres flexible-server create \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--location $AZURE_REGION \
--admin-user $PG_ADMIN_USER \
--admin-password "$PG_ADMIN_PASSWORD" \
--sku-name Standard_B1ms \
--tier Burstable \
--version 16 \
--vnet $VNET_NAME \
--subnet pg-subnet \
--yes
This takes approximately 5-10 minutes. The --vnet and --subnet flags create the server with VNet integration, so it is only accessible from within the VNet (including AKS pods).
Get the PostgreSQL FQDN:
PG_FQDN=$(az postgres flexible-server show \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--query fullyQualifiedDomainName -o tsv)
echo "PostgreSQL FQDN: $PG_FQDN"
IMPORTANT: Create the Airflow database. The Flexible Server is created with a default postgres database, but Airflow needs its own database. We will create it after configuring kubeconfig (Step 7), since the PostgreSQL server is only accessible from within the VNet.
Checkpoint:
az postgres flexible-server show \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--query "{name:name, state:state, fqdn:fullyQualifiedDomainName, version:version}" -o table
State must be Ready.
This Azure SQL Database is used as the merge engine (SQL Server). DataSurface's merge jobs connect here to perform SCD2 operations.
Note: Some Azure regions periodically stop accepting new SQL Database server provisioning. If eastus fails with RegionDoesNotAllowProvisioning, try eastus2 or another nearby region. When using a different region than the VNet, you cannot use VNet rules for firewall -- instead use the "Allow Azure services" firewall rule (0.0.0.0 to 0.0.0.0) or create a cross-region private endpoint.
SQL_SERVER_NAME="${RESOURCE_GROUP}-sqlserver"
# Create the logical SQL server
az sql server create \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--location $AZURE_REGION \
--admin-user $SQL_ADMIN_USER \
--admin-password "$SQL_ADMIN_PASSWORD"
# Create the merge_db database
az sql db create \
--resource-group $RESOURCE_GROUP \
--server $SQL_SERVER_NAME \
--name merge_db \
--service-objective S1
# Get the SQL Server FQDN
SQL_SERVER_FQDN="${SQL_SERVER_NAME}.database.windows.net"
echo "SQL Server FQDN: $SQL_SERVER_FQDN"
There are two approaches for AKS-to-SQL connectivity. For test/dev environments, use the simpler public access approach (Option B below). Private endpoints (Option A) are more secure but have known issues with NIC IP assignment that can require significant debugging.
WARNING: The private endpoint NIC may report privateIpAddress: None immediately after creation. This is a known Azure timing issue. If PE_IP comes back empty, wait 60-120 seconds and retry the NIC query. You can also query the IP via the private endpoint's customDnsConfigs:
# Alternative way to get the PE IP if ipConfigurations returns None
az network private-endpoint show \
--resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--query "customDnsConfigs[0].ipAddresses[0]" -o tsv
For production deployments, the private endpoint must work — do not fall back to public access. If the IP remains empty after multiple retries, delete and recreate the private endpoint.
To allow AKS pods to reach Azure SQL over the VNet (without public internet):
# Disable public network access (security best practice)
az sql server update \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--set publicNetworkAccess=Disabled
# Get SQL Server resource ID
SQL_SERVER_ID=$(az sql server show \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--query id -o tsv)
# Create private endpoint
az network private-endpoint create \
--resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--vnet-name $VNET_NAME \
--subnet sql-subnet \
--private-connection-resource-id $SQL_SERVER_ID \
--group-id sqlServer \
--connection-name "${SQL_SERVER_NAME}-pe-conn"
# Create private DNS zone for SQL Server
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name "privatelink.database.windows.net"
# Link DNS zone to VNet
az network private-dns zone vnet-link create \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--name "${VNET_NAME}-sql-link" \
--virtual-network $VNET_NAME \
--registration-enabled false
# Create DNS records for the private endpoint
PE_NIC_ID=$(az network private-endpoint show \
--resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--query "networkInterfaces[0].id" -o tsv)
PE_IP=$(az network nic show --ids $PE_NIC_ID \
--query "ipConfigurations[0].privateIpAddress" -o tsv)
az network private-dns record-set a create \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--name $SQL_SERVER_NAME
az network private-dns record-set a add-record \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--record-set-name $SQL_SERVER_NAME \
--ipv4-address $PE_IP
This is the simpler and more reliable approach. Enable public access and restrict it to Azure services and your VNet:
# Only if NOT using private endpoint:
AKS_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--query id -o tsv)
az sql server vnet-rule create \
--resource-group $RESOURCE_GROUP \
--server $SQL_SERVER_NAME \
--name aks-access \
--vnet-name $VNET_NAME \
--subnet aks-subnet
# Also enable the service endpoint on the AKS subnet
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--service-endpoints Microsoft.Sql
# Allow Azure services (required for AKS pods to reach SQL via Azure backbone)
az sql server firewall-rule create \
--resource-group $RESOURCE_GROUP \
--server $SQL_SERVER_NAME \
--name AllowAzureServices \
--start-ip-address 0.0.0.0 \
--end-ip-address 0.0.0.0
Checkpoint:
az sql db show --resource-group $RESOURCE_GROUP --server $SQL_SERVER_NAME --name merge_db \
--query "{name:name, status:status, serviceObjective:currentServiceObjectiveName}" -o table
Status must be Online.
If using private endpoint, verify:
az network private-endpoint show --resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--query "{name:name, state:provisioningState, status:privateLinkServiceConnections[0].privateLinkServiceConnectionState.status}" -o table
State must be Succeeded and status must be Approved.
az keyvault create \
--resource-group $RESOURCE_GROUP \
--name $KEY_VAULT_NAME \
--location $AZURE_REGION \
--enable-rbac-authorization true
IMPORTANT: We use --enable-rbac-authorization true so that access is controlled via Azure RBAC roles (specifically "Key Vault Secrets User") rather than vault access policies. This integrates cleanly with Workload Identity.
Grant yourself access to manage secrets during setup:
CURRENT_USER_ID=$(az ad signed-in-user show --query id -o tsv)
az role assignment create \
--role "Key Vault Secrets Officer" \
--assignee-object-id $CURRENT_USER_ID \
--scope $(az keyvault show --name $KEY_VAULT_NAME --query id -o tsv)
Checkpoint:
az keyvault show --name $KEY_VAULT_NAME \
--query "{name:name, state:properties.provisioningState, rbac:properties.enableRbacAuthorization, uri:properties.vaultUri}" -o table
State must be Succeeded, RBAC must be True.
az aks get-credentials \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--overwrite-existing
kubectl get nodes
Checkpoint: All nodes should be in Ready status:
kubectl get nodes -o wide
IMPORTANT: Create the Airflow database now. The PostgreSQL Flexible Server is only accessible from within the VNet, so we must create the airflow_db database from inside the AKS cluster:
kubectl run db-setup --rm -i --restart=Never \
--image=postgres:16 \
--env="PGPASSWORD=$PG_ADMIN_PASSWORD" \
-- bash -c "psql -h $PG_FQDN -U $PG_ADMIN_USER -d postgres -c 'CREATE DATABASE airflow_db;'"
Note: This runs in the default namespace since our application namespace does not exist yet.
Checkpoint: Test PostgreSQL connectivity from a pod:
kubectl run db-test --rm -i --restart=Never \
--image=postgres:16 \
--env="PGPASSWORD=$PG_ADMIN_PASSWORD" \
-- psql -h $PG_FQDN -U $PG_ADMIN_USER -d airflow_db -c "SELECT 1;"
NOTE: For test/dev environments, you can skip this step entirely and use plain Kubernetes secrets instead (Steps 16, 18, and 23 create the necessary K8s secrets). The CSI Secrets Store driver is only needed if you want to sync Azure Key Vault secrets directly into pods. If you skip this step, also skip Steps 10b-10c (Workload Identity federation) and Step 21 (SA annotations for Workload Identity). The Key Vault created in Step 6 is still useful for storing secrets centrally even without CSI — you just manage K8s secrets manually.
If you do want CSI Secrets Store (recommended for production):
IMPORTANT: Install the Azure provider Helm chart, which includes the CSI driver as a dependency. Do NOT install them separately — installing the CSI driver first and the Azure provider second causes hostPort conflicts (port 9808) and DaemonSet scheduling failures on multi-node clusters.
helm repo add csi-secrets-store-provider-azure \
https://azure.github.io/secrets-store-csi-driver-provider-azure/charts
helm repo update
helm install csi-azure csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \
--namespace kube-system \
--set secrets-store-csi-driver.install=true
Wait for all pods to be ready:
kubectl wait --for=condition=ready pod -l app=csi-secrets-store-provider-azure -n kube-system --timeout=120s
kubectl wait --for=condition=ready pod -l app=secrets-store-csi-driver -n kube-system --timeout=120s
Checkpoint:
kubectl get pods -n kube-system -l app=csi-secrets-store-provider-azure
kubectl get pods -n kube-system -l app=secrets-store-csi-driver
Both CSI driver and Azure provider pods should be Running.
Troubleshooting: If provider pods are Pending with "didn't have free ports" errors, the CSI driver and provider were likely installed separately. Uninstall both (helm uninstall <release-name> -n kube-system), wait for all pods to terminate, then reinstall using the single command above.
Azure Files NFS is built-in to AKS and does not require installing any additional CSI drivers or addons. The azurefile-csi-nfs StorageClass is available by default.
kubectl get storageclass azurefile-csi-nfs
If the StorageClass exists, test it with a temporary PVC:
cat > /tmp/test-azurefile-pvc.yaml << 'EOF'
apiVersion: v1