Interactive walkthrough for setting up a DataSurface Yellow environment on Azure AKS with Azure Database for PostgreSQL, Azure SQL Database, Helm Airflow 3.x, Azure Key Vault, and Workload Identity. Use this skill to guide users through the complete Azure installation process step-by-step.
This skill guides you through deploying a DataSurface Yellow environment on Azure AKS (Azure Kubernetes Service). It uses az CLI commands for infrastructure, Azure Database for PostgreSQL Flexible Server for the Airflow metadata database, Azure SQL Database for the merge engine, Azure Files NFS for shared storage, Azure Key Vault for credentials, and Workload Identity for pod-level Azure access. Follow each step in order and verify completion before proceeding.
Before starting, verify the user has:
az account show succeeds)kubectl CLI installedhelm CLI installedAsk the user for these environment variables if not already set:
AZURE_SUBSCRIPTION_ID # Azure subscription ID
AZURE_REGION # Azure region (recommend westus2 - see Region Selection note below)
RESOURCE_GROUP # Resource group name (e.g., ds-demo-rg)
CLUSTER_NAME # AKS cluster name (e.g., ds-demo-aks)
VNET_NAME # VNet name (e.g., ds-vnet)
PG_ADMIN_USER # PostgreSQL admin username (NOT 'admin' - reserved by Azure)
PG_ADMIN_PASSWORD # PostgreSQL admin password (see password rules below)
SQL_ADMIN_USER # Azure SQL admin username
SQL_ADMIN_PASSWORD # Azure SQL admin password (see password rules below)
GITHUB_USERNAME # GitHub username
GITHUB_TOKEN # GitHub Personal Access Token (repo access)
GITLAB_CUSTOMER_USER # GitLab deploy token username
GITLAB_CUSTOMER_TOKEN # GitLab deploy token
DATASURFACE_VERSION # DataSurface version (default: 1.1.0)
MODEL_REPO # Target model repo (e.g., yourorg/demo1)
AIRFLOW_REPO # Target DAG repo (e.g., yourorg/demo1_airflow)
NAMESPACE # K8s namespace (default: demo1-azure)
KEY_VAULT_NAME # Globally unique Key Vault name (3-24 chars, alphanumeric + hyphens, no consecutive hyphens, e.g., dsdemokv<random>)
MANAGED_IDENTITY_NAME # Managed identity name (e.g., ds-airflow-identity)
CRITICAL: Password Character Restrictions. Database passwords must NOT contain !, \, $, backticks, or other shell metacharacters. The macOS default shell (zsh) escapes ! even inside single quotes when used with kubectl create secret --from-literal, silently corrupting passwords (e.g., DsDemo2024pg! becomes DsDemo2024pg\! in the stored secret). This causes persistent "password authentication failed" errors that are extremely difficult to diagnose because the password looks correct in the az CLI but doesn't match what's stored in the K8s secret. Use only alphanumeric characters and simple symbols like - or _ in all passwords. Minimum 8 characters with uppercase + lowercase + number to satisfy Azure complexity requirements.
Verify Azure CLI is configured:
az account show --query "{subscriptionId:id, name:name, state:state}" -o table
Set the active subscription:
az account set --subscription $AZURE_SUBSCRIPTION_ID
IMPORTANT: Region Selection. Azure periodically restricts new database provisioning in popular regions (especially eastus). Both PostgreSQL Flexible Server and Azure SQL Database can be blocked simultaneously. Recommend westus2 as the default region -- it has broad service availability and fewer provisioning restrictions. If your chosen region fails with RegionDoesNotAllowProvisioning or "location is restricted for provisioning", you must tear down and recreate everything in a different region. All resources (VNet, AKS, PostgreSQL, SQL) must be in the same region for VNet integration to work. Cross-region VNet rules and VNet-integrated PostgreSQL are not supported.
IMPORTANT: Pre-register Azure resource providers. New subscriptions often lack provider registrations, which causes MissingSubscriptionRegistration errors during resource creation. Register all needed providers upfront (takes 1-2 minutes):
az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.DBforPostgreSQL
az provider register --namespace Microsoft.Sql
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.ManagedIdentity
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Network
# Wait for the critical one (AKS)
while [ "$(az provider show -n Microsoft.ContainerService --query registrationState -o tsv)" != "Registered" ]; do sleep 10; done
echo "ContainerService registered"
IMPORTANT: Check vCPU quota. New Azure subscriptions typically have a 10 vCPU regional quota. Our default config (3 x Standard_D2s_v3 = 6 cores) fits within this. If you need larger VMs, check your quota first:
az vm list-usage --location $AZURE_REGION -o table | grep "Total Regional"
If your quota is less than the cores you need, request an increase via the Azure Portal before proceeding.
Always run this step, even for "fresh" installations. Previous resource groups, namespaces, or Key Vault soft-deleted secrets can cause conflicts.
kubectl get namespace $NAMESPACE 2>/dev/null && \
kubectl delete namespace $NAMESPACE
# If namespace is stuck in Terminating state (wait 30 seconds, then check):
kubectl get namespace $NAMESPACE -o json 2>/dev/null | jq '.spec.finalizers = []' | \
kubectl replace --raw "/api/v1/namespaces/$NAMESPACE/finalize" -f -
WARNING: This deletes ALL resources in the group (AKS, databases, Key Vault, VNet). Only do this if you want a completely fresh start.
az group exists --name $RESOURCE_GROUP && \
az group delete --name $RESOURCE_GROUP --yes --no-wait
Wait for the resource group to be fully deleted before proceeding (can take 5-10 minutes):
while az group exists --name $RESOURCE_GROUP 2>/dev/null | grep -q true; do
echo "Waiting for resource group deletion..."
sleep 30
done
echo "Resource group deleted"
Azure Key Vault has soft-delete enabled by default. If you previously deleted a Key Vault with the same name, you must purge it before recreating:
az keyvault purge --name $KEY_VAULT_NAME 2>/dev/null || true
If you are reusing an existing Key Vault rather than deleting the resource group:
az keyvault secret delete --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--sqlserver-demo-merge--credentials" 2>/dev/null || true
az keyvault secret delete --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--git--credentials" 2>/dev/null || true
# Purge deleted secrets (required before re-creating with same name)
az keyvault secret purge --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--sqlserver-demo-merge--credentials" 2>/dev/null || true
az keyvault secret purge --vault-name $KEY_VAULT_NAME \
--name "datasurface--${NAMESPACE}--Demo--git--credentials" 2>/dev/null || true
Checkpoint:
az group exists --name $RESOURCE_GROUP returns falsekubectl get namespace $NAMESPACE returns "not found"az keyvault show --name $KEY_VAULT_NAME returns "not found" or has been purgedaz group create --name $RESOURCE_GROUP --location $AZURE_REGION
Checkpoint:
az group show --name $RESOURCE_GROUP --query "{name:name, location:location, state:properties.provisioningState}" -o table
Provisioning state must be Succeeded.
Create a VNet with three subnets: one for AKS, one delegated to PostgreSQL Flexible Server, and one for Azure SQL private endpoints.
# Create VNet
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET_NAME \
--address-prefix 10.0.0.0/16 \
--location $AZURE_REGION
# AKS subnet (large - AKS needs IPs for nodes + pods)
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--address-prefix 10.0.0.0/20
# PostgreSQL delegated subnet (required for VNet-integrated Flex Server)
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name pg-subnet \
--address-prefix 10.0.16.0/24 \
--delegations Microsoft.DBforPostgreSQL/flexibleServers
# Azure SQL private endpoint subnet
az network vnet subnet create \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name sql-subnet \
--address-prefix 10.0.17.0/24
Checkpoint:
az network vnet subnet list --resource-group $RESOURCE_GROUP --vnet-name $VNET_NAME -o table
Should show three subnets: aks-subnet, pg-subnet, sql-subnet.
AKS_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--query id -o tsv)
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 3 \
--node-vm-size Standard_D2s_v3 \
--vnet-subnet-id $AKS_SUBNET_ID \
--enable-oidc-issuer \
--enable-workload-identity \
--generate-ssh-keys \
--location $AZURE_REGION \
--service-cidr 172.16.0.0/16 \
--dns-service-ip 172.16.0.10
Note: We use Standard_D2s_v3 (2 vCPU, 8 GB RAM) by default because most new Azure subscriptions have a 10 vCPU regional quota. 3 x D2s_v3 = 6 cores fits comfortably. Use Standard_D4s_v3 (4 vCPU) if you have quota for 12+ cores.
Note: The --service-cidr 172.16.0.0/16 and --dns-service-ip 172.16.0.10 flags are required because the default AKS service CIDR (10.0.0.0/16) overlaps with our VNet address space (10.0.0.0/16). Using 172.16.0.0/16 avoids the conflict.
This takes approximately 5-10 minutes.
Checkpoint:
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
--query "{name:name, state:provisioningState, k8sVersion:kubernetesVersion, nodeCount:agentPoolProfiles[0].count}" -o table
Provisioning state must be Succeeded. Verify OIDC issuer is enabled:
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
--query "oidcIssuerProfile.enabled" -o tsv
Must return true.
This PostgreSQL instance is used only for the Airflow metadata database. The merge engine uses Azure SQL Database (Step 5).
Note: If this fails with "location is restricted for provisioning of flexible servers", your chosen region is blocking new PostgreSQL Flex Server creation. You must tear down ALL resources and start over in a different region (e.g., westus2). PostgreSQL Flex Server with VNet integration requires the server and VNet to be in the same region -- there is no cross-region workaround.
az postgres flexible-server create \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--location $AZURE_REGION \
--admin-user $PG_ADMIN_USER \
--admin-password "$PG_ADMIN_PASSWORD" \
--sku-name Standard_B1ms \
--tier Burstable \
--version 16 \
--vnet $VNET_NAME \
--subnet pg-subnet \
--yes
This takes approximately 5-10 minutes. The --vnet and --subnet flags create the server with VNet integration, so it is only accessible from within the VNet (including AKS pods).
Get the PostgreSQL FQDN:
PG_FQDN=$(az postgres flexible-server show \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--query fullyQualifiedDomainName -o tsv)
echo "PostgreSQL FQDN: $PG_FQDN"
IMPORTANT: Create the Airflow database. The Flexible Server is created with a default postgres database, but Airflow needs its own database. We will create it after configuring kubeconfig (Step 7), since the PostgreSQL server is only accessible from within the VNet.
Checkpoint:
az postgres flexible-server show \
--resource-group $RESOURCE_GROUP \
--name ${RESOURCE_GROUP}-pgflex \
--query "{name:name, state:state, fqdn:fullyQualifiedDomainName, version:version}" -o table
State must be Ready.
This Azure SQL Database is used as the merge engine (SQL Server). DataSurface's merge jobs connect here to perform SCD2 operations.
Note: Some Azure regions periodically stop accepting new SQL Database server provisioning. If eastus fails with RegionDoesNotAllowProvisioning, try eastus2 or another nearby region. When using a different region than the VNet, you cannot use VNet rules for firewall -- instead use the "Allow Azure services" firewall rule (0.0.0.0 to 0.0.0.0) or create a cross-region private endpoint.
SQL_SERVER_NAME="${RESOURCE_GROUP}-sqlserver"
# Create the logical SQL server
az sql server create \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--location $AZURE_REGION \
--admin-user $SQL_ADMIN_USER \
--admin-password "$SQL_ADMIN_PASSWORD"
# Create the merge_db database
az sql db create \
--resource-group $RESOURCE_GROUP \
--server $SQL_SERVER_NAME \
--name merge_db \
--service-objective S1
# Get the SQL Server FQDN
SQL_SERVER_FQDN="${SQL_SERVER_NAME}.database.windows.net"
echo "SQL Server FQDN: $SQL_SERVER_FQDN"
To allow AKS pods to reach Azure SQL over the VNet (without public internet):
# Disable public network access (security best practice)
az sql server update \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--set publicNetworkAccess=Disabled
# Get SQL Server resource ID
SQL_SERVER_ID=$(az sql server show \
--resource-group $RESOURCE_GROUP \
--name $SQL_SERVER_NAME \
--query id -o tsv)
# Create private endpoint
az network private-endpoint create \
--resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--vnet-name $VNET_NAME \
--subnet sql-subnet \
--private-connection-resource-id $SQL_SERVER_ID \
--group-id sqlServer \
--connection-name "${SQL_SERVER_NAME}-pe-conn"
# Create private DNS zone for SQL Server
az network private-dns zone create \
--resource-group $RESOURCE_GROUP \
--name "privatelink.database.windows.net"
# Link DNS zone to VNet
az network private-dns zone vnet-link create \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--name "${VNET_NAME}-sql-link" \
--virtual-network $VNET_NAME \
--registration-enabled false
# Create DNS records for the private endpoint
PE_NIC_ID=$(az network private-endpoint show \
--resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--query "networkInterfaces[0].id" -o tsv)
PE_IP=$(az network nic show --ids $PE_NIC_ID \
--query "ipConfigurations[0].privateIpAddress" -o tsv)
az network private-dns record-set a create \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--name $SQL_SERVER_NAME
az network private-dns record-set a add-record \
--resource-group $RESOURCE_GROUP \
--zone-name "privatelink.database.windows.net" \
--record-set-name $SQL_SERVER_NAME \
--ipv4-address $PE_IP
Alternative (simpler but less secure): If you prefer public access during setup, skip the private endpoint and instead add a VNet firewall rule:
# Only if NOT using private endpoint:
AKS_SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--query id -o tsv)
az sql server vnet-rule create \
--resource-group $RESOURCE_GROUP \
--server $SQL_SERVER_NAME \
--name aks-access \
--vnet-name $VNET_NAME \
--subnet aks-subnet
# Also enable the service endpoint on the AKS subnet
az network vnet subnet update \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name aks-subnet \
--service-endpoints Microsoft.Sql
Checkpoint:
az sql db show --resource-group $RESOURCE_GROUP --server $SQL_SERVER_NAME --name merge_db \
--query "{name:name, status:status, serviceObjective:currentServiceObjectiveName}" -o table
Status must be Online.
If using private endpoint, verify:
az network private-endpoint show --resource-group $RESOURCE_GROUP \
--name "${SQL_SERVER_NAME}-pe" \
--query "{name:name, state:provisioningState, status:privateLinkServiceConnections[0].privateLinkServiceConnectionState.status}" -o table
State must be Succeeded and status must be Approved.
az keyvault create \
--resource-group $RESOURCE_GROUP \
--name $KEY_VAULT_NAME \
--location $AZURE_REGION \
--enable-rbac-authorization true
IMPORTANT: We use --enable-rbac-authorization true so that access is controlled via Azure RBAC roles (specifically "Key Vault Secrets User") rather than vault access policies. This integrates cleanly with Workload Identity.
Grant yourself access to manage secrets during setup:
CURRENT_USER_ID=$(az ad signed-in-user show --query id -o tsv)
az role assignment create \
--role "Key Vault Secrets Officer" \
--assignee-object-id $CURRENT_USER_ID \
--scope $(az keyvault show --name $KEY_VAULT_NAME --query id -o tsv)
Checkpoint:
az keyvault show --name $KEY_VAULT_NAME \
--query "{name:name, state:properties.provisioningState, rbac:properties.enableRbacAuthorization, uri:properties.vaultUri}" -o table
State must be Succeeded, RBAC must be True.
az aks get-credentials \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--overwrite-existing
kubectl get nodes
Checkpoint: All nodes should be in Ready status:
kubectl get nodes -o wide
IMPORTANT: Create the Airflow database now. The PostgreSQL Flexible Server is only accessible from within the VNet, so we must create the airflow_db database from inside the AKS cluster:
kubectl run db-setup --rm -i --restart=Never \
--image=postgres:16 \
--env="PGPASSWORD=$PG_ADMIN_PASSWORD" \
-- bash -c "psql -h $PG_FQDN -U $PG_ADMIN_USER -d postgres -c 'CREATE DATABASE airflow_db;'"
Note: This runs in the default namespace since our application namespace does not exist yet.
Checkpoint: Test PostgreSQL connectivity from a pod:
kubectl run db-test --rm -i --restart=Never \
--image=postgres:16 \
--env="PGPASSWORD=$PG_ADMIN_PASSWORD" \
-- psql -h $PG_FQDN -U $PG_ADMIN_USER -d airflow_db -c "SELECT 1;"
helm repo add secrets-store-csi-driver https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts
helm repo update
helm install csi-secrets-store secrets-store-csi-driver/secrets-store-csi-driver \
--namespace kube-system
# Install Azure Key Vault provider for the CSI driver
kubectl apply -f https://raw.githubusercontent.com/Azure/secrets-store-csi-driver-provider-azure/master/deployment/provider-azure-installer.yaml
# Wait for provider pods
kubectl wait --for=condition=ready pod -l app=csi-secrets-store-provider-azure -n kube-system --timeout=60s
Checkpoint:
kubectl get pods -n kube-system -l app=csi-secrets-store-provider-azure
kubectl get pods -n kube-system -l app=secrets-store-csi-driver
Both CSI driver and Azure provider pods should be Running.
Azure Files NFS is built-in to AKS and does not require installing any additional CSI drivers or addons. The azurefile-csi-nfs StorageClass is available by default.
kubectl get storageclass azurefile-csi-nfs
If the StorageClass exists, test it with a temporary PVC:
cat > /tmp/test-azurefile-pvc.yaml << 'EOF'
apiVersion: v1