Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.
This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered.
azmcp-*) over direct Azure CLI when availableAction: Retrieve diagnostic and troubleshooting best practices Tools: Azure MCP best practices tool Process:
Action: Locate and identify the target Azure resource Tools: Azure MCP tools + Azure CLI fallback Process:
Resource Lookup:
azmcp-subscription-listaz resource list --name <resource-name> to find matching resourcesResource Type Detection:
Action: Evaluate current resource health and availability Tools: Azure MCP monitoring tools + Azure CLI Process:
Basic Health Check:
Service-Specific Health Indicators:
Action: Analyze logs and telemetry to identify issues and patterns Tools: Azure MCP monitoring tools for Log Analytics queries Process:
Find Monitoring Sources:
azmcp-monitor-workspace-list to identify Log Analytics workspacesazmcp-monitor-table-listExecute Diagnostic Queries:
Use azmcp-monitor-log-query with targeted KQL queries based on resource type:
General Error Analysis:
// Recent errors and exceptions
union isfuzzy=true
AzureDiagnostics,
AppServiceHTTPLogs,
AppServiceAppLogs,
AzureActivity
| where TimeGenerated > ago(24h)
| where Level == "Error" or ResultType != "Success"
| summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
Performance Analysis:
// Performance degradation patterns
Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| where avg_CounterValue > 80
Application-Specific Queries:
// Application Insights - Failed requests
requests
| where timestamp > ago(24h)
| where success == false
| summarize FailureCount=count() by resultCode, bin(timestamp, 1h)
| order by timestamp desc
// Database - Connection failures
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.SQL"
| where Category == "SQLSecurityAuditEvents"
| where action_name_s == "CONNECTION_FAILED"
| summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)
Pattern Recognition:
Action: Categorize identified issues and determine root causes Process:
Issue Classification:
Root Cause Analysis:
Impact Assessment:
Action: Create a comprehensive plan to address identified issues Process:
Immediate Actions (Critical issues):
Short-term Fixes (High/Medium issues):
Long-term Improvements (All issues):
Implementation Steps:
Action: Present findings and get approval for remediation actions Process:
Display Health Assessment Summary:
🏥 Azure Resource Health Assessment
📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Location: [Region]
• Last Analyzed: [Timestamp]
🚨 Issues Identified:
• Critical: X issues requiring immediate attention
• High: Y issues affecting performance/reliability
• Medium: Z issues for optimization
• Low: N informational items
🔍 Top Issues:
1. [Issue Type]: [Description] - Impact: [High/Medium/Low]
2. [Issue Type]: [Description] - Impact: [High/Medium/Low]
3. [Issue Type]: [Description] - Impact: [High/Medium/Low]
🛠️ Remediation Plan:
• Immediate Actions: X items
• Short-term Fixes: Y items
• Long-term Improvements: Z items
• Estimated Resolution Time: [Timeline]
❓ Proceed with detailed remediation plan? (y/n)
Generate Detailed Report:
# Azure Resource Health Report: [Resource Name]
**Generated**: [Timestamp]
**Resource**: [Full Resource ID]
**Overall Health**: [Status with color indicator]
## 🔍 Executive Summary
[Brief overview of health status and key findings]
## 📊 Health Metrics
- **Availability**: X% over last 24h
- **Performance**: [Average response time/throughput]
- **Error Rate**: X% over last 24h
- **Resource Utilization**: [CPU/Memory/Storage percentages]
## 🚨 Issues Identified
### Critical Issues
- **[Issue 1]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Business impact]
- **Immediate Action**: [Required steps]
### High Priority Issues
- **[Issue 2]**: [Description]
- **Root Cause**: [Analysis]
- **Impact**: [Performance/reliability impact]
- **Recommended Fix**: [Solution steps]
## 🛠️ Remediation Plan
### Phase 1: Immediate Actions (0-2 hours)
```bash
# Critical fixes to restore service
[Azure CLI commands with explanations]
# Performance and reliability improvements
[Azure CLI commands with explanations]
# Architectural and preventive measures
[Azure CLI commands and configuration changes]