Systematic methodology for diagnosing infrastructure problems across compute, storage, network, and virtualization layers. Use as the starting framework for any troubleshooting scenario, then reference platform-specific skills (vmware-esxi-performance, etc.) for deep dives.
This is the parent methodology skill. For platform-specific deep dives:
| Platform | Skill | Covers |
|---|---|---|
| VMware ESXi | vmware-esxi-performance | esxtop, KAVG/DAVG, DSNRO, vmxnet3, PVSCSI |
| OpenShift Virt | (future) | KubeVirt, Portworx, OVN networking |
| Pure Storage | (future) | FlashArray diagnostics, replication |
Start here for methodology, go to child skills for commands and thresholds.
Work from the bottom of the stack up. Symptoms at the application layer often have root causes in infrastructure.
Application Layer ← Symptoms appear here
↓
Operating System ← Guest drivers, kernel, services
↓
Virtualization ← VM config, hypervisor scheduling, virtual hardware
↓
Infrastructure ← Physical compute, storage, network
Rule: Don't chase symptoms. Isolate which layer owns the problem before diving deep.
Performance problem reported
│
├── Affecting multiple VMs/apps?
│ YES → Infrastructure layer (storage, network, host)
│ NO → VM-specific or app-specific
│
├── Correlates with time of day?
│ YES → Contention, scheduled jobs, backups
│ NO → Configuration or hardware issue
│
├── Did anything change recently?
│ YES → Start there (patches, config, new workloads)
│ NO → Gradual degradation or intermittent failure
│
└── Can you reproduce it?
YES → Capture data during reproduction
NO → Set up monitoring to catch next occurrence
First question: Is latency from the array (DAVG) or the hypervisor (KAVG)?
Deep dive: See vmware-esxi-performance skill for metrics, thresholds, esxtop commands, and DSNRO tuning.
First question: Is it latency-bound or throughput-bound?
Test raw throughput (bypass protocols):
iperf3 -c <target> -t 30 -P 4
Bandwidth Delay Product (BDP):
Max throughput = TCP Window Size / RTT
Check for packet loss:
pathping <target>
Even 1-2% loss destroys TCP throughput.
First question: Is the workload starved for CPU? Check %RDY (ready time) and %CSTP (co-stop) in esxtop.
Deep dive: See vmware-esxi-performance skill for thresholds and analysis.
Symptoms: Traffic slow through firewall, fast same-segment.
Quick isolation:
Palo Alto quick checks:
show session id <id>
show counter global filter delta yes severity drop
Look for: threat inspection latency, session limits, QoS throttling.
Common culprits for slow SMB:
Diagnostics:
# Check SMB version and signing
Get-SmbConnection | Select-Object ServerName, Dialect
Get-SmbClientConfiguration | Select-Object RequireSecuritySignature
# Check TCP window scaling
netsh interface tcp show global
Wireshark analysis:
tcp.flags.syn == 1 to find handshaketcp.analysis.retransmissionIf same file transfers fast locally, slow over WAN:
| Check | Command | Looking for |
|---|---|---|
| Latency | ping -n 20 <target> | Baseline RTT |
| Packet loss | pathping <target> | Loss at any hop |
| MTU issues | ping -f -l 1472 <target> | Fragmentation |
| Raw throughput | iperf3 -c <target> | Actual bandwidth |
If iperf shows good throughput but robocopy is slow:
/MT:16 for parallel streams| Log | Location | Retention |
|---|---|---|
| vmkernel | /var/log/vmkernel.log | Rotates at 1MB |
| VM logs | /vmfs/volumes/<ds>/<vm>/vmware.log | Longer retention |
| hostd | /var/log/hostd.log | VM management events |
Note: vmkernel rotates quickly. Check .log.1, .log.2 for historical data.
dmesg -T # Kernel ring buffer with timestamps
journalctl -u <service> # Systemd service logs
journalctl -k # Kernel messages only
journalctl --since "1 hour ago"
Get-EventLog -LogName System -Newest 100
Get-WinEvent -FilterHashtable @{LogName='System'; Level=2} # Errors only
oc adm node-logs <node>
oc logs <pod> --previous # Previous container instance
oc describe node <node> # Events and conditions
Collect data at time of incident. Logs rotate, metrics age out.
vm-support bundle from ESXi hostvmware.log (from datastore, not guest)esxtop batch capture: esxtop -b -d 2 -n 30 > capture.csvgrep -E "H:0x|APD|PDL" /var/log/vmkernel.logoc adm must-gathersosreport from affected nodesdmesg and journalctl outputoc get clusterversion| Don't | Do Instead |
|---|---|
| Chase symptoms at app layer | Isolate which layer owns the problem |
| Change multiple variables | One change, measure, repeat |
| Skip data collection | Capture first, analyze later |
| Assume the obvious | Verify with data |
| Escalate without logs | Package data vendors actually need |
When documenting troubleshooting:
## Symptom
[What exactly is happening, quantified]
## Timeline
[When started, any correlation with changes/events]
## Scope
[What's affected, what's working normally]
## Data Collected
[Logs, metrics, captures — with timestamps]
## Layer Isolation
[Infrastructure → Virtualization → OS → Application]
## Root Cause
[Which layer, which component, why]
## Resolution
[What fixed it, or escalation path]