Name: Infrastructure Troubleshooting
Author: abix-

Skills suchen.../

Infrastructure Troubleshooting | Skills Pool

Performance problem reported
    │
    ├── Affecting multiple VMs/apps?
    │       YES → Infrastructure layer (storage, network, host)
    │       NO  → VM-specific or app-specific
    │
    ├── Correlates with time of day?
    │       YES → Contention, scheduled jobs, backups
    │       NO  → Configuration or hardware issue
    │
    ├── Did anything change recently?
    │       YES → Start there (patches, config, new workloads)
    │       NO  → Gradual degradation or intermittent failure
    │
    └── Can you reproduce it?
            YES → Capture data during reproduction
            NO  → Set up monitoring to catch next occurrence

iperf3 -c <target> -t 30 -P 4

Max throughput = TCP Window Size / RTT

pathping <target>

show session id <id>
show counter global filter delta yes severity drop

# Check SMB version and signing
Get-SmbConnection | Select-Object ServerName, Dialect
Get-SmbClientConfiguration | Select-Object RequireSecuritySignature

# Check TCP window scaling
netsh interface tcp show global

dmesg -T                    # Kernel ring buffer with timestamps
journalctl -u <service>     # Systemd service logs
journalctl -k               # Kernel messages only
journalctl --since "1 hour ago"

Get-EventLog -LogName System -Newest 100
Get-WinEvent -FilterHashtable @{LogName='System'; Level=2} # Errors only

oc adm node-logs <node>
oc logs <pod> --previous      # Previous container instance
oc describe node <node>       # Events and conditions

Don't	Do Instead
Chase symptoms at app layer	Isolate which layer owns the problem
Change multiple variables	One change, measure, repeat
Skip data collection	Capture first, analyze later
Assume the obvious	Verify with data
Escalate without logs	Package data vendors actually need

## Symptom
[What exactly is happening, quantified]

## Timeline
[When started, any correlation with changes/events]

## Scope
[What's affected, what's working normally]

## Data Collected
[Logs, metrics, captures — with timestamps]

## Layer Isolation
[Infrastructure → Virtualization → OS → Application]

## Root Cause
[Which layer, which component, why]

## Resolution
[What fixed it, or escalation path]

Platform	Skill	Covers
VMware ESXi	`vmware-esxi-performance`	esxtop, KAVG/DAVG, DSNRO, vmxnet3, PVSCSI
OpenShift Virt	(future)	KubeVirt, Portworx, OVN networking
Pure Storage	(future)	FlashArray diagnostics, replication

Platform	Skill	Covers
VMware ESXi	`vmware-esxi-performance`	esxtop, KAVG/DAVG, DSNRO, vmxnet3, PVSCSI
OpenShift Virt	(future)	KubeVirt, Portworx, OVN networking
Pure Storage	(future)	FlashArray diagnostics, replication

Check	Command	Looking for
Latency	`ping -n 20 <target>`	Baseline RTT
Packet loss	`pathping <target>`	Loss at any hop
MTU issues	`ping -f -l 1472 <target>`	Fragmentation
Raw throughput	`iperf3 -c <target>`	Actual bandwidth

Log	Location	Retention
vmkernel	`/var/log/vmkernel.log`	Rotates at 1MB
VM logs	`/vmfs/volumes/<ds>/<vm>/vmware.log`	Longer retention
hostd	`/var/log/hostd.log`	VM management events

Infrastructure Troubleshooting

Skill Hierarchy

Core Principle: Layered Isolation

Infrastructure Troubleshooting

Skill Hierarchy

Core Principle: Layered Isolation

The Troubleshooting Loop

Quick Decision Tree

Layer-Specific Quick Checks

Storage Latency

Network Performance

CPU Contention

Firewall/Security Devices

Network Troubleshooting (Non-VMware)

SMB/Windows File Copies

WAN Performance

Log Locations Quick Reference

VMware ESXi

Linux

Windows

OpenShift/Kubernetes

Data Collection for Escalation

VMware Support

Storage Vendor

Red Hat/OpenShift

Network Vendor

Anti-Patterns

Response Documentation Template

Feishu Drive

Nanoclaw Repl

Crosspost

Cloudflare

Mcp Integration

Setup Deploy