功能描述

Cisco IOS-XE and NX-OS device health check and triage procedure. Use when troubleshooting Cisco routers, switches, or Nexus platforms — assessing CPU, memory...

使用说明 (SKILL.md)

Cisco IOS-XE / NX-OS Device Health Check

Name: Cisco Device Health
Author: vahagn-madatyan

Structured triage procedure for assessing Cisco device health across IOS-XE and NX-OS platforms. Produces a prioritized findings report with severity classifications and recommended actions.

Commands are labeled [IOS-XE] or [NX-OS] where they diverge. Unlabeled commands work on both platforms.

When to Use

Device reported as slow, dropping traffic, or unresponsive
Scheduled health audit of IOS-XE routers/switches or Nexus platforms
Post-change verification after upgrades, patches, or configuration changes
Capacity planning data collection for CPU, memory, and link utilization
Incident response when a Cisco device is suspected as the fault domain
NX-OS VDC isolation troubleshooting — health check scoped to a specific VDC

Prerequisites

SSH or console access to the device (privilege 1 minimum for IOS-XE, show role for NX-OS)
IOS-XE 16.x/17.x or NX-OS 9.x/10.x (commands validated against IOS-XE 17.3+ and NX-OS 10.2+)
Network reachability to management interface confirmed
Awareness of the device's normal baseline (CPU, memory, traffic patterns)
For NX-OS: identify which VDC you are operating in (show vdc current)

Procedure

Follow this sequence. Each step produces data for the final report. If the device is unresponsive, skip to Step 6 for crash recovery.

Step 1: Establish Baseline Context

Collect identity, uptime, and platform context.

[IOS-XE]

show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock

[NX-OS]

show version | include uptime|system version|Memory
show inventory | include PID
show clock
show vdc current

Record: hostname, OS version, uptime, hardware model, current time. Short uptime means a recent reload or crash — investigate immediately. On NX-OS, confirm VDC context: all subsequent commands apply to this VDC only.

Step 2: CPU Utilization

IOS-XE separates Route Processor (RP) and QuantumFlow Processor (QFP). Triage both planes independently. NX-OS uses a unified supervisor model — show system resources is the primary view.

[IOS-XE]

show processes cpu sorted | head 20
show processes cpu history
show platform software status control-processor brief
show platform hardware qfp active statistics drop

[NX-OS]

show system resources
show processes cpu sort | head 20
show module

IOS-XE interpretation: Compare 5-second, 1-minute, and 5-minute RP averages. If RP CPU is normal but QFP drops are climbing, the data plane is overloaded while control plane is healthy — different remediation than RP overload. Key RP processes: IP Input (traffic storms), BGP Router (route churn), Crypto IKMP (VPN storms), IOSD (control plane congestion).

NX-OS interpretation: show system resources reports CPU as kernel + user percentages. Normal idle should be above 60%. NX-OS control plane CPU runs hotter than IOS-XE RP because the supervisor handles both planes. Key processes: netstack (forwarding), bgp (route processing), stp (spanning tree), vpc (vPC keepalive).

Step 3: Memory Utilization

[IOS-XE]

show memory statistics
show processes memory sorted | head 15

[NX-OS]

show system resources
show processes memory | head 15

IOS-XE: Calculate used percentage from processor pool: (Total - Free) / Total * 100. Check fragmentation: if largest free block \x3C 10% of total free, memory is fragmented.

NX-OS: show system resources reports memory as total/used/free in KB. NX-OS reserves a larger kernel memory footprint than IOS-XE — used percentages above 60% are typical. Watch for MemFree dropping below 15% of total.

Step 4: Interface Health

show interfaces summary
show interfaces counters errors

[IOS-XE]

show interfaces | include line protocol|drops|error|CRC
show controllers [name]

[NX-OS]

show interface [name] | include errors|drops|CRC
show hardware internal errors module [n]

For each interface with errors:

Calculate error rate: errors / (input + output packets) * 100
CRC errors → Layer 1 issue (cabling, optics, SFP seating)
Input errors without CRC → buffer overruns
Output drops → congestion; check QoS policy or link capacity
On NX-OS, show hardware internal errors reveals ASIC-level drops not visible in standard interface counters

Step 5: Routing Protocol Health

show ip route summary

[IOS-XE]

show ip bgp summary
show ip ospf neighbor
show ip eigrp neighbors

[NX-OS]

show bgp ipv4 unicast summary
show ip ospf neighbors
show ip eigrp neighbors
show vpc brief

Verify: expected route count, all neighbors in established/full state, no unexpected withdrawals. On NX-OS, vPC peer-link state is critical — a failed peer-link orphans traffic. BGP prefix counts that deviate >10% from baseline indicate route churn.

Step 6: Environment and Platform

[IOS-XE]

show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback | tail 50
dir crashinfo:

[NX-OS]

show environment
show module
show logging last 50
show cores

Check: power supply status, fan health, temperature readings. Any environmental alarm escalates immediately. On IOS-XE, review crashinfo: for recent crash dumps. On NX-OS, show cores lists supervisor and module core dumps — presence indicates a process crash requiring investigation.

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical	Platform Notes
CPU 5-min avg	\x3C 40%	40–70%	> 70%	NX-OS: expect 5–10% higher baseline
CPU 5-sec spike	\x3C 80%	80–90%	> 90%	IOS-XE RP only; NX-OS: use system resources
QFP drops/sec	\x3C 100	100–1000	> 1000	IOS-XE only
Memory used	\x3C 70%	70–85%	> 85%	NX-OS: normal baseline is 55–65%
Interface error rate	\x3C 0.01%	0.01–0.1%	> 0.1%	Both platforms
Output drops/hr	\x3C 100	100–1000	> 1000	Both platforms
Routing neighbors	All established	Flapping	Down	NX-OS: include vPC state
Temperature	Within spec	5°C of max	At/above max	Per-sensor spec varies

Decision Trees

Primary Triage

Is the device reachable?
├── No → Check console, power, environment. Collect crashinfo after recovery.
└── Yes
    ├── Identify platform → IOS-XE or NX-OS?
    │
    ├── [IOS-XE] CPU issue?
    │   ├── RP CPU high, QFP normal → Control plane overload
    │   │   ├── BGP Router high → Route churn, check peer stability
    │   │   ├── IP Input high → Traffic storm, check ACLs/CoPP
    │   │   └── Other process → Collect 'show tech-support' for TAC
    │   ├── RP normal, QFP drops high → Data plane overload
    │   │   └── Check traffic rates, ACL complexity, NAT sessions
    │   └── Both high → Platform capacity exceeded, reduce load
    │
    ├── [NX-OS] CPU issue?
    │   ├── Check VDC context → Are you in the correct VDC?
    │   ├── System resources CPU high
    │   │   ├── netstack high → Forwarding overload, check ARP/routes
    │   │   ├── bgp high → Route churn, check table size and peers
    │   │   ├── stp/vpc high → L2 instability, check topology
    │   │   └── Other → Collect 'show tech-support' for TAC
    │   └── Per-module issue → Specific linecard problem
    │       └── show module → Identify failed/degraded module
    │
    ├── Memory issue? (either platform)
    │   ├── [IOS-XE] Fragmentation high → Schedule reload during window
    │   ├── [NX-OS] MemFree \x3C 15% → Identify top consumers
    │   └── Steady growth → Memory leak, collect allocation data, plan reload
    │
    ├── Interface errors? → Classify error type
    │   ├── CRC/input errors → Layer 1 (cable, optic, SFP)
    │   ├── Output drops → QoS or congestion
    │   └── [NX-OS] ASIC errors → Hardware issue, check module health
    │
    └── All within thresholds → Document clean health

Escalation Criteria

Escalate to senior engineer or TAC when:

CPU sustained > 90% for 15+ minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Traceback, CPUHOG, or MALLOCFAIL in logs (IOS-XE); core dump present (NX-OS)
Any environmental alarm (power, fan, temperature)
More than 3 routing neighbor state changes in the last hour
NX-OS module in non-OK state or vPC peer-link failure
QFP drops exceeding 10,000/sec with no traffic anomaly explanation (IOS-XE)

Report Template

DEVICE HEALTH REPORT
====================
Device: [hostname]
Platform: [IOS-XE | NX-OS]
Model: [PID from inventory]
Software: [version]
Uptime: [uptime string]
VDC: [VDC name — NX-OS only, omit for IOS-XE]
Check Time: [timestamp]
Performed By: [operator/agent]

SUMMARY: [HEALTHY | WARNING | CRITICAL | EMERGENCY]

FINDINGS:
1. [Severity] [Component] — [Description]
   Platform: [IOS-XE | NX-OS]
   Observed: [metric value]
   Threshold: [normal/warning/critical range]
   Action: [recommended action]

2. ...

PLATFORM-SPECIFIC NOTES:
- [IOS-XE: QFP drop summary or NX-OS: VDC/module status]

RECOMMENDATIONS:
- [Prioritized action list]

NEXT CHECK: [date based on severity — CRITICAL: 24hr, WARNING: 7d, HEALTHY: 30d]

Troubleshooting

Device Unresponsive to SSH

Try console access. If console also unresponsive, check power and environment via out-of-band management (CIMC for UCS-based, smart PDU, or console server). On IOS-XE: collect dir crashinfo: after recovery. On NX-OS: show cores and show system reset-reason.

NX-OS VDC Context Confusion

Health data is VDC-scoped. If metrics look wrong, verify VDC: show vdc current. Switch VDC with switchto vdc [name]. The default VDC contains system-level resources; admin VDC shows aggregate hardware.

CPU Spikes During Health Check

Show commands and SNMP polling can briefly spike CPU. Wait 30 seconds after connecting before collecting CPU data. Use terminal length 0 (IOS-XE) or terminal length 0 (NX-OS) to avoid paging pauses.

IOS-XE QFP Drops vs RP CPU

High QFP drops with normal RP CPU means the data plane is stressed but control plane is healthy. Do not restart the device — the RP is functioning normally. Investigate traffic patterns, ACL complexity, and NAT session counts. Use show platform hardware qfp active feature to identify which QFP feature is consuming resources.

NX-OS Module Issues

If show module reports a module not in "ok" state, collect show module internal exception-log module [n] before taking action. Powercycling a linecard (poweroff module [n] / no poweroff module [n]) can resolve transient failures but requires change control.

Inconsistent Memory Between Platforms

IOS-XE and NX-OS report memory differently. IOS-XE show memory statistics shows IOS memory pools; NX-OS show system resources shows Linux-level memory. NX-OS baseline memory usage is higher (55–65% is normal) because the Linux kernel and system daemons consume more than IOS-XE's monolithic process model.

安全使用建议

This skill is a read-only checklist for Cisco IOS‑XE and NX‑OS health checks and appears coherent for that purpose, but consider the following before use: - The agent or environment that runs these instructions must have network/SSH or console access and credentials to the devices; the skill itself does not manage credentials. Only provide those credentials in secure, controlled ways. - The commands will return device-internal data (logs, crash dumps, routing tables, interface counters) which can contain sensitive info — ensure the outputs are handled appropriately and not exfiltrated to untrusted destinations. - Resolve the small metadata mismatch (SKILL.md notes 'ssh' as a required binary while registry lists none) so callers know whether an 'ssh' client must be present. - Test the procedure on non-production devices first to validate command compatibility with your exact IOS‑XE/NX‑OS versions and hardware before running in production. - If you plan to let an autonomous agent run this skill, restrict which devices it can reach and audit command outputs and access logs regularly.

功能分析

Type: OpenClaw Skill Name: cisco-device-health Version: 1.0.0 The skill bundle provides a comprehensive and structured procedure for performing read-only health checks on Cisco IOS-XE and NX-OS devices. All commands listed in SKILL.md and references/cli-reference.md are standard diagnostic 'show' commands, and the documentation includes clear thresholds and decision trees aligned with legitimate network administration tasks. No evidence of malicious intent, data exfiltration, or unauthorized access was found.

能力评估

ℹ Purpose & Capability

The skill's name and description match the runtime instructions: it is a procedural playbook for IOS‑XE and NX‑OS health checks using only read-only 'show' commands. Minor metadata inconsistency: the SKILL.md's embedded openclaw metadata lists 'ssh' as a required binary, while the registry metadata for this upload lists no required binaries — expecting SSH/console access is reasonable for this purpose, but the mismatch should be resolved.

ℹ Instruction Scope

All runtime instructions are read-only 'show' commands and platform-local diagnostics (version, cpu, memory, interfaces, logs, crash cores). This stays within the stated triage scope. Note: 'show logging', 'dir crashinfo:', and 'show cores' can surface sensitive operational data and crash dumps from the device; that is expected for triage but may expose sensitive information to the agent executing the commands.

✓ Install Mechanism

Instruction-only skill with no install spec and no code files. Nothing will be downloaded or written to disk by the skill package itself.

✓ Credentials

The skill declares no required environment variables or credentials. It assumes the operator (or agent) has SSH/console access to the target devices — this is proportional to the stated purpose. There is no request for unrelated cloud or service credentials.

✓ Persistence & Privilege

always:false and default model-invocation behavior. The skill does not request persistent/system-wide privileges or to modify other skills; no elevated persistence is requested.

版本历史

v1.0.0

Initial release of Cisco device health check and triage skill. - Provides structured procedures for assessing Cisco IOS-XE and NX-OS device health, covering routers, switches, and Nexus platforms. - Includes platform-specific commands, decision trees, and severity thresholds for CPU, memory, interface, routing, and environmental health. - Supports QFP/RP architecture on IOS-XE and VDC isolation on NX-OS. - Outputs prioritized findings reports with recommended actions. - Designed for audit, troubleshooting, post-upgrade checks, incident response, and capacity planning.

元数据

Slug cisco-device-health

版本 1.0.0

许可证 MIT-0

累计安装 2

当前安装数 1

历史版本数 1

常见问题

Cisco Device Health 是什么？

Cisco IOS-XE and NX-OS device health check and triage procedure. Use when troubleshooting Cisco routers, switches, or Nexus platforms — assessing CPU, memory... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 191 次。

如何安装 Cisco Device Health？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install cisco-device-health」即可一键安装，无需额外配置。

Cisco Device Health 是免费的吗？

是的，Cisco Device Health 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Cisco Device Health 支持哪些平台？

Cisco Device Health 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Cisco Device Health？

由 Vahagn Madatyan（@vahagn-madatyan）开发并维护，当前版本 v1.0.0。

Cisco Device Health

Cisco IOS-XE / NX-OS Device Health Check

When to Use

Prerequisites

Procedure

Step 1: Establish Baseline Context

Step 2: CPU Utilization

Step 3: Memory Utilization

Step 4: Interface Health

Step 5: Routing Protocol Health

Step 6: Environment and Platform

Threshold Tables

Decision Trees

Primary Triage

Escalation Criteria

Report Template

Troubleshooting

Device Unresponsive to SSH

NX-OS VDC Context Confusion

CPU Spikes During Health Check

IOS-XE QFP Drops vs RP CPU

NX-OS Module Issues

Inconsistent Memory Between Platforms

Cisco Device Health 是什么？

如何安装 Cisco Device Health？

Cisco Device Health 是免费的吗？

Cisco Device Health 支持哪些平台？

谁开发了 Cisco Device Health？

💬 留言讨论