功能描述

Cisco IOS-XE device health check and triage procedure. Use when troubleshooting Cisco IOS-XE routers or switches, checking CPU utilization, memory usage, int...

使用说明 (SKILL.md)

Cisco IOS-XE Device Health Check

Name: Example Device Health
Author: vahagn-madatyan

Structured triage procedure for assessing Cisco IOS-XE device health. Produces a prioritized findings report with severity classifications and recommended actions.

When to Use

Device is reported as slow, unresponsive, or dropping traffic
Scheduled health audit of IOS-XE routers or switches
Post-change verification after configuration or software updates
Capacity planning data collection for CPU, memory, and interface utilization
Incident response when a device is suspected as the fault domain

Prerequisites

SSH or console access to the target IOS-XE device (privilege level 1 minimum)
Device running IOS-XE 16.x or 17.x (commands validated against 17.3+)
Network reachability confirmed (ping/traceroute to management IP succeeds)
Knowledge of the device's normal baseline (typical CPU, memory, traffic levels)
Change control approval if performing checks during a maintenance window

Procedure

Follow this sequence. Each step produces data for the final report. Do not skip steps unless the device is unresponsive (jump to Step 6 for crash recovery).

Step 1: Establish Baseline Context

Collect device identity and uptime to frame the health check.

show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock

Record: hostname, software version, uptime, hardware model, current time. Flag if uptime is unexpectedly short — indicates recent reload or crash.

Step 2: CPU Utilization Assessment

show processes cpu sorted | head 20
show processes cpu history
show processes cpu platform sorted 5sec

Compare 5-second, 1-minute, and 5-minute averages against thresholds. If 5-second average exceeds 80%, identify the top process immediately.

Key processes to watch:

IP Input — high values indicate traffic processing overload
Crypto IKMP — VPN negotiation storms
SNMP ENGINE — aggressive polling
BGP Router — large table churn or route oscillation
IOSD — general control plane congestion

Step 3: Memory Utilization Assessment

show memory statistics
show memory platform information
show processes memory sorted | head 15

Calculate used percentage: (Total - Free) / Total * 100. Check for memory fragmentation: compare Largest Free block to Total Free. If largest free block is less than 10% of total free, fragmentation is a concern.

Step 4: Interface Health

show interfaces summary
show interfaces counters errors
show interfaces | include line protocol|drops|error|CRC|collision

For each interface with errors:

Calculate error rate: errors / (input packets + output packets) * 100
Error rate above 0.1% is warning, above 1% is critical
CRC errors suggest Layer 1 issues (cabling, optics, SFP)
Input errors with no CRC suggest buffer or overrun issues
Output drops indicate congestion — check QoS policy

Step 5: Routing Table Health

show ip route summary
show ip bgp summary (if BGP is configured)
show ip ospf neighbor (if OSPF is configured)
show ip eigrp neighbors (if EIGRP is configured)

Verify: expected number of routes present, no unexpected route withdrawals, all routing protocol neighbors in established/full state.

Flag: neighbor state changes in the last hour, route count significantly different from baseline, any routes via unexpected next-hops.

Step 6: Platform and Environment

show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback (last 50 lines)

Check: power supply status, fan status, temperature readings. Any environmental alarm is an immediate escalation trigger. Review recent syslog for crash signatures (traceback, CPUHOG, MALLOCFAIL).

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical
CPU 5-min avg	\x3C 40%	40–70%	> 70%
CPU 5-sec spike	\x3C 80%	80–90%	> 90%
Memory used	\x3C 70%	70–85%	> 85%
Memory fragmentation	> 10% largest/total	5–10%	\x3C 5%
Interface error rate	\x3C 0.01%	0.01–0.1%	> 0.1%
Interface output drops	\x3C 100/hr	100–1000/hr	> 1000/hr
Routing neighbors	All established	Flapping	Down
Temperature	Within spec	Within 5°C of max	At or above max

Decision Trees

Triage Priority

Is the device reachable?
├── No → Escalate immediately. Check console access, power, environment.
└── Yes
    ├── CPU critical? → Identify top process → Apply mitigation per process
    │   ├── IP Input → Check for traffic storm, ACL optimization
    │   ├── BGP Router → Check for route churn, peer flap, table size
    │   └── Other → Collect 'show tech-support' for TAC escalation
    ├── Memory critical? → Check for memory leak
    │   ├── Largest free \x3C 5% of total → Likely fragmentation, schedule reload
    │   └── Steady growth over time → Memory leak, collect 'show mem alloc'
    ├── Interface errors? → Classify error type
    │   ├── CRC/input errors → Layer 1 (cable, optic, SFP)
    │   └── Output drops → QoS policy or congestion
    └── All within thresholds → Document clean health, schedule next check

Escalation Criteria

Escalate to senior engineer or TAC when any of these conditions are met:

CPU sustained above 90% for more than 15 minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Traceback or CPUHOG messages in logs within last 24 hours
Environmental alarm (power, fan, temperature) present
More than 3 routing neighbor state changes in last hour

Report Template

Generate a structured report with these sections:

DEVICE HEALTH REPORT
====================
Device: [hostname]
Model: [PID from inventory]
Software: [version]
Uptime: [uptime string]
Check Time: [timestamp]
Performed By: [operator/agent]

SUMMARY: [HEALTHY | WARNING | CRITICAL]

FINDINGS:
1. [Severity] [Component] — [Description]
   Observed: [metric value]
   Threshold: [normal/warning/critical range]
   Action: [recommended action]

2. ...

RECOMMENDATIONS:
- [Prioritized list of actions]

NEXT CHECK: [scheduled date based on findings severity]

Severity levels for findings:

INFO — within normal thresholds, noted for baseline
WARNING — approaching threshold, monitor closely
CRITICAL — threshold exceeded, action required
EMERGENCY — device at risk of failure, immediate action

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and environment remotely (smart PDU, out-of-band management). If the device has crashed, collect crashinfo: dir crashinfo: after recovery.

CPU Spikes During Health Check

SNMP polling or show commands themselves can briefly spike CPU. Wait 30 seconds after connecting before collecting CPU data. Use terminal length 0 to avoid paging pauses that extend session time.

Inconsistent Memory Readings

Memory values fluctuate during normal operation. Collect three samples at 30-second intervals and average them. Check show memory dead for memory that is allocated but unreachable (leak indicator).

Interface Counter Interpretation

Counters are cumulative since last clear. Use show interfaces [name] to see the last clear time. For rate calculations, collect counters twice with a known interval: (counter2 - counter1) / interval_seconds.

Routing Protocol Neighbor Issues

If OSPF neighbors are stuck in INIT/2WAY, check MTU mismatch and area configuration. If BGP peers show "Active" state, verify TCP connectivity on port 179 and check for ACL blocking. EIGRP stuck-in-active indicates a convergence problem downstream.

安全使用建议

This skill is an instruction-only Cisco IOS‑XE triage checklist and appears coherent with its stated purpose. Before installing/using: (1) Confirm how you will provide SSH/console credentials (the skill needs device access but does not declare credential requirements); ensure credentials are supplied securely and not pasted into untrusted channels. (2) Be aware that outputs like 'show tech-support' and crash dumps can include sensitive configuration and logs — avoid sharing them outside authorized support channels. (3) Verify change-control/maintenance-window approval before running commands on production devices. (4) The SKILL.md metadata mentions 'ssh' while registry metadata does not—this is likely benign but worth double-checking with the skill author if you require strict inventory/attestation. If you require the skill to never exfiltrate data, confirm there are no hidden egress endpoints (there are none declared) and prefer running the procedure locally or via an isolated management jump host.

能力评估

✓ Purpose & Capability

The name/description (IOS‑XE health and triage) match the content: only IOS show-commands, thresholds, and escalation guidance are present. The procedure legitimately needs SSH/console access to target devices. One minor inconsistency: the SKILL.md embedded metadata lists 'ssh' under requires.bins, but the registry-level requirements list no required binaries—this is likely a metadata mismatch rather than malicious behavior.

✓ Instruction Scope

SKILL.md instructs running read-only IOS show commands, collecting counters, environment and routing info, and producing a structured report. It does not instruct reading local files, environment variables, or contacting external endpoints. It does advise collecting larger outputs (e.g., 'show tech-support') for escalation, which is standard for TAC workflows but may contain sensitive device data.

✓ Install Mechanism

No install spec and no code files — instruction-only. Nothing will be downloaded or written to disk by the skill itself, which is the lowest-risk install posture.

ℹ Credentials

The skill does not declare or request environment variables or credentials, yet operationally SSH/console credentials (device management credentials, jump-host keys) are required to run the described checks. This is proportionate to the task but the skill does not declare how credentials will be supplied or handled — a point the user should verify before use.

✓ Persistence & Privilege

always is false, the skill is user-invocable and not force-included. The skill contains no instructions to modify other skills, agent config, or system-wide settings.

版本历史

v1.0.0

example-device-health 1.0.0 initial release: - Provides a step-by-step triage procedure for Cisco IOS-XE routers and switches. - Covers health checks for CPU, memory, interfaces, routing protocols, and platform environment. - Includes recommended show commands, escalation criteria, and decision trees for troubleshooting. - Supplies threshold tables for interpreting health metrics. - Generates structured device health reports with severity and actionable recommendations. - Designed for use during outages, audits, post-change verification, and incident response.

元数据

Slug example-device-health

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题

Example Device Health 是什么？

Cisco IOS-XE device health check and triage procedure. Use when troubleshooting Cisco IOS-XE routers or switches, checking CPU utilization, memory usage, int... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 149 次。

如何安装 Example Device Health？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install example-device-health」即可一键安装，无需额外配置。

Example Device Health 是免费的吗？

是的，Example Device Health 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Example Device Health 支持哪些平台？

Example Device Health 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Example Device Health？

由 Vahagn Madatyan（@vahagn-madatyan）开发并维护，当前版本 v1.0.0。

Example Device Health