Description

Cisco IOS-XE device health check and triage procedure. Use when troubleshooting Cisco IOS-XE routers or switches, checking CPU utilization, memory usage, int...

README (SKILL.md)

Cisco IOS-XE Device Health Check

Name: Example Device Health
Author: vahagn-madatyan

Structured triage procedure for assessing Cisco IOS-XE device health. Produces a prioritized findings report with severity classifications and recommended actions.

When to Use

Device is reported as slow, unresponsive, or dropping traffic
Scheduled health audit of IOS-XE routers or switches
Post-change verification after configuration or software updates
Capacity planning data collection for CPU, memory, and interface utilization
Incident response when a device is suspected as the fault domain

Prerequisites

SSH or console access to the target IOS-XE device (privilege level 1 minimum)
Device running IOS-XE 16.x or 17.x (commands validated against 17.3+)
Network reachability confirmed (ping/traceroute to management IP succeeds)
Knowledge of the device's normal baseline (typical CPU, memory, traffic levels)
Change control approval if performing checks during a maintenance window

Procedure

Follow this sequence. Each step produces data for the final report. Do not skip steps unless the device is unresponsive (jump to Step 6 for crash recovery).

Step 1: Establish Baseline Context

Collect device identity and uptime to frame the health check.

show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock

Record: hostname, software version, uptime, hardware model, current time. Flag if uptime is unexpectedly short — indicates recent reload or crash.

Step 2: CPU Utilization Assessment

show processes cpu sorted | head 20
show processes cpu history
show processes cpu platform sorted 5sec

Compare 5-second, 1-minute, and 5-minute averages against thresholds. If 5-second average exceeds 80%, identify the top process immediately.

Key processes to watch:

IP Input — high values indicate traffic processing overload
Crypto IKMP — VPN negotiation storms
SNMP ENGINE — aggressive polling
BGP Router — large table churn or route oscillation
IOSD — general control plane congestion

Step 3: Memory Utilization Assessment

show memory statistics
show memory platform information
show processes memory sorted | head 15

Calculate used percentage: (Total - Free) / Total * 100. Check for memory fragmentation: compare Largest Free block to Total Free. If largest free block is less than 10% of total free, fragmentation is a concern.

Step 4: Interface Health

show interfaces summary
show interfaces counters errors
show interfaces | include line protocol|drops|error|CRC|collision

For each interface with errors:

Calculate error rate: errors / (input packets + output packets) * 100
Error rate above 0.1% is warning, above 1% is critical
CRC errors suggest Layer 1 issues (cabling, optics, SFP)
Input errors with no CRC suggest buffer or overrun issues
Output drops indicate congestion — check QoS policy

Step 5: Routing Table Health

show ip route summary
show ip bgp summary (if BGP is configured)
show ip ospf neighbor (if OSPF is configured)
show ip eigrp neighbors (if EIGRP is configured)

Verify: expected number of routes present, no unexpected route withdrawals, all routing protocol neighbors in established/full state.

Flag: neighbor state changes in the last hour, route count significantly different from baseline, any routes via unexpected next-hops.

Step 6: Platform and Environment

show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback (last 50 lines)

Check: power supply status, fan status, temperature readings. Any environmental alarm is an immediate escalation trigger. Review recent syslog for crash signatures (traceback, CPUHOG, MALLOCFAIL).

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical
CPU 5-min avg	\x3C 40%	40–70%	> 70%
CPU 5-sec spike	\x3C 80%	80–90%	> 90%
Memory used	\x3C 70%	70–85%	> 85%
Memory fragmentation	> 10% largest/total	5–10%	\x3C 5%
Interface error rate	\x3C 0.01%	0.01–0.1%	> 0.1%
Interface output drops	\x3C 100/hr	100–1000/hr	> 1000/hr
Routing neighbors	All established	Flapping	Down
Temperature	Within spec	Within 5°C of max	At or above max

Decision Trees

Triage Priority

Is the device reachable?
├── No → Escalate immediately. Check console access, power, environment.
└── Yes
    ├── CPU critical? → Identify top process → Apply mitigation per process
    │   ├── IP Input → Check for traffic storm, ACL optimization
    │   ├── BGP Router → Check for route churn, peer flap, table size
    │   └── Other → Collect 'show tech-support' for TAC escalation
    ├── Memory critical? → Check for memory leak
    │   ├── Largest free \x3C 5% of total → Likely fragmentation, schedule reload
    │   └── Steady growth over time → Memory leak, collect 'show mem alloc'
    ├── Interface errors? → Classify error type
    │   ├── CRC/input errors → Layer 1 (cable, optic, SFP)
    │   └── Output drops → QoS policy or congestion
    └── All within thresholds → Document clean health, schedule next check

Escalation Criteria

Escalate to senior engineer or TAC when any of these conditions are met:

CPU sustained above 90% for more than 15 minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Traceback or CPUHOG messages in logs within last 24 hours
Environmental alarm (power, fan, temperature) present
More than 3 routing neighbor state changes in last hour

Report Template

Generate a structured report with these sections:

DEVICE HEALTH REPORT
====================
Device: [hostname]
Model: [PID from inventory]
Software: [version]
Uptime: [uptime string]
Check Time: [timestamp]
Performed By: [operator/agent]

SUMMARY: [HEALTHY | WARNING | CRITICAL]

FINDINGS:
1. [Severity] [Component] — [Description]
   Observed: [metric value]
   Threshold: [normal/warning/critical range]
   Action: [recommended action]

2. ...

RECOMMENDATIONS:
- [Prioritized list of actions]

NEXT CHECK: [scheduled date based on findings severity]

Severity levels for findings:

INFO — within normal thresholds, noted for baseline
WARNING — approaching threshold, monitor closely
CRITICAL — threshold exceeded, action required
EMERGENCY — device at risk of failure, immediate action

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and environment remotely (smart PDU, out-of-band management). If the device has crashed, collect crashinfo: dir crashinfo: after recovery.

CPU Spikes During Health Check

SNMP polling or show commands themselves can briefly spike CPU. Wait 30 seconds after connecting before collecting CPU data. Use terminal length 0 to avoid paging pauses that extend session time.

Inconsistent Memory Readings

Memory values fluctuate during normal operation. Collect three samples at 30-second intervals and average them. Check show memory dead for memory that is allocated but unreachable (leak indicator).

Interface Counter Interpretation

Counters are cumulative since last clear. Use show interfaces [name] to see the last clear time. For rate calculations, collect counters twice with a known interval: (counter2 - counter1) / interval_seconds.

Routing Protocol Neighbor Issues

If OSPF neighbors are stuck in INIT/2WAY, check MTU mismatch and area configuration. If BGP peers show "Active" state, verify TCP connectivity on port 179 and check for ACL blocking. EIGRP stuck-in-active indicates a convergence problem downstream.

Usage Guidance

This skill is an instruction-only Cisco IOS‑XE triage checklist and appears coherent with its stated purpose. Before installing/using: (1) Confirm how you will provide SSH/console credentials (the skill needs device access but does not declare credential requirements); ensure credentials are supplied securely and not pasted into untrusted channels. (2) Be aware that outputs like 'show tech-support' and crash dumps can include sensitive configuration and logs — avoid sharing them outside authorized support channels. (3) Verify change-control/maintenance-window approval before running commands on production devices. (4) The SKILL.md metadata mentions 'ssh' while registry metadata does not—this is likely benign but worth double-checking with the skill author if you require strict inventory/attestation. If you require the skill to never exfiltrate data, confirm there are no hidden egress endpoints (there are none declared) and prefer running the procedure locally or via an isolated management jump host.

Capability Assessment

✓ Purpose & Capability

The name/description (IOS‑XE health and triage) match the content: only IOS show-commands, thresholds, and escalation guidance are present. The procedure legitimately needs SSH/console access to target devices. One minor inconsistency: the SKILL.md embedded metadata lists 'ssh' under requires.bins, but the registry-level requirements list no required binaries—this is likely a metadata mismatch rather than malicious behavior.

✓ Instruction Scope

SKILL.md instructs running read-only IOS show commands, collecting counters, environment and routing info, and producing a structured report. It does not instruct reading local files, environment variables, or contacting external endpoints. It does advise collecting larger outputs (e.g., 'show tech-support') for escalation, which is standard for TAC workflows but may contain sensitive device data.

✓ Install Mechanism

No install spec and no code files — instruction-only. Nothing will be downloaded or written to disk by the skill itself, which is the lowest-risk install posture.

ℹ Credentials

The skill does not declare or request environment variables or credentials, yet operationally SSH/console credentials (device management credentials, jump-host keys) are required to run the described checks. This is proportionate to the task but the skill does not declare how credentials will be supplied or handled — a point the user should verify before use.

✓ Persistence & Privilege

always is false, the skill is user-invocable and not force-included. The skill contains no instructions to modify other skills, agent config, or system-wide settings.

Version History

v1.0.0

example-device-health 1.0.0 initial release: - Provides a step-by-step triage procedure for Cisco IOS-XE routers and switches. - Covers health checks for CPU, memory, interfaces, routing protocols, and platform environment. - Includes recommended show commands, escalation criteria, and decision trees for troubleshooting. - Supplies threshold tables for interpreting health metrics. - Generates structured device health reports with severity and actionable recommendations. - Designed for use during outages, audits, post-change verification, and incident response.

Metadata

Slug example-device-health

Version 1.0.0

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 1

Frequently Asked Questions

What is Example Device Health?

Cisco IOS-XE device health check and triage procedure. Use when troubleshooting Cisco IOS-XE routers or switches, checking CPU utilization, memory usage, int... It is an AI Agent Skill for Claude Code / OpenClaw, with 149 downloads so far.

How do I install Example Device Health?

Run "/install example-device-health" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Example Device Health free?

Yes, Example Device Health is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Example Device Health support?

Example Device Health is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Example Device Health?

It is built and maintained by Vahagn Madatyan (@vahagn-madatyan); the current version is v1.0.0.

More Skills

Example Device Health