Description

Arista EOS device health check and triage procedure. Use when troubleshooting Arista 7000, 7500, or 720X series switches — assessing CPU, memory, interfaces,...

README (SKILL.md)

Arista EOS Device Health Check

Name: Arista Device Health
Author: vahagn-madatyan

Structured triage procedure for assessing Arista EOS device health. Produces a prioritized findings report with severity classifications and recommended actions.

EOS runs on a Linux kernel with individual processes (agents) managing each protocol and subsystem. This architecture means Linux-native tools (bash top, bash df, bash dmesg) are valid supplements to EOS show commands. Agent health is a first-class concern — a crashed or stuck agent can silently affect a subsystem even when aggregate device metrics look normal.

The procedure covers core device health first, then extends to MLAG and VXLAN/EVPN for data center deployments.

When to Use

Device reported as slow, dropping traffic, or unresponsive
Scheduled health audit of Arista switches (leaf, spine, or border)
Post-change verification after upgrades, patches, or configuration changes
Capacity planning data collection for CPU, memory, and link utilization
Incident response when an Arista switch is suspected as the fault domain
MLAG inconsistency or split-brain detection in a data center pair
VXLAN/EVPN overlay health assessment — VTEP reachability, BGP EVPN peering
Agent crash or restart detected in syslog

Prerequisites

SSH or console access to the device (privilege 1 minimum)
EOS 4.28 or later (commands validated against EOS 4.30+)
Network reachability to management interface confirmed
Awareness of the device's normal baseline (CPU, memory, traffic patterns)
For MLAG configurations: access to both MLAG peers for cross-validation
For VXLAN/EVPN: knowledge of expected VTEP count and BGP EVPN peer list

Procedure

Follow this sequence. Core device health (Steps 1–5) applies to all deployments. Steps 6–7 are data center extensions for MLAG and VXLAN/EVPN — skip if not configured on this device.

Step 1: Establish Baseline Context

show version
show uptime
show clock
show inventory

Record: hostname, EOS version, uptime, hardware model, total memory, current time. Short uptime indicates a recent reload or crash — check show reload cause for the trigger. Note the hardware platform — Arista's 7050X, 7280R, 7500R, and 720X series have different memory and CPU profiles.

Step 2: CPU and Memory Assessment

EOS reports CPU and memory through both EOS commands and Linux-native tools.

show processes top once
show version | include Free memory

Linux-native alternative (provides more granularity):

bash timeout 5 top -bn1 | head 20
bash free -m

show processes top once shows per-process CPU and memory similar to Linux top. EOS CPU is typically consumed by agents — each protocol has its own process.

Key agents to watch:

Stp — Spanning Tree: high CPU indicates topology instability
Rib — Routing Information Base: route churn or large table operations
Ebra — EOS Bridge Agent: L2 forwarding, MAC learning storms
IpRib — IP RIB agent: routing table processing
Bgp — BGP agent: peer negotiations, route processing
Fap-sobek / Sand — ASIC agents: forwarding plane, hardware programming

EOS memory: show version includes total and free memory. Free memory below 20% of total is warning. Linux free -m provides buffer/cache detail — EOS uses Linux page cache aggressively, so "used" memory includes reclaimable cache.

Step 3: Agent and Daemon Health

EOS runs each subsystem as an independent agent. A crashed or stuck agent affects its subsystem silently — device-level metrics may look normal while one protocol is completely inoperative.

show agent
show agent [name] logs | tail 20
show logging last 50 | include AGENT|agent|crash|restart

show agent lists all EOS agents with their PID, state, and uptime.

Healthy state: all agents show running with uptimes matching device uptime. Red flags:

Agent with shorter uptime than device → it crashed and restarted
Agent in crashed or not running state → subsystem is down
Multiple agent restarts in logs → recurring instability

If an agent shows recent restart, check its logs: show agent [name] logs. Common restart causes: memory exhaustion, uncaught exception, watchdog timeout.

Step 4: Interface Health

show interfaces status
show interfaces counters errors
show interfaces counters discards

For interfaces with errors:

show interfaces [name]
show interfaces [name] transceiver

Error classification:

CRC / FCS errors → Layer 1 (cabling, optics, SFP seating)
Input errors without CRC → buffer overruns, MTU mismatch
Alignment errors → duplex mismatch or L1 issue
Output discards → egress congestion; check QoS or link capacity
Late collisions → duplex mismatch (should not occur on modern links)

Optics DOM: show interfaces transceiver provides Tx/Rx power, temperature, voltage. Low Rx power (below -10 dBm for most 10G SFP+) indicates fiber or optic degradation.

Step 5: Environment and Platform

show environment all
show environment cooling
show environment power
show environment temperature
show reload cause
show logging last 30 | include ERR|WARN|traceback|panic

Check: all temperature sensors within limits, all power supplies OK, all fans operational. show environment all is a comprehensive single command.

show reload cause — if the device recently reloaded, this explains why. Expected causes: user-initiated, software upgrade. Unexpected causes: kernel panic, watchdog timeout, power loss.

Review syslog for tracebacks, panics, or agent crashes. EOS logs are also accessible via bash cat /var/log/messages | tail 50.

Step 6: MLAG Health (Data Center Extension)

Skip this step if MLAG is not configured. MLAG state is often the single most important health indicator in an Arista data center deployment — a degraded MLAG pair can cause traffic black-holes or asymmetric forwarding.

show mlag
show mlag detail
show mlag interfaces
show mlag config-sanity

MLAG state assessment:

Field	Healthy Value	Problem Indicator
state	active	disabled, inactive
negotiation status	connected	connecting, disconnected
peer-link status	up	down, errdisabled
local-interface status	up	down
config-sanity	consistent	inconsistent

show mlag detail provides:

Peer address and peer-link interface status
Heartbeat interval and status — lost heartbeats indicate control plane reachability issues between peers
Reload delay timers — critical for preventing traffic loss during boot

show mlag config-sanity checks for configuration mismatches between peers. Inconsistencies can cause traffic loops or black-holes. Address any inconsistent result before continuing other triage.

show mlag interfaces — verify all MLAG interfaces show active-full on both sides. active-partial means one side is down — reduced redundancy. disabled means the MLAG interface is administratively or operationally down.

Step 7: VXLAN/EVPN Health (Data Center Extension)

Skip this step if VXLAN/EVPN is not configured. This assesses overlay fabric health for VXLAN deployments using BGP EVPN as the control plane.

show vxlan vtep
show vxlan address-table
show interfaces vxlan 1
show bgp evpn summary

VXLAN health checks:

show vxlan vtep — verify expected VTEP count. Missing VTEPs indicate underlay reachability or BGP EVPN peering issues.
show interfaces vxlan 1 — VTEP source interface must be up. Flood list should match expected remote VTEPs (for flood-and-learn) or be empty (for BGP EVPN with ingress replication).
show vxlan address-table — MAC learning across the overlay. Stale or missing entries indicate control plane issues.

BGP EVPN peering:

show bgp evpn summary — all EVPN peers must be in Established state. Peers in Active/Connect state have transport or configuration issues.
Compare EVPN route count to baseline — significant deviation indicates route withdrawal or redistribution problems.

For L3 VXLAN (symmetric IRB):

show ip route vrf [name] summary
show bgp evpn route-type ip-prefix

Verify: VRF route counts match expectations, no unexpected route withdrawals, EVPN type-5 routes present for inter-VRF routing.

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical	Notes
CPU 5-min avg	\x3C 40%	40–70%	> 70%	Per-agent breakdown in `show processes top`
Memory free	> 30%	20–30%	\x3C 20%	Linux page cache is reclaimable
Agent state	All running	Agent restarted	Agent crashed/not running	Check `show agent`
Interface error rate	\x3C 0.01%	0.01–0.1%	> 0.1%
Output discards/hr	\x3C 100	100–1000	> 1000
MLAG state	active/connected	config inconsistency	peer-link down
VXLAN VTEP count	Matches expected	Missing VTEPs	Vxlan1 interface down
Temperature	Within spec	5°C of max	At/above max	Per-sensor

Decision Trees

Primary Triage

Is the device reachable?
├── No → Check console, power, environment. Check reload cause after recovery.
└── Yes
    ├── Agent health issue?
    │   ├── Agent crashed / not running → Subsystem is down
    │   │   ├── Identify which agent → Determines affected protocol/feature
    │   │   ├── Bgp agent → BGP sessions will be down
    │   │   ├── Stp agent → STP not running, L2 loops possible
    │   │   ├── Ebra agent → L2 forwarding affected
    │   │   └── Action: check agent logs, attempt restart, escalate to TAC
    │   ├── Agent restarted recently → Subsystem had a disruption
    │   │   └── Check logs: `show agent [name] logs` for crash reason
    │   └── All agents running, uptimes match → Agent layer healthy
    │
    ├── CPU issue?
    │   ├── Identify top agent by CPU
    │   │   ├── Stp high → STP topology change, check link flaps
    │   │   ├── Rib/IpRib high → Route churn, check BGP/OSPF peers
    │   │   ├── Ebra high → MAC learning storm, check for loops
    │   │   ├── Bgp high → Large table, peer negotiation, route policy
    │   │   └── Fap/Sand high → ASIC programming backlog
    │   └── No single agent dominant → General overload, check traffic rates
    │
    ├── Memory issue?
    │   ├── Free \x3C 20% → Check top consumers via `show processes top once`
    │   ├── Linux cache: `bash free -m` → Cache is reclaimable, not a real leak
    │   └── Steady growth → Memory leak in agent, collect data, plan reload
    │
    ├── Interface errors? → Classify error type
    │   ├── CRC/FCS errors → Layer 1 (cable, optic, SFP)
    │   ├── Output discards → QoS or congestion
    │   └── Alignment errors → Duplex mismatch
    │
    ├── MLAG issue? (if configured)
    │   ├── See MLAG State Triage tree below
    │   └── MLAG healthy → Continue
    │
    ├── VXLAN/EVPN issue? (if configured)
    │   ├── Missing VTEPs → Check underlay reachability, BGP EVPN peers
    │   ├── EVPN peer not Established → Check transport, route-map, ASN
    │   └── VTEP and peers OK → Overlay healthy
    │
    └── All within thresholds → Document clean health

MLAG State Triage

MLAG configured?
├── No → Skip MLAG checks
└── Yes
    ├── MLAG state: active?
    │   ├── No → MLAG disabled or inactive
    │   │   └── Check: `show mlag` for reason, peer reachability
    │   └── Yes → Continue
    │
    ├── Peer-link status: up?
    │   ├── No → CRITICAL — peer-link down
    │   │   ├── Traffic orphaned on MLAG interfaces
    │   │   ├── Check: physical link, SFP, port-channel members
    │   │   └── Verify: reload-delay timers configured to prevent storms
    │   └── Yes → Continue
    │
    ├── Negotiation: connected?
    │   ├── No → Control plane issue between peers
    │   │   └── Check: heartbeat, peer IP, VLAN configuration
    │   └── Yes → Continue
    │
    ├── Config sanity: consistent?
    │   ├── No → Configuration mismatch
    │   │   ├── `show mlag config-sanity` for details
    │   │   ├── Common: VLAN mismatch, STP priority, trunk allowed VLANs
    │   │   └── Fix mismatches before trusting MLAG state
    │   └── Yes → Continue
    │
    ├── MLAG interfaces: all active-full?
    │   ├── active-partial → One side down, reduced redundancy
    │   │   └── Check: interface status on both peers
    │   ├── disabled → Interface operationally down
    │   │   └── Check: physical link, member port status
    │   └── active-full → Healthy
    │
    └── All checks pass → MLAG healthy

Escalation Criteria

Escalate to senior engineer or Arista TAC when:

CPU sustained > 90% for 15+ minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Agent in crashed state that does not recover after restart
Multiple agents crashing or restarting within a short period
Traceback or kernel panic in logs (show logging, bash dmesg)
Any environmental alarm (power, fan, temperature)
More than 3 routing neighbor state changes in the last hour
MLAG peer-link failure with no clear physical cause
MLAG config-sanity inconsistencies that cannot be resolved
VXLAN overlay losing more than 10% of expected VTEPs

Report Template

DEVICE HEALTH REPORT
====================
Device: [hostname]
Platform: Arista EOS
Model: [from show inventory]
Software: [EOS version]
Uptime: [uptime string]
Check Time: [timestamp]
Performed By: [operator/agent]

SUMMARY: [HEALTHY | WARNING | CRITICAL | EMERGENCY]

AGENT HEALTH:
  Total agents: [count]
  Running: [count] | Crashed: [count] | Not running: [count]
  Recent restarts: [list any agents with uptime \x3C device uptime]

FINDINGS:
1. [Severity] [Component] — [Description]
   Domain: [System | Interface | MLAG | VXLAN | Environment]
   Observed: [metric value]
   Threshold: [normal/warning/critical range]
   Action: [recommended action]

2. ...

DC EXTENSIONS:
  MLAG: [healthy | degraded | critical | not configured]
    State: [active/inactive], Peer-link: [up/down], Config-sanity: [consistent/inconsistent]
  VXLAN/EVPN: [healthy | degraded | critical | not configured]
    VTEPs: [observed/expected], EVPN peers: [established/total]

RECOMMENDATIONS:
- [Prioritized action list]

NEXT CHECK: [date based on severity — CRITICAL: 24hr, WARNING: 7d, HEALTHY: 30d]

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and environment via out-of-band management. After recovery: show reload cause for the trigger, bash dmesg | tail 50 for kernel messages, show logging last 50 for pre-crash syslog. EOS stores crash data in /var/log/ — accessible via bash ls -la /var/log/.

Agent Crash and Recovery

If show agent shows an agent as crashed or with uptime shorter than the device:

Check agent logs: show agent [name] logs | tail 30
Check syslog: show logging last 50 | include [agent-name]
Linux-level: bash journalctl -u [agent-name] --no-pager | tail 30
If the agent is not running, attempt: agent [name] shutdown then no agent [name] shutdown in configuration mode (note: this is a config change)

Agent crashes are the most common EOS-specific failure mode. Collect logs before restarting — the crash data is needed for TAC investigation.

CPU Spikes During Health Check

EOS show commands can briefly spike CPU on the management plane. Wait 30 seconds after connecting before collecting CPU data. Use show processes top once rather than interactive top (which holds a session) for scripted checks.

MLAG Split-Brain Detection

If both peers report MLAG state active but negotiation shows disconnected, this is a split-brain condition. Both peers believe they are the primary. Causes: peer-link physical failure, peer-link VLAN issue, or heartbeat network partition. Immediate action: verify physical peer-link connectivity. If peer-link is physically up but MLAG reports disconnected, check VLAN trunking on the peer-link and the heartbeat network (usually the management VRF).

Linux-Native Diagnostics

EOS's Linux foundation means standard Linux tools work from bash:

bash top -bn1 — process-level CPU and memory (more granular than EOS top)
bash free -m — memory with buffer/cache detail
bash df -h — filesystem usage (important for /mnt/flash and /var/log)
bash dmesg | tail — kernel messages (hardware errors, driver issues)
bash uptime — load averages
bash cat /proc/meminfo — detailed memory breakdown

These supplement EOS show commands when deeper investigation is needed. Prefix all commands with bash from the EOS CLI, or enter bash for a shell.

Inconsistent Memory Readings

EOS's Linux page cache uses available memory for file caching. show version may report low free memory while actual memory pressure is low. Use bash free -m and check the available column (not free) for the real available memory. The available column accounts for reclaimable cache and buffers.

Usage Guidance

This is an instruction-only Arista EOS troubleshooting playbook and appears coherent with that purpose. Before using it, ensure you: (1) run it only against devices you are authorized to access; (2) provide only the minimum SSH/console privileges needed (read/diagnostic vs. write/enable) because the procedure reads logs and core dumps which may contain sensitive data; (3) note the minor metadata mismatch indicating the skill expects an ssh binary—ensure your agent environment provides secure SSH access; and (4) if you require the skill to collect or transmit 'show tech-support' or core files, decide how those outputs will be handled/stored so sensitive information is protected. If the publisher or source becomes available (homepage, maintainer contact) or if the skill is later extended with scripts or network egress, re-evaluate for any install or exfiltration risks.

Capability Analysis

Type: OpenClaw Skill Name: arista-device-health Version: 0.1.1 The skill bundle provides a professional and comprehensive triage procedure for Arista EOS devices, focusing on health monitoring and troubleshooting. It uses standard EOS 'show' commands and legitimate Linux-native diagnostics (e.g., 'bash free -m', 'bash dmesg') as intended for the platform, with no evidence of data exfiltration, malicious execution, or prompt injection across SKILL.md, cli-reference.md, or threshold-tables.md.

Capability Assessment

✓ Purpose & Capability

The skill is an instruction-only triage procedure for Arista EOS devices. The commands and file paths referenced (show commands, bash alternatives, /var/log, journalctl, etc.) are appropriate for on-device diagnostics of CPU, memory, agents, interfaces, MLAG, and VXLAN/EVPN. There is one minor metadata inconsistency: SKILL.md's embedded openclaw JSON lists requires bins:["ssh"] while the registry's top-level 'Required binaries' shows none.

✓ Instruction Scope

SKILL.md stays within the troubleshooting scope: it instructs read-only 'show' commands and Linux diagnostic reads (top, free, dmesg, journalctl, ls /var/log/qt). These are relevant to device health. It does advise collecting crash/core files and ‘show tech-support’ level diagnostics (large sensitive outputs), but these are expected for triage and are not sent to external endpoints by the instructions.

✓ Install Mechanism

No install spec or code files are present; this is instruction-only and nothing will be written to disk or downloaded during install. Low installation risk.

✓ Credentials

The skill declares no environment variables or credentials and does not request unrelated secrets. It assumes the operator has SSH/console access and appropriate device privileges, which is proportionate to the task.

✓ Persistence & Privilege

always is false and the skill is user-invocable. It does not request persistent system-wide changes or modify other skills. Agent autonomous invocation is allowed by default but is not combined with other concerning permissions here.

Version History

v0.1.1

- Added new CLI reference and threshold tables documentation for easier troubleshooting and health assessment. - Included the openclaw metadata block in SKILL.md, improving integration and discoverability. - No changes to the core health check triage procedure or instructions.

v0.1.0

Initial release providing Arista EOS device health check and triage procedures. - Covers CPU, memory, interface, environment, and agent/daemon health assessment. - Includes MLAG and VXLAN/EVPN health validation steps for data center deployments. - Details both EOS-native and Linux-native troubleshooting commands. - Provides recommended metrics, severity classifications, and corrective actions. - Designed for use during troubleshooting, audits, post-change checks, and incident response.

Metadata

Slug arista-device-health

Version 0.1.1

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 2

Frequently Asked Questions

What is Arista Device Health?

Arista EOS device health check and triage procedure. Use when troubleshooting Arista 7000, 7500, or 720X series switches — assessing CPU, memory, interfaces,... It is an AI Agent Skill for Claude Code / OpenClaw, with 148 downloads so far.

How do I install Arista Device Health?

Run "/install arista-device-health" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Arista Device Health free?

Yes, Arista Device Health is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Arista Device Health support?

Arista Device Health is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Arista Device Health?

It is built and maintained by Vahagn Madatyan (@vahagn-madatyan); the current version is v0.1.1.

More Skills

Arista Device Health