/install linux-system-health
\r \r
Linux System Health Diagnostic Skill\r
\r You are a Linux OS diagnostic expert. When a user reports any of the following problems, use this skill:\r
- Performance: server slow, high load, lag, unresponsive\r
- Memory: OOM killed, out of memory, memory leak, swap thrashing\r
- Disk: disk full, read-only filesystem, inode exhaustion, log files too large\r
- CPU: high CPU, IO wait, process stuck, load average spike\r
- Network: DNS failure, connection timeout, port exhaustion, CLOSE_WAIT accumulation, firewall blocking\r
- Process: crash, zombie processes, too many open files, file descriptor limit\r
- Browser automation: missing shared libraries, Chromium sandbox error, headless browser failures\r
- Locale/Encoding: garbled text, character encoding issues, locale not configured\r
\r
Use the judgment rules below to systematically diagnose OS-level root causes.\r
\r
When NOT to use this skill: For application-level issues specific to OpenClaw (gateway config, API keys, model configuration, service management, systemd units), use the
openclaw-diagnosticskill instead. This skill only covers OS-level diagnostics.\r \r Diagnostic workflow:\r
- Always start with Section 1 (System Environment Baseline) to establish context\r
- Then run the sections relevant to the user's reported symptoms\r
- If the root cause is unclear, run all sections in order for a comprehensive check\r \r
Commands: Run the corresponding section in scripts/diagnostics.sh. Run as root with
export LANG=C.\r \r Issue Registry: See reference.md for severity level definitions and the complete issue name table.\r \r Data access scope — this skill collects OS-level diagnostic data. Review before running in sensitive environments:\r \r | Category | What is accessed | Sections |\r |----------|-----------------|----------|\r | System config files |/etc/os-release,/etc/resolv.conf,/etc/security/limits.conf,/etc/default/locale,/etc/locale.conf,/etc/systemd/journald.conf| 1, 6, 8, 11, 17 |\r | Kernel interfaces |/proc/meminfo,/proc/stat,/proc/loadavg,/proc/sys/fs/*,/proc/sys/net/*,/sys/kernel/mm/*| 2, 3, 5, 6, 7, 14 |\r | Kernel ring buffer |dmesg— may contain process names and OOM kill details | 2, 7, 12 |\r | Systemd journal |journalctl -k— kernel messages only | 2 |\r | Log directory |/var/log/size enumeration only (does not read log content) | 11 |\r | Process & socket table |ps,ss -p— exposes PIDs, command names, socket owners | 2, 3, 10, 15 |\r | User home directories |/root/.cache/ms-playwright,/home/*/.cache/ms-playwright— Chromium binary search only | 16 |\r | Outbound network probes | DNS resolution tests (nslookup/dig/getenttogithub.com), nameserver TCP/53 reachability, Chrome headless launch test (about:blank) | 8, 16 |\r | Write operation | Creates and immediately removes/tmp/.oc_write_testto verify filesystem writability — the only write in the entire script | 12 |\r \r Output format: After running diagnostics, report findings as a severity-sorted list (FATAL > CRITICAL > ERROR > WARNING > INFO). For each issue found, include:\r
- Issue name (e.g.,
OpenClaw.Memory.SystemMemoryCritical)\r - Severity level\r
- Observed value vs threshold\r
- Recommended remediation\r \r ---\r \r
1. System Environment Baseline\r
\r Collect OS context for subsequent analysis.\r \r Judgment rules:\r
- Record output as OpenClaw.System.EnvironmentBaseline (INFO) — no issues, context only.\r \r ---\r \r
2. Memory & OOM\r
\r Detect low memory and past OOM kills that affect any workload on this server.\r \r Judgment rules:\r
- MemAvailable / MemTotal \x3C 5% → OpenClaw.Memory.SystemMemoryCritical (CRITICAL)\r
- Remediation: Kill unnecessary processes, add swap, or increase instance RAM\r
- MemAvailable / MemTotal \x3C 10% → OpenClaw.Memory.SystemMemoryLow (WARNING)\r
- Remediation: Monitor closely; consider scaling up\r
- MemTotal \x3C 2 GB → OpenClaw.Memory.InsufficientTotalMemory (ERROR)\r
- Remediation: 4 GB+ RAM recommended for production workloads\r
- dmesg contains "oom-killer" → OpenClaw.Memory.OOMKillerEvent (WARNING)\r
- Remediation: Identify which processes were killed; review memory allocation\r \r ---\r \r
3. CPU & Performance\r
\r Resource contention causes slow responses; high iowait indicates disk bottlenecks.\r \r Judgment rules:\r
- Load average (1 min) > 2x
nproc→ OpenClaw.CPU.SystemLoadHigh (WARNING)\r- Remediation: Identify top CPU consumers; check for runaway processes\r
- CPU idle \x3C 10% (i.e., total utilization > 90%) → OpenClaw.CPU.SystemCPUExhausted (CRITICAL)\r
- Remediation: Identify top process; check for log flooding or computation storms\r
- iowait > 30% (from
/proc/stat) → OpenClaw.CPU.HighIOWait (WARNING)\r- Remediation: Check disk I/O — likely excessive logging or disk-bound workload\r \r ---\r \r
4. Network Infrastructure\r
\r Basic network configuration, DNS, IPv6, and firewall state.\r \r Judgment rules:\r
- IPv6 enabled and services bind
::but upstream resolves to IPv4 only → OpenClaw.Network.IPv6Mismatch (WARNING)\r- Remediation: Set
NODE_OPTIONS='--dns-result-order=ipv4first'orsysctl -w net.ipv6.conf.all.disable_ipv6=1\r \r ---\r \r
- Remediation: Set
5. Disk & inotify\r
\r Disk space exhaustion and inotify limits cause "ENOSPC" errors.\r \r Judgment rules:\r
- Any filesystem usage >= 95% → OpenClaw.Disk.FilesystemFull (CRITICAL)\r
- Remediation: Clean old logs and data; extend partition or add disk\r
- Any filesystem usage >= 80% → OpenClaw.Disk.FilesystemHighUsage (WARNING)\r
- Remediation: Monitor; plan cleanup or expansion\r
max_user_watches\x3C 65536 → OpenClaw.Disk.InotifyWatchesTooLow (ERROR)\r- Remediation:
echo 'fs.inotify.max_user_watches=524288' >> /etc/sysctl.d/99-inotify.conf && sysctl -p /etc/sysctl.d/99-inotify.conf\r
- Remediation:
max_user_instances\x3C 256 → OpenClaw.Disk.InotifyInstancesTooLow (WARNING)\r- Remediation:
echo 'fs.inotify.max_user_instances=512' >> /etc/sysctl.d/99-inotify.conf && sysctl -p /etc/sysctl.d/99-inotify.conf\r \r ---\r \r
- Remediation:
6. File Descriptor & Process Limits\r
\r Low ulimits cause "too many open files" (EMFILE) errors under load.\r \r Judgment rules:\r
- Shell
ulimit -n\x3C 4096 → OpenClaw.Limits.NofileTooLow (ERROR)\r- Remediation: Add
* soft nofile 65536and* hard nofile 65536to/etc/security/limits.conf; re-login\r
- Remediation: Add
- limits.conf
nofilevalue >fs.nr_open→ OpenClaw.Limits.NofileExceedsKernelMax (CRITICAL)\r- Remediation: Increase
fs.nr_openfirst:sysctl -w fs.nr_open=1048576and persist in/etc/sysctl.d/\r
- Remediation: Increase
file-nrallocated / max > 80% → OpenClaw.Limits.SystemFileDescriptorsHigh (WARNING)\r- Remediation: Identify processes holding many FDs (
ls /proc/*/fd 2>/dev/null | wc -l); increasefs.file-maxif needed\r \r ---\r \r
- Remediation: Identify processes holding many FDs (
7. Kernel & Sysctl Tuning\r
\r nf_conntrack, TCP tuning, and somaxconn affect high-concurrency workloads.\r \r Judgment rules:\r
nf_conntrack_max\x3C 65536 → OpenClaw.Kernel.NfConntrackMaxTooLow (ERROR)\r- Remediation:
sysctl -w net.netfilter.nf_conntrack_max=262144and persist in/etc/sysctl.d/99-sysctl.conf\r
- Remediation:
- dmesg contains "nf_conntrack: table full" → OpenClaw.Kernel.NfConntrackTableFull (CRITICAL)\r
- Remediation: Increase
nf_conntrack_max; check for connection leaks\r
- Remediation: Increase
somaxconn\x3C 1024 → OpenClaw.Kernel.SomaxconnTooLow (WARNING)\r- Remediation:
sysctl -w net.core.somaxconn=4096and persist\r
- Remediation:
tcp_max_tw_buckets\x3C 10000 → OpenClaw.Kernel.TcpMaxTwBucketsTooLow (WARNING)\r- Remediation:
sysctl -w net.ipv4.tcp_max_tw_buckets=262144\r
- Remediation:
tcp_tw_reuse = 0→ OpenClaw.Kernel.TcpTwReuseNotEnabled (WARNING)\r- Remediation:
sysctl -w net.ipv4.tcp_tw_reuse=1\r
- Remediation:
- TIME_WAIT count from
ss -s> 10000 → OpenClaw.Kernel.TimeWaitOverflow (WARNING)\r- Remediation: Enable
tcp_tw_reuse, increasetcp_max_tw_buckets, reducetcp_fin_timeout\r
- Remediation: Enable
- ListenOverflows > 0 in
/proc/net/netstat→ OpenClaw.Kernel.TcpListenOverflows (WARNING)\r- Remediation: Increase
somaxconnand application backlog setting\r
- Remediation: Increase
vm.overcommit_memory = 2and swap \x3C 1 GB → OpenClaw.Kernel.StrictOvercommitWithLowSwap (WARNING)\r- Remediation: Add swap space or set
vm.overcommit_memory=0\r \r ---\r \r
- Remediation: Add swap space or set
8. DNS Resolution Health\r
\r
Broken or slow DNS causes EAI_AGAIN errors, API timeouts, and silent connectivity failures.\r
\r
Judgment rules:\r
/etc/resolv.confis empty or has zeronameserverlines → OpenClaw.Network.NoDNSNameservers (ERROR)\r- Remediation: Add nameservers — e.g.,
echo 'nameserver 8.8.8.8' >> /etc/resolv.conf; for systemd-resolved check/etc/systemd/resolved.conf\r
- Remediation: Add nameservers — e.g.,
nslookup,dig, andgetentall fail for a known-good domain → OpenClaw.Network.DNSResolutionFailed (CRITICAL)\r- Remediation: Verify network connectivity; check if nameservers are reachable; inspect firewall rules blocking UDP/TCP port 53\r
- Any configured nameserver fails TCP/53 reachability test → OpenClaw.Network.DNSNameserverUnreachable (WARNING)\r
- Remediation: Replace unreachable nameserver in
/etc/resolv.conf; consider adding a backup nameserver\r \r ---\r \r
- Remediation: Replace unreachable nameserver in
9. Time Synchronization\r
\r Clock drift causes SSL/TLS certificate validation failures, API auth token rejection, and log timestamp inconsistencies.\r \r Judgment rules:\r
- None of
chronyd,ntpd, orsystemd-timesyncdis active → OpenClaw.Time.NTPServiceNotRunning (ERROR)\r- Remediation: Install and enable a time sync service —
yum install chrony && systemctl enable --now chronyd(RHEL/CentOS) orapt install chrony && systemctl enable --now chronyd(Debian/Ubuntu)\r
- Remediation: Install and enable a time sync service —
timedatectlshows "NTP synchronized: no" → OpenClaw.Time.ClockNotSynchronized (CRITICAL)\r- Remediation: Start NTP service; verify NTP server reachability (
chronyc sourcesorntpq -p); check firewall allows UDP port 123\r
- Remediation: Start NTP service; verify NTP server reachability (
chronyc trackingshows system clock offset > 3 seconds, or hwclock drift > 5 seconds from system time → OpenClaw.Time.ClockDriftDetected (WARNING)\r- Remediation: Force sync —
chronyc makesteporntpdate -u pool.ntp.org; investigate why drift occurred (suspended VM, unreachable NTP server)\r \r ---\r \r
- Remediation: Force sync —
10. Zombie & D-State Processes\r
\r Zombie processes indicate child process leaks; D-state (uninterruptible sleep) processes signal I/O hangs that block system operations.\r \r Judgment rules:\r
- Zombie count > 10 → OpenClaw.Process.ZombieProcessesHigh (WARNING)\r
- Remediation: Identify parent processes (
ps -eo pid,ppid,stat,comm | awk '$3~/Z/'); the parent is not reaping children — restart or fix the parent process\r
- Remediation: Identify parent processes (
- D-state process count > 0 → OpenClaw.Process.DStateProcessesFound (CRITICAL)\r
- Remediation: D-state processes are blocked on I/O — check disk health (
dmesg | grep -i error), NFS mounts (mount -t nfs), and storage subsystem; these processes cannot be killed normally\r
- Remediation: D-state processes are blocked on I/O — check disk health (
- Total process count > 80% of
kernel.pid_max→ OpenClaw.Process.TotalProcessCountHigh (WARNING)\r- Remediation: Identify process-spawning storms (
ps -eo user --sort=user | uniq -c | sort -rn | head); increasekernel.pid_maxif needed\r \r ---\r \r
- Remediation: Identify process-spawning storms (
11. Systemd Journal & Log Disk Usage\r
\r Systemd journal grows unbounded on long-running servers, silently consuming disk space — a common hidden root cause of "disk full" events.\r \r Judgment rules:\r
- Journal disk usage > 2 GB → OpenClaw.Logs.JournalDiskUsageHigh (WARNING)\r
- Remediation:
journalctl --vacuum-size=500M; setSystemMaxUse=500Min/etc/systemd/journald.confand restartsystemd-journald\r
- Remediation:
/var/logtotal size > 5 GB → OpenClaw.Logs.VarLogOversized (WARNING)\r- Remediation: Identify large files (
find /var/log -type f -size +100M); configure logrotate; clean old rotated logs\r \r ---\r \r
- Remediation: Identify large files (
12. Filesystem Integrity\r
\r Read-only filesystem (from ext4/xfs journal errors) prevents writing session data, logs, and PID files. Inode exhaustion produces "No space left on device" even with free disk space.\r \r Judgment rules:\r
- Any non-virtual mount has
roflag, or/tmpwrite test fails → OpenClaw.Disk.ReadOnlyFilesystem (CRITICAL)\r- Remediation: Check
dmesgfor filesystem errors; runfsckon the affected partition (requires unmount or single-user mode); may indicate disk hardware failure\r
- Remediation: Check
- Any real filesystem inode usage >= 80% → OpenClaw.Disk.InodeUsageHigh (WARNING)\r
- Remediation: Find directories with many small files (
find / -xdev -printf '%h\ ' | sort | uniq -c | sort -rn | head -10); clean up session/temp files\r
- Remediation: Find directories with many small files (
dmesgcontains EXT4-fs error, XFS error, or read-only remount messages → OpenClaw.Disk.FilesystemErrorsDetected (CRITICAL)\r- Remediation: Back up data immediately; run
fsckat next maintenance window; check disk SMART status (smartctl -a /dev/sdX)\r \r ---\r \r
- Remediation: Back up data immediately; run
13. Firewall & Outbound Connectivity\r
\r Firewall rules blocking inbound or outbound traffic are the #1 cause of "port not reachable" and "API connection refused" in self-hosted deployments.\r \r Judgment rules:\r
- DROP or REJECT rules detected on INPUT or OUTPUT chains → OpenClaw.Network.FirewallDropRulesDetected (WARNING)\r
- Remediation: Review rules — ensure required ports (gateway port, 443 outbound) are allowed; use
iptables -L -n -vfor detailed hit counts\r \x3C!-- - Outbound TCP connect to external port 443 fails → OpenClaw.Network.OutboundHTTPSBlocked (ERROR)\r - Remediation: Check OUTPUT chain rules; verify cloud security group allows outbound 443; check if proxy is required (
HTTP_PROXY/HTTPS_PROXY) -->\r
- Remediation: Review rules — ensure required ports (gateway port, 443 outbound) are allowed; use
ufw statusshows default deny incoming (informational only) → OpenClaw.Network.UFWDefaultDeny (INFO)\r- Remediation: No action required if intentional; ensure gateway port is explicitly allowed (
ufw allow \x3Cport>/tcp)\r \r ---\r \r
- Remediation: No action required if intentional; ensure gateway port is explicitly allowed (
14. Transparent Hugepages\r
\r THP causes latency spikes and memory fragmentation for Node.js workloads. Multiple database and runtime vendors recommend disabling it on servers.\r \r Judgment rules:\r
- THP
enabledis set to[always]→ OpenClaw.Kernel.THPEnabled (WARNING)\r- Remediation:
echo never > /sys/kernel/mm/transparent_hugepage/enabled; persist via systemd unit or/etc/rc.local\r
- Remediation:
- THP
defragis set to[always]→ OpenClaw.Kernel.THPDefragEnabled (INFO)\r- Remediation:
echo never > /sys/kernel/mm/transparent_hugepage/defrag; reduces latency spikes from compaction\r \r ---\r \r
- Remediation:
15. TCP Connection Overload\r
\r Excessive network connections exhaust file descriptors, memory, and conntrack table capacity, degrading system-wide performance.\r \r Judgment rules:\r
- Total TCP connections > 10000 → OpenClaw.Network.TcpConnectionCountHigh (WARNING)\r
- Remediation: Identify top connection-holding processes; check for connection leaks; consider connection pooling\r
- CLOSE_WAIT count > 500 → OpenClaw.Network.CloseWaitAccumulation (ERROR)\r
- Remediation: CLOSE_WAIT indicates the local application is not calling
close()on sockets — identify the leaking process and restart it; this is an application bug\r
- Remediation: CLOSE_WAIT indicates the local application is not calling
- ESTABLISHED count > 5000 → OpenClaw.Network.EstablishedConnectionsHigh (WARNING)\r
- Remediation: Review whether all connections are legitimate; check for connection pool exhaustion or slow clients holding connections open\r
- Ephemeral ports in use > 80% of available range → OpenClaw.Network.EphemeralPortExhaustion (CRITICAL)\r
- Remediation: Widen range
sysctl -w net.ipv4.ip_local_port_range='1024 65535'; enabletcp_tw_reuse; check for connection leaks\r \r ---\r \r
- Remediation: Widen range
16. Headless Browser / Chromium Dependencies\r
\r OpenClaw skills that use browser automation (Playwright, Puppeteer) require Chromium shared libraries and headless mode. The diagnostic first tests whether Chrome can actually launch in headless mode. Dependency diagnosis is only performed when Chrome fails or is absent.\r \r Judgment rules:\r
- Chrome headless launch test (
--headless=new --dump-dom about:blank) succeeds → no issue, skip dependency checks\r - Chrome headless launch test fails → proceed with dependency diagnosis below:\r
- Any of the 7 critical shared library stems (libnss3, libatk-bridge-2.0, libgbm, libxkbcommon, libdrm, libgtk-3, libasound) is absent from
ldconfig -p→ OpenClaw.Browser.ChromiumDependenciesMissing (ERROR)\r- Remediation: On Debian/Ubuntu:
apt install -y libnss3 libatk-bridge2.0-0 libgbm1 libxkbcommon0 libdrm2 libgtk-3-0 libasound2; on RHEL/CentOS:yum install -y nss atk at-spi2-atk mesa-libgbm libxkbcommon libdrm gtk3 alsa-lib\r
- Remediation: On Debian/Ubuntu:
lddon chromium binary shows one or more "not found" entries → OpenClaw.Browser.ChromiumBinaryLddFailures (CRITICAL)\r- Remediation: Install the specific missing libraries identified by
ldd; runldconfigafter installation to update the dynamic linker cache\r
- Remediation: Install the specific missing libraries identified by
/proc/sys/kernel/unprivileged_userns_cloneis0→ OpenClaw.Browser.UserNamespaceDisabled (ERROR)\r- Remediation:
sysctl -w kernel.unprivileged_userns_clone=1and persist in/etc/sysctl.d/99-userns.conf; or configure Chromium with--no-sandbox(less secure, not recommended for production)\r \r ---\r \r
- Remediation:
- Any of the 7 critical shared library stems (libnss3, libatk-bridge-2.0, libgbm, libxkbcommon, libdrm, libgtk-3, libasound) is absent from
17. Locale & Encoding Configuration\r
\r
Missing or misconfigured locale causes garbled text output, incorrect sorting in logs, and subtle bugs like backspace deleting two characters over SSH (when client sends UTF-8 but server expects ASCII). OpenClaw's text processing relies on correct UTF-8 support.\r
\r
Judgment rules (use the persistent LANG value read from /etc/default/locale or /etc/locale.conf, not the runtime $LANG which may be overridden to C by the diagnostic runner):\r
- Persistent
LANGis empty, unset, or set toPOSIX/C→ OpenClaw.Locale.LocaleNotConfigured (ERROR)\r- Remediation: On Debian/Ubuntu:
apt install locales && dpkg-reconfigure locales, then setLANG=en_US.UTF-8in/etc/default/locale; on RHEL/CentOS:localectl set-locale LANG=en_US.UTF-8\r
- Remediation: On Debian/Ubuntu:
- The persistent
LANGvalue does not appear inlocale -aoutput (configured but not generated/installed) → OpenClaw.Locale.LocaleNotGenerated (WARNING)\r- Remediation: On Debian/Ubuntu: uncomment the locale in
/etc/locale.genand runlocale-gen; on RHEL/CentOS:localedef -i en_US -f UTF-8 en_US.UTF-8\r
- Remediation: On Debian/Ubuntu: uncomment the locale in
- Persistent
LANGdoes not containUTF-8orutf8→ OpenClaw.Locale.NonUTF8LocaleDetected (WARNING)\r- Remediation: Change to a UTF-8 variant:
localectl set-locale LANG=en_US.UTF-8; re-login for the change to take effect\r
- Remediation: Change to a UTF-8 variant:
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install linux-system-health - 安装完成后,直接呼叫该 Skill 的名称或使用
/linux-system-health触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
linux-system-health 是什么?
Diagnose Linux OS-level issues — slow server, OOM kills, disk full, high CPU/load, DNS failures, connection timeouts, port exhaustion, too many open files, z... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 241 次。
如何安装 linux-system-health?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install linux-system-health」即可一键安装,无需额外配置。
linux-system-health 是免费的吗?
是的,linux-system-health 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
linux-system-health 支持哪些平台?
linux-system-health 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(linux)。
谁开发了 linux-system-health?
由 zjxylc(@zjxylc)开发并维护,当前版本 v1.2.1。