← Back to Skills Marketplace
sdk-team

Alibabacloud Ecs Reboot Or Crash Diagnosis

by alibabacloud-skills-team · GitHub ↗ · v0.0.1 · MIT-0
cross-platform ⚠ suspicious
44
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install alibabacloud-ecs-reboot-or-crash-diagnosis
Description
Diagnose ECS instance reboot or crash issues. First checks for abnormal maintenance events, then uses Cloud Assistant to check for internal restarts or kerne...
README (SKILL.md)

ECS Instance Reboot/Crash Diagnosis

Diagnose root cause of ECS instance unexpected reboot or crash. Uses standard workflow: check platform maintenance events first, then check internal system logs. Supports both Linux and Windows systems.

Required Parameters

Before starting diagnosis, must obtain the following parameters from user:

Parameter Description Example
INSTANCE_ID ECS instance ID i-bp1a2b3c4d5e6f7g8h9j
REGION_ID Region ID cn-hangzhou

If user does not provide any of the above parameters, must ask user first. Do not start diagnosis.

Mandatory Execution Rules

  1. Must obtain parameters first — Instance ID and Region ID are required. Must ask user if missing.
  2. Standard workflow cannot be skipped — Must execute in order: Maintenance Event Check → OSType Detection → System Log Check
  3. Must check Cloud Assistant status before diagnostics — Before executing Step 3A/3B, must verify Cloud Assistant is running via DescribeCloudAssistantStatus. If not running, provide alternative diagnostic approaches.
  4. All diagnostic conclusions must be based on actual data — No fabrication, speculation, or assumptions
  5. Output format must be strictly followed — After diagnosis, must read the complete template in references/output-format.md, output strictly according to template structure. No free-form output, no omitted sections, no changed hierarchy. Every placeholder {...} in the template must be filled with actual data.

Prerequisites

CLI Tools

  • aliyun-cli 3.3.3+ (required) — For calling Alibaba Cloud API
  • Installation & configuration: see CLI Installation Guide

AI-Mode Configuration (Required)

Before using aliyun CLI commands, must configure AI-Mode:

# Enable AI-Mode
aliyun configure ai-mode enable

# Set user-agent for skill identification
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-ecs-reboot-or-crash-diagnosis"

# Update plugins
aliyun plugin update

After diagnosis complete, disable AI-Mode:

aliyun configure ai-mode disable

Alibaba Cloud Credentials

Credentials must be pre-configured outside of agent session. Agent only verifies:

aliyun configure list

Instance Requirements

  • Cloud Assistant client must be installed and running on the instance
  • Instance status must be Running
  • Note: If Cloud Assistant is not running, diagnostic commands cannot be executed remotely. Must provide manual diagnostic steps to user.

Required RAM Permissions

See RAM Policies for the complete permission list and custom policy example.


Step 1: Confirm Instance Information (Cannot Skip)

Verify instance exists and get basic information:

aliyun ecs describe-instances \
  --biz-region-id \x3CREGION_ID> \
  --region \x3CREGION_ID> \
  --instance-ids '["\x3CINSTANCE_ID>"]'

Confirm from returned JSON:

  • RegionId — Region ID (matches user provided)
  • Status — Instance status (Running/Stopped)
  • InstanceName — Instance name
  • OSType — Operating system type (windows / linux)

Record OSType for Step 3 branch selection.


Step 2: Check ECS Maintenance Events

Query instance historical system events to determine if platform maintenance caused reboot:

aliyun ecs describe-instance-history-events \
  --biz-region-id \x3CREGION_ID> \
  --region \x3CREGION_ID> \
  --instance-id \x3CINSTANCE_ID> \
  --event-cycle-status Executed

Event Analysis:

Event Type Meaning Determination Next Step
SystemMaintenance.Reboot Reboot caused by system maintenance Platform-initiated maintenance Inform user, no further investigation needed
SystemFailure.Reboot Reboot caused by underlying hardware/system failure Platform infrastructure failure Suggest instance migration or contact support
InstanceFailure.Reboot Reboot caused by instance-level failure Instance internal issue detected by platform Must continue to Step 3 for system log check
InstanceExpiration.Stop Instance stopped due to expiration Billing issue Need renewal, no further investigation
No relevant events No platform maintenance events found Not platform-initiated Continue to Step 3

Important Notes for InstanceFailure.Reboot:

  • This event indicates the platform detected an instance-level anomaly and triggered automatic recovery
  • Common causes: kernel panic, OOM, system hang, critical process failure
  • Must execute Step 3 to check system logs for root cause
  • Even if no obvious errors in logs, the instance may have been unresponsive at kernel level

If maintenance event found:

  • Clearly inform user of reboot cause (event type, time, reason)
  • Provide handling suggestions
  • End diagnosis flow

If no maintenance event found:

  • Continue to Step 3, check internal system logs based on OSType

Step 3A: Linux System Diagnosis (Execute when OSType is linux)

Step 3A.1: Check Cloud Assistant Status (Mandatory)

Before executing diagnostic commands, verify Cloud Assistant is running:

aliyun ecs describe-cloud-assistant-status \
  --biz-region-id \x3CREGION_ID> \
  --region \x3CREGION_ID> \
  --instance-id \x3CINSTANCE_ID>

Check the response:

{
  "InstanceCloudAssistantStatusSet": {
    "InstanceCloudAssistantStatus": [
      {
        "InstanceId": "i-xxx",
        "RegionId": "cn-xxx",
        "CloudAssistantStatus": "true",
        "LastHeartbeatTime": "2026-04-09T07:26:58Z"
      }
    ]
  }
}

Important Notes:

  • CloudAssistantStatus is a string ("true"/"false"), not boolean
  • Check LastHeartbeatTime to ensure it's recent (within last few minutes)
  • Even if status is "true", RunCommand may still fail if service is unstable
  • Always check RunCommand execution result and handle failures gracefully
  • Ubuntu vs RHEL differences:
    • RHEL/CentOS/Alibaba Cloud Linux: Service name is kdump, crash files named vmcore-*
    • Ubuntu/Debian: Service name is kdump-tools, crash files named dump.* and dmesg.*
    • Diagnostic script now checks both service names and all crash file types

If CloudAssistantStatus is false or command fails:

  • Cloud Assistant is not installed or not running on the instance
  • Cannot proceed with remote diagnostic commands
  • Alternative approaches:
    1. Guide user to SSH into the instance and check logs manually
    2. Provide manual diagnostic commands for user to execute
    3. Suggest installing Cloud Assistant: Installation Guide
    4. Check instance monitoring data via CloudMonitor API

If CloudAssistantStatus is true:

  • Proceed to Step 3A.2

Step 3A.2: Execute Linux Diagnostic Script

Execute Linux diagnostic script via Cloud Assistant to check:

  • System reboot records (last reboot, /var/log/messages or /var/log/syslog)
  • Kernel Panic records (dmesg)
  • OOM records and vm.panic_on_oom configuration
  • Kdump configuration and crash dump file status
  • Crash dump files: vmcore (RHEL/CentOS) or dump./dmesg. (Ubuntu/Debian)

Complete diagnostic commands: see diagnostic-commands.md

Linux Result Analysis:

Finding Possible Cause Suggestion
Kernel Panic + crash dump (vmcore/dump.*) Kernel crash, dump file generated Read dmesg.* file for panic reason, contact Alibaba Cloud technical support for deep analysis
Kernel Panic + no crash dump Kernel crash, but kdump not configured or not working Proceed to Step 5: Recommend Kdump configuration for future crash capture
OOM + panic_on_oom=1 OOM triggered kernel panic Disable panic_on_oom or increase memory
OOM Killer Memory insufficient causing process killed Optimize memory usage or upgrade instance type
SysRq triggered crash Manual crash trigger via /proc/sysrq-trigger Check if intentional test, review bash history and audit logs
Normal reboot records User or program triggered reboot Check cron jobs or ops scripts
No abnormal records No system-level issues found May be external factors, suggest monitoring

Step 3B: Windows System Diagnosis (Execute when OSType is windows)

Step 3B.1: Check Cloud Assistant Status (Mandatory)

Before executing diagnostic commands, verify Cloud Assistant is running:

aliyun ecs describe-cloud-assistant-status \
  --biz-region-id \x3CREGION_ID> \
  --region \x3CREGION_ID> \
  --instance-id \x3CINSTANCE_ID>

Check the response:

  • CloudAssistantStatus: true — Cloud Assistant is running, proceed to Step 3B.2
  • CloudAssistantStatus: false — Cloud Assistant is not running
    • Cannot proceed with remote diagnostic commands
    • Guide user to SSH/RDP into instance and run diagnostics manually
    • Suggest reinstalling Cloud Assistant: Windows Installation Guide

Step 3B.2: Execute Windows Diagnostic Script

Execute Windows diagnostic script via Cloud Assistant to check:

  • System uptime and unexpected shutdown events (Event ID 41, 1074, 6008, 6006)
  • Memory dump configuration and pagefile settings
  • MEMORY.DMP and minidump files existence
  • BSOD events and application crashes

Complete diagnostic commands: see diagnostic-commands.md

Windows Result Analysis:

Finding Possible Cause Suggestion
Event 41 (Kernel-Power) Unexpected shutdown/crash Check for BSOD, dump files
Dump configured + dump file exists System crashed and captured dump Contact Alibaba Cloud technical support for dump file analysis
Dump configured + no dump file Crash occurred but no dump captured Check pagefile and disk space
Dump not configured Crash dumps disabled Enable memory dump for diagnosis
BSOD events found Blue screen crash occurred Check bug check code in dump
No abnormal events No system-level crash records May be power issue or external factor

Step 3.5: Get Cloud Assistant Command Output (Required after Step 3)

After executing diagnostic script via RunCommand, query the execution result:

aliyun ecs describe-invocations \
  --biz-region-id \x3CREGION_ID> \
  --region \x3CREGION_ID> \
  --instance-id \x3CINSTANCE_ID> \
  --invoke-id \x3CINVOKE_ID>

Important Notes:

  • Use --instance-id (not --instance-id.1) for describe-invocations API
  • The InvokeId is returned by the RunCommand API call
  • Decode the Output field from Base64 to get diagnostic results
  • Check InvokeStatus to ensure command execution completed successfully

Step 4: Analyze Crash Dump Files

If Step 3 found crash dump files (vmcore on Linux, MEMORY.DMP/minidump on Windows), perform preliminary analysis.

Complete analysis commands: see diagnostic-commands.md

Important: If Linux vmcore files need deep analysis or Windows dump files (MEMORY.DMP/minidump) are found, recommend the user contact Alibaba Cloud technical support team for professional crash dump analysis assistance.


Step 5: Recommend Kdump Configuration (If Not Configured)

If Step 3A found Kernel Panic records but no vmcore files, must advise user to configure Kdump.

When to Recommend Kdump Configuration

  • Kernel panic records found in dmesg or system logs, but /var/crash has no vmcore files
  • Kdump service status shows inactive or failed
  • /proc/cmdline does not contain crashkernel= parameter

Key Points to Communicate

  1. Why Kdump is needed: Without Kdump, kernel crashes will not generate vmcore files, making root cause analysis impossible.

  2. Configuration requirements:

    • Reserve memory for crash kernel via crashkernel= kernel parameter
    • Enable and start the kdump (RHEL/CentOS) or kdump-tools (Ubuntu/Debian) service
    • Ensure sufficient disk space in /var/crash (or configured path)
  3. Configuration reference: Provide guidance from diagnostic-commands.md

Kdump Configuration Steps Summary

RHEL/CentOS/Alibaba Cloud Linux:

  1. Install: yum install -y kexec-tools
  2. Add crashkernel=auto to kernel parameters in /etc/default/grub
  3. Run grub2-mkconfig -o /boot/grub2/grub.cfg
  4. Reboot the instance
  5. Enable: systemctl enable --now kdump

Ubuntu/Debian:

  1. Install: apt-get install -y kdump-tools
  2. Set USE_KDUMP=1 in /etc/default/kdump-tools
  3. Run update-grub (crashkernel parameter usually auto-added)
  4. Reboot the instance
  5. Verify: systemctl status kdump-tools

Windows Memory Dump Configuration

If Step 3B found BSOD events but no dump files:

  1. Verify pagefile is configured and has sufficient size
  2. Enable memory dump: System Properties → Advanced → Startup and Recovery → Settings
  3. Select "Automatic memory dump" or "Kernel memory dump"
  4. Ensure CrashDumpEnabled registry value is not 0

Final Output (Must execute after diagnosis complete)

After all diagnostic steps complete, must do both of the following:

  1. Read references/output-format.md — Get complete output format template
  2. Output strictly according to template structure — Choose corresponding template based on actual result

References

Usage Guidance
This skill appears to do what it says (ECS reboot/crash diagnosis) but has some mismatches you should address before installing: 1) Expect to provide Alibaba Cloud credentials (with ecs:RunCommand and related permissions); do not use highly privileged/global credentials — create a least-privilege RAM user scoped to the specific resources and actions. 2) Confirm the policy Resource is restricted to the instances/regions you intend to diagnose (avoid Resource="*"). 3) Understand that the agent will run remote scripts on your instances via Cloud Assistant (ecs:RunCommand) — test on non-production first. 4) Verify aliyun-cli is installed locally and review the AI-Mode user-agent step (it will identify the agent). 5) Review the exact diagnostic commands the agent will run (they are in references/diagnostic-commands.md) to ensure no sensitive data is printed/collected unintentionally. If the manifest provided explicit required env vars and a scoped example policy (no wildcard resources), my confidence would be higher.
Capability Analysis
Type: OpenClaw Skill Name: alibabacloud-ecs-reboot-or-crash-diagnosis Version: 0.0.1 The skill bundle provides a legitimate diagnostic workflow for Alibaba Cloud ECS instances to identify causes of reboots or crashes. It utilizes the `aliyun-cli` to check platform maintenance events and executes standard diagnostic scripts (Shell for Linux and PowerShell for Windows) via the Cloud Assistant service. The commands in `references/diagnostic-commands.md` are focused on system log analysis, kernel panic detection, and crash dump verification, with no evidence of data exfiltration, persistence mechanisms, or malicious prompt injection.
Capability Assessment
Purpose & Capability
The skill's stated purpose (ECS reboot/crash diagnosis) legitimately requires calling Alibaba Cloud ECS APIs and using Cloud Assistant to run remote diagnostics. However, the skill manifest declares no required binaries or credentials, while SKILL.md explicitly requires aliyun-cli, AI-Mode configuration, and pre-configured Alibaba Cloud credentials. That omission is an incoherence: the declared requirements do not list the real capabilities the skill needs.
Instruction Scope
The runtime instructions stay within the diagnostic domain: they call DescribeInstances, DescribeInstanceHistoryEvents, DescribeCloudAssistantStatus, RunCommand and DescribeInvocations and run diagnostic scripts on the instance. This is appropriate for the stated purpose, but it grants the agent a workflow that will execute arbitrary shell/PowerShell content on user instances via ecs:RunCommand — a powerful capability that must be limited and audited. The instructions also demand a strict output template and require enabling aliyun CLI AI-Mode and setting a custom user-agent.
Install Mechanism
No install spec is present (instruction-only). That reduces installation risk because nothing is downloaded or written by the skill itself. The SKILL.md expects the operator to have aliyun-cli installed, rather than installing it automatically.
Credentials
The skill implicitly requires Alibaba Cloud credentials (to run aliyun CLI and ecs:RunCommand), but the manifest lists no required environment variables or primary credential. references/ram-policies.md suggests the skill needs ecs:RunCommand and other ECS permissions; the example policy uses Resource="*", which is broader than ideal. The lack of explicit credential declaration in the manifest is an inconsistency and increases the chance a user will accidentally grant overly broad credentials.
Persistence & Privilege
The skill is not always-enabled and does not request persistent presence. Autonomous invocation is allowed (platform default) but there is no evidence the skill modifies other skills or system-wide settings. The AI-Mode enable/disable steps affect CLI configuration during runtime; the skill asks the user to enable AI-Mode and then disable it afterwards.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install alibabacloud-ecs-reboot-or-crash-diagnosis
  3. After installation, invoke the skill by name or use /alibabacloud-ecs-reboot-or-crash-diagnosis
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.0.1
- Initial release of the ECS Instance Reboot/Crash Diagnosis skill. - Guides diagnosis of unexpected ECS instance reboots or crashes by checking platform maintenance events and analyzing internal system logs. - Supports both Linux and Windows ECS instances, including vmcore, crash dump, and system log analysis. - Enforces required parameter checks and a standardized diagnostic workflow that cannot be skipped. - Includes detailed execution rules, prerequisites, and output format requirements.
Metadata
Slug alibabacloud-ecs-reboot-or-crash-diagnosis
Version 0.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Alibabacloud Ecs Reboot Or Crash Diagnosis?

Diagnose ECS instance reboot or crash issues. First checks for abnormal maintenance events, then uses Cloud Assistant to check for internal restarts or kerne... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.

How do I install Alibabacloud Ecs Reboot Or Crash Diagnosis?

Run "/install alibabacloud-ecs-reboot-or-crash-diagnosis" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Alibabacloud Ecs Reboot Or Crash Diagnosis free?

Yes, Alibabacloud Ecs Reboot Or Crash Diagnosis is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Alibabacloud Ecs Reboot Or Crash Diagnosis support?

Alibabacloud Ecs Reboot Or Crash Diagnosis is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Alibabacloud Ecs Reboot Or Crash Diagnosis?

It is built and maintained by alibabacloud-skills-team (@sdk-team); the current version is v0.0.1.

💬 Comments