Alibabacloud Aes Sysom Pai Diagnosis
/install alibabacloud-aes-sysom-pai-diagnosis
alibabacloud-aes-sysom-pai-diagnosis
Skill Name: alibabacloud-aes-sysom-pai-diagnosis Goal: Perform SysOM deep diagnosis on Alibaba Cloud PAI products (EAS / DLC) to identify root causes of instance-level performance and health issues.
Credential Security
[CRITICAL] Credential Security Rules:
- NEVER print, echo, or display AccessKey ID / AccessKey Secret values in conversation or command output (even partial masking of
LTAI_ACCESS_KEY_IDis FORBIDDEN)- NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
aliyun configure setwith literal credential values- ONLY use
aliyun configure listto check credential statusaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
aliyun configurein terminal or environment variables in shell profile)- Return and re-run after
aliyun configure listshows a valid profile
RAM Policy
For the full list of RAM permissions required by this skill, see references/ram-policies.md.
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- Read
references/ram-policies.mdto get the full list of permissions required by this SKILL- Use
ram-permission-diagnoseskill to guide the user through requesting the necessary permissions- Pause and wait until the user confirms that the required permissions have been granted
Parameter Confirmation
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance IDs, product type, time ranges, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
| Parameter | Required/Optional | Description | Default Value |
|---|---|---|---|
region |
Required | Region of the PAI resource (e.g., cn-hangzhou) |
None, must be provided by user |
instance |
Required | PAI instance ID (EAS service ID eas-m-xxx or DLC job ID dlcxxxxxxxx) |
None, must be provided by user |
product |
Required | PAI sub-product type, one of EAS or DLC |
Auto-inferred from instance prefix (eas- → EAS, dlc → DLC); only ask user when inference fails |
start_time |
Optional | Diagnosis start timestamp (Unix seconds) | 0 (real-time) |
end_time |
Optional | Diagnosis end timestamp (Unix seconds) | 0 |
enable_diagnosis |
Optional | Force real-time diagnosis (highest priority) | false |
uid |
Optional | Account ID owning the resource | None |
ocd_description |
Optional | User's problem description in English, with words joined by underscores (_). No Chinese characters, no spaces. Example: GPU_OOM_instance_restart |
None |
Product Auto-Inference Rule
The product field MUST be present in the params JSON. The value is determined as follows:
- If the user explicitly specifies
product(EASorDLC), use the user value - Otherwise, infer from the
instanceprefix:eas-→EASdlc(no hyphen, e.g.,dlcxxxxxxxx) →DLC
- If inference is ambiguous or fails, you MUST explicitly ask the user to choose between
EASandDLC
Core Workflow
The workflow has two phases with 8 steps. All aliyun CLI business commands (SysOM, EAS, DLC API calls) MUST include --user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis. System commands (version, configure, plugin) do NOT use --user-agent.
Phase 1: Environment Setup (Steps 0–3)
Step 0 — Enable AI-Mode and Update Plugins
Before executing any CLI commands, enable AI-Mode, set User-Agent, and update plugins:
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis"
aliyun plugin update
⚠️ The above three commands must be executed before all CLI operations, and only need to be run once.
Step 1 — CLI Version Check
aliyun version
Verify version >= 3.3.1. If not met, refer to references/cli-installation-guide.md for installation.
Step 2 — Enable Auto Plugin Installation
aliyun configure set --auto-plugin-install true
Step 3 — Credential Verification
aliyun configure list
If no valid credentials exist, STOP and guide the user to configure credentials outside the session.
Phase 2: Diagnosis Execution (Steps 4–8)
For detailed workflow, see references/diagnose-workflow.md.
Step 4 — Ambiguous Problem Clarification (Inversion Gate)
Must confirm region, instance, and when the anomaly occurred. If not provided by the user, ask explicitly. product is auto-inferred from the instance prefix (eas- → EAS, dlc → DLC); only ask user when inference fails. Also extract optional time range.
⚠️ Time Inference Rule: When the user's description contains any temporal reference (e.g., "this morning", "yesterday afternoon", "around 3pm", "last night"), you MUST proactively ask for the specific time range and recommend historical diagnosis mode. Do NOT silently default to real-time diagnosis when the problem clearly occurred in the past.
Step 5 — SysOM Role Initialization
aliyun sysom initial-sysom --check-only false --source aes-skills --user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis
Step 6 — Resource Validation
Before invoking diagnosis, you MUST validate the resource based on the inferred product:
6A. EAS — Verify Service Exists
aliyun eas list-services \
--region \x3Cregion> \
--filter \x3Ceas_service_id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis
From the returned Services array, verify that an entry with a matching ServiceId exists. If no match is found, inform the user that the service ID is invalid and stop the pipeline.
6B. DLC — Verify Resource Type is Lingjun
aliyun pai-dlc get-job \
--region \x3Cregion> \
--job-id \x3Cdlc_job_id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis
Check the ResourceType field in the response:
Lingjun→ proceed to Step 7- Any other value → STOP and inform the user: "SysOM diagnosis currently only supports DLC jobs running on Lingjun resources. Your job uses
\x3CResourceType>, which is not yet supported."
⚠️ The
instancefield in params JSON uses the original instance ID directly (eas-m-xxxordlcxxxxxxxx) — this step is purely for validation.
Step 7 — Invoke Diagnosis and Poll Results
Diagnosis Mode Decision Rules
if enable_diagnosis == true:
mode = real-time diagnosis # enable_diagnosis has highest priority
elif start_time != 0:
mode = historical diagnosis # time range specified, retrospective analysis
else:
mode = real-time diagnosis # default
- Real-time:
start_time=0,end_time=0 - Historical:
start_time=\x3Cunix_ts>,end_time=\x3Cunix_ts> - Forced real-time: when
enable_diagnosis=true, forcestart_timeto 0 even if provided
Build params JSON
Use snake_case keys (consistent with SDK). Required base fields (ALL must be included):
{
"instance": "\x3Ceas_service_id_or_dlc_job_id>",
"region": "\x3Cregion>",
"product": "\x3CEAS_or_DLC>",
"start_time": 0,
"end_time": 0,
"type": "ocd",
"ai_roadmap": true,
"enable_sysom_link": false,
"ocd_description": "\x3Cuser_problem_description_in_english_with_underscores>"
}
⚠️ Anti-confusion Warning:
"type": "ocd"and"product": "\x3CEAS|DLC>"are BOTH REQUIRED fields inside the params JSON — do NOT omit either!
--service-name ocd(CLI argument) → tells CLI which diagnosis service endpoint to call"type": "ocd"(params JSON field) → tells the diagnosis engine which diagnosis type to execute internally"product": "EAS"or"product": "DLC"(params JSON field) → tells the diagnosis engine which PAI sub-product to targetAll three are mandatory; do NOT omit any of them.
⚠️ The
instancefield uses the original instance ID directly —eas-m-xxxfor EAS,dlcxxxxxxxxfor DLC. Do NOT convert to ServiceName or any other identifier.
Conditional fields (add only when non-empty):
uid: account ID owning the resource (integer)ocd_description: user's problem description (string). Format constraints: must be in English, no Chinese characters, no spaces — use underscores (_) to join words. Example:high_latency_first_token,GPU_OOM_killed
Invoke Diagnosis
aliyun sysom invoke-diagnosis \
--service-name ocd \
--channel ecs \
--params '{"instance":"\x3Ceas_service_id_or_dlc_job_id>","region":"\x3Cregion>","product":"\x3CEAS|DLC>","start_time":\x3Cstart_time>,"end_time":\x3Cend_time>,"type":"ocd","ai_roadmap":true,"enable_sysom_link":false}' \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis
Extract task_id from the response.
⚠️ [CRITICAL]
Sysom.TaskInProgressError Handling: Ifinvoke-diagnosisreturns aSysom.TaskInProgresserror, this means a diagnosis task is already running. You MUST:
- Extract the existing
task_idfrom the error message using string match (pattern:ocd(\x3Ctask_id>)or similar identifier in the message body)- Immediately proceed to the polling flow with the extracted
task_id- NEVER treat
TaskInProgressas a fatal failure or abort the workflow
Poll Results (interval: 10s, max: 60 attempts)
aliyun sysom get-diagnosis-result --task-id \x3Ctask_id> --user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis
Check the status field in the response:
Ready/Running→ MUST continue polling at 10s intervalsSuccess→ diagnosis complete, proceed to Step 8Fail→ diagnosis failed, inform the user
⛔ [CRITICAL] Mandatory Polling Rules (MUST OBEY — violations will produce incorrect results):
Runningstatus is NORMAL — it simply means the diagnosis engine is still working. You MUST continue polling every 10 seconds.Runningis NOT an error and MUST NOT trigger early termination.- NEVER abandon polling early — do NOT stop polling before reaching
Success,Fail, or the 60-attempt limit. Do NOT "give up" after a fewRunningresponses.- NEVER fall back to manual analysis — if polling is ongoing or timed out, you MUST NOT attempt to manually diagnose the issue by analyzing
ListServicesoutput, instance metadata, or any other data source. The diagnosis report is the ONLY valid source of root cause information.- NEVER fabricate diagnosis results — if the task has not reached
Successstatus, you MUST NOT output anysummary.overall_status,summary.root_cause, orsummary.suggestionsvalues. These fields come exclusively from the completed diagnosis result.- Timeout handling — if still incomplete after 60 polling attempts, output ONLY this template and stop:
FORBIDDEN to add alternative suggestions, manual analysis, or fabricated conclusions in timeout output.⏳ SysOM diagnosis task timed out - Task ID: \x3Ctask_id> - Current status: \x3Cstatus> - Suggestion: Please continue waiting for the diagnosis to complete.
Step 8 — Result Parsing and Output
Parse the returned JSON and present summary.overall_status, summary.root_cause, summary.suggestions, issues[], and other key information to the user.
Success Verification
For verification methods of each phase, see references/verification-method.md.
Cleanup
The diagnosis operations in this skill are read-only and do not modify the PAI service / job state — no cleanup is needed.
PAI EAS / DLC are fully managed services — there is no agent to install or uninstall.
After all CLI operations are complete, you MUST disable AI-Mode:
aliyun configure ai-mode disable
Command Tables
For the full CLI command list, see references/related-commands.md.
Best Practices
- Product auto-inferred silently:
productis determined from theinstanceprefix (eas-→EAS,dlc→DLC) — only ask user when the prefix is unrecognizable - Resource validation is mandatory: EAS calls
ListServicesto verify existence; DLC callsGetJobto verify existence AND checkResourceTypeisLingjun - Instance ID used directly in params: Both EAS (
eas-m-xxx) and DLC (dlcxxxxxxxx) instance IDs are passed as-is in theinstancefield — do NOT convert to ServiceName - Use real-time diagnosis mode by default: Unless the user explicitly specifies a time range, default to real-time diagnosis
- Credential security: Never print or echo AK/SK values in conversation
- All business CLI commands must include
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-aes-sysom-pai-diagnosis(system commands likeversion,configure,plugindo not use--user-agent) - Remediation suggestions may involve high-risk operations: Follow the Human-in-the-loop protocol and wait for user confirmation
- No enrollment / agent installation needed: PAI EAS and DLC are managed services; SysOM accesses them through the platform side, not via instance-level agents
Unsupported Scenarios
- Non-PAI products (use
alibabacloud-aes-sysom-os-diagnosisfor ECS instances) - PAI products other than EAS and DLC (e.g., DSW, MaxCompute) — current skill scope is EAS / DLC only
- Pure configuration issues (e.g., model version mismatch, EAS routing config — no OS-level diagnosis needed)
Error Handling
| Error Scenario | CLI Response | Agent Action |
|---|---|---|
| Invalid EAS ServiceId | ListServices returns empty |
Inform user the service ID does not exist in the region, stop pipeline |
| Invalid DLC JobId | GetJob returns not found |
Inform user the DLC job ID does not exist, stop pipeline |
| DLC ResourceType not Lingjun | GetJob returns non-Lingjun type |
Inform user SysOM only supports Lingjun resources, stop pipeline |
| Unknown product / ambiguous prefix | Cannot infer from instance |
Explicitly ask user to choose EAS or DLC |
| Role authorization failure | initial-sysom returns error |
Prompt user to check SysOM service activation status |
| Diagnosis invocation failure | invoke-diagnosis returns error |
Check credential, permission, and product field correctness |
| Diagnosis timeout | get-diagnosis-result polling timeout |
Output timeout template, suggest user retry later |
| Insufficient permissions | API returns Forbidden | Read references/ram-policies.md and guide user to request permissions |
Reference Links
| Reference | Description |
|---|---|
| references/cli-installation-guide.md | Aliyun CLI installation and configuration guide |
| references/ram-policies.md | RAM permission policy list |
| references/related-commands.md | Full CLI command list |
| references/verification-method.md | Success verification methods for each phase |
| references/diagnose-workflow.md | Detailed diagnosis workflow (Steps 4–8) |
| references/acceptance-criteria.md | Test acceptance criteria |
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install alibabacloud-aes-sysom-pai-diagnosis - 安装完成后,直接呼叫该 Skill 的名称或使用
/alibabacloud-aes-sysom-pai-diagnosis触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Alibabacloud Aes Sysom Pai Diagnosis 是什么?
Perform SysOM deep diagnosis on Alibaba Cloud PAI products (EAS / DLC) to identify root causes of instance-level issues. Use when users report: - EAS instanc... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 34 次。
如何安装 Alibabacloud Aes Sysom Pai Diagnosis?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install alibabacloud-aes-sysom-pai-diagnosis」即可一键安装,无需额外配置。
Alibabacloud Aes Sysom Pai Diagnosis 是免费的吗?
是的,Alibabacloud Aes Sysom Pai Diagnosis 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Alibabacloud Aes Sysom Pai Diagnosis 支持哪些平台?
Alibabacloud Aes Sysom Pai Diagnosis 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Alibabacloud Aes Sysom Pai Diagnosis?
由 alibabacloud-skills-team(@sdk-team)开发并维护,当前版本 v0.0.1。