← 返回 Skills 市场
sdk-team

Alibabacloud Pai Dlc Job Diagnostics

作者 alibabacloud-skills-team · GitHub ↗ · v0.0.1 · MIT-0
cross-platform ✓ 安全检测通过
33
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install alibabacloud-pai-dlc-job-diagnostics
功能描述
PAI-DLC job diagnostics and health inspection. Queuing-stuck root cause analysis, failed-job localization, cluster health checks. Companion to the `alibabacl...
使用说明 (SKILL.md)

PAI-DLC Job Diagnostics and Health Inspection

Read-only diagnostic analysis for PAI-DLC distributed training jobs, covering three scenarios:

  • Queuing-stuck root cause analysis — quota check, node scheduling diagnosis, hyper-node availability
  • Failed-job localization — failure classification, logs/events evidence chain, root cause identification
  • Cluster health inspection — training throughput, hang detection, SanityCheck, restart stability

Architecture: PAI-DLC Job (read-only queries) + PAI Studio Resource Diagnosis API (queuing scenario).

0. Dependencies

This skill performs read-only diagnostics only. All write operations (create / update / stop jobs, resource discovery, etc.) live in the companion skill alibabacloud-pai-dlc-job. The two skills are complementary in responsibility and share a common field contract.

Prerequisite Skill Role When to switch to it
alibabacloud-pai-dlc-job Write ops (create/update/stop) + AIWorkSpace resource discovery Creating / modifying / stopping jobs, or discovering Image / Dataset / CodeSource
This skill Read-only diagnostics (logs / events / sanity-check / queuing root cause) Job already exists — troubleshooting or health inspection

Discover and install the prerequisite skill:

# Discover available skills
npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-find-skills
# Install the alibabacloud-pai-dlc-job skill itself
npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-pai-dlc-job

Cross-skill field contract: The --job-id / --pod-id values this skill consumes are produced verbatim by alibabacloud-pai-dlc-job via list-jobs / get-job --cli-query "Pods[0].PodId" — no transformation needed. --region / --workspace-id follow the same resolution rules in both skills.

Installation Requirements

Pre-check: Aliyun CLI >= 3.3.1 required Run aliyun version to verify >= 3.3.1. If not installed or version too low, see references/cli-installation-guide.md. Then [MUST] run aliyun configure set --auto-plugin-install true.

Note on --user-agent: Every API-invoking aliyun command in this skill MUST include --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics. Client-side helpers (aliyun version, aliyun configure ..., aliyun plugin ..., aliyun \x3Cproduct> --help) do not invoke remote APIs and therefore do not require the flag.

aliyun version
aliyun configure set --auto-plugin-install true
aliyun plugin update
aliyun pai-dlc --help
aliyun paistudio --help >/dev/null 2>&1 || aliyun plugin install --names aliyun-cli-paistudio

aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics"
# After session: aliyun configure ai-mode disable

Authentication

Pre-check: Alibaba Cloud Credentials Required

Security Rules:

  • NEVER read, echo, or print AK/SK values
  • NEVER ask the user to input AK/SK directly
  • ONLY use aliyun configure list to check credential status
aliyun configure list

Check the output for a valid profile (AK, STS, or OAuth identity).

If no valid profile exists, STOP here.

  1. Obtain credentials from Alibaba Cloud Console
  2. Configure credentials outside of this session
  3. Return and re-run after aliyun configure list shows a valid profile

RAM Permissions

[MUST] Permission Failure Handling: When any command fails due to permission errors:

  1. Read references/ram-policies.md for the full permission list
  2. Use ram-permission-diagnose skill to guide the user
  3. Pause and wait until the user confirms permissions have been granted
Product Permissions Purpose
pai-dlc pai:GetJob, pai:GetPodLogs, pai:GetJobEvents, pai:GetPodEvents, pai:ListJobSanityCheckResults Job information collection
paistudio paistudio:GetQuotaWorkloadDiagnosis Queuing resource diagnosis

Parameter Confirmation

IMPORTANT: Parameter Confirmation — Before executing any command, ALL user-customizable parameters (RegionId, JobId, etc.) MUST be confirmed with the user.

Parameter Required Description
region Yes Region where the job runs
job_id Yes DLC job ID (e.g., dlcXXX)

Entry Routing

When a diagnostic request arrives, first call get-job to fetch job status, then route by status:

Job status Route to scenario
Queuing / Creating → Queuing-stuck root cause analysis
Failed → Failed-job localization
Running → Health inspection
Stopped Inform the user "job was actively stopped", no diagnosis
Succeeded → Historical review (follow Scenario 3 Execution steps)

Edge case — job was queuing but is now Stopped/Succeeded: If the user describes the job as "stuck in queue" but get-job shows Stopped or Succeeded, still route to Scenario 1 (queuing analysis) but expect the resource diagnosis API to return HTTP 400. Follow the "Fallback on API Failure" procedure in Scenario 1.

Users may also directly request a specific scenario (e.g., "run a health inspection" even when status is not Running).


Diagnostic Toolbox

PAI-DLC Read-Only Commands

aliyun pai-dlc get-job --region \x3Cr> --job-id \x3Cid> \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-job-events --region \x3Cr> --job-id \x3Cid> --max-events-num 50 \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-pod-events --region \x3Cr> --job-id \x3Cid> --pod-id \x3Cpod> --max-events-num 20 \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-pod-logs --region \x3Cr> --job-id \x3Cid> --pod-id \x3Cpod> --max-lines 100 \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc list-job-sanity-check-results --region \x3Cr> --job-id \x3Cid> \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics
aliyun pai-dlc get-job-sanity-check-result --region \x3Cr> --job-id \x3Cid> --sanity-check-number 1 \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics

PAI Studio Resource Diagnosis (queuing scenario only)

aliyun paistudio GET /api/v1/quotas/{quota_id}/workloads/{job_id}/diagnosis \
  --region \x3Cr> --header "Content-Type=application/json" --force \
  --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job-diagnostics

Hard constraint: quota_id MUST come from get-job's ResourceId field. If ResourceId is empty (public pay-as-you-go), this API is unavailable.

Full API structure: see references/resource-diagnosis-api.md.


Scenario 1: Queuing-Stuck Root Cause Analysis

Trigger: job status = Queuing / Creating and user reports it cannot be scheduled.

Tools: get-jobpaistudio resource diagnosis → (optional) get-job-events.

Hard constraints:

  • ResourceId empty → resource diagnosis unavailable; mine events for clues
  • ResourceId non-empty → resource diagnosis API is the primary instrument

CRITICAL: pai-dlc vs paistudio — two different products

Product Scope Commands
pai-dlc Job/Pod lifecycle (GetJob, GetJobEvents, ListJobs, GetPodLogs) aliyun pai-dlc get-job ...
paistudio Platform-level services including resource diagnosis aliyun paistudio GET /api/v1/quotas/...

The resource diagnosis API belongs to paistudio, NOT pai-dlc. Do NOT call pai-dlc GetResourceQuota, pai-dlc ListResourceQuotas, or any pai-dlc GET /api/v1/resourcequotas/... — these are wrong APIs and will fail. The correct command is:

aliyun paistudio GET /api/v1/quotas/{quota_id}/workloads/{job_id}/diagnosis ...

Pattern knowledge: resource diagnosis returns 4 checks (self_quota / ancestor_quota / user_limit / queue_strategy), plus node scheduling and hyper-node analysis. Common patterns: references/diagnostic-patterns.md §1.

Agent latitude: decide whether to compute the quota gap, whether to pull events for corroboration, and how verbose the report should be.

Fallback on API Failure

The PAI Studio resource diagnosis API may fail with HTTP 400/404 when the job is no longer in an active queuing state (e.g., already Stopped by user). In this case:

  1. Explicitly declare the API failure in the report: "Resource diagnosis API unavailable: HTTP {code} — {error message}"
  2. Perform qualitative analysis using only these data sources:
    • get-jobResourceRequest (GPU/CPU/Memory demand per pod)
    • get-jobPodCount × per-pod resources = total demand
    • get-jobEcsSpec (instance type and its per-node capacity)
    • get-job-events → scheduling event timeline and queuing duration
  3. Prohibited language: NEVER use hedging words such as "possibly", "might", "perhaps", "may not be sufficient". State only confirmed facts:
    • Total GPU demand = PodCount × RequestGPU = {computed value}
    • Instance type = {EcsSpec}
    • Queuing duration = {computed hours/minutes}
    • Job final status = {Stopped/Succeeded/etc.}
  4. Mandatory output structure when API is unavailable:
## Resource Diagnosis (API Unavailable — Configuration-Based Analysis)

- Diagnosis API: unavailable (HTTP {code}: {message})
- Resource demand: {PodCount} pods × {GPU/pod} GPU = {total} GPU cards
- Instance type: {EcsSpec}
- Queuing duration: {hours}h {minutes}m
- Quota ID: {ResourceId}
- Job final status: {status} ({ReasonCode})
- Conclusion: The job requested {total} GPU cards which could not be
  fulfilled within the queuing window before the job was {stopped/completed}.

Scenario 2: Failed-Job Localization

Trigger: job status = Failed.

Tools: get-job → (as needed) get-job-events / get-pod-events / get-pod-logs.

Hard constraints:

  • Stopped is not a failure — do not diagnose
  • [OUTPUT GUARD — RECOMMENDATIONS STRICTLY FORBIDDEN] When the failure falls into any of the categories below, the diagnostic report MUST strictly omit any "Recommendations" / "Suggested fixes" / "Next steps" / "Solution" section, and output only the root cause and objective facts:
    • ResourceAllocateFailed (insufficient resources)
    • Job preempted / evicted (ReasonMessage contains preempted or evicted)
    • Spot instance reclamation
    • Report MUST end on the diagnostic conclusion. No sentence anywhere in the output may instruct the user to change configuration, request more quota, switch instance types, add retry logic, or modify any job parameter.
    • Negative examples (ALL forbidden):
      • "Recommend using pay-as-you-go instances"
      • "Consider requesting more quota"
      • "Try switching to another zone"
      • "Disable preemptible jobs by setting EnablePreemptibleJob: false"
      • "Implement checkpointing and retry logic"
      • "Solution: Use guaranteed quota instead of oversold quota"
    • Positive examples (allowed):
      • "Root cause: spot instance reclaimed by the cloud platform, triggering pod eviction and job failure."
      • "The job failed due to preemption/eviction of pods. ReasonCode: JobFailed."
    • Only permitted user-facing suggestion: "For quota policy adjustments, please contact your platform administrator."

Mandatory output template for preemption/eviction/resource-shortage failures:

## Diagnosis Conclusion

- Failure reason: {classification} ({ReasonCode}: {ReasonMessage})
- Affected pods: {pod list with SubStatus}
- Timeline: {key timestamps from events}
- Evidence: {quoted ReasonMessage or event details}

For quota policy or resource allocation adjustments, please contact your
platform administrator.

Pattern knowledge: failure-classification priority (network > image > runtime > resource > config > system), exit-code meanings, keyword-matching rules. See references/diagnostic-patterns.md §2.

Agent latitude: when ReasonCode is clear, logs may be unnecessary; when logs already explain the issue, events may be unnecessary. Decide investigation depth based on information sufficiency.


Scenario 3: Health Inspection

Trigger: job status = Running and user requests inspection / health check.

Tools: get-job + get-job-events + get-pod-logs

  • list-job-sanity-check-results.

Execution steps:

  1. get-job → obtain status, WorkspaceId, Pod list, whether EnableSanityCheck is set
  2. get-job-events → event-chain analysis (scheduling, restarts, etc.)
  3. get-pod-logs → target the master pod (rank=0) first; extract structured training metrics if present
  4. list-job-sanity-check-results → execute only if EnableSanityCheck=true in job settings
  5. Generate console links (MANDATORY — must always output both links):
    • Job overview: https://pai.console.aliyun.com/?regionId={region}&workspaceId={workspace_id}#/dlc/jobs/{job_id}/overview
    • Monitoring dashboard: https://pai.console.aliyun.com/?regionId={region}&workspaceId={workspace_id}#/dlc/jobs/{job_id}/monitor
    • Mandatory closing notice (append verbatim to every inspection report):
      > GPU/memory real-time resource utilization metrics require the monitoring
      > dashboard above. This skill's CLI commands do not support fetching
      > runtime utilization data directly.
      
    This step is mandatory regardless of job status (Running, Succeeded, or any other state). Even if the job has already completed, the user needs the monitoring link to review historical resource utilization.

Dimension matrix (mandatory vs optional):

Dimension Mandatory/Optional Notes
Training throughput Optional Extract from master pod logs
Hang detection Running only Skip for Succeeded jobs
Hardware health Optional Requires EnableSanityCheck=true
Restart stability Mandatory Read RestartCount directly from get-job

Note on resource utilization: GPU/memory metrics are NOT available via this skill's CLI commands. The monitoring dashboard link (step 5) and its closing notice are mandatory in every inspection report — do not omit them.

Interpretation rules per dimension: see references/healthcheck-dimensions.md.

Agent latitude: kilo-card jobs focus on Hang + SanityCheck; small jobs look at training throughput from logs. Decide depth and report verbosity from scale and user intent.


Best Practices

  1. get-job first, route second — never assume status; query and then route
  2. Cap response size--max-lines 100 / --max-events-num 50 to avoid context blow-up
  3. Find the problem pod — among get-job Pods, focus on Status=Failed/Unknown or those with ReasonMessage
  4. Read log tail — errors usually live in the last few dozen lines; no need to pull the whole log
  5. Exit codes are clues, not conclusionsexit 137 may be OOM or external kill; combine with context
  6. Don't over-diagnose — when ReasonCode already states the cause, skip the full log dump
  7. Console links — generate both overview and monitoring links so the user can jump into the console for details and resource utilization metrics
  8. Read-only — this skill MUST NEVER execute stop / update / create
  9. Report summary — present a per-dimension rating table at the end of the report:
Dimension Rating Key Finding
Training throughput ✅/⚠️/❌/N/A ...
Hang detection ✅/⚠️/❌/N/A ...
Hardware health ✅/⚠️/❌/N/A ...
Restart stability ✅/⚠️/❌ ...

Reference Links

Document Contents
references/resource-diagnosis-api.md PAI Studio resource diagnosis API full reference
references/diagnostic-patterns.md Failure pattern knowledge base (queuing / failure / hang / restart)
references/healthcheck-dimensions.md Health inspection dimensions and interpretation rules
references/ram-policies.md RAM permission policies
references/related-commands.md CLI command quick reference
references/acceptance-criteria.md Acceptance criteria
references/verification-method.md Verification method
references/cli-installation-guide.md CLI installation guide
安全使用建议
Install only if you intend to diagnose Alibaba Cloud PAI-DLC jobs. Use a dedicated least-privilege RAM user or temporary credentials with the listed read-only permissions, avoid pasting raw access keys into the agent session, review persistent Aliyun CLI settings after use, and redact job metadata or pod logs before sharing outputs.
能力标签
requires-walletrequires-oauth-tokenrequires-sensitive-credentials
能力评估
Purpose & Capability
The stated purpose is PAI-DLC job diagnostics, and the main workflow uses read-only job, event, pod-log, sanity-check, and resource-diagnosis APIs with explicit no-create/update/stop constraints.
Instruction Scope
The trigger phrases include generic terms like diagnose and health check, but the skill also requires confirmation of region, job ID, and other user parameters before running commands.
Install Mechanism
The skill asks users to install or update Aliyun CLI plugins and references a broad CLI setup guide; this is disclosed but broader than the minimum diagnostics path.
Credentials
Cloud credentials and job logs are sensitive, but the requested RAM policy is read-only and aligned to diagnostics; users should avoid broad account credentials.
Persistence & Privilege
Aliyun CLI configuration, installed plugins, auto-plugin-install, and ai-mode settings can persist locally; the skill notes disabling ai-mode after the session, but users should review persistent CLI settings.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install alibabacloud-pai-dlc-job-diagnostics
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /alibabacloud-pai-dlc-job-diagnostics 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.0.1
Initial release of alibabacloud-pai-dlc-job-diagnostics skill. - Provides read-only diagnostics for PAI-DLC distributed training jobs. - Supports queuing-stuck root cause analysis, failed-job localization, and cluster health inspection. - Offers routing based on job status to appropriate diagnostic scenario. - Designed to complement write operations in the `alibabacloud-pai-dlc-job` skill. - Requires confirmation of all user-customizable parameters before execution. - Comprehensive installation, authentication, and permission instructions.
元数据
Slug alibabacloud-pai-dlc-job-diagnostics
版本 0.0.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Alibabacloud Pai Dlc Job Diagnostics 是什么?

PAI-DLC job diagnostics and health inspection. Queuing-stuck root cause analysis, failed-job localization, cluster health checks. Companion to the `alibabacl... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 33 次。

如何安装 Alibabacloud Pai Dlc Job Diagnostics?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install alibabacloud-pai-dlc-job-diagnostics」即可一键安装,无需额外配置。

Alibabacloud Pai Dlc Job Diagnostics 是免费的吗?

是的,Alibabacloud Pai Dlc Job Diagnostics 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Alibabacloud Pai Dlc Job Diagnostics 支持哪些平台?

Alibabacloud Pai Dlc Job Diagnostics 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Alibabacloud Pai Dlc Job Diagnostics?

由 alibabacloud-skills-team(@sdk-team)开发并维护,当前版本 v0.0.1。

💬 留言讨论