Alibabacloud Pai Dlc Job
/install alibabacloud-pai-dlc-job
PAI-DLC Deep Learning Job Management
Manage deep learning training jobs on Alibaba Cloud PAI-DLC (Platform for AI - Deep Learning Containers) service.
Scenario Description
PAI-DLC is a distributed training service provided by Alibaba Cloud's AI Platform PAI, supporting:
- Job Creation and Execution — Create distributed training jobs for TensorFlow, PyTorch, XGBoost, and other frameworks
- Job Monitoring — Get job status, logs, events, and monitoring metrics
- Compute Health Check — Check health status of GPU and other compute devices
- Job Management — Update and stop jobs
- Job Templates — Save reusable
CreateJobconfigurations as templates with multi-version management and field constraints
Architecture: PAI Workspace + DLC Job + Computing Resources (ECS public pay-as-you-go or Lingjun dedicated quota) + AIWorkSpace catalog (images / datasets / code sources / quotas / workspaces).
Installation Requirements
Pre-check: Aliyun CLI >= 3.3.1 required Run
aliyun versionto verify version >= 3.3.1. If not installed or version is too low, see references/cli-installation-guide.md for installation instructions. Then [Required] runaliyun configure set --auto-plugin-install trueto enable automatic plugin installation.
Note on
--user-agent: Every API-invokingaliyuncommand in this skill MUST include--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job. Client-side helpers (aliyun version,aliyun configure ...,aliyun plugin ...,aliyun \x3Cproduct> --help) do not invoke remote APIs and therefore do not require the flag.
aliyun version
aliyun configure set --auto-plugin-install true
aliyun pai-dlc --help
# JobTemplate (§7.7) requires aliyun-cli-pai-dlc >= 0.3.1.
# If create-job-template --help fails: aliyun plugin update --name aliyun-cli-pai-dlc
aliyun aiworkspace --help >/dev/null 2>&1 || aliyun plugin install --names aliyun-cli-aiworkspace
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job"
# After session: aliyun configure ai-mode disable
Environment Variables
This skill does not require any custom environment variables. Credentials are handled by the Alibaba Cloud CLI configuration (see Authentication below). Optionally:
| Variable | Required | Purpose |
|---|---|---|
ALIBABA_CLOUD_PROFILE |
Optional | Selects a non-default aliyun configure profile |
ALIBABA_CLOUD_REGION_ID |
Optional | Default region when --region is omitted (still recommended to pass --region explicitly) |
Do NOT export ALIBABA_CLOUD_ACCESS_KEY_ID / ALIBABA_CLOUD_ACCESS_KEY_SECRET from
within this session; configure them outside (aliyun configure or shell profile).
Authentication Configuration
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
- NEVER read, echo, or print AK/SK values (e.g.,
echo $ALIBABA_CLOUD_ACCESS_KEY_IDis FORBIDDEN)- NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
aliyun configure setwith literal credential values- ONLY use
aliyun configure listto check credential statusaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
aliyun configurein terminal or environment variables in shell profile)- Return and re-run after
aliyun configure listshows a valid profile
RAM Permissions
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- Read
references/ram-policies.mdto get the full list of permissions required by this SKILL- Use
ram-permission-diagnoseskill to guide the user through requesting the necessary permissions- Pause and wait until the user confirms that the required permissions have been granted
For detailed permission list, see references/ram-policies.md.
Required Permissions Overview:
| Operation | Required Permission |
|---|---|
| Create Job | pai:CreateJob |
| List Jobs | pai:ListJobs |
| Get Job Details | pai:GetJob |
| Get Pod Logs | pai:GetPodLogs |
| Get Job Events | pai:GetJobEvents |
| Get Job Metrics | pai:GetJobMetrics |
| Update Job | pai:UpdateJob |
| Stop Job | pai:StopJob |
| Stop Job | pai:StopJob |
| Create / Read / Update Job Template | paidlc:CreateJobTemplate / paidlc:GetJobTemplate / paidlc:ListJobTemplates / paidlc:UpdateJobTemplate / paidlc:SetJobTemplateDefaultVersion |
| AIWorkSpace Resource Discovery | paiworkspace:ListWorkspaces / paiimage:ListImages,GetImage / paidataset:ListDatasets,GetDataset / paicodesource:ListCodeSources,GetCodeSource |
AIWorkSpace authorization note:
Image/DataSourceId/CodeSourceId/WorkspaceIdfield values forcreate-jobcome from the AIWorkSpace resource-discovery APIs.--resource-id(QuotaId) is manually provided by the user. RAM users MUST hold the corresponding AIWorkSpace-namespaced permissions listed above (do not abbreviate asaiworkspace:*).
Parameter Confirmation
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, CIDR blocks, passwords, domain names, resource specifications, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
Parameters Requiring User Confirmation
| Parameter | Required | Notes |
|---|---|---|
--region |
Yes | e.g., cn-hangzhou |
--workspace-id |
Yes | From aliyun aiworkspace list-workspaces |
--job-type |
Yes | PyTorchJob, TFJob, RayJob, etc. |
--display-name |
Yes | Meaningful name (project + model + date) |
--job-specs[].Image |
Yes | Verbatim ImageUri from list-images (see §7.6 red line) |
--user-command |
Yes | e.g., python train.py |
--job-specs[].EcsSpec |
Conditional | Public pay-as-you-go (mutually exclusive with ResourceConfig) |
--resource-id + ResourceConfig |
Conditional | Dedicated quota path (mutually exclusive with EcsSpec). User MUST manually provide the QuotaId. |
--data-sources / --code-source |
Optional | From list-datasets / list-code-sources |
--template-id |
Conditional | When creating Job from JobTemplate |
For all parameters: aliyun pai-dlc create-job --help.
Mutual exclusion summary:
EcsSpecandResourceConfigare mutually exclusive within a single TaskSpec.UriandDataSourceIdwithin--data-sources[]are mutually exclusive.UriandCodeSourceIdwithin--code-sourceare mutually exclusive.
For full parameter reference: see references/related-apis.md.
Core Workflows
7.1 Resource Selection Decision Guide
Before calling create-job, determine the resource path:
- Public pay-as-you-go → Use
EcsSpecin TaskSpec; do NOT pass--resource-id.- Use cases: quick start, testing, no dedicated quota.
- Example:
"EcsSpec": "ecs.gn6i-c4g1.xlarge"
- Dedicated quota (Lingjun / enterprise quota) → Use
ResourceConfigin TaskSpec AND pass--resource-id \x3CQuotaId>.- Use cases: dedicated resource group, Lingjun smart compute, Spot bidding.
- Example:
--resource-id quotaXXX+"ResourceConfig": {"CPU": "4", "Memory": "8Gi", "GPU": "1"}
EcsSpec and ResourceConfig MUST NOT both appear in the same TaskSpec.
Also required before
create-job:--job-specs[].ImageMUST come fromaliyun aiworkspace list-images;--data-sources[].DataSourceIdfromlist-datasets;--code-source.CodeSourceIdfromlist-code-sources. Full discovery flow → see §7.6.
Distributed architecture choices:
| Topology | JobSpecs shape |
|---|---|
| Single-node | One Worker only |
| TFJob PS-Worker | Both PS (CPU) and Worker (GPU) roles |
| PyTorch multi-node | One Worker with PodCount > 1 |
Optional flags: --enable-gang-scheduling true (all-or-nothing scheduling),
Settings.EnableRDMA: true (high-performance network for multi-node GPU),
Settings.EnableSanityCheck: true (GPU health verification).
7.2 Create Training Job
# Minimal single-node PyTorch job (public pay-as-you-go)
aliyun pai-dlc create-job \
--region \x3Cregion> \
--workspace-id \x3Cworkspace-id> \
--display-name "my-pytorch-training" \
--job-type PyTorchJob \
--job-specs '[{
"Type": "Worker",
"PodCount": 1,
"Image": "\x3CImageUri-from-aiworkspace-list-images>",
"EcsSpec": "ecs.gn6i-c4g1.xlarge"
}]' \
--user-command 'python train.py' \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For multi-node topologies, see §7.1. For Spot, RDMA, data mounting parameters, use aliyun pai-dlc create-job --help.
7.3 List / Get Job
# List running jobs (status filter: Creating/Queuing/Running/Succeeded/Failed/Stopped)
aliyun pai-dlc list-jobs \
--region \x3Cregion> \
--status Running \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Get job detail
aliyun pai-dlc get-job \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Get a specific PodId (for log/event queries)
aliyun pai-dlc get-job \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--cli-query "Pods[0].PodId" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
7.4 Logs, Events, and Metrics
IMPORTANT: Always limit return size:
--max-lines 100for logs,--max-events-num 50for events.
# Get PodId first, then query logs/events/metrics
POD_ID=$(aliyun pai-dlc get-job --region \x3Cregion> --job-id \x3Cjob-id> \
--cli-query "Pods[0].PodId" --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job)
aliyun pai-dlc get-pod-logs --region \x3Cregion> --job-id \x3Cjob-id> --pod-id $POD_ID --max-lines 100 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-pod-events --region \x3Cregion> --job-id \x3Cjob-id> --pod-id $POD_ID --max-events-num 20 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-job-events --region \x3Cregion> --job-id \x3Cjob-id> --max-events-num 50 --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
aliyun pai-dlc get-job-metrics --region \x3Cregion> --job-id \x3Cjob-id> --metric-type GpuCoreUsage --user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
Metric types: GpuCoreUsage, GpuMemoryUsage, CpuCoreUsage, MemoryUsage, NetworkInputRate, NetworkOutputRate, DiskReadRate, DiskWriteRate.
Diagnosis order: get-job (status) → get-job-events → get-pod-logs → get-pod-events.
7.5 Compute Health Check
# All sanity check results
aliyun pai-dlc list-job-sanity-check-results \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Single sanity check result
aliyun pai-dlc get-job-sanity-check-result \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--sanity-check-number 1 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
7.6 Pre-Create Resource Discovery (AIWorkSpace)
Discovery flow: list-workspaces → list-image-labels →
list-images → list-datasets → list-code-sources → pai-dlc create-job.
Quota (--resource-id): User MUST manually provide the QuotaId. No CLI discovery step.
# Step 1: Pick a workspace (yields --workspace-id)
aliyun aiworkspace list-workspaces \
--region \x3Cregion> --page-number 1 --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Step 2: Discover available image labels (MUST run before list-images)
# list-image-labels returns all label Key-Value pairs available in this region.
# Use this to discover valid --labels filters for list-images.
aliyun aiworkspace list-image-labels \
--region \x3Cregion> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# How to use list-image-labels results:
# - Extract label Keys (e.g. system.chipType, system.framework, system.cudaVersion)
# and their available Values to construct --labels filters
# - Multiple labels can be combined with comma: --labels "key1=val1,key2=val2"
# - Labels format: --labels "Key=Value" (single key-value pair), NOT JSON or spaces
# Step 3: Pick an image (yields WorkerSpec.Image / --job-specs[].Image)
# Labels MUST come from list-image-labels output — NEVER guess or invent label values
# NOTE: Do NOT pass --workspace-id to list-images; official images are global
aliyun aiworkspace list-images \
--region \x3Cregion> \
--labels "\x3CKey1=Value1,Key2=Value2>" \
--page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# RED LINE: --job-specs[].Image MUST be a verbatim ImageUri from list-images.
# NEVER invent, rewrite, or copy Name/ImageId instead of ImageUri.
# Step 4: Pick a dataset (yields DataSources[].DataSourceId)
aliyun aiworkspace list-datasets \
--region \x3Cregion> --workspace-id \x3Cworkspace-id> --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Step 5: Pick a code source (yields CodeSource.CodeSourceId)
aliyun aiworkspace list-code-sources \
--region \x3Cregion> --workspace-id \x3Cworkspace-id> --page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
Red line (also applies in Section 7.7): Do NOT fall back to ROA generic invocations (
--pathPattern/--method GET|POST|PUT|DELETE) when a plugin is missing or returns an error. Install/upgrade the plugin instead.
Field-mapping, full parameters, and error codes: see references/related-apis.md and references/verification-method.md.
7.7 JobTemplate Management (Reusable Templates)
JobTemplate stores a CreateJob configuration (JobSpecs, UserCommand,
DataSources, etc.) as a versioned, reusable template. Six subcommands are
exposed by aliyun-cli-pai-dlc >= 0.3.1:
create-job-template, get-job-template, list-job-templates,
update-job-template, set-job-template-default-version. A Job can be launched from a template via
aliyun pai-dlc create-job --template-id \x3Cid>.
Constraints format: When passing
--constraints, use escaped-quote JSON:--constraints '{\"JobSpecs[0].Image\":\"locked\",\"UserCommand\":\"locked\"}'.
For full CRUD workflow, Constraints semantics, JSONPath rules, and pitfalls, see references/job-template-management.md.
7.8 Job Lifecycle Management (Stop / Update / Web Terminal)
Stop is a high-risk operation. Before proceeding, query status with
get-job, present the result to the user, and require explicit confirmation.
- Stop Job: applicable only when status is
RunningorQueuing.
For the full pre-check + confirmation + execution templates, plus the
update-job low-risk path and get-web-terminal / get-token sharing
commands, see references/job-management.md.
7.9 Ecs Spec Discovery
Discover available instance types before choosing EcsSpec in --job-specs.
Results from list-ecs-specs provide the exact EcsSpec value to use.
# GPU public pay-as-you-go instances
aliyun pai-dlc list-ecs-specs \
--region \x3Cregion> \
--accelerator-type GPU \
--resource-type ECS \
--sort-by GPU \
--order desc \
--page-size 20 \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Lingjun dedicated instances
# Note: --quota-id is only available for whitelisted users
Copy the returned EcsSpec value verbatim into --job-specs[].EcsSpec.
For full parameters see aliyun pai-dlc list-ecs-specs --help.
7.10 Tensorboard Management
TensorBoard visualizes training metrics. Seven subcommands under aliyun pai-dlc:
create-tensorboard, list-tensorboards, get-tensorboard, start-tensorboard,
stop-tensorboard, update-tensorboard, get-tensorboard-shared-url.
--job-idand--data-sourcesare mutually exclusive in create.
# Create from a job (most common)
aliyun pai-dlc create-tensorboard \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--display-name "my-training-tb" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Create from a dataset summary path
aliyun pai-dlc create-tensorboard \
--region \x3Cregion> \
--data-sources '[{"DataSourceId":"\x3Cdataset-id>","MountPath":"/mnt/logs"}]' \
--summary-path /mnt/logs \
--display-name "dataset-tb" \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For full parameters and lifecycle, see references/related-apis.md TensorBoard section.
7.11 Dashboard & Ray Dashboard
Both get-dashboard and get-ray-dashboard return a URL only for RayJob type
jobs. For non-Ray jobs, the response is empty.
# Generic DLC dashboard (RayJob only)
aliyun pai-dlc get-dashboard \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
# Ray-specific dashboard with optional sharing
aliyun pai-dlc get-ray-dashboard \
--region \x3Cregion> \
--job-id \x3Cjob-id> \
--is-shared true \
--token \x3Csharing-token> \
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-pai-dlc-job
For shared access, first obtain a token via get-token --target-type job,
then pass it to get-ray-dashboard --token \x3Ctoken> --is-shared true.
Success Verification Method
For step-by-step end-to-end verification scripts (resource discovery → CreateJob → log query → cleanup, plus JobTemplate CRUD verification), see references/verification-method.md.
Quick verification:
get-job→ Status should beCreating/Queuing/Runningshortly aftercreate-jobreturns.list-jobs --status Running→ Should return the freshly created Job until it finishes or is stopped.get-pod-logs→ Should return non-empty log content once the Pod is pastEnvPreparing.
Command Tables
A flat list of every CLI command used by this skill (Product / Command / Description) is in references/related-commands.md.
Best Practices
- Job Naming Convention — Use meaningful names containing project, model,
and date, e.g.,
resnet50-imagenet-20260320. - Resource Configuration Optimization — Choose appropriate GPU type and quantity based on model size and dataset size.
- Log Monitoring — Periodically check logs and events to detect failures early.
- Priority Management — Set higher priority for critical jobs (1-9, 9 highest).
- Cost Control — Spot instances reduce cost at the risk of preemption; use
--job-max-running-time-minutesas an auto-stop guard for any long-running experiment. - Health Check — Enable
Settings.EnableSanityCheck: trueto verify GPU devices before training starts. - Resource Cleanup — Stop completed jobs promptly to free resource quotas.
- Template Reuse — Capture standardized training pipelines as JobTemplates;
mark
Image/DataSourcesaslockedandUserCommand/Envsasoverridableso consumers focus on business parameters viacreate-job --template-id. - TensorBoard Monitoring — Attach a TensorBoard instance to training jobs for real-time metric visualization.
- Ecs Spec Discovery — Run
list-ecs-specs --accelerator-type GPUbefore choosingEcsSpecto confirm which instance types are available in the region.
Reference Links
| Reference Document | Description |
|---|---|
| references/related-apis.md | Complete API and CLI command reference |
| references/related-commands.md | Flat list of all CLI commands |
| references/ram-policies.md | RAM permission policy details |
| references/verification-method.md | End-to-end verification scripts |
| references/job-management.md | High-risk Stop/Delete/Update flow + Web Terminal |
| references/job-template-management.md | JobTemplate CRUD + Constraints + version management |
| references/acceptance-criteria.md | Skill testing acceptance criteria |
| references/cli-installation-guide.md | CLI installation guide |
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install alibabacloud-pai-dlc-job - 安装完成后,直接呼叫该 Skill 的名称或使用
/alibabacloud-pai-dlc-job触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Alibabacloud Pai Dlc Job 是什么?
Alibaba Cloud PAI-DLC (Deep Learning Containers) job management skill. Use for creating, managing, and monitoring DLC training jobs and managing reusable job... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 20 次。
如何安装 Alibabacloud Pai Dlc Job?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install alibabacloud-pai-dlc-job」即可一键安装,无需额外配置。
Alibabacloud Pai Dlc Job 是免费的吗?
是的,Alibabacloud Pai Dlc Job 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Alibabacloud Pai Dlc Job 支持哪些平台?
Alibabacloud Pai Dlc Job 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Alibabacloud Pai Dlc Job?
由 alibabacloud-skills-team(@sdk-team)开发并维护,当前版本 v0.0.1-beta.1。