Description

Operates remote model training jobs on AutoDL Linux servers over SSH. Use when starting a training run, checking whether training is still alive, reviewing G...

README (SKILL.md)

Operating AutoDL Training

Name: autodl-train
Author: zhuoran-liu

Use this skill for remote training operations on an AutoDL Linux server. It is designed for high-frequency workflows around "start training, watch progress, inspect resources, read logs, diagnose failures, and decide what to do next" while keeping execution constrained to one configured project directory.

What This Skill Does

Starts a configured training command in the target project directory over SSH.
Activates the remote Python environment with Conda or virtualenv fallbacks.
Checks whether training is still running by combining process, GPU, and log freshness signals.
Summarizes GPU, CPU, memory, and disk pressure instead of dumping raw command output.
Reads recent logs and extracts likely metrics such as epoch, step, loss, lr, grad_norm, val_loss, accuracy, mAP, and F1.
Detects common training failures such as CUDA OOM, NCCL errors, NaN, disk full, timeout, and segmentation faults.
Produces a human-readable training summary and recommends whether to continue, tune, or resume from a checkpoint.

Required Inputs

Collect or confirm these values before running any script:

host: AutoDL server hostname or IP.
port: SSH port, usually 22.
username: Remote Linux username.
project_path: Absolute project directory on the remote server, for example /root/autodl-tmp/your-project.
One environment option: env_name, env_activate, or venv_path.
train_command: The training launch command, such as python train.py, python -m torch.distributed.run ..., or bash scripts/train.sh.
Optional password mode: provide AUTOCLAW_TRAIN_SSH_PASSWORD as an environment variable or local .env file when SSH key login is not available.

Prefer a config file at config.example.json copied to a real file such as config.json, or environment variables based on .env.example.

Safety Rules

Only operate inside the configured project_path.
Do not invent missing SSH credentials or secrets.
Do not write plaintext passwords into files.
Prefer SSH keys or environment variables.
Refuse obviously destructive launch commands such as rm -rf, reboot, shutdown, mkfs, or fork bombs.
Do not kill unrelated processes or run global destructive recovery commands.

Workflow

1. Confirm Configuration

Read config.example.json and references/usage.md to understand the expected fields. Ask the user for any missing values instead of guessing.

2. Start Or Resume Training

Run scripts/remote_train.py to start a background job or build a resume command:

python scripts/remote_train.py --config config.json
python scripts/remote_train.py --config config.json --resume-from outputs/checkpoints/last.ckpt

Use this when the user asks to launch training, re-launch after interruption, or resume from a checkpoint.

3. Check Live Status

Run scripts/check_status.py when the user asks whether training is still running:

python scripts/check_status.py --config config.json

This script combines process matching, nvidia-smi, and recent log updates to classify the run as running, stopped, failed, or unknown.

4. Inspect Resource Pressure

Run scripts/monitor_resources.py to summarize GPU/CPU/memory/disk usage:

python scripts/monitor_resources.py --config config.json

Use the human-readable bottleneck assessment in the output instead of pasting raw command output unless the user asks for raw data.

5. Read Logs And Summaries

Run scripts/summarize_log.py in one of these modes:

python scripts/summarize_log.py --config config.json --action read --tail 200
python scripts/summarize_log.py --config config.json --action detect-failure --tail 400
python scripts/summarize_log.py --config config.json --action summarize --tail 400

Use read for recent excerpts and metrics, detect-failure for exception diagnosis, and summarize for a concise human-facing assessment with next steps.

Script Map

scripts/remote_train.py: start training, optional resume templating, structured launch result.
scripts/check_status.py: process/GPU/log-based training status.
scripts/monitor_resources.py: GPU/CPU/memory/disk summary and bottleneck hints.
scripts/summarize_log.py: read logs, detect failures, summarize convergence and next actions.
scripts/common.py: shared config loading, SSH execution, safe path checks, remote helpers.
scripts/log_utils.py: reusable log parsing, failure detection, trend analysis, recommendation logic.

References

Read references/usage.md for setup steps, example configs, and example commands.
Read references/troubleshooting.md when SSH, environment activation, logs, or training recovery fail.

Agent Guidance

Start with the least invasive action that answers the user’s request.
When the user asks a yes/no status question, prefer scripts/check_status.py before reading a long log.
When the user asks why training stopped, run scripts/check_status.py and then scripts/summarize_log.py --action detect-failure.
When the user asks whether to continue training, run scripts/summarize_log.py --action summarize and include the recommendations from the script in the final response.
When a checkpoint path is provided, prefer scripts/remote_train.py --resume-from ... so the resume command is explicit and auditable.

Usage Guidance

This skill appears coherent and implements what it claims: it needs SSH access to the remote AutoDL server and will run commands inside the configured project_path. Before installing or running: 1) Treat SSH credentials as powerful—only grant access to hosts you trust. Prefer SSH keys over putting passwords in environment variables. 2) Verify the truncated helper functions (run_remote_script, build_ssh_command, build_activation_block) to ensure they do not write secrets to disk or leak credentials and that password handling (if used) is secure. 3) Confirm allowed_project_roots in your config so the skill cannot be pointed to an overly broad path (e.g., '/'). 4) Test first against a non-production project/host to observe behavior. 5) If you need higher assurance, request the remaining parts of common.py (the SSH/remote-run implementation) so they can be inspected for any unsafe temporary-file or subprocess patterns. Overall risk is typical for any tool that executes commands on a remote server via SSH.

Capability Analysis

Type: OpenClaw Skill Name: autodl-train Version: 1.0.0 The autodl-train skill bundle provides a suite of tools for managing remote machine learning training jobs on AutoDL servers via SSH. It includes scripts for launching background training processes (scripts/remote_train.py), monitoring system resources like GPU and disk usage (scripts/monitor_resources.py), and performing heuristic log analysis to detect failures or convergence (scripts/summarize_log.py). While the skill performs high-privilege remote execution, it includes several safety mechanisms in scripts/common.py, such as a blacklist of dangerous command fragments (e.g., 'rm -rf', 'mkfs'), strict path validation against allowed project roots, and the use of SSH_ASKPASS for secure password handling. The behavior is consistent with the stated purpose, and no evidence of malicious intent or data exfiltration was found.

Capability Assessment

✓ Purpose & Capability

Name/description match the code and SKILL.md: scripts start/resume training, check status, monitor resources and parse logs. Declared behavior (SSH to host, operate inside a configured project_path, read logs, detect failures) is exactly what the included scripts implement.

✓ Instruction Scope

SKILL.md instructs the agent to run the included scripts and to operate only inside project_path; the scripts follow that model (they create a launcher in the project directory, read log files from configured candidates, run nvidia-smi and /proc reads on the remote host). There is no instruction to collect or transmit files to third-party endpoints beyond SSHing to the target server.

✓ Install Mechanism

No install spec is present (instruction-only skill with local Python scripts). Nothing is downloaded or executed from arbitrary URLs; risk from installs is minimal.

ℹ Credentials

The skill requests no required env vars but supports many AUTOCLAW_* environment overrides (host, username, ssh key path, and ssh password among others). Those variables are relevant to SSH-based operation. Note: providing an SSH password in environment is supported (AUTOCLAW_TRAIN_SSH_PASSWORD); this is expected but raises the usual operational risk of password-in-env exposure—prefer SSH keys. All declared env mappings are proportional to the task.

✓ Persistence & Privilege

Skill does not request permanent/global privileges (always=false). Its operations are limited to running commands on a user-provided remote host and creating a launcher file inside the configured project_path. It does not attempt to modify other skills or system-wide settings.

Version History

v1.0.0

AutoDL remote training skill, initial release: - Enables starting, monitoring, and diagnosing model training jobs on AutoDL Linux servers via SSH. - Provides scripts for launching training, checking run status, monitoring GPU/CPU/memory/disk usage, reading recent logs, detecting failures, and summarizing outcomes. - Uses safe practices: only operates in the designated project directory, avoids destructive commands, and secures credentials. - Extracts key training metrics and produces actionable run summaries with next-step advice. - Designed for fast, frequent remote training cycles with minimal user overhead.

Metadata

Slug autodl-train

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is autodl-train?

Operates remote model training jobs on AutoDL Linux servers over SSH. Use when starting a training run, checking whether training is still alive, reviewing G... It is an AI Agent Skill for Claude Code / OpenClaw, with 225 downloads so far.

How do I install autodl-train?

Run "/install autodl-train" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is autodl-train free?

Yes, autodl-train is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does autodl-train support?

autodl-train is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created autodl-train?

It is built and maintained by Zhuoran-Liu (@zhuoran-liu); the current version is v1.0.0.

More Skills

autodl-train