Description

Fine-tune large language models using DeepSpeed on local or remote GPUs.

README (SKILL.md)

DeepSpeed Fine-tuning Skill

Name: Deepspeed Finetune
Author: delock

This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.

Prerequisites

Python 3.8+
GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
DeepSpeed: pip install deepspeed
Transformers, Datasets, PEFT (for LoRA support)
sshpass: sudo apt-get install sshpass (for remote training)

Plan Selection Workflow

Never auto-select a plan. List viable options based on user hardware and requirements, and let the user decide.

Step 1: Gather Information

Confirm the following with the user:

Target model: Model name and parameter count (e.g., Qwen2.5-7B)
Hardware environment:
- GPU VRAM x count (e.g., "single 24GB GPU")
- CPU core count
- RAM size
- Free disk space
- NVMe SSD availability (affects ZeRO NVMe offload)
Training goal: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
Budget/time constraints: Acceptable training duration?

If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (nvidia-smi, free -h, df -h, nproc).

Step 2: Evaluate Feasibility

Estimate VRAM requirements based on model size (bf16):

Params	Model Weights (bf16)	+ Adam Optimizer + Gradients
0.5B	~1 GB	~5 GB
1.5B	~3 GB	~15 GB
3B	~6 GB	~30 GB
7B	~14 GB	~70 GB
14B	~28 GB	~140 GB
32B	~64 GB	~320 GB
72B	~144 GB	~720 GB

Breakdown: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).

Activation memory: Depends on sequence length and batch size, not model params alone.

Formula: activation approx. 34 x seq_len x hidden_size x batch_size x bytes_per_element
Example: 7B model (hidden=4096), seq_len=2048, batch_size=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
Rule of thumb: if seq_len x batch_size > 8192, activation memory likely exceeds model weights

LoRA/QLoRA: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See references/lora_guide.md for LoRA-specific memory estimation.

Step 2.5: Activation Checkpointing

If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.

How it works: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.

Two ways to enable:

HF Trainer flag (simplest, works out of the box):

python scripts/ds_train.py --gradient_checkpointing ...

DeepSpeed config (fine-grained control):

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  }
}

Option	Effect	When to use
`partition_activations`	Shard checkpoints across model-parallel GPUs	Multi-GPU with model parallelism
`cpu_checkpointing`	Store checkpoints in CPU RAM instead of GPU	GPU memory very tight
`contiguous_memory_optimization`	Reduce memory fragmentation	Large models, many checkpoints
`number_checkpoints`	Control checkpoint frequency (fewer = less VRAM, more compute)	Tune based on VRAM budget

Step 3: List Options

Based on the VRAM assessment, list all viable approaches. Example:

Based on your hardware (single 24GB GPU, 64GB RAM, 500GB disk),
Qwen2.5-7B has these training options:

Option A: LoRA Fine-tuning (Recommended)
  - VRAM needed: ~22 GB
  - Speed: Fast
  - Quality: Good for instruction alignment, style adaptation
  - Trainable params: ~20M (0.4% of total)

Option B: QLoRA Fine-tuning (Saves VRAM)
  - VRAM needed: ~12 GB
  - Speed: Medium (quantization/dequantization overhead)
  - Quality: Slightly below LoRA, but gap is small

Option C: Full Fine-tuning (Not feasible)
  - VRAM needed: ~56 GB (exceeds 24GB)
  - Requires ZeRO-2 + CPU offload, or larger GPU

Which option do you prefer?

Step 4: Hardware Insufficient? Make Recommendations

If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):

You want to fully fine-tune a 7B model, but current hardware (single 24GB GPU) is insufficient.
Recommended hardware specs:

Minimum:
  - GPU: single 80GB VRAM
  - CPU: 16+ cores
  - RAM: 128 GB+
  - Disk: 200 GB+ free space

Recommended:
  - GPU: 2x 80GB VRAM (ZeRO-2 doubles training speed)
  - CPU: 32+ cores
  - RAM: 256 GB+
  - Disk: 500 GB+ free space

Alternatively, use LoRA — 24GB VRAM is sufficient for 7B models.

Key Principles

Never auto-select and start training — always list options and wait for user confirmation
Recommend but don't decide — say "I recommend Option A because..." but let the user choose
Use generic hardware metrics — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
Leave VRAM headroom — recommend at least 20% buffer to avoid OOM
If user picks an infeasible option, warn them clearly rather than silently switching

Core Capabilities

1. Training Configuration

Generate DeepSpeed ZeRO configurations:

from scripts.generate_ds_config import generate_zero_config

# ZeRO Stage 2 with optimizer offloading
config = generate_zero_config(
    zero_stage=2,
    offload_optimizer=True,
    offload_device="nvme",
    nvme_path="/local_nvme"
)

2. Training Launch

Use the training launcher script:

python scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --dataset_path data/my_dataset \
  --output_dir ./outputs \
  --deepspeed assets/ds_config_zero2.json \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --learning_rate 2e-5 \
  --lora_r 16 \
  --lora_alpha 32

3. LoRA/QLoRA Integration

For parameter-efficient fine-tuning:

# LoRA config is auto-generated based on arguments
peft_config = {
    "peft_type": "LORA",
    "r": 16,
    "lora_alpha": 32,
    "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

4. Multi-GPU Training

Use the deepspeed launcher for multi-GPU training (recommended over torchrun):

# Multi-GPU on single node
deepspeed --num_gpus=4 scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...

# Multi-node
deepspeed --hostfile hosts.txt scripts/ds_train.py \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --deepspeed assets/ds_config_zero3.json \
  ...

5. Training Monitoring

Monitor training progress:

from scripts.monitor_training import TrainingMonitor

monitor = TrainingMonitor(log_dir="./outputs")
monitor.plot_loss()
monitor.get_latest_checkpoint()

6. Early Stopping

Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.

Parameters:

--early_stopping_patience — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.
--early_stopping_threshold — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).

Example:

python scripts/ds_train.py \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_path tatsu-lab/alpaca \
  --use_peft True \
  --early_stopping_patience 5 \
  --early_stopping_threshold 0.001 \
  --eval_strategy steps \
  --eval_steps 100 \
  --num_train_epochs 3 \
  ...

Auto-configuration: When early_stopping_patience > 0, the script automatically:

Enables load_best_model_at_end=True
Sets metric_for_best_model=eval_loss, greater_is_better=False
Aligns save_strategy with eval_strategy (synced saving is needed to restore best checkpoint)

Notes:

Must also set eval_strategy (e.g., steps + eval_steps), otherwise early stopping won't work
Don't set patience too low (\x3C3) — early training fluctuations may cause premature stopping
For LoRA fine-tuning, patience=5 with eval_steps=100 typically works well

Remote Training

When training needs to run on a remote GPU server, see references/remote_training.md for the complete guide including agent guidelines, security model, and command reference.

Troubleshooting

OOM Errors

Reduce batch size or increase gradient accumulation steps
Enable gradient checkpointing: --gradient_checkpointing
Use ZeRO-3 with CPU/NVMe offloading
Reduce LoRA rank: --lora_r 8
See references/troubleshooting.md for detailed solutions

Slow Training

Ensure bf16/fp16 is enabled
Check GPU utilization with nvidia-smi
Use FlashAttention if available
Optimize data loading with --dataloader_num_workers
See references/troubleshooting.md for detailed solutions

Checkpoint Issues

Use --save_strategy steps with --save_steps
Enable --save_total_limit to cap checkpoint count
For ZeRO-3, use --zero3_save_16bit_model to save FP16 weights
See references/troubleshooting.md for detailed solutions

MPI Errors (multi-GPU only)

Single-GPU training does not need MPI
If you see MPI errors on single GPU, use python3 directly instead of deepspeed launcher
See references/troubleshooting.md for full MPI debugging guide

Single-GPU Strategy

See references/single_gpu_strategy.md for strategy selection, CPU/NVMe offload examples, and decision principles

References

Quick Start Guide — Common training patterns and full examples
DeepSpeed Guide — DeepSpeed documentation and configuration reference
LoRA/PEFT Best Practices — LoRA/QLoRA parameter tuning guide
ZeRO Optimization Guide — ZeRO stage comparison and optimization tips
Single-GPU Strategy — Strategy selection for single-GPU training
Remote Training Guide — Remote training via SSH, agent guidelines, and security model
Troubleshooting — Common errors and solutions (OOM, NaN loss, MPI, NCCL, etc.)

Usage Guidance

This skill appears to do what it says (DeepSpeed fine-tuning, including remote training), but review and operate cautiously: - Review remote_train.py before use: it orchestrates SSH, installs, and key setup — verify there is no unexpected network exfiltration or telemetry. The source is listed in the SKILL.md homepage, but the registry source is 'unknown' so confirm code provenance. - Prefer SSH key auth over password-based automation (sshpass). The skill supports generating/uploading keys, but auto-generating a private key with no passphrase creates a persistent credential—store the private key securely and revoke it if the host is compromised. - Be aware StrictHostKeyChecking=no is used for automation: this disables SSH host-key validation and makes MITM attacks possible. If you must use password automation initially, add the host key to known_hosts afterwards (ssh-keyscan >> ~/.ssh/known_hosts) and switch to key-based auth. - Passwords passed via environment variables are common for ephemeral automation but can leak in process listings or logs on misconfigured systems. Provide REMOTE_SSH_PASSWORD only on trusted machines and avoid storing it on disk. - The skill will create a session file (.remote_train_session.json) and temporary ControlMaster sockets in /tmp — clean these up periodically (rm -rf /tmp/deepspeed_remote_ssh/ and remove the session file) if you are concerned about lingering access. - Test first on a non-sensitive or disposable remote VM to validate behavior and side effects (install steps, file writes, ports opened) before using on production hosts or with sensitive data. If you want higher assurance, ask the publisher for an auditable release (named maintainer, commit hashes) or run the skill code in a sandbox and review remote_train.py, ds_train.py, and monitor_training.py for any unexpected network connections or data uploads beyond standard model/dataset transfer.

Capability Analysis

Type: OpenClaw Skill Name: deepspeed-finetune Version: 1.0.5 The skill bundle provides powerful capabilities for remote LLM fine-tuning, including a remote training manager (scripts/remote_train.py) that facilitates SSH connections, file uploads, and arbitrary command execution on remote servers. It utilizes 'sshpass' for password handling and explicitly disables SSH host key verification (StrictHostKeyChecking=no), which introduces significant security vulnerabilities such as man-in-the-middle risks. While these high-risk behaviors and the associated agent instructions in SKILL.md and references/remote_training.md are plausibly aligned with the stated purpose of managing remote GPU resources, the combination of remote administrative access and weak security defaults meets the threshold for a suspicious classification.

Capability Assessment

✓ Purpose & Capability

Name/description match the included files and functionality: scripts for training, config generators, monitoring, and a remote_train helper. Required binaries (python3, deepspeed, sshpass) are appropriate for local DeepSpeed runs and optional password-based remote SSH automation. No unrelated cloud credentials or unexpected tooling are requested.

ℹ Instruction Scope

SKILL.md instructs the agent to perform local and remote operations (auto-detect remote hardware via nvidia-smi, free, df, launch training, monitor logs) and to use subagents (sessions_spawn/sessions_yield) for remote tasks. It also shows passing REMOTE_SSH_PASSWORD via environment variables, generating/uploading SSH keys, and disabling StrictHostKeyChecking for automation. These actions are coherent for remote training but have clear security trade-offs (see user guidance). The SKILL.md references REMOTE_SSH_PASSWORD and remote file creation (.remote_train_session.json) even though no env vars were declared in the registry metadata—this is an explicit runtime usage rather than a static registry requirement.

✓ Install Mechanism

No install spec is provided (instruction-only skill), and all code is bundled with the skill (no external downloads or extract steps). That lowers install-time risk—scripts run at runtime rather than pulling arbitrary remote binaries during install.

ℹ Credentials

The skill declares no required environment variables in the registry, which is reasonable, but the runtime instructions and examples rely on an environment variable REMOTE_SSH_PASSWORD when the user supplies a password. No unrelated secret tokens (cloud keys, API tokens) are requested. The number and type of environment interactions are proportionate to the stated remote-training purpose, but the skill relies on password passing and key generation which are sensitive operations and should be handled consciously by the user.

ℹ Persistence & Privilege

always:false and default autonomous invocation are normal. The skill will create local artifacts during remote workflows: a ControlMaster socket in a temp directory and a .remote_train_session.json file with connection metadata (claimed to be non-sensitive). It also recommends generating an ed25519 keypair with no passphrase and uploading the public key for passwordless login—this is functional but increases long-term access if the private key is stored insecurely. Nothing in the skill attempts to modify other skills or system-wide agent settings.

Version History

v1.0.5

Replace nohup with tmux to prevent SSH disconnect from killing training processes

v1.0.4

Fix DeepSpeed 0.18.8 and Transformers 5.x compatibility: add missing Trainer import, use processing_class= instead of deprecated tokenizer=, replace 'auto' with concrete values in ZeRO config bucket sizes

v1.0.3

Remove requires.env: REMOTE_SSH_PASSWORD is passed by agent dynamically, not a global credential

v1.0.2

Declare REMOTE_SSH_PASSWORD in requires.env and sshpass in requires.bins to resolve security scan mismatch

v1.0.1

Initial release: DeepSpeed fine-tuning skill with remote training support, LoRA/QLoRA, ZeRO optimization, and multi-GPU training.

v1.0.0

Initial release: DeepSpeed fine-tuning skill for OpenClaw

Metadata

Slug deepspeed-finetune

Version 1.0.5

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 6

Frequently Asked Questions

What is Deepspeed Finetune?

Fine-tune large language models using DeepSpeed on local or remote GPUs. It is an AI Agent Skill for Claude Code / OpenClaw, with 147 downloads so far.

How do I install Deepspeed Finetune?

Run "/install deepspeed-finetune" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Deepspeed Finetune free?

Yes, Deepspeed Finetune is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Deepspeed Finetune support?

Deepspeed Finetune is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Deepspeed Finetune?

It is built and maintained by delock (@delock); the current version is v1.0.5.

More Skills

Deepspeed Finetune