← 返回 Skills 市场
roamer-remote

Debugging Reinforcement Learning

作者 Roamer 徐 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
86
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install debugging-reinforcement-learning
功能描述
Tools and methods for controlling randomness, ensuring reproducibility, analyzing agent behavior, and debugging reward issues in stochastic reinforcement lea...
使用说明 (SKILL.md)

Debugging Non-Deterministic Agent Behavior in Reinforcement Learning Environments

Overview

This skill provides a comprehensive toolkit for debugging reinforcement learning (RL) agents that exhibit non-deterministic behavior — one of the most challenging aspects of RL development. Non-determinism arises from environment stochasticity, policy randomness, seed mismanagement, and subtle numerical issues, making bugs notoriously hard to reproduce and diagnose.

Core Modules

1. Stochasticity Control

Strategies for controlling and isolating sources of randomness in RL pipelines:

  • Seed Management: Set and track seeds across all random sources (Python random, NumPy, PyTorch/TF, environment RNG, custom samplers).
  • Entropy Scheduling: Monitor and clamp policy entropy to detect exploration collapse or excessive randomness.
  • Action Distribution Inspection: Log full action distributions (not just sampled actions) to verify the policy is learning correctly.
  • Environment Stochasticity Toggle: Identify which environment transitions are stochastic vs. deterministic, and temporarily freeze stochastic dimensions for debugging.

2. Reproducibility Tools

Utilities for making RL experiments reproducible:

  • ReproWrapper: Wraps any env+agent pair to capture full episode trajectories (observations, actions, rewards, dones, seeds, RNG states).
  • Episode Replay: Replays a recorded episode step-by-step for comparison against expected behavior.
  • State Snapshot: Saves/restores complete training state (model weights, optimizer state, RNG state, env state).
  • Diff Replay: Compares two episode trajectories and highlights divergences with step-level granularity.
  • Seed Cascade: Generates deterministic seed sequences for parallel workers to avoid seed collisions.

3. Behavior Analysis

Techniques for understanding what the agent is actually doing:

  • Trajectory Clustering: Groups similar trajectories to identify behavioral modes (e.g., "agent always fails at corner cases").
  • Action Frequency Heatmap: Visualizes action distributions over state space regions.
  • Policy Consistency Check: Detects if the same state produces different action distributions across episodes (a sign of state encoding bugs or hidden state leakage).
  • Temporal Correlation Detector: Finds unintended correlations between consecutive actions that indicate the agent isn't respecting Markov assumptions.
  • Behavioral Mode Detection: Identifies distinct behavioral regimes the agent switches between (e.g., cautious vs. reckless).

4. Reward Debugging

Methods for diagnosing reward-related issues:

  • Reward Decomposition: Breaks multi-component rewards into individual signals to identify which component drives behavior.
  • Reward Shaping Validator: Checks if shaped rewards accidentally create local optima or reward cycling.
  • Sparse Reward Tracer: For sparse-reward environments, logs the full trajectory leading up to reward events for analysis.
  • Reward Scale Analyzer: Detects reward scale mismatches between components that cause gradient domination.
  • Episode Return Sanity Check: Verifies that discounted returns are computed correctly and that reward normalization isn't destroying the signal.
  • Reward Hacking Detector: Flags when the agent achieves high reward through unintended behavior (exploiting bugs in reward computation).

Usage Patterns

Quick Reproducibility Check

1. Set global seed via seedAll()
2. Run episode with EpisodeRecorder
3. Replay and compare

Diagnose Erratic Behavior

1. Run 50 episodes with fixed seeds
2. Cluster trajectories
3. Inspect divergent clusters
4. Use policyConsistencyCheck on divergent states

Reward Signal Investigation

1. Decompose reward into components
2. Run rewardScaleAnalyzer
3. Check for hacking via rewardHackingDetector
4. Validate return computation

Anti-Patterns to Watch For

  • Seed per episode but not per step: Environment internal RNG can diverge even with episode-level seeding.
  • Caching state without RNG: Replay buffers that store (s, a, r, s') without the RNG state cannot reproduce the exact transition.
  • Floating point mode differences: GPU non-determinism from reduced-precision ops. Use torch.backends.cudnn.deterministic = True during debug.
  • Hidden environment state: Some environments (e.g., Atari with frame-skipping) have internal state not exposed in the observation.
  • Reward normalization drift: Running mean/std normalization changes the effective reward over training, making early episodes non-reproducible.

Integration Tips

  • Works with Gym/Gymnasium, PettingZoo, and custom env wrappers.
  • Compatible with PyTorch, TensorFlow, and JAX-based agents.
  • Output formats: JSON trajectories, CSV logs, and structured debug reports.
安全使用建议
This skill appears to be a coherent toolkit for RL debugging and contains no obvious attempts to access secrets or the network. Two practical cautions: (1) SKILL.md repeatedly references Python libraries (Gym, PyTorch, TensorFlow, torch.backends.cudnn) while the shipped implementation is JavaScript — confirm your agent/runtime can execute the provided JS utilities or that you have a Python wrapper if you expect to operate on Python-based environments. (2) Review the remainder of index.js (the file was truncated in the manifest) to ensure there are no hidden network calls, file writes, or dynamic code-eval behaviors before running it in a sensitive environment. If you need this to integrate with Python tooling, prefer a native Python implementation or add a well-audited bridging layer; if you only want the conceptual algorithms, you can port or call the relevant functions in a sandboxed environment first.
功能分析
Type: OpenClaw Skill Name: debugging-reinforcement-learning Version: 1.0.0 The skill bundle provides a comprehensive set of utility functions for debugging reinforcement learning agents, including seed management, trajectory analysis, and reward decomposition. The implementation in index.js is entirely self-contained, performing mathematical and logical operations on provided data without any use of sensitive APIs (e.g., file system, network, or process execution), and the SKILL.md instructions are strictly aligned with the stated purpose.
能力评估
Purpose & Capability
The skill claims to provide tools for controlling randomness, reproducibility, behavior analysis, and reward debugging — and the bundled index.js implements functions with those names (seedAll, createEpisodeRecorder, diffTrajectories, etc.). However, SKILL.md explicitly references Python ecosystems (Gym/Gymnasium, PyTorch/TensorFlow/JAX, torch.backends.cudnn), while the packaged implementation is JavaScript. That mismatch is a documentation/compatibility concern: the JS utilities cannot directly manipulate Python RNGs or torch backends without a bridging layer.
Instruction Scope
SKILL.md stays on-topic: it describes seed management, recording/replay, trajectory diffing, and reward debugging. It does not instruct reading arbitrary system files, contacting external endpoints, or accessing unrelated credentials. It does provide Python-specific tips (e.g., torch.backends.cudnn.deterministic = True) which are helpful guidance but imply an expectation that the user runs these checks in a Python runtime—again a compatibility note rather than a scope creep or exfiltration risk.
Install Mechanism
No install specification is provided (instruction-only skill with code files). That is low risk because nothing is downloaded or executed automatically by an installer. The package.json is minimal and lists no dependencies, and there are no install scripts shown.
Credentials
The skill does not request environment variables, credentials, or config paths. The functions operate on data you pass (trajectories, action probabilities, seeds) and do not reference or require external secrets.
Persistence & Privilege
Flags show always:false and normal model invocation. The skill does not request persistent agent-level privileges or to modify other skills' configurations.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install debugging-reinforcement-learning
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /debugging-reinforcement-learning 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release with comprehensive debugging toolkit for RL agents: - Tools for controlling and analyzing sources of nondeterminism in reinforcement learning environments. - Modules for reproducibility, including episode replay, deterministic seed management, and diffing trajectories. - Behavior analysis features: trajectory clustering, policy consistency checks, and behavioral mode detection. - Extensive reward debugging utilities: reward decomposition, scale analysis, hacking detection, and validation. - Designed for integration with popular RL environments and libraries; outputs structured logs and reports.
元数据
Slug debugging-reinforcement-learning
版本 1.0.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Debugging Reinforcement Learning 是什么?

Tools and methods for controlling randomness, ensuring reproducibility, analyzing agent behavior, and debugging reward issues in stochastic reinforcement lea... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 86 次。

如何安装 Debugging Reinforcement Learning?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install debugging-reinforcement-learning」即可一键安装,无需额外配置。

Debugging Reinforcement Learning 是免费的吗?

是的,Debugging Reinforcement Learning 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Debugging Reinforcement Learning 支持哪些平台?

Debugging Reinforcement Learning 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Debugging Reinforcement Learning?

由 Roamer 徐(@roamer-remote)开发并维护,当前版本 v1.0.0。

💬 留言讨论