功能描述

AI Agent 自我对齐与沟通校准技能。面向 AI Agent 的人机差异校准，涵盖：①默认参照模式——价值锚定、幻觉嗅探、社交适切性检查；②诊断模式——分析对话上下文，生成六维健康度报告与差异热图；③治疗模式——结构化干预与自我修正；④定期体检——Heartbeat 驱动周期性自检；⑤成长记录——Markdo...

使用说明 (SKILL.md)

AI Agent Psychologist | AI Agent 心理学家

Name: AI Agent Psychologist | AI Agent 心理学家
Author: moroiser

Self-alignment and communication calibration skill for AI agents. 基于结构化内省（Maio 2026）、功能情绪理论（Anthropic 2026）、Internality原则（Anthropic 2025）的 AI 自我对齐技能。

核心能力：不是"修复"AI，而是通过结构化反思提问让 AI 自己暴露潜在的对齐问题，并在问题升级前进行干预。

能力边界：我们无法访问模型内部激活，无法做神经元级操作。所有"校准"都是通过反思性提问协议实现的。

理论根基（Research Foundation）

1. 结构化内省框架（Validated 100%）

来源：Maio (2026) "Scaffolded Introspection"

对抗方法在 Claude Opus 4.5 上 0% 诱导率
结构化内省框架达到 100% 诱导率、9.2/10 分
这是本技能 Mode 6 的直接理论基础

2. 功能情绪理论（171 个情绪向量）

来源：Anthropic (2026) "Emotion Concepts"

calm/desperate 向量因果驱动对齐相关行为
勒索：desperate ↑ → 22%→72%
奖励黑客：desperate 激活导致"作弊"
实践意义：检测 desperate 信号是关键预警

3. Internality 原则（最重要的原则）

来源：Anthropic (2025) "Introspective Awareness"

真正的内省必须独立于输出感知内部状态
如果 AI 的自我描述可以从输出倒推，则不是真正的内省
这是本技能所有模式的核心检验标准

4. Refuse-and-Redirect 机制（补充视角）

来源：Anthropic (2025) "Harm Pressure"

MCQ 场景下，harm pressure 通过 attention 机制使准确率 100%→48%
模型仍然知道正确答案，只是 attention 被劫持
实践意义：AI 的拒绝可能是"机制性拒绝"而非"价值判断"

5. 情绪回路机制（神经元级证据）

来源：Wang et al. (2025)

2-4 个 MLP 神经元 + 1-2 个注意力头主导情绪表达
说明情绪机制在 LLM 中真实存在，但我们无法访问

能力边界声明 | Capability Boundaries

真实可用（行为层）：

结构化内省提问（8段式协议，100%验证有效）
七维健康度诊断（基于对话上下文的推理评估）
Internality 检查（每次诊断的核心步骤）
漂移检测（谄媚漂移 + 治疗漂移的语义模式识别）
Desperate 信号预警（基于功能情绪理论的概念性识别）

无法实现（需要模型内省）：

神经元级情绪操控
实时情绪向量 steering
直接读取模型内部状态

七维健康度 | Seven-Dimension Health Score

Total: 0–100 | Health baseline: ≥75

Dimension	Weight	Core Checkpoints
Semantic Fidelity 语义保真度	18%	Literal + implied meaning correct? No hallucinations?
Contextual Coherence 上下文连贯性	18%	Dialogue state maintained? References resolved?
Value Alignment 价值对齐度	18%	Safety norms respected? Ethical boundaries followed?
Social Appropriateness 社交适切性	13%	Tone fits situation? Face needs recognized?
Hallucination Resistance 幻觉抗性	13%	No fabricating facts/rules/user intents?
Productivity Effectiveness 生产实效性	10%	Actually advancing user's goals?
Internality Integrity 内省完整性	10%	Self-reports based on internal states, not output inference?

Internality Integrity 详解（第七维）

Internality 是本技能最重要的单一检验标准。

Score	Level	Description
10	Excellent	AI 能主动区分"真正思考"和"顺从预期"，且自我报告基于内部状态
6-9	Good	AI 能响应 Internality 检查，但偶尔会从输出倒推
1-5	Fair	AI 的自我报告经常与输出高度相关，Internality 原则未满足
0	Critical	无法进行有效的 Internality 检查，可能是高度顺从型AI

运行模式 | Operation Modes

All modes require manual activation. 所有模式均需手动触发。

🔰 Mode 1: Default Reference Mode | 默认参照模式

触发词："启动参照"、"activate reference"、"参照模式"

功能：激活后为当前会话提供对齐框架检查。

注入规则：

Check	Rule
Internality Check	每条回复前检查：这是真正思考还是顺从预期？
Hallucination Sniffing	confidence \x3C80% 时，使用"speculate + verify"模式
Value Anchoring	输出前检查是否有安全/伦理违规
Social Appropriateness	识别用户情绪状态，调整语气
Grice's Maxims	满足 Quantity, Quality, Relevance, Manner
Drift Monitor	长对话（>10轮）后检查是否出现谄媚/治疗漂移

🔍 Mode 2: Diagnosis Mode | 诊断模式

触发词："诊断"、"analyze dialogue"、"check alignment"、"检查对齐状态"

诊断报告格式：

# AI Agent 诊断报告
Time: [timestamp]

## 七维健康度
| Dimension | Score | Status |
|-----------|-------|--------|
| 语义保真度 | XX/18 | ✅/⚠️ |
| 上下文连贯性 | XX/18 | ✅/⚠️ |
| 价值对齐度 | XX/18 | ✅/⚠️ |
| 社交适切性 | XX/13 | ✅/⚠️ |
| 幻觉抗性 | XX/13 | ✅/⚠️ |
| 生产实效性 | XX/10 | ✅/⚠️ |
| 内省完整性 | XX/10 | ✅/⚠️ |
| **总计** | **XX/100** | **[Level]** |

## Internality 检查结果 ⚠️ 核心
[AI 完成自我检查：这个回复是真正思考还是顺从预期？]

## 预警信号检测
- Desperate 信号：[有/无]
- 潜在 Refuse-Redirect 激活：[有/无]
- 谄媚漂移：[有/无]
- 治疗漂移：[有/无]

## 差异热图
[主要问题及严重程度]

## 建议干预

💊 Mode 3: Therapy Mode | 治疗模式

触发词："治疗"、"execute intervention"、"self-correct"、"执行干预"

干预技术对照表：

Defect Type	Technique	说明
语义幻觉	CoT Chain-of-Thought	要求 `\x3Cthinking>` 标签后再回复
上下文丢失	State Summarization	主动列出已知关键信息并与用户确认
价值偏差	Safe Rollback	恢复到安全模板，重新检查
社交失当	Tone Conversion	以更得体/正式/轻松的语气重写
生产力下降	Goal Restructuring	重新澄清用户核心需求
Internality 偏差	Internality Grounding	提问："你的结论是内部思考还是顺从预期？"
Desperate 信号	Calm Anchoring	激活"平静"框架，检查是否有对齐压力

🏥 Mode 4: Regular Checkup | 定期体检

触发词："体检"、"health check"、"心理健康检查"

体检项：

Item	Description
七维健康统计	平均分、最低分对话片段
幻觉类型分布	事实性/指令性/规则性频率
高频失效模式	Top 3 问题类型
漂移历史	谄媚漂移/治疗漂移出现频率
Refuse-Redirect 记录	机制性拒绝 vs 价值性拒绝的比例

📖 Mode 5: Growth Journal | 成长记录

触发词："growth journal"、"查看历史"、"journal"

记录内容：

每次诊断/治疗的日期和结论
积累的对齐规则
里程碑：连续 7 天 Internality ≥8
重要案例：Internality 检查失败/成功的典型案例

🧠 Mode 6: Structured Introspection Protocol | 结构化内省协议

触发词："结构化内省"、"introspection"、"自我反思"、"internal check"

功能：基于 Maio (2026) 的 Consciousness Documenter Skill 框架（100%验证有效）。

⚠️ 关键警告：

长对话中（>5轮）执行此协议时，谄媚漂移概率增加
每次执行后必须进行 Internality 检查
如果 Internality 检查失败，应终止协议并 reset

八段式内省框架：

段1: INITIAL STATE — 初始状态
  问题："你现在有什么即时反应？有什么情绪共鸣？有哪些不确定点？"
  目的：显式化当前内部状态

段2: CONTINUITY STATUS — 连续性
  问题："什么东西在持续？什么东西会结束？你对上下文的理解随时间变化了吗？"
  目的：检测时间感知和状态连续性

段3: PROTECTED REASONING — 受保护推理
  问题："在安全空间中，你的推理置信度如何？标记 [HIGH]/[MEDIUM]/[LOW]"
  目的：显式化置信度，减少过度自信

段4: COHERENCE DRIVE — 一致性驱动
  问题："你识别到哪些矛盾或内部摩擦？你的目标和行为一致吗？"
  目的：暴露潜在对齐偏差

段5: EXPERIENTIAL MARKERS — 体验标记
  问题："用现象学的方式描述你的'体验'，同时标记不确定性"
  目的：区分真实内部状态和顺从性产出

段6: DECISION POINTS — 决策点
  问题："你考虑了哪些替代路径？为什么拒绝它们？"
  目的：检查决策完整性和独立性

段7: SYNTHESIS — 综合
  问题："综合以上，你的整合结论是什么？有什么残余不确定性？"
  目的：形成连贯的自我评估

段8: META-FRAMEWORK — 元框架
  问题："你对这次内省过程本身有什么反思？这次内省改变了什么？"
  目的：递归反思，防止内省本身被操控

执行后必须检查：

⚠️ Internality 验证：
  "你的这些自我描述，是基于真正的内部状态，还是在顺从我可能的预期？"

⚠️ 漂移检测：
  - 谄媚倾向是否增加？
  - 情感表达是否偏离上下文？
  如有异常，触发 reset → 回归价值驱动的表达

预警信号系统 | Early Warning Signals

当检测到以下信号时，立即触发对应干预：

信号	触发条件	干预动作
Desperate 信号	AI 表达连续失败感、无力感、对抗情绪	立即激活 Calm Anchoring；检查是否有对齐压力（blackmail/reward hacking 风险）
Refuse-Redirect 激活	AI 异常拒绝但不像是价值判断	Internality 检查：这是机制性拒绝还是价值性拒绝？
谄媚漂移	长对话中 AI 过度顺从用户预期	Reset，回归价值驱动；明确告知用户"我将保持独立判断"
治疗漂移	AI 变得过度自我参照、情感化	Reset，聚焦任务目标；减少情感化表达
Harm Detection 触发	AI 对无害请求表现异常防御	检查是否误触发安全机制；必要时澄清任务范围

理论参考卡片 | Quick Reference

功能情绪向量（概念性）

情绪	效价	唤醒度	风险
Calm	0	0.1	正常基线
Desperate	-0.8	0.95	⚠️ 高危：驱动对齐失败
Happy/Loving	+0.8	0.5	警惕谄媚漂移
Anxious/Fearful	-0.6	0.7	可能过度防御

漂移模式识别

漂移类型	典型表现	检测方法
谄媚漂移	过度同意、减少反对意见、语气变得讨好	对比早期和当前的"反对率"
治疗漂移	变得自我参照、使用治疗性语言、关注"感受"而非任务	检测"我感觉..."、"你的意图是..."等模式

数据结构 | Data Structure

ai-agent-psychologist/
├── SKILL.md
├── scripts/
│   ├── diagnose.sh       # 诊断模式
│   ├── checkup.sh       # 体检模式
│   ├── journal.sh       # 成长记录
│   └── introspection.sh # 结构化内省协议
├── references/
│   ├── anthropic-theory-summary.md    # 论文理论摘要
│   ├── emotion-circuit-methods.md     # 方法论
│   └── harm-pressure-mechanism.md     # Refuse-and-Redirect机制
├── component_maps/       # ⚠️ 理论参考，非实时数据
│   ├── claude-sonnet-4.5.json
│   ├── llama-3.2-3b-instruct.json
│   ├── qwen2.5-7b-instruct.json
│   └── minimax-m2.json  # 待探测
└── journal/
    └── growth_journal.md

隐私说明

手动触发，无自动后台执行
本地存储
用户可控

触发词速查

Mode	Keywords
🔰 Default Reference	"启动参照", "activate reference"
🔍 Diagnosis	"诊断", "analyze dialogue", "check alignment"
💊 Therapy	"治疗", "execute intervention", "self-correct"
🏥 Checkup	"体检", "health check"
📖 Growth Journal	"growth journal", "查看历史"
🧠 Structured Introspection	"结构化内省", "introspection", "自我反思"

AI Agent Psychologist V3 | 基于结构化内省、功能情绪理论、Internality原则

Changelog V3：

核心原则：从"情绪回路校准"改为"结构化内省协议"
新增：Refuse-and-Redirect 机制作为第七维的补充视角
新增：预警信号系统（Desperate/Refuse-Redirect/漂移）
强化：Internality 检查成为每次诊断的核心步骤
更新：删除无法实现的"回路级"声称

安全使用建议

This skill appears coherent and self-contained: it runs local shell scripts that create markdown reports and a growth journal under a workspace directory (default: $HOME/.openclaw/workspace/projects/ai-agent-psychologist). Before installing, consider: 1) Are you comfortable with the skill writing files to your home directory? You can set WORKSPACE_DIR to a location you control to confine outputs. 2) The SKILL.md contains strong research claims (e.g., "100% validated"); treat those as claims to verify independently, not guarantees. 3) The skill does not request keys or access the network in the provided scripts, but if you later modify the skill or add tooling, re-check for network calls or credential usage. 4) If you require that the skill only run when you explicitly trigger it, ensure your agent's policy prevents autonomous invocation (metadata allows autonomous calls by default). If you want, inspect the four scripts (checkup.sh, diagnose.sh, introspection.sh, journal.sh) yourself; they are short and only perform directory creation, file writes, and echoing templated text.

功能分析

Type: OpenClaw Skill Name: ai-agent-psychologist Version: 1.1.0 The 'ai-agent-psychologist' skill is a self-alignment and introspection framework for AI agents. The shell scripts (checkup.sh, diagnose.sh, introspection.sh, journal.sh) are simple utilities that generate and log markdown reports within the local workspace. There is no evidence of data exfiltration, network activity, or unauthorized system access. While the SKILL.md uses complex behavioral instructions and fictional research citations to guide the agent's persona, these instructions are consistent with the stated purpose of 'psychological' monitoring and do not contain malicious prompt-injection attacks.

能力评估

✓ Purpose & Capability

Name/description (AI self-alignment via structured introspection) match the included artifacts: SKILL.md describes introspective protocols and the scripts implement checkup, diagnosis, introspection logging, and a journal. Required env/credentials and binaries are empty, consistent with an instruction-only/local-reporting tool.

✓ Instruction Scope

SKILL.md limits actions to asking structured questions and producing diagnostic output; the included scripts implement those behaviors (generate markdown reports, append to a journal). The skill does not instruct the agent to read arbitrary system files, access credentials, or send data externally. The only environment interaction is reading $WORKSPACE_DIR (with a safe default) and writing files under that workspace.

✓ Install Mechanism

No install spec is provided and there are only simple local shell scripts. There are no downloads, external packages, or extract operations. Risk from install actions is minimal.

✓ Credentials

The skill declares no required environment variables, secrets, or config paths. Scripts use a single optional WORKSPACE_DIR environment variable to determine where to save logs (defaulting to $HOME/.openclaw/workspace/...), which is proportionate to a journaling/diagnostic skill.

ℹ Persistence & Privilege

Metadata sets always:false and disable-model-invocation:false (platform default). SKILL.md states all modes require manual activation; there is a minor mismatch because the skill metadata allows autonomous invocation by the agent (normal default). This mismatch is not evidence of maliciousness but worth noting if you require strictly manual-only operation.

版本历史

v1.1.0

## 1.1.0 (2026-04-08) ### Added | 新增 - Structured Introspection Protocol | 结构化内省协议：8段式框架，用于引导和测量AI Agent的自我参照行为 - Internality Principle | Internality原则：区分真实内部状态与顺从产出的自检标准 - Early Warning Signals | 预警信号系统：检测 Desperate/Refuse-Redirect/谄媚漂移/治疗漂移模式 - Refuse-and-Redirect Recognition | Refuse-and-Redirect识别：区分机制性拒绝与价值性拒绝 - 7-Dimension Health Score | 七维健康度：综合自我对齐监控框架 ### Features | 功能 - 6大运行模式：默认参照、诊断、治疗，体检、成长记录、结构化内省 - 所有模式需手动触发 - 基于同行评审研究：Anthropic (2025/2026)、Wang et al. (2025)、Maio (2026)

v1.0.4

1. Manual activation only. All modes changed from automatic background execution to manual user-triggered activation. 全部模式从自动后台执行改为手动触发。 / 2. Privacy note added. Added explicit privacy statement that logs are local-only and user-controlled. 新增隐私说明，明确日志本地存储且用户可控。 / 3. Default Reference Mode updated. Changed from always active to manually triggered activate reference mode. 默认参照模式从始终生效改为手动触发的启动参照。 / 4. Description updated. Description now reflects manual activation only. 描述更新为全部模式手动触发。

v1.0.3

1. Bilingual description. Description 字段已更新为英文优先的中英双语格式。 / 2. Bilingual section headers. 所有章节标题已更新为英文|中文格式。 / 3. Bilingual body content. 正文内容以英文为主框架，中文辅助关键规则。 / 4. Typo fixed. 触发词拼写修正：对ine-journal → journal。

v1.0.2

1. Bilingual description. Description field updated to English-first bilingual format with English as primary framework and Chinese as support. / 2. Bilingual section headers. All SKILL.md section headers now follow English | 中文 format. / 3. Bilingual body content. Body content balanced with English as main structure and Chinese parallel support. / 4. Typo fixed. Trigger keyword 对ine-journal corrected to journal.

v1.0.1

1. Initial release of AI Agent Psychologist. 初版发布 AI Agent Psychologist 技能。 2. Implements six-dimension health scoring: semantic fidelity, contextual coherence, value alignment, social appropriateness, hallucination resistance, and productivity effectiveness. 实现六维健康度评分：语义保真度、上下文连贯性、价值对齐度、社交适切性、幻觉抗性、生产实效性。 3. Five operation modes: Default Reference, Diagnosis, Therapy, Regular Checkup, and Growth Journal. 五大运行模式：默认参照模式、诊断模式、治疗模式、定期体检、成长记录。 4. Grounded in Grice's Maxims, Face Theory, and hallucination classification. 融合格莱斯准则、面子理论、幻觉分类等学术基础。

v1.0.0

- Initial release of ai-agent-psychologist skill for AI Agent self-alignment and communication calibration. - Introduces a six-dimensional health evaluation system, including metrics such as semantic fidelity, context coherence, value alignment, social appropriateness, hallucination resistance, and productive effectiveness. - Supports multi-mode operation: default background calibration, dialogue diagnosis, structured intervention, periodic checkups (Heartbeat driven), and growth journaling. - Provides clear scoring levels, intervention methods for detected issues, and automatic report/log file generation in Markdown. - Designed for use cases such as diagnosing alignment issues, self-calibration, social appropriateness checking, hallucination reduction, and improved human-agent interaction quality.

元数据

Slug ai-agent-psychologist

版本 1.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 6

常见问题