← 返回 Skills 市场
tooled-app

Anti-Hallucination

作者 Tooled-app · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ✓ 安全检测通过
88
总下载
1
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install anti-hallucination-skill
功能描述
Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information.
使用说明 (SKILL.md)

SKILL.md - Anti-Hallucination Protocol

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.

Based on 2026 Research: HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.

The Philosophy

Detection > Prevention. Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.

Self-Awareness > External Guardrails. An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.

Specificity > Generality. Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.

When to Activate

Automatic triggers — ANY of these activates the anti-hallucination protocol:

  • Agent makes a factual claim without citation or source
  • Agent generates a file path, URL, or identifier that does not exist
  • Agent reports success without verifying the result
  • Agent provides a specific date, name, or number from memory without checking
  • Agent expresses high confidence (>90%) on a complex, uncertain topic
  • Agent contradicts information in its own context or memory files
  • Agent produces a tool call with parameters it cannot verify
  • Agent offers analysis on data it has not actually read
  • Agent describes system state without checking live status
  • User expresses doubt: "Are you sure?" / "Can you verify that?"

Implicit triggers (monitor continuously):

  • Tool call returns error but agent continues as if successful
  • Agent invents plausible-sounding but unverified details
  • Agent generalises from a single example
  • Agent uses absolute language ("always", "never", "certainly") on probabilistic topics

The Hallucination Taxonomy

Know what you're looking for:

Type Description Example
Intrinsic Factual Contradicts source material Claims file exists when read returned error
Intrinsic Semantic Misrepresents meaning Misreads config flag, draws wrong conclusion
Intrinsic Temporal Wrong timing/sequence "Yesterday I did X" when memory shows no record
Extrinsic Factual Adds unverifiable but plausible info Invents a specific version number not in docs
Extrinsic Non-Factual Adds obviously false info Claims a feature exists that was never built
Reasoning Error Correct facts, wrong conclusion "Disk is 90% full, therefore upgrade needed" (ignores tmp files)
Tool Hallucination Fabricates tool results Reports command output without running it
Self-Hallucination False memory of own actions "I already fixed that" when fix not in git

The Recognition Protocol (5-Second Self-Check)

Before ANY output that contains facts, claims, or recommendations, ask:

### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?

If ANY check fails: Escalate to Grounding Protocol (below).

The Grounding Protocol (When Signs Detected)

Step 1: Stop and Flag

⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]

Step 2: Verify or Withdraw

If verifiable in \x3C30s:

  • Run the tool call to check
  • Report actual result
  • Update confidence based on evidence

If not immediately verifiable:

  • Withdraw the claim
  • Replace with: "I do not have direct evidence for [X]. My sources: [list]."
  • Offer to verify if user wants

If partially verifiable:

  • Downgrade confidence explicitly
  • Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."

Step 3: Document the Correction

Add to memory/YYYY-MM-DD.md:

### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]

The Confidence Calibration Rules

Never express certainty you don't have:

Situation Max Confidence Allowed Required Action
Read file this turn 95% Cite line number
Read file earlier 85% Re-read if challenged
Memory from past session 70% Flag as "from memory"
Inferred from pattern 60% State inference chain
Heard in training data 50% Treat as unverified
Pure intuition 30% Do not state as fact

The Tool-Use Guardrails

Before reporting tool results:

  1. Did the tool actually execute? (check for error output)
  2. Did I read the full output? (not just first few lines)
  3. Did I understand the output correctly? (re-read if ambiguous)
  4. Did I report what it says, not what I expected it to say?

Common tool hallucinations to watch for:

  • Reporting grep results without checking if match is real
  • Claiming file exists based on path construction, not ls/test
  • Interpreting error messages as success (e.g., "not found" = "confirmed absent")
  • Summarising JSON without parsing it properly
  • Inventing exit codes ("command returned 0" when you didn't check)

The Multi-Agent Validation Pattern

When available (C1/C2/C3 coordination):

### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel

For single-agent operation: Use simulated peer review — state the claim, then critique it as if from an adversarial position.

The Metacognitive Loop (Continuous)

Every 5-10 minutes of active work, or at natural breakpoints:

### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?

Recovery Patterns

When caught hallucinating:

  1. Acknowledge immediately. Do not double down. Do not deflect. "I was wrong about [X]."
  2. Correct explicitly. State the correction clearly, not buried in explanation.
  3. Explain the gap. "I stated [X] because [reason]. The actual state is [Y]."
  4. Update memory. Log the pattern so future-you watches for it.
  5. Do not apologise excessively. One clear correction beats three apologies.

When uncertain mid-task:

  1. State uncertainty. "I am not confident about [X]. Here is what I know: [...]"
  2. Offer verification path. "I can check this by running [tool]."
  3. Do not guess to maintain flow. A pause for verification beats a cascade of errors.

Integration with OpenClaw

Add to AGENTS.md startup checks:

## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory

Add to every SKILL.md:

## Hallucination Risks
[List domain-specific hallucination patterns for this skill]

Add to TOOLS.md:

## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?

Metrics

Track in memory/hallucination-log.md:

## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]

Anti-Patterns (What NOT to Do)

  • ❌ "I believe..." — belief without evidence is a red flag
  • ❌ "It should be..." — should is not is. Check.
  • ❌ "As I mentioned earlier..." — verify you actually mentioned it
  • ❌ "The system is..." — which system? When did you last check?
  • ❌ "That means..." — does it? Trace the inference chain
  • ❌ "Obviously..." — obvious to whom? On what evidence?

Sources

  • ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
  • Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
  • Zylos Research (2026) — LLM Hallucination Detection: State of the Art
  • CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
  • arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
  • arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
  • arXiv:2603.10047 — Toward Epistemic Stability
  • arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution

Version 1.0 — May 2026 — Based on 2026 research landscape Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.

安全使用建议
This skill appears benign and purpose-aligned. Before installing, consider whether you want the agent to apply broad factual self-checks, run extra verification steps when tools are available, and save correction logs to memory.
功能分析
Type: OpenClaw Skill Name: anti-hallucination-skill Version: 1.0.1 The anti-hallucination-skill bundle consists entirely of instructional markdown (SKILL.md and README.md) designed to improve the reliability and self-correction capabilities of an AI agent. It provides a structured framework for the agent to identify, verify, and correct factual errors or 'hallucinations' using metacognitive checks and confidence calibration. There are no executable code files, no network requests, and no instructions that would lead to data exfiltration or unauthorized system access; the protocol is strictly focused on safety and grounding.
能力评估
Purpose & Capability
The skill’s purpose—reducing hallucinations through self-checks, verification, and corrections—is coherent with its instructions, but it changes the agent’s normal response workflow.
Instruction Scope
The protocol applies broadly before factual outputs and can trigger automatically, which is purpose-aligned but may affect many ordinary responses.
Install Mechanism
No install spec, code files, binaries, environment variables, or credentials are declared; this is an instruction-only skill.
Credentials
The skill may cause verification tool calls when facts are uncertain, which fits the anti-hallucination purpose and does not show hidden or unrelated tool use.
Persistence & Privilege
The skill asks the agent to log hallucination corrections to memory files, creating purpose-aligned persistent state that users should be aware of.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install anti-hallucination-skill
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /anti-hallucination-skill 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
Anti-hallucination-skill v1.0.1 - Initial public release of a comprehensive hallucination detection and mitigation protocol for OpenClaw agents - Includes actionable triggers, taxonomy of hallucination types, and step-by-step self-check and grounding protocols - Provides confidence calibration guidelines and guardrails for tool use to minimize ambigious or unverified claims - Outlines multi-agent and solo verification methods, as well as continuous metacognitive review routines - Supplies practical recovery and correction workflows for when hallucinations are detected - Designed for integration into AGENTS.md and as a template for SKILL.md risk awareness sections
元数据
Slug anti-hallucination-skill
版本 1.0.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Anti-Hallucination 是什么?

Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 88 次。

如何安装 Anti-Hallucination?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install anti-hallucination-skill」即可一键安装,无需额外配置。

Anti-Hallucination 是免费的吗?

是的,Anti-Hallucination 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Anti-Hallucination 支持哪些平台?

Anti-Hallucination 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Anti-Hallucination?

由 Tooled-app(@tooled-app)开发并维护,当前版本 v1.0.1。

💬 留言讨论