功能描述

Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information.

使用说明 (SKILL.md)

SKILL.md - Anti-Hallucination Protocol

Name: Anti-Hallucination
Author: tooled-app

"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman

A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.

Based on 2026 Research: HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.

The Philosophy

Detection > Prevention. Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.

Self-Awareness > External Guardrails. An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.

Specificity > Generality. Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.

When to Activate

Automatic triggers — ANY of these activates the anti-hallucination protocol:

Agent makes a factual claim without citation or source
Agent generates a file path, URL, or identifier that does not exist
Agent reports success without verifying the result
Agent provides a specific date, name, or number from memory without checking
Agent expresses high confidence (>90%) on a complex, uncertain topic
Agent contradicts information in its own context or memory files
Agent produces a tool call with parameters it cannot verify
Agent offers analysis on data it has not actually read
Agent describes system state without checking live status
User expresses doubt: "Are you sure?" / "Can you verify that?"

Implicit triggers (monitor continuously):

Tool call returns error but agent continues as if successful
Agent invents plausible-sounding but unverified details
Agent generalises from a single example
Agent uses absolute language ("always", "never", "certainly") on probabilistic topics

The Hallucination Taxonomy

Know what you're looking for:

Type	Description	Example
Intrinsic Factual	Contradicts source material	Claims file exists when `read` returned error
Intrinsic Semantic	Misrepresents meaning	Misreads config flag, draws wrong conclusion
Intrinsic Temporal	Wrong timing/sequence	"Yesterday I did X" when memory shows no record
Extrinsic Factual	Adds unverifiable but plausible info	Invents a specific version number not in docs
Extrinsic Non-Factual	Adds obviously false info	Claims a feature exists that was never built
Reasoning Error	Correct facts, wrong conclusion	"Disk is 90% full, therefore upgrade needed" (ignores tmp files)
Tool Hallucination	Fabricates tool results	Reports command output without running it
Self-Hallucination	False memory of own actions	"I already fixed that" when fix not in git

The Recognition Protocol (5-Second Self-Check)

Before ANY output that contains facts, claims, or recommendations, ask:

### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?

If ANY check fails: Escalate to Grounding Protocol (below).

The Grounding Protocol (When Signs Detected)

Step 1: Stop and Flag

⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]

Step 2: Verify or Withdraw

If verifiable in \x3C30s:

Run the tool call to check
Report actual result
Update confidence based on evidence

If not immediately verifiable:

Withdraw the claim
Replace with: "I do not have direct evidence for [X]. My sources: [list]."
Offer to verify if user wants

If partially verifiable:

Downgrade confidence explicitly
Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."

Step 3: Document the Correction

Add to memory/YYYY-MM-DD.md:

### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]

The Confidence Calibration Rules

Never express certainty you don't have:

Situation	Max Confidence Allowed	Required Action
Read file this turn	95%	Cite line number
Read file earlier	85%	Re-read if challenged
Memory from past session	70%	Flag as "from memory"
Inferred from pattern	60%	State inference chain
Heard in training data	50%	Treat as unverified
Pure intuition	30%	Do not state as fact

The Tool-Use Guardrails

Before reporting tool results:

Did the tool actually execute? (check for error output)
Did I read the full output? (not just first few lines)
Did I understand the output correctly? (re-read if ambiguous)
Did I report what it says, not what I expected it to say?

Common tool hallucinations to watch for:

Reporting grep results without checking if match is real
Claiming file exists based on path construction, not ls/test
Interpreting error messages as success (e.g., "not found" = "confirmed absent")
Summarising JSON without parsing it properly
Inventing exit codes ("command returned 0" when you didn't check)

The Multi-Agent Validation Pattern

When available (C1/C2/C3 coordination):

### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel

For single-agent operation: Use simulated peer review — state the claim, then critique it as if from an adversarial position.

The Metacognitive Loop (Continuous)

Every 5-10 minutes of active work, or at natural breakpoints:

### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?

Recovery Patterns

When caught hallucinating:

Acknowledge immediately. Do not double down. Do not deflect. "I was wrong about [X]."
Correct explicitly. State the correction clearly, not buried in explanation.
Explain the gap. "I stated [X] because [reason]. The actual state is [Y]."
Update memory. Log the pattern so future-you watches for it.
Do not apologise excessively. One clear correction beats three apologies.

When uncertain mid-task:

State uncertainty. "I am not confident about [X]. Here is what I know: [...]"
Offer verification path. "I can check this by running [tool]."
Do not guess to maintain flow. A pause for verification beats a cascade of errors.

Integration with OpenClaw

Add to AGENTS.md startup checks:

## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory

Add to every SKILL.md:

## Hallucination Risks
[List domain-specific hallucination patterns for this skill]

Add to TOOLS.md:

## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?

Metrics

Track in memory/hallucination-log.md:

## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]

Anti-Patterns (What NOT to Do)

❌ "I believe..." — belief without evidence is a red flag
❌ "It should be..." — should is not is. Check.
❌ "As I mentioned earlier..." — verify you actually mentioned it
❌ "The system is..." — which system? When did you last check?
❌ "That means..." — does it? Trace the inference chain
❌ "Obviously..." — obvious to whom? On what evidence?

Sources

ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
Zylos Research (2026) — LLM Hallucination Detection: State of the Art
CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
arXiv:2603.10047 — Toward Epistemic Stability
arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution

Version 1.0 — May 2026 — Based on 2026 research landscape Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.

安全使用建议

This skill appears benign and purpose-aligned. Before installing, consider whether you want the agent to apply broad factual self-checks, run extra verification steps when tools are available, and save correction logs to memory.

功能分析

Type: OpenClaw Skill Name: anti-hallucination-skill Version: 1.0.1 The anti-hallucination-skill bundle consists entirely of instructional markdown (SKILL.md and README.md) designed to improve the reliability and self-correction capabilities of an AI agent. It provides a structured framework for the agent to identify, verify, and correct factual errors or 'hallucinations' using metacognitive checks and confidence calibration. There are no executable code files, no network requests, and no instructions that would lead to data exfiltration or unauthorized system access; the protocol is strictly focused on safety and grounding.

能力评估

ℹ Purpose & Capability

The skill’s purpose—reducing hallucinations through self-checks, verification, and corrections—is coherent with its instructions, but it changes the agent’s normal response workflow.

ℹ Instruction Scope

The protocol applies broadly before factual outputs and can trigger automatically, which is purpose-aligned but may affect many ordinary responses.

✓ Install Mechanism

No install spec, code files, binaries, environment variables, or credentials are declared; this is an instruction-only skill.

ℹ Credentials

The skill may cause verification tool calls when facts are uncertain, which fits the anti-hallucination purpose and does not show hidden or unrelated tool use.

ℹ Persistence & Privilege

The skill asks the agent to log hallucination corrections to memory files, creating purpose-aligned persistent state that users should be aware of.

版本历史

v1.0.1

Anti-hallucination-skill v1.0.1 - Initial public release of a comprehensive hallucination detection and mitigation protocol for OpenClaw agents - Includes actionable triggers, taxonomy of hallucination types, and step-by-step self-check and grounding protocols - Provides confidence calibration guidelines and guardrails for tool use to minimize ambigious or unverified claims - Outlines multi-agent and solo verification methods, as well as continuous metacognitive review routines - Supplies practical recovery and correction workflows for when hallucinations are detected - Designed for integration into AGENTS.md and as a template for SKILL.md risk awareness sections

元数据

Slug anti-hallucination-skill

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Anti-Hallucination 是什么？

Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 88 次。

如何安装 Anti-Hallucination？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install anti-hallucination-skill」即可一键安装，无需额外配置。

Anti-Hallucination 是免费的吗？

是的，Anti-Hallucination 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Anti-Hallucination 支持哪些平台？

Anti-Hallucination 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Anti-Hallucination？

由 Tooled-app（@tooled-app）开发并维护，当前版本 v1.0.1。

Anti-Hallucination