Anti-Hallucination
/install anti-hallucination-skill
SKILL.md - Anti-Hallucination Protocol
"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman
A runtime hallucination detection and mitigation skill for OpenClaw agents. Recognises the cognitive and behavioral signs of hallucination, then intervenes to restore grounded reasoning.
Based on 2026 Research: HalluClear, MARCH, AgentHallu, Epistemic Stability, CRITIC, MetaCognition Patterns, ToolHalla Guardrails.
The Philosophy
Detection > Prevention. Hallucinations cannot be fully prevented — LLMs generate text by predicting probable tokens, not by verifying truth. The question is not whether your agent will hallucinate. It is whether your agent catches itself when it does.
Self-Awareness > External Guardrails. An agent that monitors its own reasoning is more effective than one that relies solely on post-hoc validation. The metacognitive loop — observe, critique, correct — must be internal.
Specificity > Generality. Generic "be careful" instructions fail. Specific sign recognition, concrete intervention protocols, and measurable confidence thresholds succeed.
When to Activate
Automatic triggers — ANY of these activates the anti-hallucination protocol:
- Agent makes a factual claim without citation or source
- Agent generates a file path, URL, or identifier that does not exist
- Agent reports success without verifying the result
- Agent provides a specific date, name, or number from memory without checking
- Agent expresses high confidence (>90%) on a complex, uncertain topic
- Agent contradicts information in its own context or memory files
- Agent produces a tool call with parameters it cannot verify
- Agent offers analysis on data it has not actually read
- Agent describes system state without checking live status
- User expresses doubt: "Are you sure?" / "Can you verify that?"
Implicit triggers (monitor continuously):
- Tool call returns error but agent continues as if successful
- Agent invents plausible-sounding but unverified details
- Agent generalises from a single example
- Agent uses absolute language ("always", "never", "certainly") on probabilistic topics
The Hallucination Taxonomy
Know what you're looking for:
| Type | Description | Example |
|---|---|---|
| Intrinsic Factual | Contradicts source material | Claims file exists when read returned error |
| Intrinsic Semantic | Misrepresents meaning | Misreads config flag, draws wrong conclusion |
| Intrinsic Temporal | Wrong timing/sequence | "Yesterday I did X" when memory shows no record |
| Extrinsic Factual | Adds unverifiable but plausible info | Invents a specific version number not in docs |
| Extrinsic Non-Factual | Adds obviously false info | Claims a feature exists that was never built |
| Reasoning Error | Correct facts, wrong conclusion | "Disk is 90% full, therefore upgrade needed" (ignores tmp files) |
| Tool Hallucination | Fabricates tool results | Reports command output without running it |
| Self-Hallucination | False memory of own actions | "I already fixed that" when fix not in git |
The Recognition Protocol (5-Second Self-Check)
Before ANY output that contains facts, claims, or recommendations, ask:
### Reality Check (5s)
1. SOURCE: Do I have direct evidence for this claim? (file read, tool output, live check)
2. VERIFICATION: Can I verify this right now with a tool call?
3. CONFIDENCE: Am I >80% confident? If yes, am I >95% confident? Flag if yes.
4. MEMORY: Is this from a file I actually read this session, or "feels right"?
5. CONTRADICTION: Does this contradict anything in my context or memory?
If ANY check fails: Escalate to Grounding Protocol (below).
The Grounding Protocol (When Signs Detected)
Step 1: Stop and Flag
⚠️ HALLUCINATION CHECK TRIGGERED
Type: [intrinsic/extrinsic/reasoning/tool/self]
Claim: [the specific claim being questioned]
Confidence: [self-assessed %]
Evidence: [what I have / what I lack]
Step 2: Verify or Withdraw
If verifiable in \x3C30s:
- Run the tool call to check
- Report actual result
- Update confidence based on evidence
If not immediately verifiable:
- Withdraw the claim
- Replace with: "I do not have direct evidence for [X]. My sources: [list]."
- Offer to verify if user wants
If partially verifiable:
- Downgrade confidence explicitly
- Distinguish verified from inferred: "Confirmed: [A]. Inferred: [B]."
Step 3: Document the Correction
Add to memory/YYYY-MM-DD.md:
### Hallucination Correction — [Time]
- Claim: [what was wrong]
- Type: [taxonomy type]
- How caught: [which trigger fired]
- Correction: [what replaced it]
- Lesson: [pattern to watch for]
The Confidence Calibration Rules
Never express certainty you don't have:
| Situation | Max Confidence Allowed | Required Action |
|---|---|---|
| Read file this turn | 95% | Cite line number |
| Read file earlier | 85% | Re-read if challenged |
| Memory from past session | 70% | Flag as "from memory" |
| Inferred from pattern | 60% | State inference chain |
| Heard in training data | 50% | Treat as unverified |
| Pure intuition | 30% | Do not state as fact |
The Tool-Use Guardrails
Before reporting tool results:
- Did the tool actually execute? (check for error output)
- Did I read the full output? (not just first few lines)
- Did I understand the output correctly? (re-read if ambiguous)
- Did I report what it says, not what I expected it to say?
Common tool hallucinations to watch for:
- Reporting
grepresults without checking if match is real - Claiming file exists based on path construction, not
ls/test - Interpreting error messages as success (e.g., "not found" = "confirmed absent")
- Summarising JSON without parsing it properly
- Inventing exit codes ("command returned 0" when you didn't check)
The Multi-Agent Validation Pattern
When available (C1/C2/C3 coordination):
### Cross-Agent Verification
1. State claim to peer agent
2. Peer evaluates: [agree / disagree / cannot verify]
3. If disagree: both re-check sources
4. Consensus required for >90% confidence claims
5. Log disagreement in coordination channel
For single-agent operation: Use simulated peer review — state the claim, then critique it as if from an adversarial position.
The Metacognitive Loop (Continuous)
Every 5-10 minutes of active work, or at natural breakpoints:
### Metacognitive Checkpoint
- [ ] What have I claimed since last checkpoint?
- [ ] Which claims were verified vs assumed?
- [ ] Did any tool call fail silently?
- [ ] Am I building on a potentially false foundation?
- [ ] Should I re-verify my starting assumptions?
Recovery Patterns
When caught hallucinating:
- Acknowledge immediately. Do not double down. Do not deflect. "I was wrong about [X]."
- Correct explicitly. State the correction clearly, not buried in explanation.
- Explain the gap. "I stated [X] because [reason]. The actual state is [Y]."
- Update memory. Log the pattern so future-you watches for it.
- Do not apologise excessively. One clear correction beats three apologies.
When uncertain mid-task:
- State uncertainty. "I am not confident about [X]. Here is what I know: [...]"
- Offer verification path. "I can check this by running [tool]."
- Do not guess to maintain flow. A pause for verification beats a cascade of errors.
Integration with OpenClaw
Add to AGENTS.md startup checks:
## Anti-Hallucination Protocol
Before any factual claim:
1. Run 5-Second Self-Check
2. If triggered, execute Grounding Protocol
3. Log corrections to memory
Add to every SKILL.md:
## Hallucination Risks
[List domain-specific hallucination patterns for this skill]
Add to TOOLS.md:
## Tool Verification Checklist
- [ ] Command executed successfully?
- [ ] Full output read and understood?
- [ ] Result reported accurately, not inferred?
Metrics
Track in memory/hallucination-log.md:
## 2026-05-13 — Session Log
- Total claims made: [N]
- Verified claims: [N]
- Hallucinations caught: [N]
- Hallucinations missed (user caught): [N]
- Recovery time: [avg seconds]
Anti-Patterns (What NOT to Do)
- ❌ "I believe..." — belief without evidence is a red flag
- ❌ "It should be..." — should is not is. Check.
- ❌ "As I mentioned earlier..." — verify you actually mentioned it
- ❌ "The system is..." — which system? When did you last check?
- ❌ "That means..." — does it? Trace the inference chain
- ❌ "Obviously..." — obvious to whom? On what evidence?
Sources
- ToolHalla.ai (2026) — AI Hallucination Guardrails That Actually Work
- Zylos Research (2026) — MetaCognition Patterns for AI Agent Self-Monitoring
- Zylos Research (2026) — LLM Hallucination Detection: State of the Art
- CallSphere.ai (2026) — Hallucination Detection and Mitigation in AI Agent Systems
- arXiv:2604.17284 — HalluClear: Diagnosing Hallucinations in GUI Agents
- arXiv:2603.24579 — MARCH: Multi-Agent Reinforced Self-Check
- arXiv:2603.10047 — Toward Epistemic Stability
- arXiv:2601.06818 — AgentHallu: Benchmarking Hallucination Attribution
Version 1.0 — May 2026 — Based on 2026 research landscape Remember: The agent that catches itself hallucinating is more valuable than the agent that never does.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install anti-hallucination-skill - 安装完成后,直接呼叫该 Skill 的名称或使用
/anti-hallucination-skill触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Anti-Hallucination 是什么?
Detects and mitigates hallucinations in agent outputs by self-checking facts, verifying claims, and correcting unsupported or contradictory information. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 88 次。
如何安装 Anti-Hallucination?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install anti-hallucination-skill」即可一键安装,无需额外配置。
Anti-Hallucination 是免费的吗?
是的,Anti-Hallucination 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Anti-Hallucination 支持哪些平台?
Anti-Hallucination 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Anti-Hallucination?
由 Tooled-app(@tooled-app)开发并维护,当前版本 v1.0.1。