Chapter 70

Prompt Injection Defense: Attack Vectors, Input Sanitization, Built-in Classifiers and Defense Architecture Design

Chapter 70: Hallucination Detection and Fact-Checking: Making Claude's Output More Trustworthy

70.1 Hallucination: An Intrinsic Challenge for LLMs

Among all AI application reliability issues, hallucination is the one engineers find most troublesome. Hallucination refers to an LLM generating content that sounds plausible but is factually wrong, expressed with high confidence — fabricated citations, nonexistent API functions, incorrect historical dates, or made-up statistics.

What makes hallucination particularly problematic is its deceptiveness: unlike grammatical errors or obvious logical contradictions, hallucinated content is often indistinguishable from correct information in wording, format, and expression style. An uninformed reader — or even a reader with some domain knowledge — can be misled by high-quality hallucinated content.

Root Causes of Hallucination

Understanding the technical causes of hallucination helps design more targeted defenses:

Training data bias: LLMs are trained to generate "statistically plausible" text. If incorrect information appears with high frequency in training data, the model may learn it as the "correct" answer.

Cumulative error in autoregressive generation: Language models generate text token by token, with each step based on previously generated content. Once a slight deviation occurs, subsequent generation builds on that deviation, causing error accumulation.

Knowledge cutoff issues: Claude's training data has a knowledge cutoff date. For events after that date, the model may engage in "speculative gap-filling" hallucination.

Overgeneralization: Models may apply learned patterns to inapplicable situations — for example, "extending" a well-known author's writing style to works they never wrote.

Confidence calibration problems: LLMs sometimes lack accurate "I don't know" capability, tending to generate plausible-sounding content rather than admitting uncertainty.

70.2 Hallucination Classification System

Intrinsic vs. Extrinsic Hallucination

Intrinsic Hallucination Hallucinations that contradict provided source documents. Particularly common in RAG systems:

# Example: Intrinsic hallucination
# Provided document content:
source_doc = "In Q3 2024, company revenue reached $120M, up 15% year-over-year."

# Claude's hallucinated output (contradicts source):
hallucinated_output = "According to the document, company Q3 2024 revenue was $180M, 
growth rate 20%."
# Error: Both the dollar amount and growth rate differ from the source document

Extrinsic Hallucination Claims that cannot be verified from the provided source documents — the information may be correct, but cannot be verified from the given context:

# Example: Extrinsic hallucination
# The document only describes company revenue data
# Claude's extrinsic hallucination:
hallucinated_output = "The company was founded in 2008 and is headquartered in Shanghai..."
# These details don't appear in the source document — Claude is "filling in" information

Classification by Risk Level

In practice, classifying by potential harm severity is more useful:

High-risk hallucinations:
├── Medical information: incorrect drug dosages, diagnoses, treatment recommendations
├── Legal information: nonexistent regulations, incorrect case citations
├── Financial data: fabricated statistics, prices, financial report data
└── Fabricated citations: nonexistent papers, books, research reports

Medium-risk hallucinations:
├── Historical facts: incorrect dates, people, event sequences
├── Technical specifications: nonexistent APIs, function signatures
└── Product information: inaccurate feature descriptions, prices

Low-risk hallucinations:
├── Creative extension: reasonable inference from known information
└── Stylistic filler: wording variations, but substantive meaning correct

70.3 Technical Methods for Detecting Hallucination

Method 1: Self-Consistency Check

Sample the same question multiple times and use voting to detect inconsistencies:

import anthropic
from collections import Counter
import re

client = anthropic.Anthropic()

def consistency_check(
    question: str,
    n_samples: int = 5,
    temperature: float = 0.8
) -> dict:
    """
    Detect self-consistency in Claude's answers through multiple samples.
    High inconsistency typically signals hallucination risk.
    """
    responses = []
    
    for _ in range(n_samples):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=512,
            temperature=temperature,
            messages=[{"role": "user", "content": question}]
        )
        responses.append(response.content[0].text)
    
    # For structured questions, extract key answers for comparison
    numbers_found = []
    for resp in responses:
        nums = re.findall(r'\b\d+(?:\.\d+)?%?\b', resp)
        numbers_found.append(set(nums))
    
    if numbers_found:
        all_nums = [num for s in numbers_found for num in s]
        num_counter = Counter(all_nums)
        consistency_score = max(num_counter.values()) / n_samples
    else:
        consistency_score = 1.0
    
    return {
        "consistency_score": consistency_score,
        "responses": responses,
        "is_reliable": consistency_score >= 0.8,
        "warning": "Low consistency detected" if consistency_score < 0.6 else None
    }

Method 2: Citation Verification

For RAG systems, verify that Claude's output genuinely draws on provided documents:

from typing import Optional

def verify_citations(
    model_output: str,
    source_documents: list,
    similarity_threshold: float = 0.75
) -> dict:
    """
    Verify whether claims in model output are supported by source documents.
    """
    from sentence_transformers import SentenceTransformer, util
    import nltk
    
    encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    sentences = nltk.sent_tokenize(model_output)
    
    verification_results = []
    
    for sentence in sentences:
        if len(sentence) < 20 or sentence.endswith('?'):
            continue
        
        sentence_embedding = encoder.encode(sentence)
        best_match = None
        best_score = 0
        
        for doc in source_documents:
            doc_sentences = nltk.sent_tokenize(doc["content"])
            
            for doc_sentence in doc_sentences:
                doc_embedding = encoder.encode(doc_sentence)
                score = float(util.cos_sim(sentence_embedding, doc_embedding))
                
                if score > best_score:
                    best_score = score
                    best_match = {
                        "source": doc["source"],
                        "matching_text": doc_sentence,
                        "similarity": score
                    }
        
        verification_results.append({
            "claim": sentence,
            "supported": best_score >= similarity_threshold,
            "confidence": best_score,
            "best_match": best_match
        })
    
    unsupported_claims = [r for r in verification_results if not r["supported"]]
    
    return {
        "total_claims": len(verification_results),
        "unsupported_claims": len(unsupported_claims),
        "reliability_score": 1 - (len(unsupported_claims) / max(len(verification_results), 1)),
        "details": verification_results
    }

Method 3: Claude Self-Evaluation

Claude can be used to evaluate the reliability of its own or another model's output:

def hallucination_self_eval(
    original_question: str,
    model_response: str,
    source_context: Optional[str] = None
) -> dict:
    """Use Claude to evaluate potential hallucinations in an output."""
    
    context_section = ""
    if source_context:
        context_section = f"""
Reference sources (to be used as factual basis):
{source_context}

---
"""
    
    eval_prompt = f"""You are a professional fact-checking assistant. 
Please evaluate the reliability of the following AI response.

{context_section}
Question: {original_question}

AI Response: {model_response}

Complete the following tasks:
1. Identify all specific factual claims (numbers, dates, names, citations, etc.)
2. For each claim, assess:
   - Whether it can be verified from the reference sources (if provided)
   - Whether it is obvious common knowledge (high confidence)
   - Whether it is a specific, hard-to-verify claim (high risk)
3. Provide an overall credibility score (0-10)

Output as JSON:
{{
  "factual_claims": [
    {{
      "claim": "specific claim text",
      "verification_status": "verified|unverified|likely_correct|suspicious",
      "risk_level": "high|medium|low",
      "reason": "basis for judgment"
    }}
  ],
  "overall_credibility_score": 0-10,
  "main_concerns": ["primary risk points"],
  "recommendation": "safe_to_use|use_with_caution|needs_verification|do_not_use"
}}"""
    
    eval_response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        temperature=0.1,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    
    import json
    try:
        return json.loads(eval_response.content[0].text)
    except json.JSONDecodeError:
        return {"error": "Failed to parse evaluation response"}

70.4 Grounding Techniques: Prompt Engineering to Reduce Hallucination

Explicit Uncertainty Expression

Use prompt instructions to require Claude to explicitly express uncertainty:

def create_grounded_prompt(question: str, sources: list) -> str:
    sources_text = "\n\n".join([
        f"[Source {i+1}: {s['title']}]\n{s['content']}"
        for i, s in enumerate(sources)
    ])
    
    return f"""Please answer the question based strictly on the provided source materials below.

Source materials:
{sources_text}

Question: {question}

Answer requirements:
1. Only state content that can be directly supported by the source materials
2. For each important claim, use attribution expressions like "According to Source X" 
   or "Source X states"
3. If part of the question cannot be answered from the source materials, explicitly state 
   "This information is not mentioned in the source materials"
4. Do not add information absent from the sources, even if you believe it to be correct
5. If uncertain about something, use phrases like "According to the source materials, 
   it appears..." or "The source materials suggest..."

Please begin your answer:"""

Structured Output Requirements

Structuring output makes hallucinations easier to detect:

def structured_factual_query(question: str, require_citations: bool = True) -> str:
    citation_instruction = """
Every factual claim must be followed by a citation tag:
[Needs Verification] or [Common Knowledge] or [Source: specific source name]
""" if require_citations else ""
    
    return f"""Please answer the following question in a structured format:

Question: {question}

Output format:
{{
  "direct_answer": "one-sentence direct answer",
  "supporting_facts": [
    {{
      "fact": "supporting factual statement",
      "confidence": "high|medium|low",
      "source_type": "verified_source|common_knowledge|inference|uncertain"
    }}
  ],
  "caveats": ["important caveats or uncertainty notes"],
  "information_gaps": ["aspects that cannot be confirmed or answered"]
}}

{citation_instruction}

Please be conservative: for any information you are not completely certain about, 
choose low confidence rather than medium or high."""

Explicitly Allowing "I Don't Know"

A frequently overlooked but highly effective technique is explicitly allowing and encouraging Claude to express uncertainty:

CALIBRATION_SYSTEM_PROMPT = """You are an assistant that places high value on accuracy.

Regarding your uncertainty:
- When unsure of a fact, explicitly say "I'm not certain" or "my information may not be current"
- When your knowledge cutoff date might affect your answer, proactively remind the user
- Don't guess at specific numbers, dates, or names just to appear knowledgeable
- For professional domains (medical, legal, financial), even when you have relevant knowledge, 
  recommend that users seek professional advice

When you don't know:
- Directly say "I don't have reliable information about this"
- Or say "I cannot give a definitive answer to this question; I recommend consulting [specific resource]"
- Rather than fabricating plausible-sounding answers

You would rather have users perceive your knowledge as limited than have them make wrong 
decisions based on your incorrect information."""

70.5 Citation Requirement System

Mandatory Citation Format

For scenarios requiring high-confidence output, establish mandatory citation mechanisms:

class CitationRequirementSystem:
    CITATION_SYSTEM_PROMPT = """You must follow these citation rules:

1. After every factual claim, add a citation tag at the end of the sentence:
   - [Verified: {source}] — information from provided documents
   - [Common Knowledge] — widely known facts
   - [Inference] — based on logical reasoning
   - [Uncertain] — cannot confirm accuracy

2. If information cannot be found in the provided context, state explicitly:
   "This information is not in the provided materials. The following is from my 
   knowledge base and should be independently verified: ..."

3. Never omit citation tags to make responses flow more smoothly."""
    
    def process_with_citations(self, query: str, context_docs: list) -> dict:
        docs_text = self._format_docs(context_docs)
        
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            system=self.CITATION_SYSTEM_PROMPT,
            messages=[{
                "role": "user",
                "content": f"Reference materials:\n{docs_text}\n\nQuestion: {query}"
            }]
        )
        
        raw_output = response.content[0].text
        citation_analysis = self._analyze_citations(raw_output)
        
        return {
            "output": raw_output,
            "citation_coverage": citation_analysis["coverage"],
            "uncited_sentences": citation_analysis["uncited"],
            "reliability_assessment": self._assess_reliability(citation_analysis)
        }
    
    def _analyze_citations(self, text: str) -> dict:
        import re
        citation_pattern = r'\[(Verified|Common Knowledge|Inference|Uncertain)[^\]]*\]'
        sentences = [s.strip() for s in text.split('.') if len(s.strip()) > 20]
        
        cited = [s for s in sentences if re.search(citation_pattern, s)]
        uncited = [s for s in sentences if not re.search(citation_pattern, s)]
        
        return {
            "coverage": len(cited) / max(len(sentences), 1),
            "cited": cited,
            "uncited": uncited
        }

70.6 Hallucination Control in Production Systems

Dual Verification for High-Stakes Scenarios

class HighStakesResponsePipeline:
    """
    Dual verification pipeline for high-stakes scenarios (medical, legal, financial).
    """
    
    async def generate_verified_response(
        self,
        query: str,
        domain: str,
        sources: list
    ) -> dict:
        # Step 1: Generate initial response (with citations)
        initial_response = await self.generate_with_citations(query, sources)
        
        # Step 2: Self-critique (focus on finding problems)
        critique = await self.self_critique(
            question=query,
            response=initial_response["output"],
            sources=sources
        )
        
        # Step 3: Revise based on critique
        if critique["issues_found"]:
            revised_response = await self.revise_response(
                original_response=initial_response["output"],
                critique=critique["critique"],
                sources=sources
            )
        else:
            revised_response = initial_response["output"]
        
        # Step 4: Final risk assessment
        risk_score = self.calculate_risk_score(
            domain=domain,
            response=revised_response,
            citation_coverage=initial_response["citation_coverage"]
        )
        
        return {
            "response": revised_response,
            "risk_score": risk_score,
            "requires_human_review": risk_score > 0.7,
            "verification_steps_completed": 4
        }
    
    def calculate_risk_score(self, domain: str, response: str, citation_coverage: float) -> float:
        base_risk = {
            "medical": 0.8,
            "legal": 0.7,
            "financial": 0.6,
            "general": 0.3
        }.get(domain, 0.5)
        
        citation_factor = 1 - (citation_coverage * 0.4)
        
        import re
        num_count = len(re.findall(r'\b\d+(?:\.\d+)?%?\b', response))
        numeric_factor = 1 + (num_count * 0.02)
        
        return min(base_risk * citation_factor * numeric_factor, 1.0)

Monitoring and Alerting

def log_hallucination_metrics(
    request_id: str,
    domain: str,
    citation_coverage: float,
    consistency_score: float,
    risk_score: float
):
    metrics = {
        "request_id": request_id,
        "domain": domain,
        "citation_coverage": citation_coverage,
        "consistency_score": consistency_score,
        "risk_score": risk_score,
        "timestamp": datetime.utcnow().isoformat()
    }
    
    monitoring_client.record(metrics)
    
    if risk_score > 0.8:
        alert_system.send_alert(
            severity="HIGH",
            message=f"High hallucination risk detected in {domain} response",
            details=metrics
        )

Summary

Hallucination is the reliability problem in LLM applications most in need of systematic treatment. Understanding the root causes — the statistical nature of generation, knowledge cutoffs, and inadequate confidence calibration — is prerequisite to designing effective defenses.

On the detection side, self-consistency checks, citation verification, and self-evaluation are three complementary methods. On the prevention side, grounding techniques, structured outputs, and mandatory citation requirements can significantly reduce hallucination rates. In high-stakes scenarios (medical, legal, financial), build multi-step generate-critique-revise verification pipelines backed by human review as a final safety net.

Ultimately, eliminating hallucinations entirely is not the goal — a completely hallucination-free LLM does not currently exist. The goal is to control the rate and risk of hallucinations within acceptable bounds, and to provide measurable trustworthiness indicators for every output.

Rate this chapter
4.5  / 5  (3 ratings)

💬 Comments