Chapter 23

Chain-of-Thought and Internal Monologue Mechanism

Chapter 23: Chain-of-Thought and Internal Monologue Mechanism

If tool calling is the Agent's "hands," then Chain-of-Thought (CoT) reasoning is its "brain." Through the Internal Monologue mechanism, Hermes Agent performs explicit reasoning and planning before taking action — dramatically improving tool call accuracy and multi-step task success rates. This chapter provides a deep analysis of how CoT works within an Agent, Hermes's internal monologue implementation, and when to enable high-intensity reasoning versus simple lookup.

23.1 The Role of Chain-of-Thought in Agents

Chain-of-Thought reasoning was systematically studied by Wei et al. (2022) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," demonstrating that having models "explicitly write out reasoning steps" brings significant performance improvements on complex tasks.

23.1.1 Three Forms of CoT

Form	Description	Hermes Implementation
Zero-shot CoT	Simply add "Let's think step by step"	System prompt instructs use of inner_monologue
Few-shot CoT	Provide examples with reasoning processes	Training data includes inner_monologue examples
Automatic CoT	Model autonomously decides whether to expand reasoning	Hermes adapts based on task complexity

23.1.2 The Value of CoT for Tool Calling

Without CoT, the model jumps directly from user input to tool call decisions, commonly causing:

Tool misselection: Choosing a functionally similar but not perfectly matching tool
Parameter errors: Missing required fields or type mismatches
Premature stopping: Only calling some of the needed tools and returning incomplete answers

With CoT, the model plans before calling:

User: Look up Apple's current stock price and compare it to three months ago

Internal Monologue:
1. I need two data points: today's stock price and the price 3 months ago
2. Tool stock_price_history can get historical data; parameters need symbol and days
3. Apple's ticker is AAPL; three months ≈ 90 days
4. I should get 90 days of history, then extract first and last data points for comparison
5. After comparison, calculate the percentage change: (new - old) / old * 100%

Action Plan:
- Step 1: stock_price_history(symbol="AAPL", days=90)
- Step 2: Extract data[0].close (earliest) and data[-1].close (latest)
- Step 3: Calculate (new - old) / old * 100%

23.2 Hermes Internal Monologue Implementation

23.2.1 Internal Monologue Trigger Conditions

Hermes does not always enable internal monologue. The system adaptively decides based on task complexity:

class TaskComplexityEstimator:
    def estimate(self, user_message: str, context: AgentContext) -> ComplexityLevel:
        signals = {
            # Multi-step signal words
            "multi_step": any(kw in user_message.lower() for kw in
                            ["then", "after", "next", "finally", "and also", "simultaneously"]),
            # Conditional logic
            "conditional": any(kw in user_message.lower() for kw in
                              ["if", "when", "based on", "depending on"]),
            # Tool chain requirement
            "tool_chain": len(self._predict_required_tools(user_message)) >= 2,
            # Numerical computation
            "computation": bool(re.search(r'\d+.*[%compare analyze]', user_message.lower())),
            # Ambiguity
            "ambiguous": self._ambiguity_score(user_message) > 0.6,
        }
        
        complexity_score = sum(signals.values()) / len(signals)
        
        if complexity_score >= 0.6:
            return ComplexityLevel.HIGH      # Enable full internal monologue
        elif complexity_score >= 0.3:
            return ComplexityLevel.MEDIUM    # Enable simplified internal monologue
        else:
            return ComplexityLevel.LOW       # Skip monologue, respond directly

23.2.2 Internal Monologue Structure

Hermes's internal monologue follows an "Observe-Analyze-Plan" three-part structure:

[inner_monologue]
## Observation
Analyzing the user's actual requirement: The user is asking...
Key constraints: ...
Potential ambiguities: ...

## Analysis
Tool evaluation:
- tool_A: Suitable, because...
- tool_B: Not suitable, because...
Data dependency: Step 1 output will serve as Step 2 input
Potential risks: If search returns empty, I need to...

## Plan
Step 1: Call {tool_name} with parameters {...}
Step 2: Based on Step 1 result, call {tool_name2}
Step 3: Integrate results and generate final answer
[/inner_monologue]

23.2.3 Real Internal Monologue Example

Task: "Find the most-starred Python machine learning libraries on GitHub and compare their recent commit activity"

[inner_monologue]
## Observation
User requirements:
1. Find the highest-star Python ML libraries
2. Compare "recent commit activity"

"Recent" is not defined — I'll assume the past 30 days.

## Analysis
Required data sources:
- GitHub search or web search for star rankings
- GitHub API for commit counts per library for past 30 days

Tool selection:
- web_search: Good for getting "top Python ML libraries by stars" list
- http_request: Call GitHub API for precise commit data
  - Endpoint: GET /repos/{owner}/{repo}/commits?since={date}

Known major ML libraries: TensorFlow, PyTorch, scikit-learn, Keras, JAX
Need to verify star ranking order, then fetch recent commits via API

## Plan
Step 1: web_search("top Python machine learning libraries GitHub stars 2024")
Step 2: Extract top 5 libraries' owner/repo pairs from results
Step 3: Call GitHub API for each: /repos/{owner}/{repo}/commits?since=30days_ago
Step 4: Aggregate data and generate comparison table
[/inner_monologue]

23.3 The step Tag Reasoning Chain Format

For particularly complex tasks, Hermes supports more granular [step] tags that decompose the reasoning process into independent step units:

[inner_monologue]
[step id=1]
Understanding the task: User wants to compare three competitors' feature differences.
Need to collect: product names, core feature lists, pricing, user reviews.
[/step]

[step id=2]
Data collection strategy:
- Use web_search to find each competitor's official website
- Use browser_navigate to access product pages for detailed feature lists
- Use web_search for user reviews (G2/Capterra/Reddit)
[/step]

[step id=3]
Validation strategy:
- Official website feature lists may have marketing bias; cross-validate with user reviews
- Use official websites as pricing authority, but watch for tax/promotional pricing
[/step]

[step id=4]
Output format planning:
- Generate a Markdown comparison table
- Rate each dimension as: Good / Neutral / Poor
- End with a summary recommendation
[/step]
[/inner_monologue]

23.4 When to Enable High-Intensity Reasoning vs. Simple Lookup

Scenario Type	Recommended Strategy	Reason
Multi-step dependent tasks (> 3 steps)	Full CoT + step tags	Complex dependency chain needs explicit planning
Conditional branching (if-else logic)	Full CoT	Need to anticipate multiple possible outcomes
Numerical computation / comparative analysis	Full CoT	Prevents calculation errors; leaves traceable reasoning
Ambiguous tool selection (3+ candidates)	Simplified CoT	Helps model weigh tool trade-offs
Single-tool direct lookup	No CoT	Overhead not worth it; direct call is faster
Simple factual Q&A	No CoT	Model's internal knowledge is sufficient
Time-sensitive real-time data	No CoT	Speed is priority; minimize token consumption

Controlling CoT via System Prompt

ENABLE_FULL_COT_PROMPT = """
Before executing any tool call, you MUST use the [inner_monologue] tag to:
1. Clearly analyze the user's actual need
2. List all tools that may need to be called
3. Plan the tool call order and data flow
4. Anticipate potential errors and mitigation strategies

Do not skip the internal monologue step.
"""

DISABLE_COT_PROMPT = """
Directly call the necessary tools and answer the user's question.
No internal monologue required; prioritize response speed.
"""

ADAPTIVE_COT_PROMPT = """
For tasks requiring multiple steps or complex logic, use [inner_monologue] for planning.
For simple direct queries, just call the tool and respond.
"""

Programmatic CoT Control

from hermes import HermesAgent, CotMode

agent = HermesAgent()

# Enable full CoT for a complex task
result = await agent.run(
    message="Analyze the key risks in this financial report",
    cot_mode=CotMode.FULL,
    cot_max_tokens=500,
    show_monologue=True,      # Show internal monologue for debugging
)

# Disable CoT for simple queries
result = await agent.run(
    message="What day of the week is it?",
    cot_mode=CotMode.DISABLED,
)

23.5 Impact of Internal Monologue on Tool Call Accuracy

23.5.1 Empirical Data

NousResearch published internal test data when launching Hermes 4 showing the impact of internal monologue on tool call accuracy:

Test Scenario	Without CoT	With CoT (inner_monologue)	Improvement
Single tool call (simple)	94.2%	95.1%	+0.9pp
Single tool call (complex params)	79.3%	91.7%	+12.4pp
Dual tool serial calls	68.4%	84.2%	+15.8pp
3+ step tool chains	51.6%	76.8%	+25.2pp
Conditional branching tool calls	44.8%	71.3%	+26.5pp
Error recovery scenarios	38.2%	67.9%	+29.7pp

Key finding: CoT provides negligible gains for simple single-tool calls (+0.9pp), but massive improvements for complex multi-step and error recovery scenarios (up to +29.7pp). This supports the "adaptive CoT" strategy — enable on demand rather than always forcing it.

23.5.2 Token Consumption vs. Accuracy Trade-off

cot_analysis = {
    "simple_task": {
        "extra_tokens": 80,             # ~80 tokens for inner_monologue
        "accuracy_gain": 0.009,         # +0.9%
        "cost_benefit": 0.009 / 80,     # 0.00011 %/token — not worthwhile
    },
    "complex_multi_step": {
        "extra_tokens": 250,
        "accuracy_gain": 0.252,         # +25.2%
        "cost_benefit": 0.252 / 250,    # 0.00101 %/token — very worthwhile
    },
    "error_recovery": {
        "extra_tokens": 200,
        "accuracy_gain": 0.297,         # +29.7%
        "cost_benefit": 0.297 / 200,    # 0.00149 %/token — highest return
    }
}

23.6 Production Best Practices for CoT

Filtering Internal Monologue from Output

import re

def filter_internal_monologue(response: str) -> str:
    """Remove internal monologue content from agent response"""
    response = re.sub(
        r'\[inner_monologue\].*?\[/inner_monologue\]',
        '',
        response,
        flags=re.DOTALL,
    )
    response = re.sub(
        r'\[step[^\]]*\].*?\[/step\]',
        '',
        response,
        flags=re.DOTALL,
    )
    response = re.sub(r'\n{3,}', '\n\n', response)
    return response.strip()

CoT Quality Monitoring

class CotQualityMonitor:
    def analyze(self, monologue: str) -> dict:
        return {
            "has_observation": "observation" in monologue.lower(),
            "has_plan": any(kw in monologue.lower() for kw in ["step", "plan", "first", "then"]),
            "tool_count_mentioned": len(re.findall(r'\b\w+_\w+\(', monologue)),
            "length_tokens": self.tokenizer.count(monologue),
            "reasoning_quality": self._score_reasoning_quality(monologue),
        }
    
    def _score_reasoning_quality(self, monologue: str) -> float:
        score = 0.0
        if any(kw in monologue.lower() for kw in ["constraint", "limitation", "requirement"]):
            score += 0.3
        if any(kw in monologue.lower() for kw in ["if fail", "fallback", "error", "retry"]):
            score += 0.3
        if re.search(r'"[^"]+"\s*:\s*"[^"]+"', monologue):
            score += 0.4
        return score

23.7 Summary

This chapter systematically covered the Hermes Chain-of-Thought and internal monologue mechanism:

CoT's value: Improves tool call accuracy by up to +29.7pp on complex tasks, with minimal benefit on simple tasks
Internal monologue structure: Observe-Analyze-Plan three-part structure using [inner_monologue] + [step] tags
Adaptive strategy: Automatically decides whether to enable CoT based on task complexity, avoiding overhead on simple tasks
Filtering and monitoring: Filter internal monologue output in production and establish quality monitoring

Review Questions

Internal monologue is invisible to users but consumes real, billable tokens. For token-cost-sensitive applications, how would you maximize reduction of CoT token overhead without sacrificing accuracy on complex tasks?
The "plan" in the internal monologue sometimes differs from the actual execution steps (the model says "Step 1 is A" but actually calls B first). What is the fundamental cause of this "plan-execution deviation"?
If internal monologue were exposed to users ("transparent mode"), in which application scenarios would this be an advantage? What new design challenges would it introduce?

Rate this chapter

4.8 / 5 (9 ratings)