Chain-of-Thought and Internal Monologue Mechanism
Chapter 23: Chain-of-Thought and Internal Monologue Mechanism
If tool calling is the Agent's "hands," then Chain-of-Thought (CoT) reasoning is its "brain." Through the Internal Monologue mechanism, Hermes Agent performs explicit reasoning and planning before taking action โ dramatically improving tool call accuracy and multi-step task success rates. This chapter provides a deep analysis of how CoT works within an Agent, Hermes's internal monologue implementation, and when to enable high-intensity reasoning versus simple lookup.
23.1 The Role of Chain-of-Thought in Agents
Chain-of-Thought reasoning was systematically studied by Wei et al. (2022) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," demonstrating that having models "explicitly write out reasoning steps" brings significant performance improvements on complex tasks.
23.1.1 Three Forms of CoT
| Form | Description | Hermes Implementation |
|---|---|---|
| Zero-shot CoT | Simply add "Let's think step by step" | System prompt instructs use of inner_monologue |
| Few-shot CoT | Provide examples with reasoning processes | Training data includes inner_monologue examples |
| Automatic CoT | Model autonomously decides whether to expand reasoning | Hermes adapts based on task complexity |
23.1.2 The Value of CoT for Tool Calling
Without CoT, the model jumps directly from user input to tool call decisions, commonly causing:
- Tool misselection: Choosing a functionally similar but not perfectly matching tool
- Parameter errors: Missing required fields or type mismatches
- Premature stopping: Only calling some of the needed tools and returning incomplete answers
With CoT, the model plans before calling:
User: Look up Apple's current stock price and compare it to three months ago
Internal Monologue:
1. I need two data points: today's stock price and the price 3 months ago
2. Tool stock_price_history can get historical data; parameters need symbol and days
3. Apple's ticker is AAPL; three months โ 90 days
4. I should get 90 days of history, then extract first and last data points for comparison
5. After comparison, calculate the percentage change: (new - old) / old * 100%
Action Plan:
- Step 1: stock_price_history(symbol="AAPL", days=90)
- Step 2: Extract data[0].close (earliest) and data[-1].close (latest)
- Step 3: Calculate (new - old) / old * 100%
23.2 Hermes Internal Monologue Implementation
23.2.1 Internal Monologue Trigger Conditions
Hermes does not always enable internal monologue. The system adaptively decides based on task complexity:
class TaskComplexityEstimator:
def estimate(self, user_message: str, context: AgentContext) -> ComplexityLevel:
signals = {
# Multi-step signal words
"multi_step": any(kw in user_message.lower() for kw in
["then", "after", "next", "finally", "and also", "simultaneously"]),
# Conditional logic
"conditional": any(kw in user_message.lower() for kw in
["if", "when", "based on", "depending on"]),
# Tool chain requirement
"tool_chain": len(self._predict_required_tools(user_message)) >= 2,
# Numerical computation
"computation": bool(re.search(r'\d+.*[%compare analyze]', user_message.lower())),
# Ambiguity
"ambiguous": self._ambiguity_score(user_message) > 0.6,
}
complexity_score = sum(signals.values()) / len(signals)
if complexity_score >= 0.6:
return ComplexityLevel.HIGH # Enable full internal monologue
elif complexity_score >= 0.3:
return ComplexityLevel.MEDIUM # Enable simplified internal monologue
else:
return ComplexityLevel.LOW # Skip monologue, respond directly
23.2.2 Internal Monologue Structure
Hermes's internal monologue follows an "Observe-Analyze-Plan" three-part structure:
[inner_monologue]
## Observation
Analyzing the user's actual requirement: The user is asking...
Key constraints: ...
Potential ambiguities: ...
## Analysis
Tool evaluation:
- tool_A: Suitable, because...
- tool_B: Not suitable, because...
Data dependency: Step 1 output will serve as Step 2 input
Potential risks: If search returns empty, I need to...
## Plan
Step 1: Call {tool_name} with parameters {...}
Step 2: Based on Step 1 result, call {tool_name2}
Step 3: Integrate results and generate final answer
[/inner_monologue]
23.2.3 Real Internal Monologue Example
Task: "Find the most-starred Python machine learning libraries on GitHub and compare their recent commit activity"
[inner_monologue]
## Observation
User requirements:
1. Find the highest-star Python ML libraries
2. Compare "recent commit activity"
"Recent" is not defined โ I'll assume the past 30 days.
## Analysis
Required data sources:
- GitHub search or web search for star rankings
- GitHub API for commit counts per library for past 30 days
Tool selection:
- web_search: Good for getting "top Python ML libraries by stars" list
- http_request: Call GitHub API for precise commit data
- Endpoint: GET /repos/{owner}/{repo}/commits?since={date}
Known major ML libraries: TensorFlow, PyTorch, scikit-learn, Keras, JAX
Need to verify star ranking order, then fetch recent commits via API
## Plan
Step 1: web_search("top Python machine learning libraries GitHub stars 2024")
Step 2: Extract top 5 libraries' owner/repo pairs from results
Step 3: Call GitHub API for each: /repos/{owner}/{repo}/commits?since=30days_ago
Step 4: Aggregate data and generate comparison table
[/inner_monologue]
23.3 The step Tag Reasoning Chain Format
For particularly complex tasks, Hermes supports more granular [step] tags that decompose the reasoning process into independent step units:
[inner_monologue]
[step id=1]
Understanding the task: User wants to compare three competitors' feature differences.
Need to collect: product names, core feature lists, pricing, user reviews.
[/step]
[step id=2]
Data collection strategy:
- Use web_search to find each competitor's official website
- Use browser_navigate to access product pages for detailed feature lists
- Use web_search for user reviews (G2/Capterra/Reddit)
[/step]
[step id=3]
Validation strategy:
- Official website feature lists may have marketing bias; cross-validate with user reviews
- Use official websites as pricing authority, but watch for tax/promotional pricing
[/step]
[step id=4]
Output format planning:
- Generate a Markdown comparison table
- Rate each dimension as: Good / Neutral / Poor
- End with a summary recommendation
[/step]
[/inner_monologue]
23.4 When to Enable High-Intensity Reasoning vs. Simple Lookup
| Scenario Type | Recommended Strategy | Reason |
|---|---|---|
| Multi-step dependent tasks (> 3 steps) | Full CoT + step tags | Complex dependency chain needs explicit planning |
| Conditional branching (if-else logic) | Full CoT | Need to anticipate multiple possible outcomes |
| Numerical computation / comparative analysis | Full CoT | Prevents calculation errors; leaves traceable reasoning |
| Ambiguous tool selection (3+ candidates) | Simplified CoT | Helps model weigh tool trade-offs |
| Single-tool direct lookup | No CoT | Overhead not worth it; direct call is faster |
| Simple factual Q&A | No CoT | Model's internal knowledge is sufficient |
| Time-sensitive real-time data | No CoT | Speed is priority; minimize token consumption |
Controlling CoT via System Prompt
ENABLE_FULL_COT_PROMPT = """
Before executing any tool call, you MUST use the [inner_monologue] tag to:
1. Clearly analyze the user's actual need
2. List all tools that may need to be called
3. Plan the tool call order and data flow
4. Anticipate potential errors and mitigation strategies
Do not skip the internal monologue step.
"""
DISABLE_COT_PROMPT = """
Directly call the necessary tools and answer the user's question.
No internal monologue required; prioritize response speed.
"""
ADAPTIVE_COT_PROMPT = """
For tasks requiring multiple steps or complex logic, use [inner_monologue] for planning.
For simple direct queries, just call the tool and respond.
"""
Programmatic CoT Control
from hermes import HermesAgent, CotMode
agent = HermesAgent()
# Enable full CoT for a complex task
result = await agent.run(
message="Analyze the key risks in this financial report",
cot_mode=CotMode.FULL,
cot_max_tokens=500,
show_monologue=True, # Show internal monologue for debugging
)
# Disable CoT for simple queries
result = await agent.run(
message="What day of the week is it?",
cot_mode=CotMode.DISABLED,
)
23.5 Impact of Internal Monologue on Tool Call Accuracy
23.5.1 Empirical Data
NousResearch published internal test data when launching Hermes 4 showing the impact of internal monologue on tool call accuracy:
| Test Scenario | Without CoT | With CoT (inner_monologue) | Improvement |
|---|---|---|---|
| Single tool call (simple) | 94.2% | 95.1% | +0.9pp |
| Single tool call (complex params) | 79.3% | 91.7% | +12.4pp |
| Dual tool serial calls | 68.4% | 84.2% | +15.8pp |
| 3+ step tool chains | 51.6% | 76.8% | +25.2pp |
| Conditional branching tool calls | 44.8% | 71.3% | +26.5pp |
| Error recovery scenarios | 38.2% | 67.9% | +29.7pp |
Key finding: CoT provides negligible gains for simple single-tool calls (+0.9pp), but massive improvements for complex multi-step and error recovery scenarios (up to +29.7pp). This supports the "adaptive CoT" strategy โ enable on demand rather than always forcing it.
23.5.2 Token Consumption vs. Accuracy Trade-off
cot_analysis = {
"simple_task": {
"extra_tokens": 80, # ~80 tokens for inner_monologue
"accuracy_gain": 0.009, # +0.9%
"cost_benefit": 0.009 / 80, # 0.00011 %/token โ not worthwhile
},
"complex_multi_step": {
"extra_tokens": 250,
"accuracy_gain": 0.252, # +25.2%
"cost_benefit": 0.252 / 250, # 0.00101 %/token โ very worthwhile
},
"error_recovery": {
"extra_tokens": 200,
"accuracy_gain": 0.297, # +29.7%
"cost_benefit": 0.297 / 200, # 0.00149 %/token โ highest return
}
}
23.6 Production Best Practices for CoT
Filtering Internal Monologue from Output
import re
def filter_internal_monologue(response: str) -> str:
"""Remove internal monologue content from agent response"""
response = re.sub(
r'\[inner_monologue\].*?\[/inner_monologue\]',
'',
response,
flags=re.DOTALL,
)
response = re.sub(
r'\[step[^\]]*\].*?\[/step\]',
'',
response,
flags=re.DOTALL,
)
response = re.sub(r'\n{3,}', '\n\n', response)
return response.strip()
CoT Quality Monitoring
class CotQualityMonitor:
def analyze(self, monologue: str) -> dict:
return {
"has_observation": "observation" in monologue.lower(),
"has_plan": any(kw in monologue.lower() for kw in ["step", "plan", "first", "then"]),
"tool_count_mentioned": len(re.findall(r'\b\w+_\w+\(', monologue)),
"length_tokens": self.tokenizer.count(monologue),
"reasoning_quality": self._score_reasoning_quality(monologue),
}
def _score_reasoning_quality(self, monologue: str) -> float:
score = 0.0
if any(kw in monologue.lower() for kw in ["constraint", "limitation", "requirement"]):
score += 0.3
if any(kw in monologue.lower() for kw in ["if fail", "fallback", "error", "retry"]):
score += 0.3
if re.search(r'"[^"]+"\s*:\s*"[^"]+"', monologue):
score += 0.4
return score
23.7 Summary
This chapter systematically covered the Hermes Chain-of-Thought and internal monologue mechanism:
- CoT's value: Improves tool call accuracy by up to +29.7pp on complex tasks, with minimal benefit on simple tasks
- Internal monologue structure: Observe-Analyze-Plan three-part structure using
[inner_monologue]+[step]tags - Adaptive strategy: Automatically decides whether to enable CoT based on task complexity, avoiding overhead on simple tasks
- Filtering and monitoring: Filter internal monologue output in production and establish quality monitoring
Review Questions
-
Internal monologue is invisible to users but consumes real, billable tokens. For token-cost-sensitive applications, how would you maximize reduction of CoT token overhead without sacrificing accuracy on complex tasks?
-
The "plan" in the internal monologue sometimes differs from the actual execution steps (the model says "Step 1 is A" but actually calls B first). What is the fundamental cause of this "plan-execution deviation"?
-
If internal monologue were exposed to users ("transparent mode"), in which application scenarios would this be an advantage? What new design challenges would it introduce?