Case Study: Multi-Step Deep Research Agent
Chapter 71: Case Study โ Multi-Step Deep Research Agent
Chapter Introduction
The great paradox of the information age is that data grows exponentially while genuine insight becomes rarer. A high-quality competitive analysis report that once required days of analyst time can today be completed by a Hermes Agent in under an hour โ searching academic papers, news, technical documentation, reading full-text sources, cross-verifying facts, tracking every citation, and producing a structured report with traceable references. This chapter builds a complete multi-step deep research agent from scratch, focusing on how to automate the entire research workflow without sacrificing quality.
71.1 Requirements: Core Challenges of Automated Research
Where Time Goes in Manual Research
Time distribution in manual research:
Searching for relevant material โโโโโโ 30%
Reading and filtering content โโโโโโโโ 40%
Cross-verifying facts โโโโ 20%
Writing and organizing report โโ 10%
Core automation challenges:
| Challenge | Description | Difficulty |
|---|---|---|
| Source reliability | Distinguishing authoritative vs. low-quality content | High |
| Fact-checking | Same fact may conflict across sources | High |
| Depth vs. breadth | Too broad = shallow; too narrow = gaps | Medium |
| Citation tracking | Every claim must have a traceable source | Medium |
| Content deduplication | Same info reported by many sources | Low |
Agent Target Capabilities
- Input: Research topic + optional depth/breadth parameters
- Output: Structured Markdown report with citations
- Process: Search โ Filter โ Read โ Extract โ Synthesize โ Write
- Quality: Citation tracking + cross-verification + confidence labels
71.2 System Architecture
Research Pipeline State Machine
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Research Agent Pipeline โ
โ โ
โ [INIT] Parse topic, plan search strategy โ
โ โ โ
โ [SEARCH] Multi-angle search (Tavily/SerpAPI) โ
โ โ โ
โ [FILTER] Evaluate source reliability โ
โ โ โ
โ [READ] Deep-read high-value pages (full text) โ
โ โ โ
โ [EXTRACT] Distill key facts from each source โ
โ โ โ
โ [VERIFY] Cross-verify: find corroboration/conflicts โ
โ โ โ
โ [SYNTHESIZE] Combine findings, identify key themes โ
โ โ โ
โ [WRITE] Generate structured research report โ
โ โ โ
โ [QA] Validate citations, annotate confidence โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tool Inventory
| Tool | Description | API/Library |
|---|---|---|
search_web |
General web search | Tavily API |
search_academic |
Academic paper search | Semantic Scholar |
fetch_page_content |
Retrieve full page text | requests + BeautifulSoup |
extract_pdf_text |
PDF text extraction | pdfplumber |
record_fact |
Store verified fact to memory | In-memory |
cross_verify_fact |
Check fact against multiple sources | Hermes LLM |
write_final_report |
Compile full report | Template |
format_citations |
Format reference list | Custom |
71.3 Full Implementation
Core Agent
# research_agent/agent.py
import os
import json
from datetime import datetime
from openai import OpenAI
client = OpenAI(
base_url=os.getenv("HERMES_BASE_URL", "http://localhost:11434/v1"),
api_key=os.getenv("HERMES_API_KEY", "ollama"),
)
MODEL = os.getenv("HERMES_MODEL", "nous-hermes-2-mixtral-8x7b-dpo")
SYSTEM_PROMPT = """You are a professional research analyst skilled at synthesizing insights from large volumes of information.
Your research methodology:
1. **Broad search**: Query from multiple angles, not just the surface question
2. **Deep reading**: Read full text of key sources, not just excerpts
3. **Cross-verification**: Confirm important claims with at least 2 independent sources
4. **Citation tracking**: Every data point must have a source โ no unsourced assertions
5. **Structured output**: Reports must have clear hierarchical structure
Confidence annotation rules:
- HIGH: 2+ authoritative sources confirm the claim
- MEDIUM: 1 reliable source, not independently verified
- LOW: Speculation or single source, must be flagged
Report structure:
- Executive Summary (300 words max)
- Key Findings (3-5 bullets)
- Detailed Analysis (by section)
- Conclusions & Recommendations
- References"""
TOOLS = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web using the Tavily search API",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer", "default": 5},
"include_domains": {"type": "array", "items": {"type": "string"}},
"search_depth": {"type": "string", "enum": ["basic", "advanced"], "default": "advanced"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "search_academic",
"description": "Search academic papers via Semantic Scholar",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"year_from": {"type": "integer"},
"max_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "fetch_page_content",
"description": "Retrieve the full text content of a URL",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string"},
"max_chars": {"type": "integer", "default": 8000}
},
"required": ["url"]
}
}
},
{
"type": "function",
"function": {
"name": "record_fact",
"description": "Record a verified fact into the research memory",
"parameters": {
"type": "object",
"properties": {
"fact": {"type": "string"},
"source_url": {"type": "string"},
"source_title": {"type": "string"},
"confidence": {"type": "string", "enum": ["high", "medium", "low"]},
"category": {"type": "string"}
},
"required": ["fact", "source_url", "confidence", "category"]
}
}
},
{
"type": "function",
"function": {
"name": "get_recorded_facts",
"description": "Retrieve all recorded research facts",
"parameters": {
"type": "object",
"properties": {
"category": {"type": "string"}
}
}
}
},
{
"type": "function",
"function": {
"name": "write_final_report",
"description": "Compile all collected information into the final research report",
"parameters": {
"type": "object",
"properties": {
"topic": {"type": "string"},
"facts": {"type": "array"},
"report_structure": {"type": "array", "items": {"type": "string"}}
},
"required": ["topic", "facts"]
}
}
}
]
class ResearchMemory:
def __init__(self):
self.visited_urls: set = set()
self.facts: list = []
def add_fact(self, fact, source_url, source_title, confidence, category):
self.facts.append({
"id": len(self.facts) + 1, "fact": fact,
"source_url": source_url, "source_title": source_title,
"confidence": confidence, "category": category
})
return {"success": True, "fact_id": len(self.facts)}
def get_facts(self, category=None):
return [f for f in self.facts if not category or f["category"] == category]
def run_research_agent(topic: str, depth: str = "comprehensive") -> dict:
memory = ResearchMemory()
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"""Conduct a deep research on the following topic and produce a complete report:
**Topic:** {topic}
**Depth:** {depth}
**Date:** {datetime.now().strftime('%B %Y')}
Steps:
1. Plan search strategy (at least 3 different search angles)
2. Execute multiple search rounds from different sources
3. Deep-read the 5-10 most relevant sources in full
4. Record key facts using record_fact tool
5. Cross-verify important data points
6. Call write_final_report to generate the report
Begin the research."""}
]
for iteration in range(30):
response = client.chat.completions.create(
model=MODEL, messages=messages, tools=TOOLS,
tool_choice="auto", temperature=0.3, max_tokens=4000
)
message = response.choices[0].message
messages.append(message)
if not message.tool_calls:
return {
"status": "completed", "report": message.content,
"facts_collected": len(memory.facts),
"sources_visited": len(memory.visited_urls),
"iterations": iteration + 1
}
for tc in message.tool_calls:
args = json.loads(tc.function.arguments)
result = _dispatch(tc.function.name, args, memory)
messages.append({
"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)
})
return {"status": "max_iterations", "facts_collected": len(memory.facts)}
Search & Content Tools
# research_agent/tools/search_tools.py
import requests, os
TAVILY_KEY = os.getenv("TAVILY_API_KEY")
def search_tavily(query, max_results=5, include_domains=None, search_depth="advanced"):
payload = {
"api_key": TAVILY_KEY, "query": query,
"max_results": max_results, "search_depth": search_depth,
"include_answer": True
}
if include_domains:
payload["include_domains"] = include_domains
resp = requests.post("https://api.tavily.com/search", json=payload, timeout=30)
resp.raise_for_status()
data = resp.json()
return {
"query": query,
"ai_answer": data.get("answer", ""),
"results": [
{"title": r["title"], "url": r["url"],
"snippet": r.get("content", "")[:500], "score": r.get("score", 0)}
for r in data.get("results", [])
]
}
def search_academic(query, year_from=None, max_results=5):
params = {
"query": query, "limit": max_results,
"fields": "title,abstract,year,authors,citationCount,url"
}
if year_from:
params["year"] = f"{year_from}-"
resp = requests.get(
"https://api.semanticscholar.org/graph/v1/paper/search",
params=params, timeout=30
)
papers = resp.json().get("data", [])
papers.sort(key=lambda p: p.get("citationCount", 0), reverse=True)
return {"query": query, "papers": [
{"title": p.get("title"), "abstract": p.get("abstract", "")[:400],
"year": p.get("year"), "citations": p.get("citationCount", 0),
"url": p.get("url", "")}
for p in papers
]}
71.4 Quality Control
Citation Tracking
Every factual assertion in the final report must be traceable. The CitationTracker class assigns numeric IDs to sources and injects [1], [2] style markers into the report text at write time.
Confidence Scoring
def score_confidence(supporting_sources: list, contradicting_sources: list,
authoritative_domains: list) -> dict:
score = 0
for s in supporting_sources:
is_auth = any(d in s.get("url", "") for d in authoritative_domains)
score += 2 if is_auth else 1
score -= len(contradicting_sources)
if score >= 3:
return {"level": "high", "marker": "HIGH", "note": "Ready to cite"}
if score >= 1:
return {"level": "medium", "marker": "MEDIUM", "note": "Flag limited sources"}
return {"level": "low", "marker": "LOW", "note": "Verify or mark as speculation"}
71.5 Time & Cost Analysis
Resource Usage by Research Depth
| Research Type | Search Rounds | Pages Read | LLM Calls | Est. Time | Est. Cost* |
|---|---|---|---|---|---|
| Quick overview | 2-3 | 3-5 | 10-15 | 3-5 min | $0.05-0.15 |
| Standard research | 5-8 | 8-15 | 20-30 | 10-20 min | $0.20-0.60 |
| Deep synthesis | 10-15 | 15-30 | 35-50 | 30-60 min | $0.60-2.00 |
| Academic-grade | 20+ | 30-50 | 60-100 | 1-3 hrs | $2.00-8.00 |
Self-hosted Hermes inference. Commercial API costs are 5-10x higher.
Budget Control
class CostOptimizer:
def __init__(self, budget_usd: float = 1.0):
self.budget = budget_usd
self.spent = 0.0
def should_continue(self, facts_count: int, min_facts: int = 10) -> bool:
if self.spent >= self.budget:
return False
if facts_count >= min_facts and self.spent >= self.budget * 0.5:
return False
return True
def should_deep_read(self, relevance_score: float) -> bool:
return relevance_score >= 0.7 and self.spent < self.budget * 0.8
Chapter Summary
This chapter built a production-quality deep research agent covering:
- Pipeline design: Full search โ filter โ read โ extract โ verify โ synthesize โ write chain
- Quality mechanisms: Citation tracking, confidence scoring, cross-verification
- Tool stack: Tavily search, Semantic Scholar, BeautifulSoup content extraction
- Cost control: Depth-tiered research with built-in budget controller
The agent's core value is encoding the research analyst's methodology โ systematic search, rigorous verification, complete citations โ into agent behavior, not just "asking AI a question."
Discussion Questions
- How should the agent handle contradicting facts from equally authoritative sources?
- How can we prevent confirmation bias โ only collecting evidence that supports a preset conclusion?
- For non-English research topics, how should the agent balance source language diversity?
- What automated quality metrics could evaluate a research report's reliability without human review?