XML Structured Output and Scratchpad
Chapter 24: XML Structured Output and Scratchpad
Hermes Agent can not only output natural language but also generate precise XML structured data — a critical capability for deeply integrating AI into systems engineering. From Scratchpad intermediate reasoning staging, to Mermaid architecture diagram generation, to Python parsing code, this chapter presents the design philosophy and engineering practices behind Hermes's XML output system.
24.1 Hermes XML Output Specification
Hermes's XML output system consists of four core tags, each with clear semantic boundaries:
<!-- Complete XML output structure example -->
<response>
<scratchpad>
<!-- Intermediate reasoning and computation staging (hidden from users) -->
</scratchpad>
<thinking>
<!-- Lightweight reasoning (optionally displayed) -->
</thinking>
<result>
<!-- Final structured output -->
</result>
<metadata>
<!-- Meta-information (execution time, confidence scores, etc.) -->
</metadata>
</response>
24.1.1 Tag Purpose Comparison
| Tag | Purpose | User-Visible | Typical Content |
|---|---|---|---|
scratchpad |
Intermediate computation staging | No | Draft calculations, temp variables, candidate approaches |
thinking |
High-level reasoning process | Optional | Decision logic, trade-off analysis |
result |
Final output | Yes | Structured answers, reports, data |
metadata |
Meta-information | Optional | Confidence, sources, execution stats |
24.1.2 XML Output Format Specification
<!-- Specification example: code review output -->
<code_review>
<summary severity="high">
Found 3 high-severity vulnerabilities, 7 medium risks, and 12 low-risk issues
</summary>
<findings>
<issue id="001" severity="high" line="42" file="auth.py">
<type>SQL Injection</type>
<description>User input is directly concatenated into SQL without parameterization</description>
<code_snippet>query = f"SELECT * FROM users WHERE id = {user_id}"</code_snippet>
<fix>
<description>Use parameterized queries</description>
<code>query = "SELECT * FROM users WHERE id = %s"; cursor.execute(query, (user_id,))</code>
</fix>
<references>
<ref>OWASP A03:2021 – Injection</ref>
<ref>CWE-89: SQL Injection</ref>
</references>
</issue>
<issue id="002" severity="medium" line="78" file="auth.py">
<type>Hardcoded Secret</type>
<description>API key is hardcoded in source code</description>
<code_snippet>API_KEY = "sk-abc123xyz..."</code_snippet>
<fix>
<description>Read from environment variable</description>
<code>API_KEY = os.environ.get("API_KEY")</code>
</fix>
</issue>
</findings>
<metrics>
<files_analyzed>15</files_analyzed>
<lines_analyzed>2847</lines_analyzed>
<scan_time_seconds>3.2</scan_time_seconds>
</metrics>
</code_review>
24.2 Scratchpad Tag: Purpose and Mechanism
scratchpad is one of Hermes's most distinctive output components. It provides the model with a "scratch paper" space for intermediate calculations, recording temporary variables, and exploring multiple candidate approaches — ultimately distilling high-quality final answers from the scratchpad.
24.2.1 Three Key Values of Scratchpad
1. Reducing working memory pressure
A language model's "working memory" is its context window. For complex computations, scratchpad allows the model to "write on paper" rather than trying to hold everything in memory:
<scratchpad>
Task: Find the sum of all prime numbers under 100
Primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97
Cumulative sum:
2 + 3 = 5
5 + 5 = 10
10 + 7 = 17
17 + 11 = 28
28 + 13 = 41
...
Final total: 1060
</scratchpad>
<result>The sum of all prime numbers under 100 is **1060**.</result>
2. Supporting multi-candidate exploration (Draft-then-Select)
<scratchpad>
User needs a function name, candidates:
Option A: process_user_data() — Too broad, not descriptive enough
Option B: sanitize_and_validate_user_input() — Too verbose
Option C: validate_user_profile() — Accurate and appropriately concise
Option D: check_user_data() — Imprecise verb choice
Conclusion: Option C is best
</scratchpad>
<result>Recommended function name: `validate_user_profile()`</result>
3. Improving structured output accuracy
Planning the structure in scratchpad before generating complex XML:
<scratchpad>
Required XML structure:
- Root node: report
- summary (string)
- findings (list)
- finding (object): id, severity, description, fix
- statistics (object): count, scan_time
Validation checklist:
- severity enum: must be one of high/medium/low
- id format: "FIND-{three digits}"
- fix must contain both description and code_example child nodes
</scratchpad>
24.3 Structured Output vs JSON Mode Comparison
Hermes XML structured output differs fundamentally from the JSON Mode common in LLM APIs:
| Dimension | JSON Mode | Hermes XML Structured Output |
|---|---|---|
| Format flexibility | Strict JSON Schema constraint | XML + natural language mix, more flexible |
| Readability | Machine-friendly, human-readable | Readable by both humans and machines |
| Nesting complexity | Readability degrades with deep nesting | Indented tags remain naturally readable |
| Comment support | No comments | XML comments (<!-- -->) supported |
| Streaming | Must accumulate complete JSON before parsing | Supports tag-level streaming parse |
| Model training | Requires JSON schema compliance training | Hermes native support |
| Scratchpad | No standard support | Native scratchpad tag |
| Mixed content | Not supported (values must be JSON types) | Supports embedded code blocks, tables, etc. |
JSON Mode Example
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-turbo",
response_format={"type": "json_object"},
messages=[
{"role": "user", "content": "Analyze this code's quality and return as JSON"}
]
)
# Output must be valid JSON; intermediate reasoning process is not capturable
Hermes XML Output Example
from hermes import HermesAgent
agent = HermesAgent()
result = await agent.run("""
Analyze the code quality using this XML format:
<code_analysis>
<scratchpad>(your intermediate analysis)</scratchpad>
<overall_score>(0-100)</overall_score>
<issues>(list of problems found)</issues>
<recommendations>(improvement suggestions)</recommendations>
</code_analysis>
""")
24.4 Mermaid Diagram Generation Examples
Hermes can generate Mermaid-format diagrams within XML structured output, suitable for architecture diagrams, flowcharts, sequence diagrams, and more.
System Architecture Diagram
<architecture_analysis>
<summary>This is a standard three-tier web architecture</summary>
<diagram type="mermaid">
```mermaid
graph TB
subgraph "Frontend Layer"
A[React SPA] --> B[Nginx]
B --> C[CDN]
end
subgraph "Application Layer"
D[API Gateway]
E[Auth Service]
F[Business Service]
G[File Service]
D --> E
D --> F
D --> G
end
subgraph "Data Layer"
H[(PostgreSQL)]
I[(Redis Cache)]
J[(S3 Storage)]
F --> H
F --> I
G --> J
end
B --> D
Agent State Machine Diagram
<state_diagram>
<diagram type="mermaid">
```mermaid
stateDiagram-v2
[*] --> Idle: Initialization complete
Idle --> Thinking: User request received
Thinking --> Planning: Task analyzed
Planning --> Executing: Plan formulated
Executing --> ToolCalling: Tool needed
ToolCalling --> Executing: Tool complete
Executing --> Responding: All steps complete
Executing --> ErrorHandling: Tool failure
ErrorHandling --> Planning: Re-planning
ErrorHandling --> Responding: Unrecoverable
Responding --> Idle: Reply sent
```
24.5 Parsing Hermes XML Output in Python
Complete XML Parser
import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from typing import Any
import re
@dataclass
class HermesXMLOutput:
"""Parsed Hermes XML output structure"""
raw: str
scratchpad: str | None = None
thinking: str | None = None
result: Any = None
metadata: dict = field(default_factory=dict)
custom_tags: dict = field(default_factory=dict)
class HermesXMLParser:
"""Hermes XML output parser"""
KNOWN_TAGS = {"scratchpad", "thinking", "result", "metadata"}
def parse(self, response: str) -> HermesXMLOutput:
"""
Parse a Hermes XML-format response.
Supports:
- XML with or without an outer root tag
- XML mixed with natural language (auto-extracts XML portion)
- XML containing code blocks (prevents code from breaking the parser)
"""
xml_content = self._extract_xml(response)
if not xml_content:
return HermesXMLOutput(raw=response, result=response)
if not xml_content.strip().startswith("<response>"):
xml_content = f"<response>{xml_content}</response>"
try:
root = ET.fromstring(xml_content)
except ET.ParseError as e:
return self._fallback_parse(response, str(e))
output = HermesXMLOutput(raw=response)
for child in root:
tag = child.tag
text = self._get_text_content(child)
if tag == "scratchpad":
output.scratchpad = text
elif tag == "thinking":
output.thinking = text
elif tag == "result":
output.result = self._parse_result(child)
elif tag == "metadata":
output.metadata = self._parse_metadata(child)
else:
output.custom_tags[tag] = self._parse_element(child)
return output
def _extract_xml(self, text: str) -> str | None:
xml_start = re.search(r'<(?!--)\w+[\s>]', text)
if not xml_start:
return None
return text[xml_start.start():]
def _get_text_content(self, element: ET.Element) -> str:
parts = []
if element.text:
parts.append(element.text.strip())
for child in element:
parts.append(ET.tostring(child, encoding="unicode"))
if child.tail:
parts.append(child.tail.strip())
return "\n".join(filter(None, parts))
def _parse_result(self, element: ET.Element) -> dict | str:
children = list(element)
if not children:
return element.text or ""
result = {}
for child in children:
result[child.tag] = self._parse_element(child)
return result
def _parse_element(self, element: ET.Element) -> Any:
children = list(element)
if not children:
return element.text or ""
result = {}
for child in children:
result[child.tag] = self._parse_element(child)
if element.attrib:
result["_attributes"] = element.attrib
return result
def _parse_metadata(self, element: ET.Element) -> dict:
metadata = {}
for child in element:
value = child.text or ""
try:
value = int(value)
except ValueError:
try:
value = float(value)
except ValueError:
pass
metadata[child.tag] = value
return metadata
def _fallback_parse(self, text: str, error: str) -> HermesXMLOutput:
scratchpad = self._regex_extract(text, "scratchpad")
result = self._regex_extract(text, "result")
return HermesXMLOutput(
raw=text,
scratchpad=scratchpad,
result=result or text,
metadata={"parse_error": error, "parse_method": "regex_fallback"},
)
def _regex_extract(self, text: str, tag: str) -> str | None:
pattern = rf'<{tag}[^>]*>(.*?)</{tag}>'
match = re.search(pattern, text, re.DOTALL)
return match.group(1).strip() if match else None
Mermaid Diagram Extractor
class MermaidExtractor:
"""Extract Mermaid diagrams from Hermes XML output"""
def extract_all(self, xml_output: HermesXMLOutput) -> list[dict]:
all_content = xml_output.raw
pattern = r'```mermaid\s*\n(.*?)```'
matches = re.findall(pattern, all_content, re.DOTALL)
diagrams = []
for i, code in enumerate(matches):
diagram_type = self._detect_type(code)
diagrams.append({
"index": i,
"type": diagram_type,
"code": code.strip(),
"rendered_url": self._render_url(code),
})
return diagrams
def _detect_type(self, code: str) -> str:
first_line = code.strip().split("\n")[0].lower()
type_map = {
"graph": "flowchart",
"sequencediagram": "sequence",
"statediagram": "state",
"classdiagram": "class",
"erdiagram": "er",
"gantt": "gantt",
}
for key, value in type_map.items():
if key in first_line:
return value
return "unknown"
def _render_url(self, code: str) -> str:
import base64, json
payload = json.dumps({"code": code, "mermaid": {"theme": "default"}})
encoded = base64.urlsafe_b64encode(payload.encode()).decode()
return f"https://mermaid.live/edit#{encoded}"
Complete Usage Example
async def main():
from hermes import HermesAgent
agent = HermesAgent()
response = await agent.run("""
Analyze the code quality of the following Python function:
```python
def get_user(id):
db = connect("localhost:5432/prod")
result = db.execute(f"SELECT * FROM users WHERE id = {id}")
return result
```
Output using this XML format:
- scratchpad: your analysis process
- result/issues: list of problems (each with severity and description)
- result/overall_score: composite score from 0-100
- result/refactored_code: the refactored version
""")
parser = HermesXMLParser()
parsed = parser.parse(response.content)
print("=== Internal analysis (not shown to users) ===")
print(parsed.scratchpad)
print("\n=== Final Result ===")
if isinstance(parsed.result, dict):
score = parsed.result.get("overall_score", "N/A")
issues = parsed.result.get("issues", {})
print(f"Overall score: {score}")
print(f"Issues found: {issues}")
extractor = MermaidExtractor()
diagrams = extractor.extract_all(parsed)
for diagram in diagrams:
print(f"\nDiagram type: {diagram['type']}")
print(f"Mermaid Live URL: {diagram['rendered_url']}")
import asyncio
asyncio.run(main())
24.6 Summary
This chapter systematically covered the Hermes XML structured output system:
- Four core tags: scratchpad (staging) / thinking (reasoning) / result (output) / metadata (meta-info)
- Scratchpad value: Reduces working memory pressure, supports multi-candidate exploration, improves structured output accuracy
- vs JSON Mode: XML supports mixed content, comments, and streaming parsing — better suited for complex Agent output scenarios
- Mermaid diagrams: Embed architecture diagrams, sequence diagrams, and state machines within XML — "documentation as code"
- Python parsing:
HermesXMLParserwith graceful degradation;MermaidExtractorfor renderable diagram extraction
Review Questions
-
Hermes's scratchpad is "written" during reasoning but not shown to users. This means scratchpad content still consumes tokens. How would you design a "lightweight scratchpad" scheme that preserves reasoning value while minimizing token consumption?
-
When generating Mermaid diagrams, the model may make syntax errors (e.g., invalid node ID formats). How would you design a "Mermaid syntax validation + auto-correction" post-processing pipeline?
-
XML structured output in streaming scenarios faces the "incomplete tags" problem (users see raw XML tokens arriving one by one). How would you design a streaming XML frontend rendering strategy that lets users see meaningful content before the output is complete?