Chapter 24

XML Structured Output and Scratchpad

Chapter 24: XML Structured Output and Scratchpad

Hermes Agent can not only output natural language but also generate precise XML structured data — a critical capability for deeply integrating AI into systems engineering. From Scratchpad intermediate reasoning staging, to Mermaid architecture diagram generation, to Python parsing code, this chapter presents the design philosophy and engineering practices behind Hermes's XML output system.

24.1 Hermes XML Output Specification

Hermes's XML output system consists of four core tags, each with clear semantic boundaries:

<!-- Complete XML output structure example -->
<response>
  <scratchpad>
    <!-- Intermediate reasoning and computation staging (hidden from users) -->
  </scratchpad>

  <thinking>
    <!-- Lightweight reasoning (optionally displayed) -->
  </thinking>

  <result>
    <!-- Final structured output -->
  </result>

  <metadata>
    <!-- Meta-information (execution time, confidence scores, etc.) -->
  </metadata>
</response>

24.1.1 Tag Purpose Comparison

Tag	Purpose	User-Visible	Typical Content
`scratchpad`	Intermediate computation staging	No	Draft calculations, temp variables, candidate approaches
`thinking`	High-level reasoning process	Optional	Decision logic, trade-off analysis
`result`	Final output	Yes	Structured answers, reports, data
`metadata`	Meta-information	Optional	Confidence, sources, execution stats

24.1.2 XML Output Format Specification

<!-- Specification example: code review output -->
<code_review>
  <summary severity="high">
    Found 3 high-severity vulnerabilities, 7 medium risks, and 12 low-risk issues
  </summary>

  <findings>
    <issue id="001" severity="high" line="42" file="auth.py">
      <type>SQL Injection</type>
      <description>User input is directly concatenated into SQL without parameterization</description>
      <code_snippet>query = f"SELECT * FROM users WHERE id = {user_id}"</code_snippet>
      <fix>
        <description>Use parameterized queries</description>
        <code>query = "SELECT * FROM users WHERE id = %s"; cursor.execute(query, (user_id,))</code>
      </fix>
      <references>
        <ref>OWASP A03:2021 – Injection</ref>
        <ref>CWE-89: SQL Injection</ref>
      </references>
    </issue>

    <issue id="002" severity="medium" line="78" file="auth.py">
      <type>Hardcoded Secret</type>
      <description>API key is hardcoded in source code</description>
      <code_snippet>API_KEY = "sk-abc123xyz..."</code_snippet>
      <fix>
        <description>Read from environment variable</description>
        <code>API_KEY = os.environ.get("API_KEY")</code>
      </fix>
    </issue>
  </findings>

  <metrics>
    <files_analyzed>15</files_analyzed>
    <lines_analyzed>2847</lines_analyzed>
    <scan_time_seconds>3.2</scan_time_seconds>
  </metrics>
</code_review>

24.2 Scratchpad Tag: Purpose and Mechanism

scratchpad is one of Hermes's most distinctive output components. It provides the model with a "scratch paper" space for intermediate calculations, recording temporary variables, and exploring multiple candidate approaches — ultimately distilling high-quality final answers from the scratchpad.

24.2.1 Three Key Values of Scratchpad

1. Reducing working memory pressure

A language model's "working memory" is its context window. For complex computations, scratchpad allows the model to "write on paper" rather than trying to hold everything in memory:

<scratchpad>
Task: Find the sum of all prime numbers under 100

Primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
        53, 59, 61, 67, 71, 73, 79, 83, 89, 97

Cumulative sum:
  2 + 3 = 5
  5 + 5 = 10
  10 + 7 = 17
  17 + 11 = 28
  28 + 13 = 41
  ...
  Final total: 1060
</scratchpad>

<result>The sum of all prime numbers under 100 is **1060**.</result>

2. Supporting multi-candidate exploration (Draft-then-Select)

<scratchpad>
User needs a function name, candidates:
Option A: process_user_data()    — Too broad, not descriptive enough
Option B: sanitize_and_validate_user_input()  — Too verbose
Option C: validate_user_profile()  — Accurate and appropriately concise
Option D: check_user_data()  — Imprecise verb choice

Conclusion: Option C is best
</scratchpad>

<result>Recommended function name: `validate_user_profile()`</result>

3. Improving structured output accuracy

Planning the structure in scratchpad before generating complex XML:

<scratchpad>
Required XML structure:
- Root node: report
  - summary (string)
  - findings (list)
    - finding (object): id, severity, description, fix
  - statistics (object): count, scan_time

Validation checklist:
- severity enum: must be one of high/medium/low
- id format: "FIND-{three digits}"
- fix must contain both description and code_example child nodes
</scratchpad>

24.3 Structured Output vs JSON Mode Comparison

Hermes XML structured output differs fundamentally from the JSON Mode common in LLM APIs:

Dimension	JSON Mode	Hermes XML Structured Output
Format flexibility	Strict JSON Schema constraint	XML + natural language mix, more flexible
Readability	Machine-friendly, human-readable	Readable by both humans and machines
Nesting complexity	Readability degrades with deep nesting	Indented tags remain naturally readable
Comment support	No comments	XML comments (`<!-- -->`) supported
Streaming	Must accumulate complete JSON before parsing	Supports tag-level streaming parse
Model training	Requires JSON schema compliance training	Hermes native support
Scratchpad	No standard support	Native `scratchpad` tag
Mixed content	Not supported (values must be JSON types)	Supports embedded code blocks, tables, etc.

JSON Mode Example

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4-turbo",
    response_format={"type": "json_object"},
    messages=[
        {"role": "user", "content": "Analyze this code's quality and return as JSON"}
    ]
)
# Output must be valid JSON; intermediate reasoning process is not capturable

Hermes XML Output Example

from hermes import HermesAgent

agent = HermesAgent()
result = await agent.run("""
Analyze the code quality using this XML format:
<code_analysis>
  <scratchpad>(your intermediate analysis)</scratchpad>
  <overall_score>(0-100)</overall_score>
  <issues>(list of problems found)</issues>
  <recommendations>(improvement suggestions)</recommendations>
</code_analysis>
""")

24.4 Mermaid Diagram Generation Examples

Hermes can generate Mermaid-format diagrams within XML structured output, suitable for architecture diagrams, flowcharts, sequence diagrams, and more.

System Architecture Diagram

<architecture_analysis>
  <summary>This is a standard three-tier web architecture</summary>

  <diagram type="mermaid">
```mermaid
graph TB
    subgraph "Frontend Layer"
        A[React SPA] --> B[Nginx]
        B --> C[CDN]
    end

    subgraph "Application Layer"
        D[API Gateway]
        E[Auth Service]
        F[Business Service]
        G[File Service]
        D --> E
        D --> F
        D --> G
    end

    subgraph "Data Layer"
        H[(PostgreSQL)]
        I[(Redis Cache)]
        J[(S3 Storage)]
        F --> H
        F --> I
        G --> J
    end

    B --> D

API Gateway is a single point of failure; recommend multi-instance deployment PostgreSQL lacks read/write separation; potential bottleneck under high concurrency ```

Agent State Machine Diagram

<state_diagram>
  <diagram type="mermaid">
```mermaid
stateDiagram-v2
    [*] --> Idle: Initialization complete
    Idle --> Thinking: User request received
    Thinking --> Planning: Task analyzed
    Planning --> Executing: Plan formulated
    Executing --> ToolCalling: Tool needed
    ToolCalling --> Executing: Tool complete
    Executing --> Responding: All steps complete
    Executing --> ErrorHandling: Tool failure
    ErrorHandling --> Planning: Re-planning
    ErrorHandling --> Responding: Unrecoverable
    Responding --> Idle: Reply sent

```

24.5 Parsing Hermes XML Output in Python

Complete XML Parser

import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from typing import Any
import re

@dataclass
class HermesXMLOutput:
    """Parsed Hermes XML output structure"""
    raw: str
    scratchpad: str | None = None
    thinking: str | None = None
    result: Any = None
    metadata: dict = field(default_factory=dict)
    custom_tags: dict = field(default_factory=dict)

class HermesXMLParser:
    """Hermes XML output parser"""
    
    KNOWN_TAGS = {"scratchpad", "thinking", "result", "metadata"}
    
    def parse(self, response: str) -> HermesXMLOutput:
        """
        Parse a Hermes XML-format response.
        
        Supports:
        - XML with or without an outer root tag
        - XML mixed with natural language (auto-extracts XML portion)
        - XML containing code blocks (prevents code from breaking the parser)
        """
        xml_content = self._extract_xml(response)
        
        if not xml_content:
            return HermesXMLOutput(raw=response, result=response)
        
        if not xml_content.strip().startswith("<response>"):
            xml_content = f"<response>{xml_content}</response>"
        
        try:
            root = ET.fromstring(xml_content)
        except ET.ParseError as e:
            return self._fallback_parse(response, str(e))
        
        output = HermesXMLOutput(raw=response)
        
        for child in root:
            tag = child.tag
            text = self._get_text_content(child)
            
            if tag == "scratchpad":
                output.scratchpad = text
            elif tag == "thinking":
                output.thinking = text
            elif tag == "result":
                output.result = self._parse_result(child)
            elif tag == "metadata":
                output.metadata = self._parse_metadata(child)
            else:
                output.custom_tags[tag] = self._parse_element(child)
        
        return output
    
    def _extract_xml(self, text: str) -> str | None:
        xml_start = re.search(r'<(?!--)\w+[\s>]', text)
        if not xml_start:
            return None
        return text[xml_start.start():]
    
    def _get_text_content(self, element: ET.Element) -> str:
        parts = []
        if element.text:
            parts.append(element.text.strip())
        for child in element:
            parts.append(ET.tostring(child, encoding="unicode"))
            if child.tail:
                parts.append(child.tail.strip())
        return "\n".join(filter(None, parts))
    
    def _parse_result(self, element: ET.Element) -> dict | str:
        children = list(element)
        if not children:
            return element.text or ""
        result = {}
        for child in children:
            result[child.tag] = self._parse_element(child)
        return result
    
    def _parse_element(self, element: ET.Element) -> Any:
        children = list(element)
        if not children:
            return element.text or ""
        result = {}
        for child in children:
            result[child.tag] = self._parse_element(child)
        if element.attrib:
            result["_attributes"] = element.attrib
        return result
    
    def _parse_metadata(self, element: ET.Element) -> dict:
        metadata = {}
        for child in element:
            value = child.text or ""
            try:
                value = int(value)
            except ValueError:
                try:
                    value = float(value)
                except ValueError:
                    pass
            metadata[child.tag] = value
        return metadata
    
    def _fallback_parse(self, text: str, error: str) -> HermesXMLOutput:
        scratchpad = self._regex_extract(text, "scratchpad")
        result = self._regex_extract(text, "result")
        return HermesXMLOutput(
            raw=text,
            scratchpad=scratchpad,
            result=result or text,
            metadata={"parse_error": error, "parse_method": "regex_fallback"},
        )
    
    def _regex_extract(self, text: str, tag: str) -> str | None:
        pattern = rf'<{tag}[^>]*>(.*?)</{tag}>'
        match = re.search(pattern, text, re.DOTALL)
        return match.group(1).strip() if match else None

Mermaid Diagram Extractor

class MermaidExtractor:
    """Extract Mermaid diagrams from Hermes XML output"""
    
    def extract_all(self, xml_output: HermesXMLOutput) -> list[dict]:
        all_content = xml_output.raw
        pattern = r'```mermaid\s*\n(.*?)```'
        matches = re.findall(pattern, all_content, re.DOTALL)
        
        diagrams = []
        for i, code in enumerate(matches):
            diagram_type = self._detect_type(code)
            diagrams.append({
                "index": i,
                "type": diagram_type,
                "code": code.strip(),
                "rendered_url": self._render_url(code),
            })
        return diagrams
    
    def _detect_type(self, code: str) -> str:
        first_line = code.strip().split("\n")[0].lower()
        type_map = {
            "graph": "flowchart",
            "sequencediagram": "sequence",
            "statediagram": "state",
            "classdiagram": "class",
            "erdiagram": "er",
            "gantt": "gantt",
        }
        for key, value in type_map.items():
            if key in first_line:
                return value
        return "unknown"
    
    def _render_url(self, code: str) -> str:
        import base64, json
        payload = json.dumps({"code": code, "mermaid": {"theme": "default"}})
        encoded = base64.urlsafe_b64encode(payload.encode()).decode()
        return f"https://mermaid.live/edit#{encoded}"

Complete Usage Example

async def main():
    from hermes import HermesAgent
    
    agent = HermesAgent()
    
    response = await agent.run("""
    Analyze the code quality of the following Python function:

    ```python
    def get_user(id):
        db = connect("localhost:5432/prod")
        result = db.execute(f"SELECT * FROM users WHERE id = {id}")
        return result
    ```

    Output using this XML format:
    - scratchpad: your analysis process
    - result/issues: list of problems (each with severity and description)
    - result/overall_score: composite score from 0-100
    - result/refactored_code: the refactored version
    """)
    
    parser = HermesXMLParser()
    parsed = parser.parse(response.content)
    
    print("=== Internal analysis (not shown to users) ===")
    print(parsed.scratchpad)
    
    print("\n=== Final Result ===")
    if isinstance(parsed.result, dict):
        score = parsed.result.get("overall_score", "N/A")
        issues = parsed.result.get("issues", {})
        print(f"Overall score: {score}")
        print(f"Issues found: {issues}")
    
    extractor = MermaidExtractor()
    diagrams = extractor.extract_all(parsed)
    for diagram in diagrams:
        print(f"\nDiagram type: {diagram['type']}")
        print(f"Mermaid Live URL: {diagram['rendered_url']}")

import asyncio
asyncio.run(main())

24.6 Summary

This chapter systematically covered the Hermes XML structured output system:

Four core tags: scratchpad (staging) / thinking (reasoning) / result (output) / metadata (meta-info)
Scratchpad value: Reduces working memory pressure, supports multi-candidate exploration, improves structured output accuracy
vs JSON Mode: XML supports mixed content, comments, and streaming parsing — better suited for complex Agent output scenarios
Mermaid diagrams: Embed architecture diagrams, sequence diagrams, and state machines within XML — "documentation as code"
Python parsing: HermesXMLParser with graceful degradation; MermaidExtractor for renderable diagram extraction

Review Questions

Hermes's scratchpad is "written" during reasoning but not shown to users. This means scratchpad content still consumes tokens. How would you design a "lightweight scratchpad" scheme that preserves reasoning value while minimizing token consumption?
When generating Mermaid diagrams, the model may make syntax errors (e.g., invalid node ID formats). How would you design a "Mermaid syntax validation + auto-correction" post-processing pipeline?
XML structured output in streaming scenarios faces the "incomplete tags" problem (users see raw XML tokens arriving one by one). How would you design a streaming XML frontend rendering strategy that lets users see meaningful content before the output is complete?

Rate this chapter

4.6 / 5 (8 ratings)