Chapter 21

ChatML Format and Special Token Design

Chapter 21: ChatML Format and Special Token Design

Hermes Agent's language understanding capability stems in large part from its carefully designed token system. ChatML (Chat Markup Language) is not merely a message format — it is the core mechanism by which Hermes distinguishes "who is speaking" and "what is happening right now." Understanding ChatML format and the semantics of special tokens is essential knowledge for hand-crafting high-quality prompts and debugging Agent behavior.

21.1 Complete ChatML Format Specification

ChatML was originally proposed by OpenAI and has since been widely adopted as a conversation markup language. Hermes extends standard ChatML with Agent-specific roles and tags.

21.1.1 Standard ChatML Structure

<|im_start|>system
{system prompt content}
<|im_end|>
<|im_start|>user
{user message content}
<|im_end|>
<|im_start|>assistant
{assistant reply content}
<|im_end|>

21.1.2 Hermes Extended Format

<|im_start|>system
{system prompt}
<|im_end|>
<|im_start|>user
{user message}
<|im_end|>
<|im_start|>assistant
[inner_monologue]
{internal reasoning process}
[/inner_monologue]
[tool_call]
{"name": "tool_name", "arguments": {...}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{tool execution result}
[/tool_response]
<|im_end|>
<|im_start|>assistant
{final user-visible reply}
<|im_end|>

21.2 The Role of im_start / im_end Tokens

<|im_start|> and <|im_end|> are the most fundamental control tokens in ChatML. They are assigned dedicated token IDs in the vocabulary and cannot be confused with ordinary text.

21.2.1 Token ID Assignment

Token	Typical ID (Llama architecture)	Purpose
`<	im_start	>`
`<	im_end	>`
`<	endoftext	>`

21.2.2 Tokenizer Behavior

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Hermes-2-Pro-Mistral-7B")

# Verify special token IDs
print(tokenizer.convert_tokens_to_ids("<|im_start|>"))  # Output: 32001
print(tokenizer.convert_tokens_to_ids("<|im_end|>"))    # Output: 32002

# Tokenize a complete ChatML conversation
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I help you?"},
]

formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print(formatted)
# Output:
# <|im_start|>system
# You are a helpful assistant<|im_end|>
# <|im_start|>user
# Hello<|im_end|>
# <|im_start|>assistant
# Hello! How can I help you?<|im_end|>
# <|im_start|>assistant
# (model continues generating here)

21.2.3 Attention Mask Implications

During training, tokens following <|im_end|> have their attention masked (set to zero). This ensures the model doesn't erroneously attend across message boundaries:

<|im_start|> user \n Hello <|im_end|> <|im_start|> assistant \n
    [1]      [1]  [1]  [1]    [0]         [1]          [1]    [1]
                                ↑
                        attention masked to 0 after im_end

21.3 system / user / assistant Role Markers

21.3.1 system Role

The system message is the "constitution" of the entire conversation, defining the Agent's identity, capability boundaries, and behavioral guidelines:

<|im_start|>system
You are Hermes, an autonomous AI Agent developed by NousResearch.
You have the following capabilities:
- Web search and information retrieval
- Code writing, execution, and debugging
- File read/write operations
- Completing complex tasks through tool calls

## Available Tools
{tool_list_json}

## Output Format
Use [tool_call] tags to invoke tools...
<|im_end|>

21.3.2 user Role

The user role represents human input. In multimodal scenarios, user messages can include text, images, or file references:

<|im_start|>user
Please analyze the data trends in this image
[IMAGE: base64_encoded_image_data]
<|im_end|>

21.3.3 assistant Role

The assistant role contains model-generated content including:

Internal monologue (within [inner_monologue] tags)
Tool calls (within [tool_call] tags)
Final user-visible reply (plain text outside tags)

21.4 Hermes Extended Tokens Reference

Hermes introduces 6 extended roles/tags beyond standard ChatML to support Agent workflows:

Tag / Role	Purpose	User-Visible?
`[inner_monologue]`	Internal reasoning; model thinking process	No (typically filtered)
`[tool_call]`	Declare the tool and parameters to invoke	No (intercepted by system)
`[tool_response]`	Tool execution result	No (injected into context)
`[step]`	Each step in multi-step reasoning	No
`[scratchpad]`	Intermediate computation staging area	No
`tool` role	Tool result message block	No

Complete Tool Call Sequence Example

<|im_start|>assistant
[inner_monologue]
The user wants to find Python asyncio tutorials. I should search the web for current resources.
[/inner_monologue]
[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio best practices 2024", "max_results": 5}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{
  "status": "success",
  "results": [
    {"title": "AsyncIO in Python: A Complete Walkthrough", "url": "https://realpython.com/async-io-python/"},
    {"title": "Python asyncio official docs", "url": "https://docs.python.org/3/library/asyncio.html"}
  ]
}
[/tool_response]
<|im_end|>
<|im_start|>assistant
Based on the search results, here are the top Python asyncio learning resources:

1. **Real Python Complete Tutorial** — https://realpython.com/async-io-python/
2. **Python Official Docs** — https://docs.python.org/3/library/asyncio.html
<|im_end|>

21.5 Differences from OpenAI Format

Hermes ChatML differs from the OpenAI API format (as used with GPT-4) in several key ways:

Comparison Table

Dimension	OpenAI API Format	Hermes ChatML
Message carrier	JSON object array	Text sequence (token stream)
Role marking	`role` field value	`<
Message delimiter	JSON structure (implicit)	`<
Tool calls	`tool_calls` array field	`[tool_call]` tag
Tool response role	`role: "tool"` + `tool_call_id`	`<
Internal monologue	No standard support	`[inner_monologue]` tag
Streaming output	SSE JSON chunks	Token stream
Context format	API request body JSON	Raw text token sequence

OpenAI Format (JSON)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Search for Python asyncio"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "web_search",
            "arguments": "{\"query\": \"Python asyncio\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "call_abc123",
      "content": "{\"results\": [...]}"
    }
  ]
}

Hermes ChatML (Raw Token Sequence)

<|im_start|>system
You are a helpful assistant
<|im_end|>
<|im_start|>user
Search for Python asyncio
<|im_end|>
<|im_start|>assistant
[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio"}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{"results": [...]}
[/tool_response]
<|im_end|>
<|im_start|>assistant

Key difference: Hermes ChatML tool calls don't need a tool_call_id for tracking (because the sequential token stream with context position uniquely identifies each call), whereas OpenAI's JSON format uses IDs to correlate requests and responses.

21.6 Hand-Crafting Effective Prompts

Manually constructing Hermes prompts is valuable for debugging, fine-tuning data preparation, and integration testing.

Base Prompt Construction Function

from transformers import AutoTokenizer
import json

def build_hermes_prompt(
    system: str,
    messages: list[dict],
    tools: list[dict] | None = None,
    add_generation_prompt: bool = True,
) -> str:
    """
    Manually construct a Hermes ChatML format prompt
    
    Args:
        system: System prompt content
        messages: Message list [{"role": "user/assistant/tool", "content": "..."}]
        tools: Tool definition list (JSON Schema format)
        add_generation_prompt: Whether to append the assistant start marker
    """
    parts = []
    
    sys_content = system
    if tools:
        sys_content += f"\n\n## Available Tools\n```json\n{json.dumps(tools, indent=2)}\n```"
    
    parts.append(f"<|im_start|>system\n{sys_content}\n<|im_end|>")
    
    for msg in messages:
        role = msg["role"]
        content = msg["content"]
        
        if role == "tool":
            parts.append(f"<|im_start|>tool\n[tool_response]\n{content}\n[/tool_response]\n<|im_end|>")
        else:
            parts.append(f"<|im_start|>{role}\n{content}\n<|im_end|>")
    
    if add_generation_prompt:
        parts.append("<|im_start|>assistant\n")
    
    return "\n".join(parts)


# Usage example
prompt = build_hermes_prompt(
    system="You are a professional coding assistant",
    messages=[
        {"role": "user", "content": "Write a quicksort algorithm for me"},
    ],
    tools=[
        {
            "name": "code_execute",
            "description": "Execute a Python code snippet",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                }
            }
        }
    ]
)

Tokenizer-Level Validation

def validate_prompt_tokens(prompt: str, model_name: str) -> dict:
    """Validate prompt token count and special token structure"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokens = tokenizer.encode(prompt)
    
    im_start_id = tokenizer.convert_tokens_to_ids("<|im_start|>")
    im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
    
    im_starts = tokens.count(im_start_id)
    im_ends = tokens.count(im_end_id)
    
    return {
        "total_tokens": len(tokens),
        "im_start_count": im_starts,
        "im_end_count": im_ends,
        # balanced=False is expected when add_generation_prompt=True
        # (last assistant block not yet closed)
        "balanced": im_starts == im_ends,
        "estimated_cost_usd": len(tokens) / 1000 * 0.01,
    }

21.7 Summary

This chapter provided a deep analysis of Hermes's ChatML format and special token system:

ChatML fundamentals: <|im_start|> / <|im_end|> are dedicated tokens immune to text interference
Three base roles: system (constitution) / user (input) / assistant (output)
Six extension tags: inner_monologue, tool_call, tool_response, step, scratchpad + tool role
Differences from OpenAI: ChatML is a token stream rather than a JSON structure; tool calls need no ID tracking
Hand-crafting prompts: Understanding format details is foundational for debugging, fine-tuning, and integration

Review Questions

<|im_start|> and <|im_end|> are designed as dedicated tokens that cannot be produced by ordinary text. However, if a user inputs the string <|im_start|> in a message, how does the tokenizer handle it? Does this create a security vulnerability?
In production deployments, Hermes's [inner_monologue] is typically filtered and not shown to users. If users were allowed to see the inner monologue, what impact would this have on user trust and Agent security?
Compared to OpenAI's JSON format, what challenges does ChatML's token stream format face when handling parallel multi-tool calls? How would you implement parallel tool calling while preserving the sequential nature of the token stream?

Rate this chapter

4.7 / 5 (12 ratings)