ChatML Format and Special Token Design
Chapter 21: ChatML Format and Special Token Design
Hermes Agent's language understanding capability stems in large part from its carefully designed token system. ChatML (Chat Markup Language) is not merely a message format — it is the core mechanism by which Hermes distinguishes "who is speaking" and "what is happening right now." Understanding ChatML format and the semantics of special tokens is essential knowledge for hand-crafting high-quality prompts and debugging Agent behavior.
21.1 Complete ChatML Format Specification
ChatML was originally proposed by OpenAI and has since been widely adopted as a conversation markup language. Hermes extends standard ChatML with Agent-specific roles and tags.
21.1.1 Standard ChatML Structure
<|im_start|>system
{system prompt content}
<|im_end|>
<|im_start|>user
{user message content}
<|im_end|>
<|im_start|>assistant
{assistant reply content}
<|im_end|>
21.1.2 Hermes Extended Format
<|im_start|>system
{system prompt}
<|im_end|>
<|im_start|>user
{user message}
<|im_end|>
<|im_start|>assistant
[inner_monologue]
{internal reasoning process}
[/inner_monologue]
[tool_call]
{"name": "tool_name", "arguments": {...}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{tool execution result}
[/tool_response]
<|im_end|>
<|im_start|>assistant
{final user-visible reply}
<|im_end|>
21.2 The Role of im_start / im_end Tokens
<|im_start|> and <|im_end|> are the most fundamental control tokens in ChatML. They are assigned dedicated token IDs in the vocabulary and cannot be confused with ordinary text.
21.2.1 Token ID Assignment
| Token | Typical ID (Llama architecture) | Purpose |
|---|---|---|
| `< | im_start | >` |
| `< | im_end | >` |
| `< | endoftext | >` |
21.2.2 Tokenizer Behavior
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Hermes-2-Pro-Mistral-7B")
# Verify special token IDs
print(tokenizer.convert_tokens_to_ids("<|im_start|>")) # Output: 32001
print(tokenizer.convert_tokens_to_ids("<|im_end|>")) # Output: 32002
# Tokenize a complete ChatML conversation
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"},
]
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(formatted)
# Output:
# <|im_start|>system
# You are a helpful assistant<|im_end|>
# <|im_start|>user
# Hello<|im_end|>
# <|im_start|>assistant
# Hello! How can I help you?<|im_end|>
# <|im_start|>assistant
# (model continues generating here)
21.2.3 Attention Mask Implications
During training, tokens following <|im_end|> have their attention masked (set to zero). This ensures the model doesn't erroneously attend across message boundaries:
<|im_start|> user \n Hello <|im_end|> <|im_start|> assistant \n
[1] [1] [1] [1] [0] [1] [1] [1]
↑
attention masked to 0 after im_end
21.3 system / user / assistant Role Markers
21.3.1 system Role
The system message is the "constitution" of the entire conversation, defining the Agent's identity, capability boundaries, and behavioral guidelines:
<|im_start|>system
You are Hermes, an autonomous AI Agent developed by NousResearch.
You have the following capabilities:
- Web search and information retrieval
- Code writing, execution, and debugging
- File read/write operations
- Completing complex tasks through tool calls
## Available Tools
{tool_list_json}
## Output Format
Use [tool_call] tags to invoke tools...
<|im_end|>
21.3.2 user Role
The user role represents human input. In multimodal scenarios, user messages can include text, images, or file references:
<|im_start|>user
Please analyze the data trends in this image
[IMAGE: base64_encoded_image_data]
<|im_end|>
21.3.3 assistant Role
The assistant role contains model-generated content including:
- Internal monologue (within
[inner_monologue]tags) - Tool calls (within
[tool_call]tags) - Final user-visible reply (plain text outside tags)
21.4 Hermes Extended Tokens Reference
Hermes introduces 6 extended roles/tags beyond standard ChatML to support Agent workflows:
| Tag / Role | Purpose | User-Visible? |
|---|---|---|
[inner_monologue] |
Internal reasoning; model thinking process | No (typically filtered) |
[tool_call] |
Declare the tool and parameters to invoke | No (intercepted by system) |
[tool_response] |
Tool execution result | No (injected into context) |
[step] |
Each step in multi-step reasoning | No |
[scratchpad] |
Intermediate computation staging area | No |
tool role |
Tool result message block | No |
Complete Tool Call Sequence Example
<|im_start|>assistant
[inner_monologue]
The user wants to find Python asyncio tutorials. I should search the web for current resources.
[/inner_monologue]
[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio best practices 2024", "max_results": 5}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{
"status": "success",
"results": [
{"title": "AsyncIO in Python: A Complete Walkthrough", "url": "https://realpython.com/async-io-python/"},
{"title": "Python asyncio official docs", "url": "https://docs.python.org/3/library/asyncio.html"}
]
}
[/tool_response]
<|im_end|>
<|im_start|>assistant
Based on the search results, here are the top Python asyncio learning resources:
1. **Real Python Complete Tutorial** — https://realpython.com/async-io-python/
2. **Python Official Docs** — https://docs.python.org/3/library/asyncio.html
<|im_end|>
21.5 Differences from OpenAI Format
Hermes ChatML differs from the OpenAI API format (as used with GPT-4) in several key ways:
Comparison Table
| Dimension | OpenAI API Format | Hermes ChatML |
|---|---|---|
| Message carrier | JSON object array | Text sequence (token stream) |
| Role marking | role field value |
`< |
| Message delimiter | JSON structure (implicit) | `< |
| Tool calls | tool_calls array field |
[tool_call] tag |
| Tool response role | role: "tool" + tool_call_id |
`< |
| Internal monologue | No standard support | [inner_monologue] tag |
| Streaming output | SSE JSON chunks | Token stream |
| Context format | API request body JSON | Raw text token sequence |
OpenAI Format (JSON)
{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Search for Python asyncio"},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "web_search",
"arguments": "{\"query\": \"Python asyncio\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"results\": [...]}"
}
]
}
Hermes ChatML (Raw Token Sequence)
<|im_start|>system
You are a helpful assistant
<|im_end|>
<|im_start|>user
Search for Python asyncio
<|im_end|>
<|im_start|>assistant
[tool_call]
{"name": "web_search", "arguments": {"query": "Python asyncio"}}
[/tool_call]
<|im_end|>
<|im_start|>tool
[tool_response]
{"results": [...]}
[/tool_response]
<|im_end|>
<|im_start|>assistant
Key difference: Hermes ChatML tool calls don't need a tool_call_id for tracking (because the sequential token stream with context position uniquely identifies each call), whereas OpenAI's JSON format uses IDs to correlate requests and responses.
21.6 Hand-Crafting Effective Prompts
Manually constructing Hermes prompts is valuable for debugging, fine-tuning data preparation, and integration testing.
Base Prompt Construction Function
from transformers import AutoTokenizer
import json
def build_hermes_prompt(
system: str,
messages: list[dict],
tools: list[dict] | None = None,
add_generation_prompt: bool = True,
) -> str:
"""
Manually construct a Hermes ChatML format prompt
Args:
system: System prompt content
messages: Message list [{"role": "user/assistant/tool", "content": "..."}]
tools: Tool definition list (JSON Schema format)
add_generation_prompt: Whether to append the assistant start marker
"""
parts = []
sys_content = system
if tools:
sys_content += f"\n\n## Available Tools\n```json\n{json.dumps(tools, indent=2)}\n```"
parts.append(f"<|im_start|>system\n{sys_content}\n<|im_end|>")
for msg in messages:
role = msg["role"]
content = msg["content"]
if role == "tool":
parts.append(f"<|im_start|>tool\n[tool_response]\n{content}\n[/tool_response]\n<|im_end|>")
else:
parts.append(f"<|im_start|>{role}\n{content}\n<|im_end|>")
if add_generation_prompt:
parts.append("<|im_start|>assistant\n")
return "\n".join(parts)
# Usage example
prompt = build_hermes_prompt(
system="You are a professional coding assistant",
messages=[
{"role": "user", "content": "Write a quicksort algorithm for me"},
],
tools=[
{
"name": "code_execute",
"description": "Execute a Python code snippet",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
}
}
}
]
)
Tokenizer-Level Validation
def validate_prompt_tokens(prompt: str, model_name: str) -> dict:
"""Validate prompt token count and special token structure"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer.encode(prompt)
im_start_id = tokenizer.convert_tokens_to_ids("<|im_start|>")
im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
im_starts = tokens.count(im_start_id)
im_ends = tokens.count(im_end_id)
return {
"total_tokens": len(tokens),
"im_start_count": im_starts,
"im_end_count": im_ends,
# balanced=False is expected when add_generation_prompt=True
# (last assistant block not yet closed)
"balanced": im_starts == im_ends,
"estimated_cost_usd": len(tokens) / 1000 * 0.01,
}
21.7 Summary
This chapter provided a deep analysis of Hermes's ChatML format and special token system:
- ChatML fundamentals:
<|im_start|>/<|im_end|>are dedicated tokens immune to text interference - Three base roles: system (constitution) / user (input) / assistant (output)
- Six extension tags: inner_monologue, tool_call, tool_response, step, scratchpad + tool role
- Differences from OpenAI: ChatML is a token stream rather than a JSON structure; tool calls need no ID tracking
- Hand-crafting prompts: Understanding format details is foundational for debugging, fine-tuning, and integration
Review Questions
-
<|im_start|>and<|im_end|>are designed as dedicated tokens that cannot be produced by ordinary text. However, if a user inputs the string<|im_start|>in a message, how does the tokenizer handle it? Does this create a security vulnerability? -
In production deployments, Hermes's
[inner_monologue]is typically filtered and not shown to users. If users were allowed to see the inner monologue, what impact would this have on user trust and Agent security? -
Compared to OpenAI's JSON format, what challenges does ChatML's token stream format face when handling parallel multi-tool calls? How would you implement parallel tool calling while preserving the sequential nature of the token stream?