Parameter Scale Selection: 3B/8B/70B/405B Use-Case Matrix
Chapter 27: Choosing the Right Scale: The 3B/8B/70B/405B Selection Matrix
Choosing the right parameter scale for Hermes Agent is like choosing the right tool for a job: use a screwdriver for screws, a drill for walls. The wrong choice isn't just inefficient — it can make a project fail at its foundation. This chapter provides a data-driven selection framework across three dimensions: Function Calling success rate, reasoning quality, and hardware requirements.
27.1 Capability Boundaries by Scale
Comprehensive Capability Scoring Matrix
Based on extensive community benchmarking, Hermes models at each scale perform as follows on core Agent tasks (out of 10):
| Capability | Hermes-3B | Hermes-8B | Hermes-70B | Hermes-405B |
|---|---|---|---|---|
| Function Calling success rate | 3.2/10 | 6.8/10 | 9.1/10 | 9.7/10 |
| Multi-step reasoning quality | 2.8/10 | 6.4/10 | 9.0/10 | 9.8/10 |
| Tool parameter accuracy | 3.5/10 | 7.2/10 | 9.3/10 | 9.9/10 |
| Instruction following | 5.1/10 | 7.6/10 | 9.2/10 | 9.8/10 |
| Error recovery | 2.1/10 | 5.8/10 | 8.7/10 | 9.5/10 |
| Usable context length | 4K tokens | 8K tokens | 32K tokens | 128K tokens |
| Inference speed (local Q4) | 45 tok/s | 28 tok/s | 8 tok/s | 1.2 tok/s |
| Overall Agent suitability | Low | Medium | High | Excellent |
Benchmark Results (500-task standardized Agent test suite)
| Scale | Tool Selection | Parameter Accuracy | Format Compliance | Multi-step Success |
|---|---|---|---|---|
| 3B | 52.3% | 41.7% | 78.2% | 18.6% |
| 8B | 79.4% | 73.8% | 96.1% | 61.3% |
| 70B | 94.7% | 91.2% | 99.3% | 88.9% |
| 405B | 97.8% | 96.4% | 99.8% | 94.2% |
Critical threshold: The data shows 8B is the absolute minimum for Agent tasks (multi-step success 61.3%), while 70B is the production-safe baseline (multi-step success 88.9%).
27.2 Hardware Requirements
Local Deployment Hardware Requirements (by Quantization)
| Model | Quantization | VRAM | System RAM | Recommended GPU | Speed |
|---|---|---|---|---|---|
| 3B | FP16 | 6 GB | 8 GB | RTX 3060 6GB | 60 tok/s |
| 3B | Q4_K_M | 2.3 GB | 4 GB | Apple M1/M2 | 45 tok/s |
| 8B | FP16 | 16 GB | 16 GB | RTX 3090 | 32 tok/s |
| 8B | Q4_K_M | 5.2 GB | 8 GB | RTX 3060 8GB | 28 tok/s |
| 8B | Q8_0 | 9.1 GB | 12 GB | RTX 3080 | 24 tok/s |
| 70B | FP16 | 140 GB | 160 GB | 2×A100 80GB | 12 tok/s |
| 70B | Q4_K_M | 42 GB | 64 GB | 2×RTX 3090 or A40 | 8 tok/s |
| 70B | Q8_0 | 75 GB | 96 GB | A100 80GB | 6 tok/s |
| 405B | Q4_K_M | 245 GB | 320 GB | 4×A100 80GB | 1.2 tok/s |
Hardware Scenario Quick Guide
MacBook M3 Pro (16GB unified memory)
→ Hermes-8B Q4_K_M (5.2 GB, runs smoothly)
Gaming PC (RTX 4080 16GB VRAM)
→ Hermes-8B FP16 (16 GB, best 8B quality)
Workstation (2×RTX 3090 24GB)
→ Hermes-70B Q4_K_M (42 GB, dual-GPU required)
→ ~8 tok/s, production-viable
Single server (A100 80GB)
→ Hermes-70B Q4_K_M (42 GB, comfortable)
→ Hermes-70B Q8_0 (75 GB, near lossless)
Multi-GPU server (4×A100 80GB)
→ Hermes-405B Q4_K_M (245 GB)
27.3 Why Sub-7B Models Fail at Tool Calling
Understanding why small models struggle with Function Calling is essential — it informs both selection and mitigation strategies.
Root Cause 1: Insufficient Instruction-Following Capacity
Function Calling requires the model to output strict JSON format AND make the correct judgment between "when to call a tool" vs. "when to answer directly."
Typical 3B failure modes:
- Describes the tool call in natural language instead of outputting JSON
- Outputs half-JSON, half-text
- Over-calls tools (treats every question as needing a tool)
- Ignores tools entirely (forgets it has them)
Example (3B model):
User: Please read the contents of /tmp/data.csv
3B output: I will use the read_file tool with path /tmp/data.csv,
and then return the file contents. [Content would appear here]
(Correct behavior: output tool call JSON, wait for result, then process it)
Root Cause 2: Weak Parameter Extraction
# Test: Parameter extraction accuracy
test_prompt = """
User: Read lines 50 to 100 from main.py using UTF-8 encoding
Expected tool call:
{
"name": "read_file",
"input": {
"path": "main.py",
"start_line": 50,
"end_line": 100,
"encoding": "utf-8"
}
}
"""
# Common 3B errors (from actual testing):
fail_examples_3b = [
# Type error: strings instead of integers
{"path": "main.py", "start_line": "50", "end_line": "100"},
# Missing parameters
{"path": "main.py"},
# Wrong parameter names
{"filename": "main.py", "lines": "50-100"},
# Case sensitivity
{"path": "main.py", "start_line": 50, "end_line": 100, "encoding": "UTF-8"}
]
Root Cause 3: Multi-Step Chain Breakage
Agent tasks typically require multi-step tool calls where each step's output feeds the next. 3B models lose track of intermediate state, confabulate results, and fail to carry variables across steps.
Mitigation Techniques for Small Models
# Technique 1: Simplify the tool set (reduce choice paralysis)
# Never give small models more than 3–5 tools
# Technique 2: Reinforce format constraints
SMALL_MODEL_SYSTEM = """
CRITICAL: When calling a tool, output ONLY this JSON block:
<tool_call>
{"name": "tool_name", "arguments": {...}}
</tool_call>
No text before or after. No descriptions. Just the JSON.
"""
# Technique 3: Decompose tasks at the Agent layer, not the model layer
class SimpleStepAgent:
def decompose_task(self, complex_task: str) -> list[str]:
"""Break complex tasks into single-step calls using rules, not the model."""
pass
27.4 Selection Decision Tree
Q1: Primary task type?
├─ Simple Q&A / text processing (no tool calls)
│ → 3B or 8B, choose 3B to save resources
│
├─ Single-step tool calling (e.g., search + answer)
│ → Minimum 8B, recommend 8B
│
├─ Multi-step Agent (3+ tool calls)
│ → Minimum 70B, strongly recommend 70B
│
└─ Complex reasoning + multi-step Agent (code gen, data analysis)
→ Recommend 70B, use 405B if available
Q2: Deployment environment?
├─ Mobile / edge (< 4GB memory)
│ → 3B only (limited use, not recommended for complex Agent)
│
├─ Personal computer / MacBook (8–32GB)
│ → 8B (Q4_K_M quantization)
│
├─ Workstation / single server (32–80GB VRAM)
│ → 70B (Q4_K_M quantization)
│
└─ Multi-GPU server / cloud (> 80GB VRAM)
→ 70B FP16 or 405B Q4
Q3: Latency requirement?
├─ < 1s first token (real-time interaction)
│ → Smaller scale + quantization, or use API
│
└─ 2–10s acceptable (batch / background)
→ Can use larger scale
Q4: Budget?
├─ API only → Use Claude/GPT-4 class APIs
├─ < $1,000 hardware → 8B Q4 (consumer GPU)
├─ $3,000–$10,000 → 70B Q4 (A40 or 2×RTX 3090)
└─ > $20,000 → 70B FP16 or 405B Q4
27.5 Cloud API vs. Local Deployment Break-Even Analysis
Cost Comparison Framework
def calculate_break_even(
local_hardware_cost: float,
local_monthly_electricity: float,
local_monthly_maintenance: float,
api_price_per_1k_tokens: float,
monthly_token_usage: int,
hardware_lifespan_months: int = 36
) -> dict:
monthly_hardware_amortized = local_hardware_cost / hardware_lifespan_months
local_monthly_total = (
monthly_hardware_amortized +
local_monthly_electricity +
local_monthly_maintenance
)
api_monthly_cost = monthly_token_usage / 1000 * api_price_per_1k_tokens
monthly_savings = api_monthly_cost - local_monthly_total
break_even_months = local_hardware_cost / max(monthly_savings, 0.01)
return {
"local_monthly_cost": local_monthly_total,
"api_monthly_cost": api_monthly_cost,
"monthly_savings": monthly_savings,
"break_even_months": break_even_months,
"recommendation": "local" if monthly_savings > 0 else "api"
}
Break-Even Token Volume by Scale
| Model Scale | Recommended Local Setup | Hardware Cost | Monthly Fixed | API Cost Break-Even |
|---|---|---|---|---|
| 8B | RTX 4080 16GB | ~$1,200 | ~$63/month | > 21M tokens/month |
| 70B | A100 80GB | ~$12,000 | ~$380/month | > 127M tokens/month |
| 70B | 2×RTX 3090 | ~$3,000 | ~$163/month | > 54M tokens/month |
| 405B | 4×A100 80GB | ~$48,000 | ~$1,520/month | > 507M tokens/month |
Conclusion: For most small-to-mid applications (< 10M tokens/month), API is more economical than self-hosting 70B+. The primary driver of local deployment is usually data privacy, not cost savings.
Privacy-Driven Selection Logic
Local deployment reasons beyond cost:
1. Compliance requirements
→ Medical/financial/government data cannot leave corporate network
→ Minimum: 70B local
2. Ultra-low latency
→ Need < 100ms response, API network latency unacceptable
→ Prefer: 8B local (speed priority)
3. Offline operation
→ No network (factories, ships, defense)
→ Choose: largest scale that fits available hardware
4. Custom fine-tuning
→ Need LoRA training on private data
→ Need: enough VRAM for training
27.6 Summary
Parameter scale selection is the most consequential deployment decision for Hermes Agent:
- 3B: Mobile, edge inference, simple conversation — not suitable for tool-heavy Agents
- 8B: Best choice for personal development and prototyping; multi-step success ~61%, use carefully in production
- 70B: Production-safe baseline with 89% multi-step success; hardware requirements manageable with quantization
- 405B: Enterprise-critical tasks requiring highest accuracy; API deployment preferred over local
- Break-even principle: For most applications with < 50M monthly tokens, API is more economical than local 70B+ deployment
Discussion Questions
-
Why does Function Calling success rate jump so dramatically from 8B (61%) to 70B (89%)? Is it purely parameter count, or are there other factors (training data, architecture)?
-
In your business scenario, what is the cost of an Agent task failure? If 8B's 39% failure rate means user churn, should you skip straight to 70B or 405B?
-
The break-even analysis for "2×RTX 3090 to run 70B" requires > 54M monthly tokens. What variables does this calculation miss? Which do you consider most significant?
-
If only 8B is available but you need 70B-level Agent performance, what engineering compensation strategies would you employ?