Chapter 27

Parameter Scale Selection: 3B/8B/70B/405B Use-Case Matrix

Chapter 27: Choosing the Right Scale: The 3B/8B/70B/405B Selection Matrix

Choosing the right parameter scale for Hermes Agent is like choosing the right tool for a job: use a screwdriver for screws, a drill for walls. The wrong choice isn't just inefficient — it can make a project fail at its foundation. This chapter provides a data-driven selection framework across three dimensions: Function Calling success rate, reasoning quality, and hardware requirements.

27.1 Capability Boundaries by Scale

Comprehensive Capability Scoring Matrix

Based on extensive community benchmarking, Hermes models at each scale perform as follows on core Agent tasks (out of 10):

Capability	Hermes-3B	Hermes-8B	Hermes-70B	Hermes-405B
Function Calling success rate	3.2/10	6.8/10	9.1/10	9.7/10
Multi-step reasoning quality	2.8/10	6.4/10	9.0/10	9.8/10
Tool parameter accuracy	3.5/10	7.2/10	9.3/10	9.9/10
Instruction following	5.1/10	7.6/10	9.2/10	9.8/10
Error recovery	2.1/10	5.8/10	8.7/10	9.5/10
Usable context length	4K tokens	8K tokens	32K tokens	128K tokens
Inference speed (local Q4)	45 tok/s	28 tok/s	8 tok/s	1.2 tok/s
Overall Agent suitability	Low	Medium	High	Excellent

Benchmark Results (500-task standardized Agent test suite)

Scale	Tool Selection	Parameter Accuracy	Format Compliance	Multi-step Success
3B	52.3%	41.7%	78.2%	18.6%
8B	79.4%	73.8%	96.1%	61.3%
70B	94.7%	91.2%	99.3%	88.9%
405B	97.8%	96.4%	99.8%	94.2%

Critical threshold: The data shows 8B is the absolute minimum for Agent tasks (multi-step success 61.3%), while 70B is the production-safe baseline (multi-step success 88.9%).

27.2 Hardware Requirements

Local Deployment Hardware Requirements (by Quantization)

Model	Quantization	VRAM	System RAM	Recommended GPU	Speed
3B	FP16	6 GB	8 GB	RTX 3060 6GB	60 tok/s
3B	Q4_K_M	2.3 GB	4 GB	Apple M1/M2	45 tok/s
8B	FP16	16 GB	16 GB	RTX 3090	32 tok/s
8B	Q4_K_M	5.2 GB	8 GB	RTX 3060 8GB	28 tok/s
8B	Q8_0	9.1 GB	12 GB	RTX 3080	24 tok/s
70B	FP16	140 GB	160 GB	2×A100 80GB	12 tok/s
70B	Q4_K_M	42 GB	64 GB	2×RTX 3090 or A40	8 tok/s
70B	Q8_0	75 GB	96 GB	A100 80GB	6 tok/s
405B	Q4_K_M	245 GB	320 GB	4×A100 80GB	1.2 tok/s

Hardware Scenario Quick Guide

MacBook M3 Pro (16GB unified memory)
  → Hermes-8B Q4_K_M (5.2 GB, runs smoothly)

Gaming PC (RTX 4080 16GB VRAM)
  → Hermes-8B FP16 (16 GB, best 8B quality)

Workstation (2×RTX 3090 24GB)
  → Hermes-70B Q4_K_M (42 GB, dual-GPU required)
  → ~8 tok/s, production-viable

Single server (A100 80GB)
  → Hermes-70B Q4_K_M (42 GB, comfortable)
  → Hermes-70B Q8_0 (75 GB, near lossless)

Multi-GPU server (4×A100 80GB)
  → Hermes-405B Q4_K_M (245 GB)

27.3 Why Sub-7B Models Fail at Tool Calling

Understanding why small models struggle with Function Calling is essential — it informs both selection and mitigation strategies.

Root Cause 1: Insufficient Instruction-Following Capacity

Function Calling requires the model to output strict JSON format AND make the correct judgment between "when to call a tool" vs. "when to answer directly."

Typical 3B failure modes:
- Describes the tool call in natural language instead of outputting JSON
- Outputs half-JSON, half-text
- Over-calls tools (treats every question as needing a tool)
- Ignores tools entirely (forgets it has them)

Example (3B model):
  User: Please read the contents of /tmp/data.csv
  3B output: I will use the read_file tool with path /tmp/data.csv, 
              and then return the file contents. [Content would appear here]

  (Correct behavior: output tool call JSON, wait for result, then process it)

Root Cause 2: Weak Parameter Extraction

# Test: Parameter extraction accuracy

test_prompt = """
User: Read lines 50 to 100 from main.py using UTF-8 encoding

Expected tool call:
{
  "name": "read_file",
  "input": {
    "path": "main.py",
    "start_line": 50,
    "end_line": 100,
    "encoding": "utf-8"
  }
}
"""

# Common 3B errors (from actual testing):
fail_examples_3b = [
    # Type error: strings instead of integers
    {"path": "main.py", "start_line": "50", "end_line": "100"},
    
    # Missing parameters
    {"path": "main.py"},
    
    # Wrong parameter names
    {"filename": "main.py", "lines": "50-100"},
    
    # Case sensitivity
    {"path": "main.py", "start_line": 50, "end_line": 100, "encoding": "UTF-8"}
]

Root Cause 3: Multi-Step Chain Breakage

Agent tasks typically require multi-step tool calls where each step's output feeds the next. 3B models lose track of intermediate state, confabulate results, and fail to carry variables across steps.

Mitigation Techniques for Small Models

# Technique 1: Simplify the tool set (reduce choice paralysis)
# Never give small models more than 3–5 tools

# Technique 2: Reinforce format constraints
SMALL_MODEL_SYSTEM = """
CRITICAL: When calling a tool, output ONLY this JSON block:
<tool_call>
{"name": "tool_name", "arguments": {...}}
</tool_call>
No text before or after. No descriptions. Just the JSON.
"""

# Technique 3: Decompose tasks at the Agent layer, not the model layer
class SimpleStepAgent:
    def decompose_task(self, complex_task: str) -> list[str]:
        """Break complex tasks into single-step calls using rules, not the model."""
        pass

27.4 Selection Decision Tree

Q1: Primary task type?
  ├─ Simple Q&A / text processing (no tool calls)
  │   → 3B or 8B, choose 3B to save resources
  │
  ├─ Single-step tool calling (e.g., search + answer)
  │   → Minimum 8B, recommend 8B
  │
  ├─ Multi-step Agent (3+ tool calls)
  │   → Minimum 70B, strongly recommend 70B
  │
  └─ Complex reasoning + multi-step Agent (code gen, data analysis)
      → Recommend 70B, use 405B if available

Q2: Deployment environment?
  ├─ Mobile / edge (< 4GB memory)
  │   → 3B only (limited use, not recommended for complex Agent)
  │
  ├─ Personal computer / MacBook (8–32GB)
  │   → 8B (Q4_K_M quantization)
  │
  ├─ Workstation / single server (32–80GB VRAM)
  │   → 70B (Q4_K_M quantization)
  │
  └─ Multi-GPU server / cloud (> 80GB VRAM)
      → 70B FP16 or 405B Q4

Q3: Latency requirement?
  ├─ < 1s first token (real-time interaction)
  │   → Smaller scale + quantization, or use API
  │
  └─ 2–10s acceptable (batch / background)
      → Can use larger scale

Q4: Budget?
  ├─ API only → Use Claude/GPT-4 class APIs
  ├─ < $1,000 hardware → 8B Q4 (consumer GPU)
  ├─ $3,000–$10,000 → 70B Q4 (A40 or 2×RTX 3090)
  └─ > $20,000 → 70B FP16 or 405B Q4

27.5 Cloud API vs. Local Deployment Break-Even Analysis

Cost Comparison Framework

def calculate_break_even(
    local_hardware_cost: float,
    local_monthly_electricity: float,
    local_monthly_maintenance: float,
    api_price_per_1k_tokens: float,
    monthly_token_usage: int,
    hardware_lifespan_months: int = 36
) -> dict:
    monthly_hardware_amortized = local_hardware_cost / hardware_lifespan_months
    local_monthly_total = (
        monthly_hardware_amortized + 
        local_monthly_electricity + 
        local_monthly_maintenance
    )
    api_monthly_cost = monthly_token_usage / 1000 * api_price_per_1k_tokens
    monthly_savings = api_monthly_cost - local_monthly_total
    break_even_months = local_hardware_cost / max(monthly_savings, 0.01)
    
    return {
        "local_monthly_cost": local_monthly_total,
        "api_monthly_cost": api_monthly_cost,
        "monthly_savings": monthly_savings,
        "break_even_months": break_even_months,
        "recommendation": "local" if monthly_savings > 0 else "api"
    }

Break-Even Token Volume by Scale

Model Scale	Recommended Local Setup	Hardware Cost	Monthly Fixed	API Cost Break-Even
8B	RTX 4080 16GB	~$1,200	~$63/month	> 21M tokens/month
70B	A100 80GB	~$12,000	~$380/month	> 127M tokens/month
70B	2×RTX 3090	~$3,000	~$163/month	> 54M tokens/month
405B	4×A100 80GB	~$48,000	~$1,520/month	> 507M tokens/month

Conclusion: For most small-to-mid applications (< 10M tokens/month), API is more economical than self-hosting 70B+. The primary driver of local deployment is usually data privacy, not cost savings.

Privacy-Driven Selection Logic

Local deployment reasons beyond cost:

1. Compliance requirements
   → Medical/financial/government data cannot leave corporate network
   → Minimum: 70B local

2. Ultra-low latency
   → Need < 100ms response, API network latency unacceptable
   → Prefer: 8B local (speed priority)

3. Offline operation
   → No network (factories, ships, defense)
   → Choose: largest scale that fits available hardware

4. Custom fine-tuning
   → Need LoRA training on private data
   → Need: enough VRAM for training

27.6 Summary

Parameter scale selection is the most consequential deployment decision for Hermes Agent:

3B: Mobile, edge inference, simple conversation — not suitable for tool-heavy Agents
8B: Best choice for personal development and prototyping; multi-step success ~61%, use carefully in production
70B: Production-safe baseline with 89% multi-step success; hardware requirements manageable with quantization
405B: Enterprise-critical tasks requiring highest accuracy; API deployment preferred over local
Break-even principle: For most applications with < 50M monthly tokens, API is more economical than local 70B+ deployment

Discussion Questions

Why does Function Calling success rate jump so dramatically from 8B (61%) to 70B (89%)? Is it purely parameter count, or are there other factors (training data, architecture)?
In your business scenario, what is the cost of an Agent task failure? If 8B's 39% failure rate means user churn, should you skip straight to 70B or 405B?
The break-even analysis for "2×RTX 3090 to run 70B" requires > 54M monthly tokens. What variables does this calculation miss? Which do you consider most significant?
If only 8B is available but you need 70B-level Agent performance, what engineering compensation strategies would you employ?

Rate this chapter

4.8 / 5 (5 ratings)