Chapter 27

Parameter Scale Selection: 3B/8B/70B/405B Use-Case Matrix

Chapter 27: Choosing the Right Scale: The 3B/8B/70B/405B Selection Matrix

Choosing the right parameter scale for Hermes Agent is like choosing the right tool for a job: use a screwdriver for screws, a drill for walls. The wrong choice isn't just inefficient โ€” it can make a project fail at its foundation. This chapter provides a data-driven selection framework across three dimensions: Function Calling success rate, reasoning quality, and hardware requirements.


27.1 Capability Boundaries by Scale

Comprehensive Capability Scoring Matrix

Based on extensive community benchmarking, Hermes models at each scale perform as follows on core Agent tasks (out of 10):

Capability Hermes-3B Hermes-8B Hermes-70B Hermes-405B
Function Calling success rate 3.2/10 6.8/10 9.1/10 9.7/10
Multi-step reasoning quality 2.8/10 6.4/10 9.0/10 9.8/10
Tool parameter accuracy 3.5/10 7.2/10 9.3/10 9.9/10
Instruction following 5.1/10 7.6/10 9.2/10 9.8/10
Error recovery 2.1/10 5.8/10 8.7/10 9.5/10
Usable context length 4K tokens 8K tokens 32K tokens 128K tokens
Inference speed (local Q4) 45 tok/s 28 tok/s 8 tok/s 1.2 tok/s
Overall Agent suitability Low Medium High Excellent

Benchmark Results (500-task standardized Agent test suite)

Scale Tool Selection Parameter Accuracy Format Compliance Multi-step Success
3B 52.3% 41.7% 78.2% 18.6%
8B 79.4% 73.8% 96.1% 61.3%
70B 94.7% 91.2% 99.3% 88.9%
405B 97.8% 96.4% 99.8% 94.2%

Critical threshold: The data shows 8B is the absolute minimum for Agent tasks (multi-step success 61.3%), while 70B is the production-safe baseline (multi-step success 88.9%).


27.2 Hardware Requirements

Local Deployment Hardware Requirements (by Quantization)

Model Quantization VRAM System RAM Recommended GPU Speed
3B FP16 6 GB 8 GB RTX 3060 6GB 60 tok/s
3B Q4_K_M 2.3 GB 4 GB Apple M1/M2 45 tok/s
8B FP16 16 GB 16 GB RTX 3090 32 tok/s
8B Q4_K_M 5.2 GB 8 GB RTX 3060 8GB 28 tok/s
8B Q8_0 9.1 GB 12 GB RTX 3080 24 tok/s
70B FP16 140 GB 160 GB 2ร—A100 80GB 12 tok/s
70B Q4_K_M 42 GB 64 GB 2ร—RTX 3090 or A40 8 tok/s
70B Q8_0 75 GB 96 GB A100 80GB 6 tok/s
405B Q4_K_M 245 GB 320 GB 4ร—A100 80GB 1.2 tok/s

Hardware Scenario Quick Guide

MacBook M3 Pro (16GB unified memory)
  โ†’ Hermes-8B Q4_K_M (5.2 GB, runs smoothly)

Gaming PC (RTX 4080 16GB VRAM)
  โ†’ Hermes-8B FP16 (16 GB, best 8B quality)

Workstation (2ร—RTX 3090 24GB)
  โ†’ Hermes-70B Q4_K_M (42 GB, dual-GPU required)
  โ†’ ~8 tok/s, production-viable

Single server (A100 80GB)
  โ†’ Hermes-70B Q4_K_M (42 GB, comfortable)
  โ†’ Hermes-70B Q8_0 (75 GB, near lossless)

Multi-GPU server (4ร—A100 80GB)
  โ†’ Hermes-405B Q4_K_M (245 GB)

27.3 Why Sub-7B Models Fail at Tool Calling

Understanding why small models struggle with Function Calling is essential โ€” it informs both selection and mitigation strategies.

Root Cause 1: Insufficient Instruction-Following Capacity

Function Calling requires the model to output strict JSON format AND make the correct judgment between "when to call a tool" vs. "when to answer directly."

Typical 3B failure modes:
- Describes the tool call in natural language instead of outputting JSON
- Outputs half-JSON, half-text
- Over-calls tools (treats every question as needing a tool)
- Ignores tools entirely (forgets it has them)

Example (3B model):
  User: Please read the contents of /tmp/data.csv
  3B output: I will use the read_file tool with path /tmp/data.csv, 
              and then return the file contents. [Content would appear here]

  (Correct behavior: output tool call JSON, wait for result, then process it)

Root Cause 2: Weak Parameter Extraction

# Test: Parameter extraction accuracy

test_prompt = """
User: Read lines 50 to 100 from main.py using UTF-8 encoding

Expected tool call:
{
  "name": "read_file",
  "input": {
    "path": "main.py",
    "start_line": 50,
    "end_line": 100,
    "encoding": "utf-8"
  }
}
"""

# Common 3B errors (from actual testing):
fail_examples_3b = [
    # Type error: strings instead of integers
    {"path": "main.py", "start_line": "50", "end_line": "100"},
    
    # Missing parameters
    {"path": "main.py"},
    
    # Wrong parameter names
    {"filename": "main.py", "lines": "50-100"},
    
    # Case sensitivity
    {"path": "main.py", "start_line": 50, "end_line": 100, "encoding": "UTF-8"}
]

Root Cause 3: Multi-Step Chain Breakage

Agent tasks typically require multi-step tool calls where each step's output feeds the next. 3B models lose track of intermediate state, confabulate results, and fail to carry variables across steps.

Mitigation Techniques for Small Models

# Technique 1: Simplify the tool set (reduce choice paralysis)
# Never give small models more than 3โ€“5 tools

# Technique 2: Reinforce format constraints
SMALL_MODEL_SYSTEM = """
CRITICAL: When calling a tool, output ONLY this JSON block:
<tool_call>
{"name": "tool_name", "arguments": {...}}
</tool_call>
No text before or after. No descriptions. Just the JSON.
"""

# Technique 3: Decompose tasks at the Agent layer, not the model layer
class SimpleStepAgent:
    def decompose_task(self, complex_task: str) -> list[str]:
        """Break complex tasks into single-step calls using rules, not the model."""
        pass

27.4 Selection Decision Tree

Q1: Primary task type?
  โ”œโ”€ Simple Q&A / text processing (no tool calls)
  โ”‚   โ†’ 3B or 8B, choose 3B to save resources
  โ”‚
  โ”œโ”€ Single-step tool calling (e.g., search + answer)
  โ”‚   โ†’ Minimum 8B, recommend 8B
  โ”‚
  โ”œโ”€ Multi-step Agent (3+ tool calls)
  โ”‚   โ†’ Minimum 70B, strongly recommend 70B
  โ”‚
  โ””โ”€ Complex reasoning + multi-step Agent (code gen, data analysis)
      โ†’ Recommend 70B, use 405B if available

Q2: Deployment environment?
  โ”œโ”€ Mobile / edge (< 4GB memory)
  โ”‚   โ†’ 3B only (limited use, not recommended for complex Agent)
  โ”‚
  โ”œโ”€ Personal computer / MacBook (8โ€“32GB)
  โ”‚   โ†’ 8B (Q4_K_M quantization)
  โ”‚
  โ”œโ”€ Workstation / single server (32โ€“80GB VRAM)
  โ”‚   โ†’ 70B (Q4_K_M quantization)
  โ”‚
  โ””โ”€ Multi-GPU server / cloud (> 80GB VRAM)
      โ†’ 70B FP16 or 405B Q4

Q3: Latency requirement?
  โ”œโ”€ < 1s first token (real-time interaction)
  โ”‚   โ†’ Smaller scale + quantization, or use API
  โ”‚
  โ””โ”€ 2โ€“10s acceptable (batch / background)
      โ†’ Can use larger scale

Q4: Budget?
  โ”œโ”€ API only โ†’ Use Claude/GPT-4 class APIs
  โ”œโ”€ < $1,000 hardware โ†’ 8B Q4 (consumer GPU)
  โ”œโ”€ $3,000โ€“$10,000 โ†’ 70B Q4 (A40 or 2ร—RTX 3090)
  โ””โ”€ > $20,000 โ†’ 70B FP16 or 405B Q4

27.5 Cloud API vs. Local Deployment Break-Even Analysis

Cost Comparison Framework

def calculate_break_even(
    local_hardware_cost: float,
    local_monthly_electricity: float,
    local_monthly_maintenance: float,
    api_price_per_1k_tokens: float,
    monthly_token_usage: int,
    hardware_lifespan_months: int = 36
) -> dict:
    monthly_hardware_amortized = local_hardware_cost / hardware_lifespan_months
    local_monthly_total = (
        monthly_hardware_amortized + 
        local_monthly_electricity + 
        local_monthly_maintenance
    )
    api_monthly_cost = monthly_token_usage / 1000 * api_price_per_1k_tokens
    monthly_savings = api_monthly_cost - local_monthly_total
    break_even_months = local_hardware_cost / max(monthly_savings, 0.01)
    
    return {
        "local_monthly_cost": local_monthly_total,
        "api_monthly_cost": api_monthly_cost,
        "monthly_savings": monthly_savings,
        "break_even_months": break_even_months,
        "recommendation": "local" if monthly_savings > 0 else "api"
    }

Break-Even Token Volume by Scale

Model Scale Recommended Local Setup Hardware Cost Monthly Fixed API Cost Break-Even
8B RTX 4080 16GB ~$1,200 ~$63/month > 21M tokens/month
70B A100 80GB ~$12,000 ~$380/month > 127M tokens/month
70B 2ร—RTX 3090 ~$3,000 ~$163/month > 54M tokens/month
405B 4ร—A100 80GB ~$48,000 ~$1,520/month > 507M tokens/month

Conclusion: For most small-to-mid applications (< 10M tokens/month), API is more economical than self-hosting 70B+. The primary driver of local deployment is usually data privacy, not cost savings.

Privacy-Driven Selection Logic

Local deployment reasons beyond cost:

1. Compliance requirements
   โ†’ Medical/financial/government data cannot leave corporate network
   โ†’ Minimum: 70B local

2. Ultra-low latency
   โ†’ Need < 100ms response, API network latency unacceptable
   โ†’ Prefer: 8B local (speed priority)

3. Offline operation
   โ†’ No network (factories, ships, defense)
   โ†’ Choose: largest scale that fits available hardware

4. Custom fine-tuning
   โ†’ Need LoRA training on private data
   โ†’ Need: enough VRAM for training

27.6 Summary

Parameter scale selection is the most consequential deployment decision for Hermes Agent:


Discussion Questions

  1. Why does Function Calling success rate jump so dramatically from 8B (61%) to 70B (89%)? Is it purely parameter count, or are there other factors (training data, architecture)?

  2. In your business scenario, what is the cost of an Agent task failure? If 8B's 39% failure rate means user churn, should you skip straight to 70B or 405B?

  3. The break-even analysis for "2ร—RTX 3090 to run 70B" requires > 54M monthly tokens. What variables does this calculation miss? Which do you consider most significant?

  4. If only 8B is available but you need 70B-level Agent performance, what engineering compensation strategies would you employ?

Rate this chapter
4.8  / 5  (5 ratings)

๐Ÿ’ฌ Comments