Chapter 32

Skill Testing Framework and Quality Assurance

Chapter 32: Skill Testing Framework and Quality Assurance

A Skill without tests is like an airplane without instruments — it might fly fine, but you won't know when it's about to fail or what went wrong. This chapter builds a complete Skill testing system across four dimensions: functional testing, performance testing, boundary testing, and integration testing, and shows how to automate all of them in a CI/CD pipeline.

32.1 The Four Testing Dimensions

Testing Pyramid

┌─────────────────────────────────────────────────────┐
│                  Skill Test Pyramid                  │
│                                                      │
│           ┌──────────────────────┐                  │
│           │ Integration (10%)    │ ← Real Hermes     │
│           └──────────────────────┘                  │
│        ┌────────────────────────────┐               │
│        │ Performance/Boundary (20%) │ ← Stress/Edge  │
│        └────────────────────────────┘               │
│     ┌──────────────────────────────────┐            │
│     │      Functional Tests (70%)      │ ← Core logic │
│     └──────────────────────────────────┘            │
│                                                      │
│  Functional: validates "what was done right"         │
│  Performance: validates "is it fast enough?"         │
│  Boundary: validates "what happens when it goes wrong?"│
│  Integration: validates "does it work end-to-end?"   │
└─────────────────────────────────────────────────────┘

Dimension Responsibilities

Dimension	Goal	Key Questions	Tools
Functional	Core logic correctness	Did the Skill do the right thing?	pytest + Mock
Performance	Response time, resource usage	Is it fast and stable?	pytest-benchmark
Boundary	Abnormal input handling	Does it crash on bad input?	pytest + Hypothesis
Integration	End-to-end workflow	Does it work in real environment?	Hermes TestHarness

32.2 Unit Testing Framework Setup

Test Directory Structure

tests/
├── conftest.py
├── unit/
│   ├── test_input_validation.py
│   ├── test_news_fetcher.py
│   ├── test_digest_formatter.py
│   └── test_error_handling.py
├── performance/
│   ├── test_response_time.py
│   └── test_concurrent_load.py
├── boundary/
│   ├── test_edge_cases.py
│   └── test_malformed_input.py
├── integration/
│   ├── test_full_workflow.py
│   └── test_hermes_integration.py
└── fixtures/
    ├── sample_news_response.json
    ├── sample_article_content.html
    └── edge_case_inputs.json

conftest.py: Shared Fixtures

"""Shared test fixtures and configuration."""
import pytest
import json
from pathlib import Path
from unittest.mock import MagicMock
from tools.news_fetcher import NewsFetcher, NewsArticle
from tools.digest_formatter import DigestFormatter

FIXTURES_DIR = Path(__file__).parent / "fixtures"

@pytest.fixture
def sample_articles():
    return [
        NewsArticle(
            title="EU AI Act Implementation Accelerates",
            url="https://reuters.com/eu-ai-act",
            snippet="The European Commission announced new timelines...",
            source="reuters.com",
            published_date="2 hours ago",
            full_content="Full article content for EU AI Act..."
        ),
        NewsArticle(
            title="AI Regulation: Industry Response",
            url="https://techcrunch.com/ai-regulation",
            snippet="Tech companies are preparing for...",
            source="techcrunch.com",
            published_date="5 hours ago",
            full_content="Full article content for industry response..."
        ),
    ]

@pytest.fixture
def news_fetcher():
    return NewsFetcher(search_api_key="test-key-123", timeout=5)

@pytest.fixture
def formatter():
    return DigestFormatter()

@pytest.fixture
def sample_search_response():
    return {"results": [
        {"title": "AI Act Update", "url": "https://reuters.com/ai",
         "description": "EU...", "age": "2 hours ago"},
        {"title": "Industry Response", "url": "https://techcrunch.com/ai",
         "description": "Companies...", "age": "5 hours ago"},
    ]}

unit/test_input_validation.py

import pytest
from tools.validation import SkillInputValidator, SkillInputError

@pytest.fixture
def validator():
    return SkillInputValidator(NEWS_DIGEST_INPUT_SCHEMA)

class TestRequiredParameters:
    def test_valid_minimal_input(self, validator):
        is_valid, errors = validator.validate({"topics": ["AI regulation"]})
        assert is_valid is True
        assert errors == []
    
    def test_missing_topics_fails(self, validator):
        is_valid, errors = validator.validate({"time_range": "today"})
        assert is_valid is False
        assert any("topics" in e for e in errors)
    
    def test_empty_topics_array_fails(self, validator):
        is_valid, _ = validator.validate({"topics": []})
        assert is_valid is False
    
    def test_too_many_topics_fails(self, validator):
        is_valid, _ = validator.validate({"topics": ["t1","t2","t3","t4","t5","t6"]})
        assert is_valid is False

class TestOptionalParameters:
    @pytest.mark.parametrize("time_range", ["today", "24h", "this_week", "this_month"])
    def test_valid_time_ranges(self, validator, time_range):
        is_valid, _ = validator.validate({"topics": ["AI"], "time_range": time_range})
        assert is_valid is True
    
    def test_invalid_time_range_fails(self, validator):
        is_valid, errors = validator.validate({"topics": ["AI"], "time_range": "yesterday"})
        assert is_valid is False
    
    @pytest.mark.parametrize("count,expected_valid", [
        (1, True), (5, True), (10, True), (0, False), (11, False), (-1, False)
    ])
    def test_max_articles_range(self, validator, count, expected_valid):
        is_valid, _ = validator.validate({"topics": ["AI"], "max_articles_per_topic": count})
        assert is_valid is expected_valid

class TestTypeCoercion:
    def test_string_int_coercion(self, validator):
        coerced = validator.coerce_and_validate({"topics": ["AI"], "max_articles_per_topic": "5"})
        assert coerced["max_articles_per_topic"] == 5
        assert isinstance(coerced["max_articles_per_topic"], int)
    
    def test_string_bool_coercion(self, validator):
        for val in ["true", "True", "yes", "1"]:
            coerced = validator.coerce_and_validate({"topics": ["AI"], "save_to_file": val})
            assert coerced["save_to_file"] is True
    
    def test_single_string_to_array(self, validator):
        coerced = validator.coerce_and_validate({"topics": "AI regulation"})
        assert coerced["topics"] == ["AI regulation"]

unit/test_news_fetcher.py

import pytest
import httpx
from unittest.mock import patch, MagicMock

class TestSearchNews:
    def test_returns_articles(self, news_fetcher, sample_search_response):
        with patch.object(news_fetcher.client, 'get') as mock_get:
            mock_get.return_value = MagicMock(
                json=lambda: sample_search_response,
                raise_for_status=MagicMock()
            )
            articles = news_fetcher.search_news("AI regulation")
            assert len(articles) == 2
    
    def test_empty_results(self, news_fetcher):
        with patch.object(news_fetcher.client, 'get') as mock_get:
            mock_get.return_value = MagicMock(
                json=lambda: {"results": []},
                raise_for_status=MagicMock()
            )
            assert news_fetcher.search_news("xyz") == []
    
    def test_fetch_returns_none_on_timeout(self, news_fetcher):
        with patch.object(news_fetcher.client, 'get',
                         side_effect=httpx.TimeoutException("timeout")):
            content = news_fetcher.fetch_article_content("https://slow.example.com")
            assert content is None  # No exception, graceful degradation
    
    def test_fetch_truncates_long_content(self, news_fetcher):
        long_html = f"<html><body><p>{'A' * 5000}</p></body></html>"
        with patch.object(news_fetcher.client, 'get') as mock_get:
            mock_get.return_value = MagicMock(
                text=long_html,
                raise_for_status=MagicMock()
            )
            content = news_fetcher.fetch_article_content("https://example.com")
            assert len(content) <= 2000

32.3 Mocking Tool Calls

Hermes Tool Call Mocker

"""Mock tool calls — test complete Agent workflows without real API calls."""
from dataclasses import dataclass, field
from typing import Any, Callable, Optional
import time

@dataclass
class MockToolCall:
    tool_name: str
    input: dict
    timestamp: float

@dataclass
class MockToolConfig:
    response: Any = None
    error: Optional[Exception] = None
    side_effect: Optional[Callable] = None
    call_limit: Optional[int] = None
    call_count: int = field(default=0, init=False)

class HermesToolMocker:
    def __init__(self):
        self._mocks: dict[str, MockToolConfig] = {}
        self._call_history: list[MockToolCall] = []
    
    def register(self, tool_name: str, response=None, error=None,
                 side_effect=None, call_limit=None) -> "HermesToolMocker":
        self._mocks[tool_name] = MockToolConfig(
            response=response, error=error,
            side_effect=side_effect, call_limit=call_limit
        )
        return self
    
    def handle_tool_call(self, tool_name: str, tool_input: dict) -> Any:
        self._call_history.append(
            MockToolCall(tool_name=tool_name, input=tool_input, timestamp=time.time())
        )
        if tool_name not in self._mocks:
            raise ValueError(f"Tool '{tool_name}' has no mock. Register it first.")
        
        config = self._mocks[tool_name]
        config.call_count += 1
        
        if config.call_limit and config.call_count > config.call_limit:
            raise RuntimeError(f"'{tool_name}' exceeded call limit of {config.call_limit}")
        if config.error:
            raise config.error
        if config.side_effect:
            return config.side_effect(tool_input)
        return config.response
    
    def assert_tool_called(self, tool_name: str, times: int = None):
        calls = [c for c in self._call_history if c.tool_name == tool_name]
        if not calls:
            raise AssertionError(f"Expected '{tool_name}' to be called, but it wasn't")
        if times is not None and len(calls) != times:
            raise AssertionError(
                f"Expected '{tool_name}' called {times} times, got {len(calls)}"
            )
    
    def assert_tool_not_called(self, tool_name: str):
        calls = [c for c in self._call_history if c.tool_name == tool_name]
        if calls:
            raise AssertionError(f"Expected '{tool_name}' NOT to be called, got {len(calls)} calls")
    
    def assert_tool_called_with(self, tool_name: str, **expected_params):
        matching = [
            c for c in self._call_history
            if c.tool_name == tool_name and
            all(c.input.get(k) == v for k, v in expected_params.items())
        ]
        if not matching:
            raise AssertionError(
                f"'{tool_name}' not called with {expected_params}. "
                f"Actual calls: {[c.input for c in self._call_history if c.tool_name == tool_name]}"
            )

# Usage in tests
class TestSkillWithMocks:
    def test_calls_search_tool(self, sample_articles):
        mocker = HermesToolMocker()
        mocker.register("web_search", response={"results": [
            {"title": "AI News", "url": "https://example.com", "snippet": "..."}
        ]}).register("fetch_url", response="Full article content...")
        
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        result = skill.run({"topics": ["AI regulation"]})
        
        mocker.assert_tool_called("web_search")
        assert result["status"] == "success"
    
    def test_handles_search_failure(self):
        mocker = HermesToolMocker()
        mocker.register("web_search", error=Exception("API rate limit exceeded"))
        
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        result = skill.run({"topics": ["AI"]})
        
        assert result["status"] == "failed"
    
    def test_saves_file_when_requested(self):
        mocker = HermesToolMocker()
        mocker.register("web_search", response={"results": [...]})
        mocker.register("fetch_url", response="Content...")
        mocker.register("write_file", response={"success": True})
        
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        skill.run({"topics": ["AI"], "save_to_file": True})
        
        mocker.assert_tool_called("write_file", times=1)
        write_call = mocker.get_calls("write_file")[0]
        assert write_call.input["path"].endswith(".md")

32.4 Performance and Boundary Tests

class TestPerformance:
    def test_input_validation_fast(self, benchmark, validator):
        """Validation must complete within 1ms."""
        result = benchmark(validator.validate, {"topics": ["AI", "Climate"], "max_articles_per_topic": 5})
        assert result[0] is True
    
    @pytest.mark.timeout(5)
    def test_full_skill_within_timeout(self, mocker_with_responses):
        """Full skill (with mocks) must complete within 5 seconds."""
        import time
        start = time.time()
        skill = NewsDigestSkill(tool_handler=mocker_with_responses.handle_tool_call)
        result = skill.run({"topics": ["AI"]})
        assert time.time() - start < 5.0
        assert result["status"] == "success"

class TestBoundaryConditions:
    @pytest.mark.parametrize("topic_len", [2, 50, 100])
    def test_topic_at_valid_length_boundary(self, validator, topic_len):
        is_valid, _ = validator.validate({"topics": ["A" * topic_len]})
        assert is_valid is True
    
    def test_topic_too_short_fails(self, validator):
        is_valid, _ = validator.validate({"topics": ["A"]})
        assert is_valid is False
    
    def test_unicode_topics(self, validator):
        data = {"topics": ["人工智能监管", "Künstliche Intelligenz", "気候変動"]}
        is_valid, _ = validator.validate(data)
        assert is_valid is True
    
    def test_sql_injection_does_not_crash(self, mocker):
        """Malicious input must be treated as a plain search query, not cause a crash."""
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        result = skill.run({"topics": ["'; DROP TABLE news; --"]})
        # Just verify no exception is raised
    
    def test_none_fetch_result_degrades_gracefully(self, mocker):
        mocker.register("web_search", response={"results": [{"url": "https://x.com"}]})
        mocker.register("fetch_url", response=None)  # Simulate timeout
        
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        result = skill.run({"topics": ["AI"]})
        
        # Should not crash; should use snippet as fallback
        assert result["status"] in ["success", "partial_success"]
    
    def test_partial_success_when_one_topic_fails(self, mocker):
        call_count = {"n": 0}
        def flaky_search(input_data):
            call_count["n"] += 1
            if call_count["n"] == 2:
                raise Exception("Rate limit")
            return {"results": [{"title": "News", "url": "https://x.com"}]}
        
        mocker.register("web_search", side_effect=flaky_search)
        mocker.register("fetch_url", response="Content...")
        
        skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
        result = skill.run({"topics": ["AI", "Climate", "Tech"]})
        
        assert result["status"] == "partial_success"
        assert len(result["topics_covered"]) == 2
        assert len(result["errors"]) == 1

32.5 CI/CD Configuration

# .github/workflows/skill-tests.yml
name: Skill Test Suite

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 8 * * *'  # Daily integration tests at UTC 8:00

jobs:
  lint-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11'}
      - run: pip install ruff mypy hermes-cli && pip install -r requirements.txt
      - run: ruff check .
      - run: mypy tools/ --ignore-missing-imports
      - run: hermes skill validate

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-validate
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11'}
      - run: pip install -r requirements.txt pytest pytest-cov pytest-timeout
      - name: Run unit tests with coverage
        run: |
          pytest tests/unit/ \
            --cov=tools \
            --cov-report=xml \
            --cov-fail-under=80 \
            -v
      - uses: codecov/codecov-action@v4
        with: {file: ./coverage.xml}

  boundary-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11'}
      - run: pip install -r requirements.txt pytest hypothesis
      - run: pytest tests/boundary/ -v --tb=short

  performance-tests:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt pytest pytest-benchmark
      - name: Run and check performance
        run: |
          pytest tests/performance/ --benchmark-json=results.json -v
          python scripts/check_benchmark_regression.py --current=results.json --threshold=20

  integration-tests:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule'
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: '3.11'}
      - run: pip install -r requirements.txt pytest hermes-sdk
      - name: Run integration tests
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          SEARCH_API_KEY: ${{ secrets.SEARCH_API_KEY }}
        run: pytest tests/integration/ -v --timeout=60 -m "not slow"

  publish:
    runs-on: ubuntu-latest
    if: startsWith(github.ref, 'refs/tags/v')
    needs: [unit-tests, boundary-tests, lint-and-validate]
    steps:
      - uses: actions/checkout@v4
      - run: pip install hermes-cli
      - env:
          CLAWHUB_TOKEN: ${{ secrets.CLAWHUB_TOKEN }}
        run: hermes skill publish --token $CLAWHUB_TOKEN

32.6 Common Skill Bug Patterns and Prevention

Bug Pattern Library

# Bug Pattern 1: Unchecked tool call results
# Symptom: fetch_url returns None on timeout, code calls None.split()
# Prevention:
def test_handles_none_fetch_result(mocker):
    mocker.register("web_search", response={"results": [{"url": "https://x.com"}]})
    mocker.register("fetch_url", response=None)
    skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
    result = skill.run({"topics": ["AI"]})
    assert result["status"] in ["success", "partial_success"]
    assert not any(e["code"] == "INTERNAL_ERROR" for e in result.get("errors", []))


# Bug Pattern 2: State leaking between calls
# Symptom: Results from call 1 contaminate call 2
# Prevention:
def test_no_state_leak_between_calls(mocker_factory):
    skill = NewsDigestSkill()
    
    mocker1 = mocker_factory(search_response=[{"title": "AI News Call 1"}])
    result1 = skill.run({"topics": ["AI"]}, tool_handler=mocker1.handle_tool_call)
    
    mocker2 = mocker_factory(search_response=[{"title": "Climate News Call 2"}])
    result2 = skill.run({"topics": ["Climate"]}, tool_handler=mocker2.handle_tool_call)
    
    # Results must be fully independent
    assert "Call 2" not in str(result1)
    assert "Call 1" not in str(result2)


# Bug Pattern 3: Forgotten partial success
# Symptom: 1 of 3 topics fails → whole Skill fails instead of partial_success
# Prevention: (see TestBoundaryConditions.test_partial_success_when_one_topic_fails above)


# Bug Pattern 4: Token budget overflow
# Symptom: Too much article content exceeds context window
# Prevention:
def test_large_content_stays_within_token_budget(mocker):
    huge_content = "Very long article. " * 1000  # ~5000 words
    mocker.register("web_search", response={"results": [
        {"url": f"https://example{i}.com", "title": f"Art {i}", "snippet": "..."}
        for i in range(10)
    ]})
    mocker.register("fetch_url", response=huge_content)
    
    skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
    result = skill.run({"topics": ["AI"], "max_articles_per_topic": 10})
    
    # Output should be within reasonable bounds (~40k chars ≈ 10k tokens)
    assert len(str(result)) < 40_000

Coverage Targets

Module	Target Coverage	Notes
Core logic (tools/)	> 90%	Must be high
Input validation	> 95%	Critical security layer
Error handling paths	> 85%	Verify degradation works
Output formatting	> 80%	Test all formats
Overall	> 80%	CI enforced

32.7 Summary

A complete Skill testing system is the foundation of quality assurance:

Four-dimension testing: Functional (70%) → Performance/Boundary (20%) → Integration (10%) pyramid structure
Mock-first: Use HermesToolMocker for tool calls — tests run offline, no API costs
Boundary coverage: Unicode, injection attacks, None returns, extreme-length inputs — every boundary needs a test
CI/CD automation: Unit tests on every push, performance comparison on PRs, integration tests daily on schedule
Bug pattern library: Identify and prevent 4 common Skill bugs (None checks, state leakage, error aggregation, token overflow)

Tests are the moat protecting Skill quality — investing in tests is investing in long-term maintainability.

Discussion Questions

Mock tool testing validates Skill logic effectively, but cannot test that the LLM actually follows the steps described in SKILL.md. How would you design a test to validate SKILL.md's behavioral effectiveness?
Integration tests require real API calls — they're expensive and flaky. How would you design a Record-and-Replay mechanism so integration tests validate real behavior while running stably in CI?
The test_partial_success_when_one_topic_fails test validates partial success. But if all 3 topics fail, the result should be failed, not partial_success. How do you elegantly distinguish these cases in code? Write the corresponding test.
The performance test allows a 20% regression threshold. When would you tolerate a larger regression? When should you tighten it to 5%? What's your decision framework?

Rate this chapter

4.8 / 5 (3 ratings)