Skill Testing Framework and Quality Assurance
Chapter 32: Skill Testing Framework and Quality Assurance
A Skill without tests is like an airplane without instruments — it might fly fine, but you won't know when it's about to fail or what went wrong. This chapter builds a complete Skill testing system across four dimensions: functional testing, performance testing, boundary testing, and integration testing, and shows how to automate all of them in a CI/CD pipeline.
32.1 The Four Testing Dimensions
Testing Pyramid
┌─────────────────────────────────────────────────────┐
│ Skill Test Pyramid │
│ │
│ ┌──────────────────────┐ │
│ │ Integration (10%) │ ← Real Hermes │
│ └──────────────────────┘ │
│ ┌────────────────────────────┐ │
│ │ Performance/Boundary (20%) │ ← Stress/Edge │
│ └────────────────────────────┘ │
│ ┌──────────────────────────────────┐ │
│ │ Functional Tests (70%) │ ← Core logic │
│ └──────────────────────────────────┘ │
│ │
│ Functional: validates "what was done right" │
│ Performance: validates "is it fast enough?" │
│ Boundary: validates "what happens when it goes wrong?"│
│ Integration: validates "does it work end-to-end?" │
└─────────────────────────────────────────────────────┘
Dimension Responsibilities
| Dimension | Goal | Key Questions | Tools |
|---|---|---|---|
| Functional | Core logic correctness | Did the Skill do the right thing? | pytest + Mock |
| Performance | Response time, resource usage | Is it fast and stable? | pytest-benchmark |
| Boundary | Abnormal input handling | Does it crash on bad input? | pytest + Hypothesis |
| Integration | End-to-end workflow | Does it work in real environment? | Hermes TestHarness |
32.2 Unit Testing Framework Setup
Test Directory Structure
tests/
├── conftest.py
├── unit/
│ ├── test_input_validation.py
│ ├── test_news_fetcher.py
│ ├── test_digest_formatter.py
│ └── test_error_handling.py
├── performance/
│ ├── test_response_time.py
│ └── test_concurrent_load.py
├── boundary/
│ ├── test_edge_cases.py
│ └── test_malformed_input.py
├── integration/
│ ├── test_full_workflow.py
│ └── test_hermes_integration.py
└── fixtures/
├── sample_news_response.json
├── sample_article_content.html
└── edge_case_inputs.json
conftest.py: Shared Fixtures
"""Shared test fixtures and configuration."""
import pytest
import json
from pathlib import Path
from unittest.mock import MagicMock
from tools.news_fetcher import NewsFetcher, NewsArticle
from tools.digest_formatter import DigestFormatter
FIXTURES_DIR = Path(__file__).parent / "fixtures"
@pytest.fixture
def sample_articles():
return [
NewsArticle(
title="EU AI Act Implementation Accelerates",
url="https://reuters.com/eu-ai-act",
snippet="The European Commission announced new timelines...",
source="reuters.com",
published_date="2 hours ago",
full_content="Full article content for EU AI Act..."
),
NewsArticle(
title="AI Regulation: Industry Response",
url="https://techcrunch.com/ai-regulation",
snippet="Tech companies are preparing for...",
source="techcrunch.com",
published_date="5 hours ago",
full_content="Full article content for industry response..."
),
]
@pytest.fixture
def news_fetcher():
return NewsFetcher(search_api_key="test-key-123", timeout=5)
@pytest.fixture
def formatter():
return DigestFormatter()
@pytest.fixture
def sample_search_response():
return {"results": [
{"title": "AI Act Update", "url": "https://reuters.com/ai",
"description": "EU...", "age": "2 hours ago"},
{"title": "Industry Response", "url": "https://techcrunch.com/ai",
"description": "Companies...", "age": "5 hours ago"},
]}
unit/test_input_validation.py
import pytest
from tools.validation import SkillInputValidator, SkillInputError
@pytest.fixture
def validator():
return SkillInputValidator(NEWS_DIGEST_INPUT_SCHEMA)
class TestRequiredParameters:
def test_valid_minimal_input(self, validator):
is_valid, errors = validator.validate({"topics": ["AI regulation"]})
assert is_valid is True
assert errors == []
def test_missing_topics_fails(self, validator):
is_valid, errors = validator.validate({"time_range": "today"})
assert is_valid is False
assert any("topics" in e for e in errors)
def test_empty_topics_array_fails(self, validator):
is_valid, _ = validator.validate({"topics": []})
assert is_valid is False
def test_too_many_topics_fails(self, validator):
is_valid, _ = validator.validate({"topics": ["t1","t2","t3","t4","t5","t6"]})
assert is_valid is False
class TestOptionalParameters:
@pytest.mark.parametrize("time_range", ["today", "24h", "this_week", "this_month"])
def test_valid_time_ranges(self, validator, time_range):
is_valid, _ = validator.validate({"topics": ["AI"], "time_range": time_range})
assert is_valid is True
def test_invalid_time_range_fails(self, validator):
is_valid, errors = validator.validate({"topics": ["AI"], "time_range": "yesterday"})
assert is_valid is False
@pytest.mark.parametrize("count,expected_valid", [
(1, True), (5, True), (10, True), (0, False), (11, False), (-1, False)
])
def test_max_articles_range(self, validator, count, expected_valid):
is_valid, _ = validator.validate({"topics": ["AI"], "max_articles_per_topic": count})
assert is_valid is expected_valid
class TestTypeCoercion:
def test_string_int_coercion(self, validator):
coerced = validator.coerce_and_validate({"topics": ["AI"], "max_articles_per_topic": "5"})
assert coerced["max_articles_per_topic"] == 5
assert isinstance(coerced["max_articles_per_topic"], int)
def test_string_bool_coercion(self, validator):
for val in ["true", "True", "yes", "1"]:
coerced = validator.coerce_and_validate({"topics": ["AI"], "save_to_file": val})
assert coerced["save_to_file"] is True
def test_single_string_to_array(self, validator):
coerced = validator.coerce_and_validate({"topics": "AI regulation"})
assert coerced["topics"] == ["AI regulation"]
unit/test_news_fetcher.py
import pytest
import httpx
from unittest.mock import patch, MagicMock
class TestSearchNews:
def test_returns_articles(self, news_fetcher, sample_search_response):
with patch.object(news_fetcher.client, 'get') as mock_get:
mock_get.return_value = MagicMock(
json=lambda: sample_search_response,
raise_for_status=MagicMock()
)
articles = news_fetcher.search_news("AI regulation")
assert len(articles) == 2
def test_empty_results(self, news_fetcher):
with patch.object(news_fetcher.client, 'get') as mock_get:
mock_get.return_value = MagicMock(
json=lambda: {"results": []},
raise_for_status=MagicMock()
)
assert news_fetcher.search_news("xyz") == []
def test_fetch_returns_none_on_timeout(self, news_fetcher):
with patch.object(news_fetcher.client, 'get',
side_effect=httpx.TimeoutException("timeout")):
content = news_fetcher.fetch_article_content("https://slow.example.com")
assert content is None # No exception, graceful degradation
def test_fetch_truncates_long_content(self, news_fetcher):
long_html = f"<html><body><p>{'A' * 5000}</p></body></html>"
with patch.object(news_fetcher.client, 'get') as mock_get:
mock_get.return_value = MagicMock(
text=long_html,
raise_for_status=MagicMock()
)
content = news_fetcher.fetch_article_content("https://example.com")
assert len(content) <= 2000
32.3 Mocking Tool Calls
Hermes Tool Call Mocker
"""Mock tool calls — test complete Agent workflows without real API calls."""
from dataclasses import dataclass, field
from typing import Any, Callable, Optional
import time
@dataclass
class MockToolCall:
tool_name: str
input: dict
timestamp: float
@dataclass
class MockToolConfig:
response: Any = None
error: Optional[Exception] = None
side_effect: Optional[Callable] = None
call_limit: Optional[int] = None
call_count: int = field(default=0, init=False)
class HermesToolMocker:
def __init__(self):
self._mocks: dict[str, MockToolConfig] = {}
self._call_history: list[MockToolCall] = []
def register(self, tool_name: str, response=None, error=None,
side_effect=None, call_limit=None) -> "HermesToolMocker":
self._mocks[tool_name] = MockToolConfig(
response=response, error=error,
side_effect=side_effect, call_limit=call_limit
)
return self
def handle_tool_call(self, tool_name: str, tool_input: dict) -> Any:
self._call_history.append(
MockToolCall(tool_name=tool_name, input=tool_input, timestamp=time.time())
)
if tool_name not in self._mocks:
raise ValueError(f"Tool '{tool_name}' has no mock. Register it first.")
config = self._mocks[tool_name]
config.call_count += 1
if config.call_limit and config.call_count > config.call_limit:
raise RuntimeError(f"'{tool_name}' exceeded call limit of {config.call_limit}")
if config.error:
raise config.error
if config.side_effect:
return config.side_effect(tool_input)
return config.response
def assert_tool_called(self, tool_name: str, times: int = None):
calls = [c for c in self._call_history if c.tool_name == tool_name]
if not calls:
raise AssertionError(f"Expected '{tool_name}' to be called, but it wasn't")
if times is not None and len(calls) != times:
raise AssertionError(
f"Expected '{tool_name}' called {times} times, got {len(calls)}"
)
def assert_tool_not_called(self, tool_name: str):
calls = [c for c in self._call_history if c.tool_name == tool_name]
if calls:
raise AssertionError(f"Expected '{tool_name}' NOT to be called, got {len(calls)} calls")
def assert_tool_called_with(self, tool_name: str, **expected_params):
matching = [
c for c in self._call_history
if c.tool_name == tool_name and
all(c.input.get(k) == v for k, v in expected_params.items())
]
if not matching:
raise AssertionError(
f"'{tool_name}' not called with {expected_params}. "
f"Actual calls: {[c.input for c in self._call_history if c.tool_name == tool_name]}"
)
# Usage in tests
class TestSkillWithMocks:
def test_calls_search_tool(self, sample_articles):
mocker = HermesToolMocker()
mocker.register("web_search", response={"results": [
{"title": "AI News", "url": "https://example.com", "snippet": "..."}
]}).register("fetch_url", response="Full article content...")
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI regulation"]})
mocker.assert_tool_called("web_search")
assert result["status"] == "success"
def test_handles_search_failure(self):
mocker = HermesToolMocker()
mocker.register("web_search", error=Exception("API rate limit exceeded"))
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI"]})
assert result["status"] == "failed"
def test_saves_file_when_requested(self):
mocker = HermesToolMocker()
mocker.register("web_search", response={"results": [...]})
mocker.register("fetch_url", response="Content...")
mocker.register("write_file", response={"success": True})
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
skill.run({"topics": ["AI"], "save_to_file": True})
mocker.assert_tool_called("write_file", times=1)
write_call = mocker.get_calls("write_file")[0]
assert write_call.input["path"].endswith(".md")
32.4 Performance and Boundary Tests
class TestPerformance:
def test_input_validation_fast(self, benchmark, validator):
"""Validation must complete within 1ms."""
result = benchmark(validator.validate, {"topics": ["AI", "Climate"], "max_articles_per_topic": 5})
assert result[0] is True
@pytest.mark.timeout(5)
def test_full_skill_within_timeout(self, mocker_with_responses):
"""Full skill (with mocks) must complete within 5 seconds."""
import time
start = time.time()
skill = NewsDigestSkill(tool_handler=mocker_with_responses.handle_tool_call)
result = skill.run({"topics": ["AI"]})
assert time.time() - start < 5.0
assert result["status"] == "success"
class TestBoundaryConditions:
@pytest.mark.parametrize("topic_len", [2, 50, 100])
def test_topic_at_valid_length_boundary(self, validator, topic_len):
is_valid, _ = validator.validate({"topics": ["A" * topic_len]})
assert is_valid is True
def test_topic_too_short_fails(self, validator):
is_valid, _ = validator.validate({"topics": ["A"]})
assert is_valid is False
def test_unicode_topics(self, validator):
data = {"topics": ["人工智能监管", "Künstliche Intelligenz", "気候変動"]}
is_valid, _ = validator.validate(data)
assert is_valid is True
def test_sql_injection_does_not_crash(self, mocker):
"""Malicious input must be treated as a plain search query, not cause a crash."""
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["'; DROP TABLE news; --"]})
# Just verify no exception is raised
def test_none_fetch_result_degrades_gracefully(self, mocker):
mocker.register("web_search", response={"results": [{"url": "https://x.com"}]})
mocker.register("fetch_url", response=None) # Simulate timeout
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI"]})
# Should not crash; should use snippet as fallback
assert result["status"] in ["success", "partial_success"]
def test_partial_success_when_one_topic_fails(self, mocker):
call_count = {"n": 0}
def flaky_search(input_data):
call_count["n"] += 1
if call_count["n"] == 2:
raise Exception("Rate limit")
return {"results": [{"title": "News", "url": "https://x.com"}]}
mocker.register("web_search", side_effect=flaky_search)
mocker.register("fetch_url", response="Content...")
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI", "Climate", "Tech"]})
assert result["status"] == "partial_success"
assert len(result["topics_covered"]) == 2
assert len(result["errors"]) == 1
32.5 CI/CD Configuration
# .github/workflows/skill-tests.yml
name: Skill Test Suite
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
schedule:
- cron: '0 8 * * *' # Daily integration tests at UTC 8:00
jobs:
lint-and-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.11'}
- run: pip install ruff mypy hermes-cli && pip install -r requirements.txt
- run: ruff check .
- run: mypy tools/ --ignore-missing-imports
- run: hermes skill validate
unit-tests:
runs-on: ubuntu-latest
needs: lint-and-validate
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.11'}
- run: pip install -r requirements.txt pytest pytest-cov pytest-timeout
- name: Run unit tests with coverage
run: |
pytest tests/unit/ \
--cov=tools \
--cov-report=xml \
--cov-fail-under=80 \
-v
- uses: codecov/codecov-action@v4
with: {file: ./coverage.xml}
boundary-tests:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.11'}
- run: pip install -r requirements.txt pytest hypothesis
- run: pytest tests/boundary/ -v --tb=short
performance-tests:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
needs: unit-tests
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt pytest pytest-benchmark
- name: Run and check performance
run: |
pytest tests/performance/ --benchmark-json=results.json -v
python scripts/check_benchmark_regression.py --current=results.json --threshold=20
integration-tests:
runs-on: ubuntu-latest
if: github.event_name == 'schedule'
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: '3.11'}
- run: pip install -r requirements.txt pytest hermes-sdk
- name: Run integration tests
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
SEARCH_API_KEY: ${{ secrets.SEARCH_API_KEY }}
run: pytest tests/integration/ -v --timeout=60 -m "not slow"
publish:
runs-on: ubuntu-latest
if: startsWith(github.ref, 'refs/tags/v')
needs: [unit-tests, boundary-tests, lint-and-validate]
steps:
- uses: actions/checkout@v4
- run: pip install hermes-cli
- env:
CLAWHUB_TOKEN: ${{ secrets.CLAWHUB_TOKEN }}
run: hermes skill publish --token $CLAWHUB_TOKEN
32.6 Common Skill Bug Patterns and Prevention
Bug Pattern Library
# Bug Pattern 1: Unchecked tool call results
# Symptom: fetch_url returns None on timeout, code calls None.split()
# Prevention:
def test_handles_none_fetch_result(mocker):
mocker.register("web_search", response={"results": [{"url": "https://x.com"}]})
mocker.register("fetch_url", response=None)
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI"]})
assert result["status"] in ["success", "partial_success"]
assert not any(e["code"] == "INTERNAL_ERROR" for e in result.get("errors", []))
# Bug Pattern 2: State leaking between calls
# Symptom: Results from call 1 contaminate call 2
# Prevention:
def test_no_state_leak_between_calls(mocker_factory):
skill = NewsDigestSkill()
mocker1 = mocker_factory(search_response=[{"title": "AI News Call 1"}])
result1 = skill.run({"topics": ["AI"]}, tool_handler=mocker1.handle_tool_call)
mocker2 = mocker_factory(search_response=[{"title": "Climate News Call 2"}])
result2 = skill.run({"topics": ["Climate"]}, tool_handler=mocker2.handle_tool_call)
# Results must be fully independent
assert "Call 2" not in str(result1)
assert "Call 1" not in str(result2)
# Bug Pattern 3: Forgotten partial success
# Symptom: 1 of 3 topics fails → whole Skill fails instead of partial_success
# Prevention: (see TestBoundaryConditions.test_partial_success_when_one_topic_fails above)
# Bug Pattern 4: Token budget overflow
# Symptom: Too much article content exceeds context window
# Prevention:
def test_large_content_stays_within_token_budget(mocker):
huge_content = "Very long article. " * 1000 # ~5000 words
mocker.register("web_search", response={"results": [
{"url": f"https://example{i}.com", "title": f"Art {i}", "snippet": "..."}
for i in range(10)
]})
mocker.register("fetch_url", response=huge_content)
skill = NewsDigestSkill(tool_handler=mocker.handle_tool_call)
result = skill.run({"topics": ["AI"], "max_articles_per_topic": 10})
# Output should be within reasonable bounds (~40k chars ≈ 10k tokens)
assert len(str(result)) < 40_000
Coverage Targets
| Module | Target Coverage | Notes |
|---|---|---|
| Core logic (tools/) | > 90% | Must be high |
| Input validation | > 95% | Critical security layer |
| Error handling paths | > 85% | Verify degradation works |
| Output formatting | > 80% | Test all formats |
| Overall | > 80% | CI enforced |
32.7 Summary
A complete Skill testing system is the foundation of quality assurance:
- Four-dimension testing: Functional (70%) → Performance/Boundary (20%) → Integration (10%) pyramid structure
- Mock-first: Use
HermesToolMockerfor tool calls — tests run offline, no API costs - Boundary coverage: Unicode, injection attacks, None returns, extreme-length inputs — every boundary needs a test
- CI/CD automation: Unit tests on every push, performance comparison on PRs, integration tests daily on schedule
- Bug pattern library: Identify and prevent 4 common Skill bugs (None checks, state leakage, error aggregation, token overflow)
Tests are the moat protecting Skill quality — investing in tests is investing in long-term maintainability.
Discussion Questions
-
Mock tool testing validates Skill logic effectively, but cannot test that the LLM actually follows the steps described in SKILL.md. How would you design a test to validate SKILL.md's behavioral effectiveness?
-
Integration tests require real API calls — they're expensive and flaky. How would you design a Record-and-Replay mechanism so integration tests validate real behavior while running stably in CI?
-
The
test_partial_success_when_one_topic_failstest validates partial success. But if all 3 topics fail, the result should befailed, notpartial_success. How do you elegantly distinguish these cases in code? Write the corresponding test. -
The performance test allows a 20% regression threshold. When would you tolerate a larger regression? When should you tighten it to 5%? What's your decision framework?