Claude.ai Connector Development: Complete Remote MCP + OAuth 2.1 Integration Guide
Chapter 57: Plugin Testing Strategies: Unit Testing, Integration Testing, and Sandbox Environments
57.1 Why Plugin Testing Matters
Plugins are Claude's extended reach into the real world. A plugin can search the web, query databases, call external APIs, and modify filesystems. This means plugin bugs are no longer confined to "producing wrong text"—they can cause real-world side effects: deleting records that shouldn't be deleted, sending incorrect emails, triggering erroneous payment flows.
Unlike ordinary functions, plugin testing faces several unique challenges:
- Complex external dependencies: Plugins frequently call third-party APIs, and you cannot rely on the stability of real services during testing.
- Non-deterministic Claude behavior: The same prompt does not always trigger the same tool calls.
- Parameter schema validation: JSON parameters generated by Claude must conform to the schema, or execution fails.
- Hard-to-rollback side effects: Write operations to databases and filesystems must be isolated in test environments.
A comprehensive plugin testing strategy should cover three levels: unit tests (validating tool function logic), integration tests (validating Claude-tool collaboration), and sandbox tests (validating end-to-end flows in isolated environments).
57.2 Plugin Architecture Overview
Before diving into testing strategies, let's review the typical plugin structure, which determines where we cut in for testing.
# Typical plugin structure
import anthropic
import json
from typing import Any
# Tool functions (pure business logic layer)
def search_products(query: str, max_results: int = 5) -> list[dict]:
"""Search the product database"""
...
def create_order(user_id: str, product_id: str, quantity: int) -> dict:
"""Create an order"""
...
# Tool schema definition layer
TOOLS = [
{
"name": "search_products",
"description": "Search the product catalog and return matching products",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search keywords"},
"max_results": {"type": "integer", "description": "Max results to return", "default": 5}
},
"required": ["query"]
}
}
]
# Tool dispatch layer
def execute_tool(tool_name: str, tool_input: dict) -> Any:
if tool_name == "search_products":
return search_products(**tool_input)
elif tool_name == "create_order":
return create_order(**tool_input)
else:
raise ValueError(f"Unknown tool: {tool_name}")
This structure separates concerns cleanly into three layers: business logic, schema definition, and dispatch. Our testing strategy mirrors this separation.
57.3 Unit Testing: Validating Tool Function Logic
Unit tests target the pure tool functions themselves, without any Claude API calls. The goal is to verify that each tool function behaves correctly across a range of inputs.
57.3.1 Building a Tool Function Test Suite with pytest
# tests/test_tools_unit.py
import pytest
from unittest.mock import patch, MagicMock
from plugins.ecommerce import search_products, create_order
class TestSearchProducts:
"""Unit tests for search_products tool"""
def test_basic_search_returns_results(self, mock_db):
"""Normal search returns a result list"""
mock_db.query.return_value = [
{"id": "p001", "name": "Noise-Cancelling Headphones", "price": 299},
{"id": "p002", "name": "Bluetooth Speaker", "price": 199}
]
results = search_products("bluetooth")
assert len(results) == 2
assert results[0]["id"] == "p001"
def test_max_results_limit(self, mock_db):
"""max_results parameter limits the number of returned items"""
mock_db.query.return_value = [{"id": f"p{i}"} for i in range(10)]
results = search_products("headphones", max_results=3)
assert len(results) == 3
def test_empty_query_raises_error(self):
"""Empty query string should raise ValueError"""
with pytest.raises(ValueError, match="Query string cannot be empty"):
search_products("")
def test_no_results_returns_empty_list(self, mock_db):
"""No matching results returns empty list, not an exception"""
mock_db.query.return_value = []
results = search_products("completely_nonexistent_product_xyz")
assert results == []
def test_sql_injection_sanitized(self, mock_db):
"""SQL injection strings should be sanitized"""
results = search_products("'; DROP TABLE products; --")
mock_db.query.assert_called_once()
call_args = mock_db.query.call_args
assert "DROP TABLE" not in str(call_args)
class TestCreateOrder:
"""Unit tests for create_order tool"""
def test_successful_order_creation(self, mock_db):
"""Successful order creation returns order ID"""
mock_db.insert.return_value = "order_12345"
result = create_order("user_001", "p001", 2)
assert result["order_id"] == "order_12345"
assert result["status"] == "created"
def test_zero_quantity_raises_error(self):
"""Quantity of 0 should raise a validation error"""
with pytest.raises(ValueError, match="Quantity must be greater than 0"):
create_order("user_001", "p001", 0)
def test_nonexistent_product_raises_error(self, mock_db):
"""Non-existent product ID should raise an error"""
mock_db.get_product.return_value = None
with pytest.raises(ValueError, match="Product not found"):
create_order("user_001", "nonexistent_product", 1)
@pytest.fixture
def mock_db():
"""Database mock fixture"""
with patch("plugins.ecommerce.db") as mock:
yield mock
57.3.2 Testing Schema Validity
Schema definition errors cause Claude to fail at tool invocation. We need dedicated schema tests.
# tests/test_schema_validation.py
import pytest
import jsonschema
from plugins.ecommerce import TOOLS
def get_tool_schema(tool_name: str) -> dict:
for tool in TOOLS:
if tool["name"] == tool_name:
return tool["input_schema"]
raise KeyError(f"Tool {tool_name} not found")
class TestSearchProductsSchema:
def test_valid_minimal_input(self):
"""Only required fields should pass validation"""
jsonschema.validate({"query": "headphones"}, get_tool_schema("search_products"))
def test_missing_required_field_fails(self):
"""Missing required 'query' field should fail validation"""
with pytest.raises(jsonschema.ValidationError):
jsonschema.validate({"max_results": 5}, get_tool_schema("search_products"))
def test_wrong_type_for_max_results(self):
"""Passing a string for max_results should fail"""
with pytest.raises(jsonschema.ValidationError):
jsonschema.validate({"query": "headphones", "max_results": "five"}, get_tool_schema("search_products"))
57.4 Integration Testing: Validating Claude-Tool Collaboration
Integration tests verify that Claude correctly identifies which tool to call, passes the right parameters, and produces a sensible final response after tool results are returned.
57.4.1 Integration Tests with the Real Claude API
# tests/test_integration_claude.py
import pytest
import anthropic
from plugins.ecommerce import TOOLS, execute_tool
@pytest.fixture
def claude_client():
return anthropic.Anthropic()
def run_tool_use_loop(client, messages: list, tools: list) -> dict:
"""Execute a complete tool use loop, return final outcome"""
tool_calls = []
final_text = ""
while True:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
final_text = block.text
break
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
tool_calls.append({"name": block.name, "input": block.input})
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result)
})
messages = messages + [
{"role": "assistant", "content": response.content},
{"role": "user", "content": tool_results}
]
else:
break
return {"final_text": final_text, "tool_calls": tool_calls}
class TestClaudeToolIntegration:
@pytest.mark.integration
def test_search_triggered_for_product_query(self, claude_client, mock_db):
"""Asking about products should trigger search_products"""
mock_db.query.return_value = [
{"id": "p001", "name": "Sony WH-1000XM5", "price": 349}
]
result = run_tool_use_loop(
claude_client,
[{"role": "user", "content": "Can you search for noise-cancelling headphones?"}],
TOOLS
)
assert any(tc["name"] == "search_products" for tc in result["tool_calls"])
@pytest.mark.integration
def test_multi_turn_maintains_context(self, claude_client, mock_db):
"""Tool calls in multi-turn conversations maintain context"""
mock_db.query.return_value = [{"id": "p001", "name": "Headphone A", "price": 299}]
messages = [{"role": "user", "content": "Search for bluetooth headphones"}]
result = run_tool_use_loop(claude_client, messages, TOOLS)
assert any(tc["name"] == "search_products" for tc in result["tool_calls"])
57.4.2 Lightweight Integration Testing with Mock Claude
Calling the real Claude API for every integration test is expensive and slow. A Mock Claude client enables fast CI/CD validation.
# tests/mocks/mock_claude.py
from dataclasses import dataclass, field
@dataclass
class MockToolUseBlock:
type: str = "tool_use"
id: str = "mock_tool_001"
name: str = ""
input: dict = field(default_factory=dict)
@dataclass
class MockTextBlock:
type: str = "text"
text: str = ""
@dataclass
class MockResponse:
stop_reason: str
content: list
class MockClaudeClient:
"""Simulates Claude tool-calling behavior for fast testing"""
def __init__(self, tool_call_plan: list[dict]):
self.tool_call_plan = tool_call_plan
self.call_index = 0
def messages_create(self, **kwargs):
if self.call_index < len(self.tool_call_plan):
plan = self.tool_call_plan[self.call_index]
self.call_index += 1
return MockResponse(
stop_reason="tool_use",
content=[MockToolUseBlock(
id=f"mock_{self.call_index:03d}",
name=plan["name"],
input=plan["input"]
)]
)
return MockResponse(
stop_reason="end_turn",
content=[MockTextBlock(text="Based on search results, here are the products...")]
)
57.5 Sandbox Environments: Isolating Side Effects
For plugins with write operations—creating orders, sending emails, modifying databases—testing must occur in sandbox environments to prevent test data from contaminating production.
57.5.1 Docker Sandbox Configuration
# docker-compose.test.yml
version: "3.8"
services:
test-db:
image: postgres:15-alpine
environment:
POSTGRES_DB: plugin_test
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
ports:
- "5433:5432"
volumes:
- ./tests/fixtures/init.sql:/docker-entrypoint-initdb.d/init.sql
wiremock:
image: wiremock/wiremock:3.3.1
ports:
- "8080:8080"
volumes:
- ./tests/wiremock:/home/wiremock
plugin-test-runner:
build: .
environment:
DATABASE_URL: postgresql://testuser:testpass@test-db:5432/plugin_test
EXTERNAL_API_BASE: http://wiremock:8080
depends_on: [test-db, wiremock]
command: pytest tests/ -m "sandbox" -v
57.5.2 Transaction Rollback Strategy
# tests/conftest.py
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
@pytest.fixture(scope="function")
def db_session():
"""Each test function uses an isolated transaction, auto-rolled back after the test"""
engine = create_engine(os.environ["DATABASE_URL"])
connection = engine.connect()
transaction = connection.begin()
Session = sessionmaker(bind=connection)
session = Session()
import plugins.ecommerce as plugin
original_session = plugin.db_session
plugin.db_session = session
yield session
session.close()
transaction.rollback()
connection.close()
plugin.db_session = original_session
@pytest.mark.sandbox
class TestCreateOrderSandbox:
def test_order_persisted_to_database(self, db_session):
"""Created order should be persisted to the database"""
result = create_order("user_001", "p001", 2)
order = db_session.query(Order).filter_by(id=result["order_id"]).first()
assert order is not None
assert order.quantity == 2
# Transaction rolls back automatically after test
def test_payment_failure_rolls_back_order(self, db_session):
"""Payment failure should roll back the entire order"""
with pytest.raises(PaymentError):
create_order("user_001", "p001", 99999) # exceeds limit
orders = db_session.query(Order).filter_by(user_id="user_001").all()
assert len(orders) == 0
57.6 Coverage and CI/CD Integration
57.6.1 pytest Configuration and Marker Strategy
# pytest.ini
[pytest]
markers =
unit: Unit tests, no external dependencies
integration: Integration tests, requires Claude API
sandbox: Sandbox tests, requires Docker environment
slow: Tests taking more than 5 seconds
addopts =
--cov=plugins
--cov-report=html:coverage_report
--cov-report=term-missing
--cov-fail-under=80
57.6.2 GitHub Actions CI Pipeline
# .github/workflows/plugin-tests.yml
name: Plugin Test Suite
on:
push:
paths: ['plugins/**', 'tests/**']
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: {python-version: '3.11'}
- run: pip install -r requirements-test.txt
- run: pytest -m "unit" -v --tb=short
- uses: codecov/codecov-action@v3
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests
if: github.ref == 'refs/heads/main'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: {python-version: '3.11'}
- run: pip install -r requirements-test.txt
- run: pytest -m "integration" -v
timeout-minutes: 10
57.7 Common Testing Pitfalls and Solutions
Pitfall 1: Over-relying on Claude's Non-deterministic Behavior
# Wrong: testing the exact sequence of tool calls
def test_exact_tool_sequence(claude_client):
result = run_tool_use_loop(claude_client, [...], TOOLS)
assert result["tool_calls"][0]["name"] == "search_products" # Fragile!
assert result["tool_calls"][1]["name"] == "create_order" # Fragile!
# Right: test outcome properties, not exact sequences
def test_order_eventually_created(claude_client, mock_db):
result = run_tool_use_loop(claude_client, [...], TOOLS)
order_calls = [tc for tc in result["tool_calls"] if tc["name"] == "create_order"]
assert len(order_calls) >= 1
assert order_calls[0]["input"]["quantity"] == 2
Pitfall 2: Returning Non-serializable Objects from Tools
# Wrong: SQLAlchemy objects cannot be serialized to Claude
def search_products_bad(query: str) -> list:
return db.query(Product).filter(...).all() # SQLAlchemy objects!
# Right: return plain dicts
def search_products_good(query: str) -> list[dict]:
results = db.query(Product).filter(...).all()
return [{"id": p.id, "name": p.name, "price": p.price} for p in results]
Pitfall 3: Production Database Contamination
# tests/conftest.py
def pytest_configure(config):
"""Ensure tests never accidentally connect to production"""
db_url = os.environ.get("DATABASE_URL", "")
if "prod" in db_url or "production" in db_url:
pytest.exit("Production database URL detected — aborting!", returncode=1)
Summary
A sound plugin testing strategy follows the pyramid model: a large base of unit tests (fast, no dependencies), a middle layer of integration tests (verifying Claude collaboration), and a small top layer of sandbox tests (validating end-to-end side effects). Key practices include: schema validation tests to prevent parameter errors, transaction rollback to isolate write operations, a Mock Claude client to speed up CI pipelines, and environment guards to prevent accidental production access. Testing is not overhead—it is the pathway from prototype to production.