Chapter 57

Claude.ai Connector Development: Complete Remote MCP + OAuth 2.1 Integration Guide

Chapter 57: Plugin Testing Strategies: Unit Testing, Integration Testing, and Sandbox Environments

57.1 Why Plugin Testing Matters

Plugins are Claude's extended reach into the real world. A plugin can search the web, query databases, call external APIs, and modify filesystems. This means plugin bugs are no longer confined to "producing wrong text"—they can cause real-world side effects: deleting records that shouldn't be deleted, sending incorrect emails, triggering erroneous payment flows.

Unlike ordinary functions, plugin testing faces several unique challenges:

  1. Complex external dependencies: Plugins frequently call third-party APIs, and you cannot rely on the stability of real services during testing.
  2. Non-deterministic Claude behavior: The same prompt does not always trigger the same tool calls.
  3. Parameter schema validation: JSON parameters generated by Claude must conform to the schema, or execution fails.
  4. Hard-to-rollback side effects: Write operations to databases and filesystems must be isolated in test environments.

A comprehensive plugin testing strategy should cover three levels: unit tests (validating tool function logic), integration tests (validating Claude-tool collaboration), and sandbox tests (validating end-to-end flows in isolated environments).

57.2 Plugin Architecture Overview

Before diving into testing strategies, let's review the typical plugin structure, which determines where we cut in for testing.

# Typical plugin structure
import anthropic
import json
from typing import Any

# Tool functions (pure business logic layer)
def search_products(query: str, max_results: int = 5) -> list[dict]:
    """Search the product database"""
    ...

def create_order(user_id: str, product_id: str, quantity: int) -> dict:
    """Create an order"""
    ...

# Tool schema definition layer
TOOLS = [
    {
        "name": "search_products",
        "description": "Search the product catalog and return matching products",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search keywords"},
                "max_results": {"type": "integer", "description": "Max results to return", "default": 5}
            },
            "required": ["query"]
        }
    }
]

# Tool dispatch layer
def execute_tool(tool_name: str, tool_input: dict) -> Any:
    if tool_name == "search_products":
        return search_products(**tool_input)
    elif tool_name == "create_order":
        return create_order(**tool_input)
    else:
        raise ValueError(f"Unknown tool: {tool_name}")

This structure separates concerns cleanly into three layers: business logic, schema definition, and dispatch. Our testing strategy mirrors this separation.

57.3 Unit Testing: Validating Tool Function Logic

Unit tests target the pure tool functions themselves, without any Claude API calls. The goal is to verify that each tool function behaves correctly across a range of inputs.

57.3.1 Building a Tool Function Test Suite with pytest

# tests/test_tools_unit.py
import pytest
from unittest.mock import patch, MagicMock
from plugins.ecommerce import search_products, create_order

class TestSearchProducts:
    """Unit tests for search_products tool"""

    def test_basic_search_returns_results(self, mock_db):
        """Normal search returns a result list"""
        mock_db.query.return_value = [
            {"id": "p001", "name": "Noise-Cancelling Headphones", "price": 299},
            {"id": "p002", "name": "Bluetooth Speaker", "price": 199}
        ]
        results = search_products("bluetooth")
        assert len(results) == 2
        assert results[0]["id"] == "p001"

    def test_max_results_limit(self, mock_db):
        """max_results parameter limits the number of returned items"""
        mock_db.query.return_value = [{"id": f"p{i}"} for i in range(10)]
        results = search_products("headphones", max_results=3)
        assert len(results) == 3

    def test_empty_query_raises_error(self):
        """Empty query string should raise ValueError"""
        with pytest.raises(ValueError, match="Query string cannot be empty"):
            search_products("")

    def test_no_results_returns_empty_list(self, mock_db):
        """No matching results returns empty list, not an exception"""
        mock_db.query.return_value = []
        results = search_products("completely_nonexistent_product_xyz")
        assert results == []

    def test_sql_injection_sanitized(self, mock_db):
        """SQL injection strings should be sanitized"""
        results = search_products("'; DROP TABLE products; --")
        mock_db.query.assert_called_once()
        call_args = mock_db.query.call_args
        assert "DROP TABLE" not in str(call_args)

class TestCreateOrder:
    """Unit tests for create_order tool"""

    def test_successful_order_creation(self, mock_db):
        """Successful order creation returns order ID"""
        mock_db.insert.return_value = "order_12345"
        result = create_order("user_001", "p001", 2)
        assert result["order_id"] == "order_12345"
        assert result["status"] == "created"

    def test_zero_quantity_raises_error(self):
        """Quantity of 0 should raise a validation error"""
        with pytest.raises(ValueError, match="Quantity must be greater than 0"):
            create_order("user_001", "p001", 0)

    def test_nonexistent_product_raises_error(self, mock_db):
        """Non-existent product ID should raise an error"""
        mock_db.get_product.return_value = None
        with pytest.raises(ValueError, match="Product not found"):
            create_order("user_001", "nonexistent_product", 1)

@pytest.fixture
def mock_db():
    """Database mock fixture"""
    with patch("plugins.ecommerce.db") as mock:
        yield mock

57.3.2 Testing Schema Validity

Schema definition errors cause Claude to fail at tool invocation. We need dedicated schema tests.

# tests/test_schema_validation.py
import pytest
import jsonschema
from plugins.ecommerce import TOOLS

def get_tool_schema(tool_name: str) -> dict:
    for tool in TOOLS:
        if tool["name"] == tool_name:
            return tool["input_schema"]
    raise KeyError(f"Tool {tool_name} not found")

class TestSearchProductsSchema:
    def test_valid_minimal_input(self):
        """Only required fields should pass validation"""
        jsonschema.validate({"query": "headphones"}, get_tool_schema("search_products"))

    def test_missing_required_field_fails(self):
        """Missing required 'query' field should fail validation"""
        with pytest.raises(jsonschema.ValidationError):
            jsonschema.validate({"max_results": 5}, get_tool_schema("search_products"))

    def test_wrong_type_for_max_results(self):
        """Passing a string for max_results should fail"""
        with pytest.raises(jsonschema.ValidationError):
            jsonschema.validate({"query": "headphones", "max_results": "five"}, get_tool_schema("search_products"))

57.4 Integration Testing: Validating Claude-Tool Collaboration

Integration tests verify that Claude correctly identifies which tool to call, passes the right parameters, and produces a sensible final response after tool results are returned.

57.4.1 Integration Tests with the Real Claude API

# tests/test_integration_claude.py
import pytest
import anthropic
from plugins.ecommerce import TOOLS, execute_tool

@pytest.fixture
def claude_client():
    return anthropic.Anthropic()

def run_tool_use_loop(client, messages: list, tools: list) -> dict:
    """Execute a complete tool use loop, return final outcome"""
    tool_calls = []
    final_text = ""

    while True:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    final_text = block.text
            break

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    tool_calls.append({"name": block.name, "input": block.input})
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result)
                    })

            messages = messages + [
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": tool_results}
            ]
        else:
            break

    return {"final_text": final_text, "tool_calls": tool_calls}

class TestClaudeToolIntegration:
    @pytest.mark.integration
    def test_search_triggered_for_product_query(self, claude_client, mock_db):
        """Asking about products should trigger search_products"""
        mock_db.query.return_value = [
            {"id": "p001", "name": "Sony WH-1000XM5", "price": 349}
        ]

        result = run_tool_use_loop(
            claude_client,
            [{"role": "user", "content": "Can you search for noise-cancelling headphones?"}],
            TOOLS
        )

        assert any(tc["name"] == "search_products" for tc in result["tool_calls"])

    @pytest.mark.integration
    def test_multi_turn_maintains_context(self, claude_client, mock_db):
        """Tool calls in multi-turn conversations maintain context"""
        mock_db.query.return_value = [{"id": "p001", "name": "Headphone A", "price": 299}]

        messages = [{"role": "user", "content": "Search for bluetooth headphones"}]
        result = run_tool_use_loop(claude_client, messages, TOOLS)
        assert any(tc["name"] == "search_products" for tc in result["tool_calls"])

57.4.2 Lightweight Integration Testing with Mock Claude

Calling the real Claude API for every integration test is expensive and slow. A Mock Claude client enables fast CI/CD validation.

# tests/mocks/mock_claude.py
from dataclasses import dataclass, field

@dataclass
class MockToolUseBlock:
    type: str = "tool_use"
    id: str = "mock_tool_001"
    name: str = ""
    input: dict = field(default_factory=dict)

@dataclass
class MockTextBlock:
    type: str = "text"
    text: str = ""

@dataclass
class MockResponse:
    stop_reason: str
    content: list

class MockClaudeClient:
    """Simulates Claude tool-calling behavior for fast testing"""

    def __init__(self, tool_call_plan: list[dict]):
        self.tool_call_plan = tool_call_plan
        self.call_index = 0

    def messages_create(self, **kwargs):
        if self.call_index < len(self.tool_call_plan):
            plan = self.tool_call_plan[self.call_index]
            self.call_index += 1
            return MockResponse(
                stop_reason="tool_use",
                content=[MockToolUseBlock(
                    id=f"mock_{self.call_index:03d}",
                    name=plan["name"],
                    input=plan["input"]
                )]
            )
        return MockResponse(
            stop_reason="end_turn",
            content=[MockTextBlock(text="Based on search results, here are the products...")]
        )

57.5 Sandbox Environments: Isolating Side Effects

For plugins with write operations—creating orders, sending emails, modifying databases—testing must occur in sandbox environments to prevent test data from contaminating production.

57.5.1 Docker Sandbox Configuration

# docker-compose.test.yml
version: "3.8"
services:
  test-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: plugin_test
      POSTGRES_USER: testuser
      POSTGRES_PASSWORD: testpass
    ports:
      - "5433:5432"
    volumes:
      - ./tests/fixtures/init.sql:/docker-entrypoint-initdb.d/init.sql

  wiremock:
    image: wiremock/wiremock:3.3.1
    ports:
      - "8080:8080"
    volumes:
      - ./tests/wiremock:/home/wiremock

  plugin-test-runner:
    build: .
    environment:
      DATABASE_URL: postgresql://testuser:testpass@test-db:5432/plugin_test
      EXTERNAL_API_BASE: http://wiremock:8080
    depends_on: [test-db, wiremock]
    command: pytest tests/ -m "sandbox" -v

57.5.2 Transaction Rollback Strategy

# tests/conftest.py
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

@pytest.fixture(scope="function")
def db_session():
    """Each test function uses an isolated transaction, auto-rolled back after the test"""
    engine = create_engine(os.environ["DATABASE_URL"])
    connection = engine.connect()
    transaction = connection.begin()

    Session = sessionmaker(bind=connection)
    session = Session()

    import plugins.ecommerce as plugin
    original_session = plugin.db_session
    plugin.db_session = session

    yield session

    session.close()
    transaction.rollback()
    connection.close()
    plugin.db_session = original_session

@pytest.mark.sandbox
class TestCreateOrderSandbox:
    def test_order_persisted_to_database(self, db_session):
        """Created order should be persisted to the database"""
        result = create_order("user_001", "p001", 2)
        order = db_session.query(Order).filter_by(id=result["order_id"]).first()
        assert order is not None
        assert order.quantity == 2
        # Transaction rolls back automatically after test

    def test_payment_failure_rolls_back_order(self, db_session):
        """Payment failure should roll back the entire order"""
        with pytest.raises(PaymentError):
            create_order("user_001", "p001", 99999)  # exceeds limit
        
        orders = db_session.query(Order).filter_by(user_id="user_001").all()
        assert len(orders) == 0

57.6 Coverage and CI/CD Integration

57.6.1 pytest Configuration and Marker Strategy

# pytest.ini
[pytest]
markers =
    unit: Unit tests, no external dependencies
    integration: Integration tests, requires Claude API
    sandbox: Sandbox tests, requires Docker environment
    slow: Tests taking more than 5 seconds

addopts = 
    --cov=plugins
    --cov-report=html:coverage_report
    --cov-report=term-missing
    --cov-fail-under=80

57.6.2 GitHub Actions CI Pipeline

# .github/workflows/plugin-tests.yml
name: Plugin Test Suite

on:
  push:
    paths: ['plugins/**', 'tests/**']

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: '3.11'}
      - run: pip install -r requirements-test.txt
      - run: pytest -m "unit" -v --tb=short
      - uses: codecov/codecov-action@v3

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    if: github.ref == 'refs/heads/main'
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: '3.11'}
      - run: pip install -r requirements-test.txt
      - run: pytest -m "integration" -v
        timeout-minutes: 10

57.7 Common Testing Pitfalls and Solutions

Pitfall 1: Over-relying on Claude's Non-deterministic Behavior

# Wrong: testing the exact sequence of tool calls
def test_exact_tool_sequence(claude_client):
    result = run_tool_use_loop(claude_client, [...], TOOLS)
    assert result["tool_calls"][0]["name"] == "search_products"  # Fragile!
    assert result["tool_calls"][1]["name"] == "create_order"     # Fragile!

# Right: test outcome properties, not exact sequences
def test_order_eventually_created(claude_client, mock_db):
    result = run_tool_use_loop(claude_client, [...], TOOLS)
    order_calls = [tc for tc in result["tool_calls"] if tc["name"] == "create_order"]
    assert len(order_calls) >= 1
    assert order_calls[0]["input"]["quantity"] == 2

Pitfall 2: Returning Non-serializable Objects from Tools

# Wrong: SQLAlchemy objects cannot be serialized to Claude
def search_products_bad(query: str) -> list:
    return db.query(Product).filter(...).all()  # SQLAlchemy objects!

# Right: return plain dicts
def search_products_good(query: str) -> list[dict]:
    results = db.query(Product).filter(...).all()
    return [{"id": p.id, "name": p.name, "price": p.price} for p in results]

Pitfall 3: Production Database Contamination

# tests/conftest.py
def pytest_configure(config):
    """Ensure tests never accidentally connect to production"""
    db_url = os.environ.get("DATABASE_URL", "")
    if "prod" in db_url or "production" in db_url:
        pytest.exit("Production database URL detected — aborting!", returncode=1)

Summary

A sound plugin testing strategy follows the pyramid model: a large base of unit tests (fast, no dependencies), a middle layer of integration tests (verifying Claude collaboration), and a small top layer of sandbox tests (validating end-to-end side effects). Key practices include: schema validation tests to prevent parameter errors, transaction rollback to isolate write operations, a Mock Claude client to speed up CI pipelines, and environment guards to prevent accidental production access. Testing is not overhead—it is the pathway from prototype to production.

Rate this chapter
4.7  / 5  (3 ratings)

💬 Comments