Chapter 57

Claude.ai Connector Development: Complete Remote MCP + OAuth 2.1 Integration Guide

Chapter 57: Plugin Testing Strategies: Unit Testing, Integration Testing, and Sandbox Environments

57.1 Why Plugin Testing Matters

Plugins are Claude's extended reach into the real world. A plugin can search the web, query databases, call external APIs, and modify filesystems. This means plugin bugs are no longer confined to "producing wrong text"โ€”they can cause real-world side effects: deleting records that shouldn't be deleted, sending incorrect emails, triggering erroneous payment flows.

Unlike ordinary functions, plugin testing faces several unique challenges:

  1. Complex external dependencies: Plugins frequently call third-party APIs, and you cannot rely on the stability of real services during testing.
  2. Non-deterministic Claude behavior: The same prompt does not always trigger the same tool calls.
  3. Parameter schema validation: JSON parameters generated by Claude must conform to the schema, or execution fails.
  4. Hard-to-rollback side effects: Write operations to databases and filesystems must be isolated in test environments.

A comprehensive plugin testing strategy should cover three levels: unit tests (validating tool function logic), integration tests (validating Claude-tool collaboration), and sandbox tests (validating end-to-end flows in isolated environments).

57.2 Plugin Architecture Overview

Before diving into testing strategies, let's review the typical plugin structure, which determines where we cut in for testing.

# Typical plugin structure
import anthropic
import json
from typing import Any

# Tool functions (pure business logic layer)
def search_products(query: str, max_results: int = 5) -> list[dict]:
    """Search the product database"""
    ...

def create_order(user_id: str, product_id: str, quantity: int) -> dict:
    """Create an order"""
    ...

# Tool schema definition layer
TOOLS = [
    {
        "name": "search_products",
        "description": "Search the product catalog and return matching products",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search keywords"},
                "max_results": {"type": "integer", "description": "Max results to return", "default": 5}
            },
            "required": ["query"]
        }
    }
]

# Tool dispatch layer
def execute_tool(tool_name: str, tool_input: dict) -> Any:
    if tool_name == "search_products":
        return search_products(**tool_input)
    elif tool_name == "create_order":
        return create_order(**tool_input)
    else:
        raise ValueError(f"Unknown tool: {tool_name}")

This structure separates concerns cleanly into three layers: business logic, schema definition, and dispatch. Our testing strategy mirrors this separation.

57.3 Unit Testing: Validating Tool Function Logic

Unit tests target the pure tool functions themselves, without any Claude API calls. The goal is to verify that each tool function behaves correctly across a range of inputs.

57.3.1 Building a Tool Function Test Suite with pytest

# tests/test_tools_unit.py
import pytest
from unittest.mock import patch, MagicMock
from plugins.ecommerce import search_products, create_order

class TestSearchProducts:
    """Unit tests for search_products tool"""

    def test_basic_search_returns_results(self, mock_db):
        """Normal search returns a result list"""
        mock_db.query.return_value = [
            {"id": "p001", "name": "Noise-Cancelling Headphones", "price": 299},
            {"id": "p002", "name": "Bluetooth Speaker", "price": 199}
        ]
        results = search_products("bluetooth")
        assert len(results) == 2
        assert results[0]["id"] == "p001"

    def test_max_results_limit(self, mock_db):
        """max_results parameter limits the number of returned items"""
        mock_db.query.return_value = [{"id": f"p{i}"} for i in range(10)]
        results = search_products("headphones", max_results=3)
        assert len(results) == 3

    def test_empty_query_raises_error(self):
        """Empty query string should raise ValueError"""
        with pytest.raises(ValueError, match="Query string cannot be empty"):
            search_products("")

    def test_no_results_returns_empty_list(self, mock_db):
        """No matching results returns empty list, not an exception"""
        mock_db.query.return_value = []
        results = search_products("completely_nonexistent_product_xyz")
        assert results == []

    def test_sql_injection_sanitized(self, mock_db):
        """SQL injection strings should be sanitized"""
        results = search_products("'; DROP TABLE products; --")
        mock_db.query.assert_called_once()
        call_args = mock_db.query.call_args
        assert "DROP TABLE" not in str(call_args)

class TestCreateOrder:
    """Unit tests for create_order tool"""

    def test_successful_order_creation(self, mock_db):
        """Successful order creation returns order ID"""
        mock_db.insert.return_value = "order_12345"
        result = create_order("user_001", "p001", 2)
        assert result["order_id"] == "order_12345"
        assert result["status"] == "created"

    def test_zero_quantity_raises_error(self):
        """Quantity of 0 should raise a validation error"""
        with pytest.raises(ValueError, match="Quantity must be greater than 0"):
            create_order("user_001", "p001", 0)

    def test_nonexistent_product_raises_error(self, mock_db):
        """Non-existent product ID should raise an error"""
        mock_db.get_product.return_value = None
        with pytest.raises(ValueError, match="Product not found"):
            create_order("user_001", "nonexistent_product", 1)

@pytest.fixture
def mock_db():
    """Database mock fixture"""
    with patch("plugins.ecommerce.db") as mock:
        yield mock

57.3.2 Testing Schema Validity

Schema definition errors cause Claude to fail at tool invocation. We need dedicated schema tests.

# tests/test_schema_validation.py
import pytest
import jsonschema
from plugins.ecommerce import TOOLS

def get_tool_schema(tool_name: str) -> dict:
    for tool in TOOLS:
        if tool["name"] == tool_name:
            return tool["input_schema"]
    raise KeyError(f"Tool {tool_name} not found")

class TestSearchProductsSchema:
    def test_valid_minimal_input(self):
        """Only required fields should pass validation"""
        jsonschema.validate({"query": "headphones"}, get_tool_schema("search_products"))

    def test_missing_required_field_fails(self):
        """Missing required 'query' field should fail validation"""
        with pytest.raises(jsonschema.ValidationError):
            jsonschema.validate({"max_results": 5}, get_tool_schema("search_products"))

    def test_wrong_type_for_max_results(self):
        """Passing a string for max_results should fail"""
        with pytest.raises(jsonschema.ValidationError):
            jsonschema.validate({"query": "headphones", "max_results": "five"}, get_tool_schema("search_products"))

57.4 Integration Testing: Validating Claude-Tool Collaboration

Integration tests verify that Claude correctly identifies which tool to call, passes the right parameters, and produces a sensible final response after tool results are returned.

57.4.1 Integration Tests with the Real Claude API

# tests/test_integration_claude.py
import pytest
import anthropic
from plugins.ecommerce import TOOLS, execute_tool

@pytest.fixture
def claude_client():
    return anthropic.Anthropic()

def run_tool_use_loop(client, messages: list, tools: list) -> dict:
    """Execute a complete tool use loop, return final outcome"""
    tool_calls = []
    final_text = ""

    while True:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    final_text = block.text
            break

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    tool_calls.append({"name": block.name, "input": block.input})
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result)
                    })

            messages = messages + [
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": tool_results}
            ]
        else:
            break

    return {"final_text": final_text, "tool_calls": tool_calls}

class TestClaudeToolIntegration:
    @pytest.mark.integration
    def test_search_triggered_for_product_query(self, claude_client, mock_db):
        """Asking about products should trigger search_products"""
        mock_db.query.return_value = [
            {"id": "p001", "name": "Sony WH-1000XM5", "price": 349}
        ]

        result = run_tool_use_loop(
            claude_client,
            [{"role": "user", "content": "Can you search for noise-cancelling headphones?"}],
            TOOLS
        )

        assert any(tc["name"] == "search_products" for tc in result["tool_calls"])

    @pytest.mark.integration
    def test_multi_turn_maintains_context(self, claude_client, mock_db):
        """Tool calls in multi-turn conversations maintain context"""
        mock_db.query.return_value = [{"id": "p001", "name": "Headphone A", "price": 299}]

        messages = [{"role": "user", "content": "Search for bluetooth headphones"}]
        result = run_tool_use_loop(claude_client, messages, TOOLS)
        assert any(tc["name"] == "search_products" for tc in result["tool_calls"])

57.4.2 Lightweight Integration Testing with Mock Claude

Calling the real Claude API for every integration test is expensive and slow. A Mock Claude client enables fast CI/CD validation.

# tests/mocks/mock_claude.py
from dataclasses import dataclass, field

@dataclass
class MockToolUseBlock:
    type: str = "tool_use"
    id: str = "mock_tool_001"
    name: str = ""
    input: dict = field(default_factory=dict)

@dataclass
class MockTextBlock:
    type: str = "text"
    text: str = ""

@dataclass
class MockResponse:
    stop_reason: str
    content: list

class MockClaudeClient:
    """Simulates Claude tool-calling behavior for fast testing"""

    def __init__(self, tool_call_plan: list[dict]):
        self.tool_call_plan = tool_call_plan
        self.call_index = 0

    def messages_create(self, **kwargs):
        if self.call_index < len(self.tool_call_plan):
            plan = self.tool_call_plan[self.call_index]
            self.call_index += 1
            return MockResponse(
                stop_reason="tool_use",
                content=[MockToolUseBlock(
                    id=f"mock_{self.call_index:03d}",
                    name=plan["name"],
                    input=plan["input"]
                )]
            )
        return MockResponse(
            stop_reason="end_turn",
            content=[MockTextBlock(text="Based on search results, here are the products...")]
        )

57.5 Sandbox Environments: Isolating Side Effects

For plugins with write operationsโ€”creating orders, sending emails, modifying databasesโ€”testing must occur in sandbox environments to prevent test data from contaminating production.

57.5.1 Docker Sandbox Configuration

# docker-compose.test.yml
version: "3.8"
services:
  test-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: plugin_test
      POSTGRES_USER: testuser
      POSTGRES_PASSWORD: testpass
    ports:
      - "5433:5432"
    volumes:
      - ./tests/fixtures/init.sql:/docker-entrypoint-initdb.d/init.sql

  wiremock:
    image: wiremock/wiremock:3.3.1
    ports:
      - "8080:8080"
    volumes:
      - ./tests/wiremock:/home/wiremock

  plugin-test-runner:
    build: .
    environment:
      DATABASE_URL: postgresql://testuser:testpass@test-db:5432/plugin_test
      EXTERNAL_API_BASE: http://wiremock:8080
    depends_on: [test-db, wiremock]
    command: pytest tests/ -m "sandbox" -v

57.5.2 Transaction Rollback Strategy

# tests/conftest.py
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

@pytest.fixture(scope="function")
def db_session():
    """Each test function uses an isolated transaction, auto-rolled back after the test"""
    engine = create_engine(os.environ["DATABASE_URL"])
    connection = engine.connect()
    transaction = connection.begin()

    Session = sessionmaker(bind=connection)
    session = Session()

    import plugins.ecommerce as plugin
    original_session = plugin.db_session
    plugin.db_session = session

    yield session

    session.close()
    transaction.rollback()
    connection.close()
    plugin.db_session = original_session

@pytest.mark.sandbox
class TestCreateOrderSandbox:
    def test_order_persisted_to_database(self, db_session):
        """Created order should be persisted to the database"""
        result = create_order("user_001", "p001", 2)
        order = db_session.query(Order).filter_by(id=result["order_id"]).first()
        assert order is not None
        assert order.quantity == 2
        # Transaction rolls back automatically after test

    def test_payment_failure_rolls_back_order(self, db_session):
        """Payment failure should roll back the entire order"""
        with pytest.raises(PaymentError):
            create_order("user_001", "p001", 99999)  # exceeds limit
        
        orders = db_session.query(Order).filter_by(user_id="user_001").all()
        assert len(orders) == 0

57.6 Coverage and CI/CD Integration

57.6.1 pytest Configuration and Marker Strategy

# pytest.ini
[pytest]
markers =
    unit: Unit tests, no external dependencies
    integration: Integration tests, requires Claude API
    sandbox: Sandbox tests, requires Docker environment
    slow: Tests taking more than 5 seconds

addopts = 
    --cov=plugins
    --cov-report=html:coverage_report
    --cov-report=term-missing
    --cov-fail-under=80

57.6.2 GitHub Actions CI Pipeline

# .github/workflows/plugin-tests.yml
name: Plugin Test Suite

on:
  push:
    paths: ['plugins/**', 'tests/**']

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: '3.11'}
      - run: pip install -r requirements-test.txt
      - run: pytest -m "unit" -v --tb=short
      - uses: codecov/codecov-action@v3

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    if: github.ref == 'refs/heads/main'
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: {python-version: '3.11'}
      - run: pip install -r requirements-test.txt
      - run: pytest -m "integration" -v
        timeout-minutes: 10

57.7 Common Testing Pitfalls and Solutions

Pitfall 1: Over-relying on Claude's Non-deterministic Behavior

# Wrong: testing the exact sequence of tool calls
def test_exact_tool_sequence(claude_client):
    result = run_tool_use_loop(claude_client, [...], TOOLS)
    assert result["tool_calls"][0]["name"] == "search_products"  # Fragile!
    assert result["tool_calls"][1]["name"] == "create_order"     # Fragile!

# Right: test outcome properties, not exact sequences
def test_order_eventually_created(claude_client, mock_db):
    result = run_tool_use_loop(claude_client, [...], TOOLS)
    order_calls = [tc for tc in result["tool_calls"] if tc["name"] == "create_order"]
    assert len(order_calls) >= 1
    assert order_calls[0]["input"]["quantity"] == 2

Pitfall 2: Returning Non-serializable Objects from Tools

# Wrong: SQLAlchemy objects cannot be serialized to Claude
def search_products_bad(query: str) -> list:
    return db.query(Product).filter(...).all()  # SQLAlchemy objects!

# Right: return plain dicts
def search_products_good(query: str) -> list[dict]:
    results = db.query(Product).filter(...).all()
    return [{"id": p.id, "name": p.name, "price": p.price} for p in results]

Pitfall 3: Production Database Contamination

# tests/conftest.py
def pytest_configure(config):
    """Ensure tests never accidentally connect to production"""
    db_url = os.environ.get("DATABASE_URL", "")
    if "prod" in db_url or "production" in db_url:
        pytest.exit("Production database URL detected โ€” aborting!", returncode=1)

Summary

A sound plugin testing strategy follows the pyramid model: a large base of unit tests (fast, no dependencies), a middle layer of integration tests (verifying Claude collaboration), and a small top layer of sandbox tests (validating end-to-end side effects). Key practices include: schema validation tests to prevent parameter errors, transaction rollback to isolate write operations, a Mock Claude client to speed up CI pipelines, and environment guards to prevent accidental production access. Testing is not overheadโ€”it is the pathway from prototype to production.

Rate this chapter
4.7  / 5  (3 ratings)

๐Ÿ’ฌ Comments