功能描述

Design and implement adaptive testing systems using Item Response Theory (IRT). Use when working with computerized adaptive tests (CAT), psychometric assessm...

使用说明 (SKILL.md)

Adaptive Testing with IRT

Name: Adaptive Testing
Author: woodstocksoftware

Design computerized adaptive tests that measure ability efficiently and accurately using Item Response Theory.

Core Concept

Adaptive tests adjust difficulty in real-time based on student responses. A correct answer → harder question. Incorrect → easier question. The result: accurate ability estimates in ~50% fewer questions than fixed-length tests.

Key advantage: Traditional tests waste time on too-easy or too-hard questions. Adaptive tests spend time where measurement matters most — near the student's ability level.

Quick Decision Tree

You need to...	See
Understand IRT models and parameters	IRT Fundamentals
Design a new adaptive test	Test Design Workflow
Choose item selection algorithm	Item Selection
Decide when to stop the test	Stopping Rules
Calibrate new questions	`references/calibration.md`
Implement CAT algorithm	`references/implementation.md`

IRT Fundamentals

The 3-Parameter Logistic (3PL) Model

Most adaptive tests use the 3PL model. Each question has three parameters:

a (discrimination) — How well the question differentiates ability levels. Higher = steeper curve. Typical range: 0.5 to 2.5
b (difficulty) — The ability level where P(correct) = 0.5. Range: -3 to +3 (standardized scale)
c (guessing) — Probability of guessing correctly. Usually 0.2 to 0.25 for multiple choice

Probability of correct response:

P(correct | ability, a, b, c) = c + (1 - c) / (1 + e^(-a(ability - b)))

Simpler models:

2PL: Set c = 0 (no guessing parameter)
1PL (Rasch): Set c = 0 and a = 1 for all items (only difficulty varies)

Use 3PL for high-stakes tests. Use 2PL/1PL when sample size is small (\x3C500 responses per item).

Information and Standard Error

Information measures how precisely an item estimates ability at a given level. Peak information occurs when ability ≈ difficulty (b parameter).

Standard Error (SE) is the inverse of information:

SE = 1 / sqrt(Information)

Goal of CAT: Maximize information (minimize SE) at the student's true ability level.

Test Design Workflow

1. Define Test Specifications

Purpose: Placement, diagnostic, certification, progress monitoring?
Content domain: Single skill or multidimensional?
Target population: What ability range (-3 to +3)?
Constraints: Time limit, minimum/maximum length, content balance

2. Build Item Bank

Minimum bank size: 10× the average test length. For a 20-item CAT, you need ≥200 calibrated items.

Distribution targets:

Difficulty (b): Spread across expected ability range
Discrimination (a): Target 1.0 to 2.0 (high discrimination)
Exposure: No item used >20% of the time

Content balancing: If testing math, ensure geometry/algebra/etc. are proportionally represented.

3. Choose Algorithms

Pick one from each category:

Item selection: (see below)

Maximum Information
Randomesque (MFI + exposure control)
Content balancing

Ability estimation:

Maximum Likelihood Estimation (MLE)
Expected A Posteriori (EAP) — better for extreme scores
Weighted Likelihood (WLE)

Stopping rule: (see below)

Fixed length
Standard error threshold
Information threshold

4. Simulate Performance

Before going live, simulate 1000+ test sessions with known abilities. Check:

Average test length
SE at different ability levels
Item exposure rates
Content balance adherence

Adjust if needed.

Item Selection Strategies

Maximum Fisher Information (MFI)

Rule: Select the item with highest information at current ability estimate.

Pros: Optimal precision, shortest tests Cons: Overuses "best" items, poor security

Use when: Pilot testing, low-stakes practice

Randomesque (MFI + Exposure Control)

Rule: Select from top N items by information (e.g., top 5), choose randomly from that set.

Pros: Balances precision and security Cons: Slightly longer tests than pure MFI

Use when: Operational tests, default choice

a-Stratified

Rule: Start with high-discrimination items (high a), use mid-discrimination later.

Pros: Fast initial ability estimate Cons: Complex to implement

Use when: Very large item banks, research settings

Content Balancing

Rule: Track content area usage, prioritize underrepresented areas when selecting next item.

Implementation: Weight information by content constraint satisfaction.

Use when: Blueprint requirements, multidimensional tests

Stopping Rules

Fixed Length

Stop after N items (e.g., 20 questions).

Pros: Predictable time, simple Cons: May over/under-test some students

Use when: Time limits matter, simple implementation needed

Standard Error Threshold

Stop when SE \x3C target (e.g., SE \x3C 0.3).

Pros: Consistent precision across ability levels Cons: Variable test length (harder to schedule)

Typical targets:

Low-stakes: SE \x3C 0.4
Medium-stakes: SE \x3C 0.3
High-stakes: SE \x3C 0.25

Use when: Precision matters more than time

Combined Rule

Stop when (SE \x3C target) OR (length ≥ max) OR (length ≥ min AND ability estimate stable).

Use when: Production systems (safest approach)

Practical Considerations

Starting Ability Estimate

Options:

Population mean (θ = 0)
Prior information (e.g., grade level, previous test)
First question is medium difficulty, estimate from there

Never start at extremes (-3 or +3).

Handling Extreme Response Patterns

All correct or all incorrect: MLE fails. Use EAP or Bayesian prior to regularize.

Rapid changes: If ability estimate jumps >1.0, consider response anomaly (cheating, guessing).

Exposure Control

Track how often each item is used. Flag items used >20% of the time. Consider:

Randomesque selection (above)
Sympson-Hetter method (advanced)
Periodic item bank refresh

Multidimensional IRT (MIRT)

If testing multiple skills (e.g., algebra + geometry), use separate ability estimates per dimension. Select items to balance information across dimensions.

Warning: MIRT requires larger item banks and more complex calibration.

Common Mistakes

❌ Too few items in bank → High exposure, security risk ✅ Aim for 10× average test length

❌ Poorly distributed difficulties → Accurate only in narrow ability range
✅ Spread items across -2 to +2 difficulty

❌ Ignoring content balance → May skip important topics
✅ Build content constraints into item selection

❌ Using MLE for all incorrect → Returns -∞
✅ Use EAP or cap estimates at -3/+3

❌ No exposure control → Same items every test
✅ Use randomesque or Sympson-Hetter

When to Load References

Need	File
Calibrate new items (collect data, estimate parameters)	`references/calibration.md`
Implement CAT algorithm (code patterns, libraries)	`references/implementation.md`

Real-World Example: K-12 Math Placement

Setup:

Item bank: 300 questions, b from -2 (basic) to +2 (advanced)
Target: SE \x3C 0.35 or max 25 questions
Content: 40% algebra, 30% geometry, 30% statistics
Algorithm: Randomesque (top 5), EAP estimation

Flow:

Start at θ = 0 (grade-level average)
Select item: b ≈ 0, content area needed
Student answers → update ability estimate (EAP)
Select next: maximize information at new θ, respect content balance, randomesque from top 5
Stop when SE \x3C 0.35 or 25 questions reached
Report: ability estimate + placement recommendation

Result: Average 18 questions, 95% of students placed within ±0.5 grade levels of true ability.

This is a documentation/instruction skill and appears internally consistent. Before using its example code in production, treat the samples as illustrative only: add authentication, session management, input validation, rate limiting, and database protections; encrypt and minimize storage of student identifiers to meet privacy laws (e.g., FERPA/GDPR); review item exposure controls and bias/fairness concerns; and run extensive simulation and security testing. The skill does not request secrets or install software, but any systems you build from it will need secure credentials and operational hardening — don't deploy the example API as-is to a public endpoint.

功能分析

Type: OpenClaw Skill Name: adaptivetest Version: 1.0.3 The OpenClaw skill bundle provides comprehensive documentation and code examples for designing and implementing adaptive testing systems using Item Response Theory. All files, including `SKILL.md`, `references/calibration.md`, and `references/implementation.md`, are purely educational and technical. The Python code snippets demonstrate core algorithms and API patterns using standard libraries (`numpy`, `scipy`, `fastapi`, `sqlalchemy`) without any direct file system access, network calls to suspicious domains, or use of dangerous functions (`os.system`, `eval`). There is no evidence of prompt injection attempts against the AI agent, data exfiltration, malicious execution, persistence mechanisms, or obfuscation. The content is entirely aligned with its stated purpose.

能力评估

✓ Purpose & Capability

Name, description, and included documents all focus on Item Response Theory and computerized adaptive testing. There are no required binaries, environment variables, or external services declared, which is appropriate for a documentation/instruction-only skill.

✓ Instruction Scope

SKILL.md and referenced files contain detailed algorithms, calibration guidance, and example implementation code. The instructions remain on-topic (item selection, estimation, calibration, simulation, API patterns). They do mention handling student identifiers and item banks (expected for this domain) but do not instruct reading unrelated system files or exfiltrating data.

✓ Install Mechanism

No install spec or code files to execute were provided; this is instruction-only, which minimizes installation risk.

ℹ Credentials

The skill declares no environment variables or credentials (appropriate). As a domain note: the guidance discusses handling student identifiers and item banks — deploying systems based on these instructions will require careful handling of PII, database credentials, and hosting credentials, but the skill itself does not request them.

✓ Persistence & Privilege

Skill is not always-enabled and does not request persistent agent privileges or modify other skills. It can be invoked by the user and (platform-default) by models, which is expected for a helper skill.

版本历史

v1.0.3

Switch base URL to api.adaptivetest.io custom domain

v1.0.2

Add API key signup link; move internal specs to private repo

v1.0.1

Move internal specs to private repo to clarify credential boundaries

v1.0.0

- Initial release of adaptivetest, an adaptive testing engine with IRT/CAT and AI capabilities. - Supports adaptive test creation, administration, and automatic item selection based on student ability. - Enables AI-powered question generation and provides personalized learning recommendations. - Includes endpoints for student/class management, item calibration, and detailed results analytics. - OneRoster 1.2 compatibility for SIS integration. - Comprehensive API documentation and error handling provided.

元数据

Slug adaptivetest

版本 1.0.3

许可证 —

累计安装 1

当前安装数 1

历史版本数 4

常见问题

Adaptive Testing 是什么？

Design and implement adaptive testing systems using Item Response Theory (IRT). Use when working with computerized adaptive tests (CAT), psychometric assessm... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 592 次。

如何安装 Adaptive Testing？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install adaptivetest」即可一键安装，无需额外配置。

Adaptive Testing 是免费的吗？

是的，Adaptive Testing 完全免费（开源免费），可自由下载、安装和使用。

Adaptive Testing 支持哪些平台？

Adaptive Testing 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Adaptive Testing？

由 woodstocksoftware（@woodstocksoftware）开发并维护，当前版本 v1.0.3。

Adaptive Testing