/install perf-test-flagos
Accuracy + Performance Test
Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.
Skill Components
perf-test/
├── SKILL.md # This file — execution flow
├── scripts/
│ ├── run_benchmark.py # Run single benchmark profile (JSON output)
│ └── run_all_benchmarks.py # Run all 5 profiles, collect + summarize (JSON)
└── references/
└── benchmark-profiles.md # Profile definitions, metrics, vllm bench usage
Reused from env-verify:
env-verify/scripts/test_serve_mode.py— can be used to verify server is healthy before benchmarking (optional pre-check)
Prerequisites
- Running container with software stack installed
- model-verify completed — know which stack to use (
fullvsbase) - Model path, TP size, and recommended stack from model-verify
If invoked standalone, ask for container name, model path, TP size, and stack config.
If invoked from /flagrelease, these are passed as context.
Execution Flow
Step 1: Start vLLM Server
Use the stack recommended by model-verify. Read references/benchmark-profiles.md
for the vllm serve command pattern.
docker exec -d \x3CCONTAINER> bash -c '
export USE_FLAGGEMS=\x3C0|1>
export FLAGCX_PATH=\x3Cpath_or_unset>
export VLLM_PLUGINS=\x3Cfl_or_unset>
vllm serve \x3CMODEL_PATH> \
--tensor-parallel-size \x3CTP_SIZE> \
--max-num-batched-tokens 4096 \
--max-num-seqs 256 \
--trust-remote-code \
--port 8000 \
\x3CEXTRA_ARGS>
'
Wait for server ready (poll /health, timeout 300s):
docker exec \x3CCONTAINER> bash -c '
for i in $(seq 1 150); do
if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
echo "SERVER_READY"; break
fi
sleep 2
done
'
If server doesn't start, report error and exit.
Step 2: Get Model Name from Server
docker exec \x3CCONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'
Part A: Accuracy Test (FlagEval) — PLACEHOLDER
STATUS: FlagEval test client not yet available.
When FlagEval becomes available, update this section with:
- Docker image URL or pip package name
- Supported benchmarks (MMLU, GSM8K, HumanEval, etc.)
- Required arguments and configuration
- Expected output format
- Pass/fail criteria (accuracy thresholds)
Current behavior: Report accuracy test as SKIPPED.
Part B: Performance Benchmarks
Step 3: Run All Benchmark Profiles
Copy scripts into the container and run:
docker cp \x3CSKILL_DIR>/scripts/run_benchmark.py \x3CCONTAINER>:/tmp/
docker cp \x3CSKILL_DIR>/scripts/run_all_benchmarks.py \x3CCONTAINER>:/tmp/
docker exec \x3CCONTAINER> python3 /tmp/run_all_benchmarks.py \
--model \x3CMODEL_NAME> \
--tokenizer \x3CMODEL_PATH> \
--port 8000 \
--output-dir /data/results/perf
The script runs all 5 default profiles (see references/benchmark-profiles.md),
saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report
with a summary table.
Important: One profile failure does NOT skip remaining profiles.
Step 4: Stop Server
docker exec \x3CCONTAINER> bash -c 'pkill -f "vllm serve" || true'
Step 5: Produce Report
{
"status": "PASS | PARTIAL | FAIL",
"stage": "perf-test",
"model": "\x3CMODEL_PATH>",
"tensor_parallel_size": 8,
"flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
"accuracy": {
"status": "SKIPPED",
"reason": "FlagEval test client not yet available"
},
"performance": {
"status": "PASS | PARTIAL | FAIL",
"profiles_passed": "5/5",
"profiles": [ "...per-profile results..." ],
"summary_table": "...markdown table..."
}
}
Present the summary table to the user:
| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Status logic:
PASS— all profiles completedPARTIAL— some passed, some failedFAIL— server didn't start or all profiles failed
Error Handling
| Failure | Behavior |
|---|---|
| Server fails to start | Report error; exit |
vllm bench serve not found |
Report vllm version issue |
| Single profile fails | Report error, continue remaining profiles |
| Single profile times out | Kill after 600s, report partial, continue |
| Server crashes mid-benchmark | Capture logs, report which profile caused crash |
| OOM during high concurrency | Report, suggest reducing num_prompts |
Timeout Rules
| Operation | Timeout |
|---|---|
| Server startup | 300s |
| Per profile benchmark | 600s |
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install perf-test-flagos - 安装完成后,直接呼叫该 Skill 的名称或使用
/perf-test-flagos触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Perf Test Flagos 是什么?
Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 81 次。
如何安装 Perf Test Flagos?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install perf-test-flagos」即可一键安装,无需额外配置。
Perf Test Flagos 是免费的吗?
是的,Perf Test Flagos 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Perf Test Flagos 支持哪些平台?
Perf Test Flagos 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Perf Test Flagos?
由 Flagos(@wbavon)开发并维护,当前版本 v1.0.0。