← 返回 Skills 市场
wbavon

Perf Test Flagos

作者 Flagos · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
81
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install perf-test-flagos
功能描述
Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo...
使用说明 (SKILL.md)

Accuracy + Performance Test

Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.

Skill Components

perf-test/
├── SKILL.md                            # This file — execution flow
├── scripts/
│   ├── run_benchmark.py                # Run single benchmark profile (JSON output)
│   └── run_all_benchmarks.py           # Run all 5 profiles, collect + summarize (JSON)
└── references/
    └── benchmark-profiles.md           # Profile definitions, metrics, vllm bench usage

Reused from env-verify:

  • env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy before benchmarking (optional pre-check)

Prerequisites

  • Running container with software stack installed
  • model-verify completed — know which stack to use (full vs base)
  • Model path, TP size, and recommended stack from model-verify

If invoked standalone, ask for container name, model path, TP size, and stack config. If invoked from /flagrelease, these are passed as context.

Execution Flow

Step 1: Start vLLM Server

Use the stack recommended by model-verify. Read references/benchmark-profiles.md for the vllm serve command pattern.

docker exec -d \x3CCONTAINER> bash -c '
export USE_FLAGGEMS=\x3C0|1>
export FLAGCX_PATH=\x3Cpath_or_unset>
export VLLM_PLUGINS=\x3Cfl_or_unset>
vllm serve \x3CMODEL_PATH> \
    --tensor-parallel-size \x3CTP_SIZE> \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --port 8000 \
    \x3CEXTRA_ARGS>
'

Wait for server ready (poll /health, timeout 300s):

docker exec \x3CCONTAINER> bash -c '
for i in $(seq 1 150); do
    if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
        echo "SERVER_READY"; break
    fi
    sleep 2
done
'

If server doesn't start, report error and exit.

Step 2: Get Model Name from Server

docker exec \x3CCONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'

Part A: Accuracy Test (FlagEval) — PLACEHOLDER

STATUS: FlagEval test client not yet available.

When FlagEval becomes available, update this section with:

  • Docker image URL or pip package name
  • Supported benchmarks (MMLU, GSM8K, HumanEval, etc.)
  • Required arguments and configuration
  • Expected output format
  • Pass/fail criteria (accuracy thresholds)

Current behavior: Report accuracy test as SKIPPED.


Part B: Performance Benchmarks

Step 3: Run All Benchmark Profiles

Copy scripts into the container and run:

docker cp \x3CSKILL_DIR>/scripts/run_benchmark.py \x3CCONTAINER>:/tmp/
docker cp \x3CSKILL_DIR>/scripts/run_all_benchmarks.py \x3CCONTAINER>:/tmp/

docker exec \x3CCONTAINER> python3 /tmp/run_all_benchmarks.py \
    --model \x3CMODEL_NAME> \
    --tokenizer \x3CMODEL_PATH> \
    --port 8000 \
    --output-dir /data/results/perf

The script runs all 5 default profiles (see references/benchmark-profiles.md), saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report with a summary table.

Important: One profile failure does NOT skip remaining profiles.

Step 4: Stop Server

docker exec \x3CCONTAINER> bash -c 'pkill -f "vllm serve" || true'

Step 5: Produce Report

{
  "status": "PASS | PARTIAL | FAIL",
  "stage": "perf-test",
  "model": "\x3CMODEL_PATH>",
  "tensor_parallel_size": 8,
  "flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
  "accuracy": {
    "status": "SKIPPED",
    "reason": "FlagEval test client not yet available"
  },
  "performance": {
    "status": "PASS | PARTIAL | FAIL",
    "profiles_passed": "5/5",
    "profiles": [ "...per-profile results..." ],
    "summary_table": "...markdown table..."
  }
}

Present the summary table to the user:

| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ...     | ...   | ...    | ...     | ...   | ...   | ...      | ...      | ...     | ...    |

Status logic:

  • PASS — all profiles completed
  • PARTIAL — some passed, some failed
  • FAIL — server didn't start or all profiles failed

Error Handling

Failure Behavior
Server fails to start Report error; exit
vllm bench serve not found Report vllm version issue
Single profile fails Report error, continue remaining profiles
Single profile times out Kill after 600s, report partial, continue
Server crashes mid-benchmark Capture logs, report which profile caused crash
OOM during high concurrency Report, suggest reducing num_prompts

Timeout Rules

Operation Timeout
Server startup 300s
Per profile benchmark 600s
安全使用建议
This result should not be treated as a complete approval. The review could not inspect metadata.json or artifact contents, so rerun ClawScan with readable artifacts before installing or publishing.
能力评估
Purpose & Capability
No artifact-backed purpose or capability issue was identified; confidence is low because direct workspace inspection failed.
Instruction Scope
No artifact-backed instruction-scope issue was identified; confidence is low because SKILL.md and artifact files were not readable through the available command tool.
Install Mechanism
No artifact-backed install-mechanism issue was identified; confidence is low because metadata and install artifacts could not be inspected.
Credentials
No artifact-backed environment-proportionality issue was identified; confidence is low due to unreadable workspace artifacts.
Persistence & Privilege
No artifact-backed persistence or privilege issue was identified; confidence is low due to incomplete artifact access.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install perf-test-flagos
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /perf-test-flagos 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
perf-test-flagos v1.0.0 changelog: - Initial release providing automated accuracy (placeholder/skip for now) and performance benchmarking for served models. - Runs vLLM server based on stack/model-verify outputs, tests 5 workload profiles via run_all_benchmarks.py. - Collects throughput, latency, TTFT, and TPOT metrics per profile; results summarized in JSON and markdown table. - Implements error handling for server, benchmark, and resource failures, with timeout rules and partial reporting. - Modular design; prompts for inputs if not fully configured/invoked standalone.
元数据
Slug perf-test-flagos
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Perf Test Flagos 是什么?

Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 81 次。

如何安装 Perf Test Flagos?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install perf-test-flagos」即可一键安装,无需额外配置。

Perf Test Flagos 是免费的吗?

是的,Perf Test Flagos 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Perf Test Flagos 支持哪些平台?

Perf Test Flagos 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Perf Test Flagos?

由 Flagos(@wbavon)开发并维护,当前版本 v1.0.0。

💬 留言讨论