← 返回 Skills 市场

Perf Test Flagos

Name: Perf Test Flagos
Author: wbavon

作者 Flagos · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install perf-test-flagos

功能描述

Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo...

使用说明 (SKILL.md)

Accuracy + Performance Test

Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.

Skill Components

perf-test/
├── SKILL.md                            # This file — execution flow
├── scripts/
│   ├── run_benchmark.py                # Run single benchmark profile (JSON output)
│   └── run_all_benchmarks.py           # Run all 5 profiles, collect + summarize (JSON)
└── references/
    └── benchmark-profiles.md           # Profile definitions, metrics, vllm bench usage

Reused from env-verify:

env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy before benchmarking (optional pre-check)

Prerequisites

Running container with software stack installed
model-verify completed — know which stack to use (full vs base)
Model path, TP size, and recommended stack from model-verify

If invoked standalone, ask for container name, model path, TP size, and stack config. If invoked from /flagrelease, these are passed as context.

Execution Flow

Step 1: Start vLLM Server

Use the stack recommended by model-verify. Read references/benchmark-profiles.md for the vllm serve command pattern.

docker exec -d \x3CCONTAINER> bash -c '
export USE_FLAGGEMS=\x3C0|1>
export FLAGCX_PATH=\x3Cpath_or_unset>
export VLLM_PLUGINS=\x3Cfl_or_unset>
vllm serve \x3CMODEL_PATH> \
    --tensor-parallel-size \x3CTP_SIZE> \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --port 8000 \
    \x3CEXTRA_ARGS>
'

Wait for server ready (poll /health, timeout 300s):

docker exec \x3CCONTAINER> bash -c '
for i in $(seq 1 150); do
    if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
        echo "SERVER_READY"; break
    fi
    sleep 2
done
'

If server doesn't start, report error and exit.

Step 2: Get Model Name from Server

docker exec \x3CCONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'

Part A: Accuracy Test (FlagEval) — PLACEHOLDER

STATUS: FlagEval test client not yet available.

When FlagEval becomes available, update this section with:

Docker image URL or pip package name
Supported benchmarks (MMLU, GSM8K, HumanEval, etc.)
Required arguments and configuration
Expected output format
Pass/fail criteria (accuracy thresholds)

Current behavior: Report accuracy test as SKIPPED.

Part B: Performance Benchmarks

Step 3: Run All Benchmark Profiles

Copy scripts into the container and run:

docker cp \x3CSKILL_DIR>/scripts/run_benchmark.py \x3CCONTAINER>:/tmp/
docker cp \x3CSKILL_DIR>/scripts/run_all_benchmarks.py \x3CCONTAINER>:/tmp/

docker exec \x3CCONTAINER> python3 /tmp/run_all_benchmarks.py \
    --model \x3CMODEL_NAME> \
    --tokenizer \x3CMODEL_PATH> \
    --port 8000 \
    --output-dir /data/results/perf

The script runs all 5 default profiles (see references/benchmark-profiles.md), saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report with a summary table.

Important: One profile failure does NOT skip remaining profiles.

Step 4: Stop Server

docker exec \x3CCONTAINER> bash -c 'pkill -f "vllm serve" || true'

Step 5: Produce Report

{
  "status": "PASS | PARTIAL | FAIL",
  "stage": "perf-test",
  "model": "\x3CMODEL_PATH>",
  "tensor_parallel_size": 8,
  "flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
  "accuracy": {
    "status": "SKIPPED",
    "reason": "FlagEval test client not yet available"
  },
  "performance": {
    "status": "PASS | PARTIAL | FAIL",
    "profiles_passed": "5/5",
    "profiles": [ "...per-profile results..." ],
    "summary_table": "...markdown table..."
  }
}

Present the summary table to the user:

| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ...     | ...   | ...    | ...     | ...   | ...   | ...      | ...      | ...     | ...    |

Status logic:

PASS — all profiles completed
PARTIAL — some passed, some failed
FAIL — server didn't start or all profiles failed

Error Handling

Failure	Behavior
Server fails to start	Report error; exit
`vllm bench serve` not found	Report vllm version issue
Single profile fails	Report error, continue remaining profiles
Single profile times out	Kill after 600s, report partial, continue
Server crashes mid-benchmark	Capture logs, report which profile caused crash
OOM during high concurrency	Report, suggest reducing num_prompts

Timeout Rules

Operation	Timeout
Server startup	300s
Per profile benchmark	600s

安全使用建议

This result should not be treated as a complete approval. The review could not inspect metadata.json or artifact contents, so rerun ClawScan with readable artifacts before installing or publishing.

能力评估

✓ Purpose & Capability

No artifact-backed purpose or capability issue was identified; confidence is low because direct workspace inspection failed.

✓ Instruction Scope

No artifact-backed instruction-scope issue was identified; confidence is low because SKILL.md and artifact files were not readable through the available command tool.

✓ Install Mechanism

No artifact-backed install-mechanism issue was identified; confidence is low because metadata and install artifacts could not be inspected.

✓ Credentials

No artifact-backed environment-proportionality issue was identified; confidence is low due to unreadable workspace artifacts.

✓ Persistence & Privilege

No artifact-backed persistence or privilege issue was identified; confidence is low due to incomplete artifact access.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install perf-test-flagos
安装完成后，直接呼叫该 Skill 的名称或使用 /perf-test-flagos 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

perf-test-flagos v1.0.0 changelog: - Initial release providing automated accuracy (placeholder/skip for now) and performance benchmarking for served models. - Runs vLLM server based on stack/model-verify outputs, tests 5 workload profiles via run_all_benchmarks.py. - Collects throughput, latency, TTFT, and TPOT metrics per profile; results summarized in JSON and markdown table. - Implements error handling for server, benchmark, and resource failures, with timeout rules and partial reporting. - Modular design; prompts for inputs if not fully configured/invoked standalone.

元数据

Slug perf-test-flagos

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Perf Test Flagos 是什么？

Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 81 次。

如何安装 Perf Test Flagos？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install perf-test-flagos」即可一键安装，无需额外配置。

Perf Test Flagos 是免费的吗？

是的，Perf Test Flagos 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Perf Test Flagos 支持哪些平台？

Perf Test Flagos 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Perf Test Flagos？

由 Flagos（@wbavon）开发并维护，当前版本 v1.0.0。