← Back to Skills Marketplace
wbavon

Perf Test Flagos

by Flagos · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
81
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install perf-test-flagos
Description
Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo...
README (SKILL.md)

Accuracy + Performance Test

Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.

Skill Components

perf-test/
├── SKILL.md                            # This file — execution flow
├── scripts/
│   ├── run_benchmark.py                # Run single benchmark profile (JSON output)
│   └── run_all_benchmarks.py           # Run all 5 profiles, collect + summarize (JSON)
└── references/
    └── benchmark-profiles.md           # Profile definitions, metrics, vllm bench usage

Reused from env-verify:

  • env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy before benchmarking (optional pre-check)

Prerequisites

  • Running container with software stack installed
  • model-verify completed — know which stack to use (full vs base)
  • Model path, TP size, and recommended stack from model-verify

If invoked standalone, ask for container name, model path, TP size, and stack config. If invoked from /flagrelease, these are passed as context.

Execution Flow

Step 1: Start vLLM Server

Use the stack recommended by model-verify. Read references/benchmark-profiles.md for the vllm serve command pattern.

docker exec -d \x3CCONTAINER> bash -c '
export USE_FLAGGEMS=\x3C0|1>
export FLAGCX_PATH=\x3Cpath_or_unset>
export VLLM_PLUGINS=\x3Cfl_or_unset>
vllm serve \x3CMODEL_PATH> \
    --tensor-parallel-size \x3CTP_SIZE> \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --port 8000 \
    \x3CEXTRA_ARGS>
'

Wait for server ready (poll /health, timeout 300s):

docker exec \x3CCONTAINER> bash -c '
for i in $(seq 1 150); do
    if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
        echo "SERVER_READY"; break
    fi
    sleep 2
done
'

If server doesn't start, report error and exit.

Step 2: Get Model Name from Server

docker exec \x3CCONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'

Part A: Accuracy Test (FlagEval) — PLACEHOLDER

STATUS: FlagEval test client not yet available.

When FlagEval becomes available, update this section with:

  • Docker image URL or pip package name
  • Supported benchmarks (MMLU, GSM8K, HumanEval, etc.)
  • Required arguments and configuration
  • Expected output format
  • Pass/fail criteria (accuracy thresholds)

Current behavior: Report accuracy test as SKIPPED.


Part B: Performance Benchmarks

Step 3: Run All Benchmark Profiles

Copy scripts into the container and run:

docker cp \x3CSKILL_DIR>/scripts/run_benchmark.py \x3CCONTAINER>:/tmp/
docker cp \x3CSKILL_DIR>/scripts/run_all_benchmarks.py \x3CCONTAINER>:/tmp/

docker exec \x3CCONTAINER> python3 /tmp/run_all_benchmarks.py \
    --model \x3CMODEL_NAME> \
    --tokenizer \x3CMODEL_PATH> \
    --port 8000 \
    --output-dir /data/results/perf

The script runs all 5 default profiles (see references/benchmark-profiles.md), saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report with a summary table.

Important: One profile failure does NOT skip remaining profiles.

Step 4: Stop Server

docker exec \x3CCONTAINER> bash -c 'pkill -f "vllm serve" || true'

Step 5: Produce Report

{
  "status": "PASS | PARTIAL | FAIL",
  "stage": "perf-test",
  "model": "\x3CMODEL_PATH>",
  "tensor_parallel_size": 8,
  "flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
  "accuracy": {
    "status": "SKIPPED",
    "reason": "FlagEval test client not yet available"
  },
  "performance": {
    "status": "PASS | PARTIAL | FAIL",
    "profiles_passed": "5/5",
    "profiles": [ "...per-profile results..." ],
    "summary_table": "...markdown table..."
  }
}

Present the summary table to the user:

| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ...     | ...   | ...    | ...     | ...   | ...   | ...      | ...      | ...     | ...    |

Status logic:

  • PASS — all profiles completed
  • PARTIAL — some passed, some failed
  • FAIL — server didn't start or all profiles failed

Error Handling

Failure Behavior
Server fails to start Report error; exit
vllm bench serve not found Report vllm version issue
Single profile fails Report error, continue remaining profiles
Single profile times out Kill after 600s, report partial, continue
Server crashes mid-benchmark Capture logs, report which profile caused crash
OOM during high concurrency Report, suggest reducing num_prompts

Timeout Rules

Operation Timeout
Server startup 300s
Per profile benchmark 600s
Usage Guidance
This result should not be treated as a complete approval. The review could not inspect metadata.json or artifact contents, so rerun ClawScan with readable artifacts before installing or publishing.
Capability Assessment
Purpose & Capability
No artifact-backed purpose or capability issue was identified; confidence is low because direct workspace inspection failed.
Instruction Scope
No artifact-backed instruction-scope issue was identified; confidence is low because SKILL.md and artifact files were not readable through the available command tool.
Install Mechanism
No artifact-backed install-mechanism issue was identified; confidence is low because metadata and install artifacts could not be inspected.
Credentials
No artifact-backed environment-proportionality issue was identified; confidence is low due to unreadable workspace artifacts.
Persistence & Privilege
No artifact-backed persistence or privilege issue was identified; confidence is low due to incomplete artifact access.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install perf-test-flagos
  3. After installation, invoke the skill by name or use /perf-test-flagos
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
perf-test-flagos v1.0.0 changelog: - Initial release providing automated accuracy (placeholder/skip for now) and performance benchmarking for served models. - Runs vLLM server based on stack/model-verify outputs, tests 5 workload profiles via run_all_benchmarks.py. - Collects throughput, latency, TTFT, and TPOT metrics per profile; results summarized in JSON and markdown table. - Implements error handling for server, benchmark, and resource failures, with timeout rules and partial reporting. - Modular design; prompts for inputs if not fully configured/invoked standalone.
Metadata
Slug perf-test-flagos
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Perf Test Flagos?

Run accuracy benchmarks (FlagEval, when available) and performance benchmarks (vllm bench serve) against a served model. Covers 5 workload profiles: short/lo... It is an AI Agent Skill for Claude Code / OpenClaw, with 81 downloads so far.

How do I install Perf Test Flagos?

Run "/install perf-test-flagos" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Perf Test Flagos free?

Yes, Perf Test Flagos is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Perf Test Flagos support?

Perf Test Flagos is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Perf Test Flagos?

It is built and maintained by Flagos (@wbavon); the current version is v1.0.0.

💬 Comments