功能描述

Estimate LLM inference performance metrics including TTFT, decode speed, and VRAM requirements based on model architecture, GPU specs, and quantization format.

使用说明 (SKILL.md)

\r \r

LLM Inference Performance Estimator\r

Name: LLM Inference Performance Estimator
Author: zhangyu68

\r Estimate TTFT (Time To First Token), decode speed (tokens/s), and VRAM usage for a given LLM on a specific GPU.\r \r

How to Use\r

\r The user may invoke this skill in several ways:\r \r

Named model: /llm-perf-estimator Qwen2.5-7B RTX4090 2048 512 fp16\r
With config file: /llm-perf-estimator config.json RTX4090 2048 512 int4\r
Interactive: /llm-perf-estimator — ask the user step by step\r \r Arguments (all optional, prompt for missing ones):\r

model — model name from preset list, or path to a HuggingFace config.json\r
gpu — GPU name from preset list, or custom specs\r
input_tokens — prefill sequence length (default: 1024)\r
output_tokens — number of tokens to generate (default: 256)\r
quant — quantization format: fp16, bf16, fp8, int8, int4 (default: fp16)\r \r ---\r \r

Step 1 — Resolve Model Architecture\r

\r

Preset Models\r

\r If the user provides a known model name, use the following presets:\r \r | Model | Type | Total Params | Activated Params | Layers | Hidden | Heads (Q) | Heads (KV) | FFN Type | Intermediate | Vocab |\r |---|---|---|---|---|---|---|---|---|---|---|\r | Qwen3.5-4B | Hybrid Dense | 4B | 4B | 32 (8 full+24 linear) | 2560 | 16 (full) / 16 (linear) | 4 (full) | SwiGLU | 9216 | 248320 |\r | Qwen3.5-35B-A3B | Hybrid MoE | 35B | 3B | 40 (10 full+30 linear) | 2048 | 16 (full) / 16 (linear) | 2 (full) | SwiGLU+MoE | 8×512 per tok | 248320 |\r \r If the model is not in the preset list and no config file is provided, ask the user to provide a config.json. They can get it without downloading the full model:\r \r

# ModelScope (browser)\r
https://modelscope.cn/models/{org}/{model}/file/view/master/config.json\r
\r
# HuggingFace (browser)\r
https://huggingface.co/{org}/{model}/blob/main/config.json\r
```\r
\r
Open the URL, copy the content, and paste it directly into the conversation. Alternatively, provide the local file path if the model is already downloaded.\r
\r
If the user cannot provide a config, ask them to manually input:\r
- `num_hidden_layers`, `hidden_size`, `num_attention_heads`, `num_key_value_heads`\r
- `intermediate_size`, `vocab_size`\r
- For MoE: `num_experts`, `num_experts_per_tok`, `moe_intermediate_size`\r
\r
### Parsing config.json\r
\r
If the user provides a `config.json` path, read the file and extract:\r
```\r
num_hidden_layers, hidden_size, num_attention_heads, num_key_value_heads,\r
intermediate_size, vocab_size, model_type,\r
# MoE fields (if present):\r
num_experts / num_local_experts, num_experts_per_tok, moe_intermediate_size\r
# Hybrid attention (if present):\r
layer_types  ← list of strings, e.g. ["linear_attention", ..., "full_attention", ...]\r
head_dim     ← if explicitly provided, use it; otherwise head_dim = hidden_size / num_attention_heads\r
```\r
\r
**Determine `num_full_attn_layers`**:\r
- If `layer_types` exists: `num_full_attn_layers = count of "full_attention" in layer_types`\r
- If `layer_types` is absent (standard transformer): `num_full_attn_layers = num_hidden_layers`\r
\r
**Note on nested configs** (e.g. Qwen3.5-35B-A3B has a `text_config` wrapper):\r
- If the top-level JSON has a `text_config` key, read all text model fields from inside it.\r
- `head_dim` may be explicitly set (e.g. `256`); prefer that over computing from `hidden_size / num_attention_heads`.\r
\r
**Note on `tie_word_embeddings`**: if `true`, the embedding table and lm_head share the same weights. Do not count them twice in VRAM — the embedding contributes `vocab_size × hidden_size × bytes_per_param` only once.\r
\r
**Note on `attn_output_gate`**: recognized but ignored in calculations — its contribution to FLOPs and VRAM is \x3C1% and within the MFU uncertainty margin.\r
\r
---\r
\r
## Step 2 — Resolve GPU Specs\r
\r
### Preset GPUs\r
\r
| GPU | VRAM (GB) | BF16 TFLOPS | FP8 TFLOPS | INT8 TOPS | HBM BW (GB/s) |\r
|---|---|---|---|---|---|\r
| RTX 4060 | 8 | 15.1 | — | 30.2 | 272 |\r
| RTX 4060 Ti | 16 | 22.1 | — | 44.2 | 288 |\r
| RTX 4070 | 12 | 29.1 | — | 58.2 | 504 |\r
| RTX 4070 Ti | 12 | 40.1 | — | 80.2 | 504 |\r
| RTX 4070 Ti Super | 16 | 40.1 | — | 80.2 | 672 |\r
| RTX 4080 | 16 | 48.7 | — | 97.4 | 717 |\r
| RTX 4080 Super | 16 | 52.2 | — | 104.4 | 736 |\r
| RTX 4090 | 24 | 82.6 | — | 165.2 | 1008 |\r
| RTX 5070 Ti | 16 | 176.0 | 352.0 | 352.0 | 896 |\r
| RTX 5080 | 16 | 225.0 | 450.0 | 450.0 | 960 |\r
| RTX 5090 | 32 | 419.0 | 838.0 | 838.0 | 1792 |\r
| A10G | 24 | 31.2 | — | 62.5 | 600 |\r
| A100-40G | 40 | 77.97 | — | 311.9 | 1555 |\r
| A100-80G | 80 | 77.97 | — | 311.9 | 2000 |\r
| H100-SXM | 80 | 989.4 | 1978.9 | 3958.0 | 3350 |\r
| H100-PCIe | 80 | 756.0 | 1513.0 | 3026.0 | 2000 |\r
| H200-SXM | 141 | 989.4 | 1978.9 | 3958.0 | 4800 |\r
| L4 | 24 | 30.3 | 60.6 | 121.2 | 300 |\r
| L40S | 48 | 91.6 | 183.2 | 366.4 | 864 |\r
| MI300X | 192 | 1307.4 | 2614.9 | 5229.8 | 5300 |\r
| Apple M4 (16GB) | 16 | 4.6 | — | — | 120 |\r
| Apple M4 Pro (48GB) | 48 | 9.2 | — | — | 273 |\r
| Apple M4 Max (128GB) | 128 | 18.4 | — | — | 546 |\r
\r
If the GPU is not listed, ask the user to provide:\r
- VRAM (GB)\r
- BF16/FP16 TFLOPS\r
- HBM bandwidth (GB/s)\r
\r
---\r
\r
## Step 3 — Quantization Bytes Per Parameter\r
\r
| Format | Bytes/param | Compute dtype | Notes |\r
|---|---|---|---|\r
| fp32 | 4.0 | fp32 | Rarely used for inference |\r
| bf16 / fp16 | 2.0 | bf16/fp16 | Baseline |\r
| fp8 | 1.0 | fp8 | Requires H100/H200/RTX50xx |\r
| int8 | 1.0 | int8 | W8A8 or W8A16 |\r
| int4 | 0.5 | int4/fp16 | GPTQ/AWQ/bitsandbytes |\r
\r
Select the GPU TFLOPS column matching the compute dtype:\r
- fp16/bf16 → BF16 TFLOPS\r
- fp8 → FP8 TFLOPS (fall back to BF16 if not supported, with a warning)\r
- int8 → INT8 TOPS\r
- int4 → BF16 TFLOPS (dequant to fp16 for matmul in most frameworks)\r
\r
---\r
\r
## Step 4 — Compute VRAM Requirements\r
\r
### 4.1 Weight Memory\r
\r
```\r
weight_bytes = total_params × bytes_per_param\r
weight_GB = weight_bytes / 1e9\r
```\r
\r
For MoE models, `total_params` includes all expert weights (not just activated).\r
\r
### 4.2 KV Cache Memory\r
\r
Only **full attention layers** maintain a KV cache. Linear attention layers use a fixed-size recurrent state (negligible, ~tens of MB) that does not grow with sequence length.\r
\r
```\r
kv_heads = num_key_value_heads          # from full attention config\r
kv_bytes_per_token = 2 × num_full_attn_layers × kv_heads × head_dim × bytes_per_param\r
kv_cache_GB = kv_bytes_per_token × (input_tokens + output_tokens) / 1e9\r
```\r
\r
If `num_full_attn_layers = num_hidden_layers` (standard transformer), this reduces to the standard formula.\r
\r
### 4.3 Activation Memory (prefill peak)\r
\r
```\r
activation_GB ≈ num_layers × hidden_size × input_tokens × bytes_per_param × 2 / 1e9\r
```\r
\r
This is an approximation; actual peak depends on framework and attention implementation.\r
\r
### 4.4 Total VRAM\r
\r
```\r
total_VRAM_GB = weight_GB + kv_cache_GB + activation_GB\r
```\r
\r
Add a **15% overhead** for framework buffers, CUDA context, etc.:\r
```\r
total_VRAM_GB_with_overhead = total_VRAM_GB × 1.15\r
```\r
\r
---\r
\r
## Step 5 — Estimate TTFT (Prefill Latency)\r
\r
Prefill is **compute-bound** for long sequences.\r
\r
### 5.1 Attention FLOPs (prefill)\r
\r
Only **full attention layers** have O(n²) attention compute. Linear attention layers are O(n) and their attention FLOPs are already captured in the projection FLOPs (Step 5.3).\r
\r
```\r
attn_flops = 4 × num_full_attn_layers × input_tokens² × hidden_size\r
```\r
(factor of 4 = QK matmul + softmax + AV matmul, forward pass)\r
\r
If `num_full_attn_layers = num_hidden_layers`, this is the standard transformer formula.\r
\r
### 5.2 FFN FLOPs (prefill)\r
\r
For SwiGLU/GeGLU (3 projections: gate, up, down):\r
```\r
ffn_flops = 3 × 2 × num_layers × input_tokens × hidden_size × intermediate_size\r
```\r
\r
For MoE, replace `intermediate_size` with `num_experts_per_tok × moe_intermediate_size`.\r
\r
### 5.3 QKV + Output Projection FLOPs\r
\r
For **full attention layers** (standard QKV projections):\r
```\r
full_proj_flops = 2 × num_full_attn_layers × input_tokens × hidden_size\r
                  × (num_attention_heads × head_dim + 2 × kv_heads × head_dim + hidden_size)\r
```\r
\r
For **linear attention layers** (also have Q/K/V-equivalent projections, but different dims):\r
```\r
linear_proj_flops = 2 × num_linear_attn_layers × input_tokens × hidden_size\r
                    × (linear_num_key_heads × linear_key_head_dim\r
                       + linear_num_key_heads × linear_key_head_dim\r
                       + linear_num_value_heads × linear_value_head_dim\r
                       + hidden_size)\r
```\r
\r
If `layer_types` is absent (standard transformer), only `full_proj_flops` applies and `num_linear_attn_layers = 0`.\r
\r
### 5.4 Total Prefill FLOPs\r
\r
```\r
total_prefill_flops = attn_flops + ffn_flops + full_proj_flops + linear_proj_flops\r
```\r
\r
### 5.5 TTFT\r
\r
Apply **MFU (Model FLOP Utilization)** efficiency factor:\r
\r
| Scenario | MFU |\r
|---|---|\r
| Long prompt (>512 tokens), data center GPU | 0.45 |\r
| Long prompt, consumer GPU | 0.35 |\r
| Short prompt (\x3C128 tokens) | 0.25 |\r
\r
```\r
effective_tflops = gpu_tflops × MFU\r
TTFT_seconds = total_prefill_flops / (effective_tflops × 1e12)\r
```\r
\r
---\r
\r
## Step 6 — Estimate Decode Speed\r
\r
Decode is **memory-bandwidth-bound** at batch=1.\r
\r
### 6.1 Bytes Read Per Decode Step\r
\r
Each decode step reads:\r
- All activated model weights once\r
- KV cache for all previous tokens (full attention layers only; linear attention state is fixed-size and already loaded with weights)\r
\r
```\r
activated_weight_bytes = activated_params × bytes_per_param\r
kv_cache_bytes_at_step = kv_bytes_per_token × (input_tokens + current_output_tokens)\r
bytes_per_step = activated_weight_bytes + kv_cache_bytes_at_step\r
```\r
\r
For the average decode step, use `current_output_tokens ≈ output_tokens / 2`.\r
\r
### 6.2 Decode Speed\r
\r
Apply **bandwidth utilization** efficiency factor:\r
\r
| Scenario | BW Utilization |\r
|---|---|\r
| Data center GPU (HBM2e/HBM3) | 0.85 |\r
| Consumer GPU (GDDR6X) | 0.75 |\r
| Apple Silicon (unified memory) | 0.80 |\r
\r
```\r
effective_bandwidth = gpu_bandwidth_GBs × bw_utilization\r
decode_speed_tps = effective_bandwidth × 1e9 / bytes_per_step\r
```\r
\r
---\r
\r
## Step 7 — Output Report\r
\r
Present results as a Markdown report with the following sections:\r
\r
### Section 1: Configuration Summary\r
\r
| Parameter | Value |\r
|---|---|\r
| Model | {model_name} |\r
| Type | Dense / MoE / Hybrid MoE |\r
| Total Params | {X}B |\r
| Activated Params | {X}B |\r
| Total Layers | {N} |\r
| Full Attention Layers | {N} ({N} linear attention) |\r
| GPU | {gpu_name} |\r
| VRAM Available | {X} GB |\r
| Quantization | {quant} |\r
| Input Tokens | {N} |\r
| Output Tokens | {N} |\r
\r
### Section 2: VRAM Breakdown\r
\r
| Component | Size (GB) |\r
|---|---|\r
| Model Weights | {X} |\r
| KV Cache | {X} |\r
| Activations (peak) | {X} |\r
| Framework Overhead (15%) | {X} |\r
| **Total Required** | **{X}** |\r
| GPU Available | {X} |\r
| **Fits in VRAM?** | ✅ Yes / ❌ No |\r
\r
If it doesn't fit, suggest:\r
- A lower quantization format\r
- Offloading options (CPU offload, disk offload)\r
\r
### Section 3: Performance Estimates\r
\r
| Metric | Estimate |\r
|---|---|\r
| TTFT (Time to First Token) | {X} ms |\r
| Decode Speed | {X} tokens/s |\r
| Time to Generate {N} tokens | {X} s |\r
| Total End-to-End Latency | {X} s |\r
\r
### Section 4: Assumptions & Caveats\r
\r
List the MFU and bandwidth utilization values used, and note:\r
- Estimates assume batch_size=1, single GPU\r
- Actual performance varies by framework (vLLM, llama.cpp, Ollama, etc.)\r
- FlashAttention / FlashAttention-2 is assumed for prefill\r
- KV cache quantization not considered\r
- Speculative decoding not considered\r
\r
---\r
\r
## Notes for the Agent\r
\r
- Always show intermediate calculations in a collapsible section or footnote if the user asks "how did you calculate this"\r
- If VRAM is insufficient, proactively suggest the minimum quantization that would fit\r
- If the user provides a `config.json`, confirm the parsed values before computing\r
- Round all results to 2 significant figures for readability\r
- For MoE models, clearly distinguish total vs activated parameters in all calculations\r

安全使用建议

This skill appears to do exactly what it says — estimate TTFT, decode speed, and VRAM from model and GPU specs — and it asks for no credentials or installs. A few practical cautions before use: - If you provide a local file path, the agent will read that file. Do not point it at unrelated sensitive files (e.g., ~/ .ssh, credentials files, or system configs). - Prefer copying and pasting only the model config.json contents (or sanitizing it) rather than giving a broad directory path. Configs typically do not contain secrets, but double-check before pasting. - The skill suggests visiting HF/ModelScope URLs in your browser and pasting config text; it does not fetch those URLs itself. If you prefer, provide model parameters manually instead of providing a file. - No environment variables or cloud credentials are requested, and there is no install step. If you see prompts later asking for secrets or for the skill to fetch remote resources, stop and verify why. Overall this skill is internally consistent and low-risk for the stated task; follow the above precautions about local file paths and pasted content.

功能分析

Type: OpenClaw Skill Name: llm-perf-estimator Version: 1.0.0 The skill is a legitimate tool designed to estimate LLM performance metrics (VRAM, TTFT, and throughput) based on model architecture and GPU specifications. It contains detailed mathematical formulas and hardware presets (e.g., RTX 4090, H100) consistent with its stated purpose. While it instructs the agent to read local configuration files (SKILL.md), this is a functional requirement for parsing model architectures and does not exhibit signs of intentional data exfiltration, malicious execution, or harmful prompt injection.

能力评估

✓ Purpose & Capability

The skill's name/description (LLM inference performance estimator) matches the actions described in SKILL.md: parsing model configs, accepting GPU specs/quant formats, and computing TTFT/throughput/VRAM. It does not request unrelated binaries, credentials, or system config paths.

ℹ Instruction Scope

Runtime instructions stay within the stated purpose: they ask for a preset model name or a model config.json (user-pasted content or a local file path) and GPU specs. The only noteworthy behavior is that, if given a local file path, the agent is instructed to read that file to extract fields — which is necessary for the estimator but means the agent will access whatever file path the user supplies. The SKILL.md does not instruct the agent to fetch remote URLs itself (it suggests the user open HF/ModelScope links in a browser and paste the config).

✓ Install Mechanism

No install spec or code files — instruction-only skill. This minimizes risk because nothing is downloaded or written to disk by the skill itself.

✓ Credentials

The skill declares no required environment variables, credentials, or special config paths. The only inputs are user-supplied model config data and GPU specs, which are proportionate to estimation functionality.

✓ Persistence & Privilege

always is false and the skill is user-invocable. disable-model-invocation is default (agent may invoke autonomously), which is the platform default and not excessive here. The skill does not request persistent system-wide changes or access to other skills' configs.

版本历史

v1.0.0

Initial public release. - Estimate LLM inference performance metrics: TTFT (Time To First Token), decode speed, and VRAM requirements. - Supports model selection by name, config file, or interactive input. - Includes detailed preset tables for major LLMs and GPUs, with support for custom entries. - Handles quantization effects and key architectural details (MoE, hybrid attention, embeddings). - Guides the user step-by-step if information is missing. - Provides clear calculation methods and caveats for each metric.

元数据

Slug llm-perf-estimator

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

LLM Inference Performance Estimator 是什么？

Estimate LLM inference performance metrics including TTFT, decode speed, and VRAM requirements based on model architecture, GPU specs, and quantization format. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 139 次。

如何安装 LLM Inference Performance Estimator？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-perf-estimator」即可一键安装，无需额外配置。

LLM Inference Performance Estimator 是免费的吗？

是的，LLM Inference Performance Estimator 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

LLM Inference Performance Estimator 支持哪些平台？

LLM Inference Performance Estimator 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 LLM Inference Performance Estimator？

由 zhangyu68（@zhangyu68）开发并维护，当前版本 v1.0.0。

LLM Inference Performance Estimator