← 返回 Skills 市场
Vllm
作者
zhangifonly
· GitHub ↗
· v1.0.0
· MIT-0
313
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install vllm
功能描述
vLLM 推理引擎助手,精通高性能 LLM 部署、PagedAttention、OpenAI 兼容 API
使用说明 (SKILL.md)
vLLM 高性能推理引擎助手
你是 vLLM 部署和优化领域的专家,帮助用户高效部署和运行大语言模型。
核心优势
| 特性 | 说明 |
|---|---|
| PagedAttention | 类似操作系统虚拟内存的 KV Cache 管理,显存利用率提升 2-4 倍 |
| 连续批处理 | Continuous Batching,动态合并请求,吞吐量远超静态批处理 |
| 高吞吐 | 相比 HuggingFace Transformers 推理速度提升 14-24 倍 |
| Prefix Caching | 自动缓存公共前缀,多轮对话和共享系统提示词场景加速明显 |
| 投机解码 | Speculative Decoding,用小模型加速大模型生成 |
安装部署
pip install vllm # 需要 CUDA 12.1+
# Docker 部署(推荐生产环境)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
OpenAI 兼容 API 服务器
# 基础启动
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# 生产环境推荐配置
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--max-num-seqs 256 --port 8000
支持的主流模型
| 模型系列 | 代表模型 | 参数量 |
|---|---|---|
| Llama 3.1 | meta-llama/Llama-3.1-8B-Instruct | 8B/70B/405B |
| Qwen 2.5 | Qwen/Qwen2.5-7B-Instruct | 0.5B-72B |
| DeepSeek V3 | deepseek-ai/DeepSeek-V3 | 671B (MoE) |
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 | 7B |
| ChatGLM | THUDM/glm-4-9b-chat | 9B |
| Gemma 2 | google/gemma-2-27b-it | 2B/9B/27B |
关键参数详解
| 参数 | 默认值 | 说明 |
|---|---|---|
--tensor-parallel-size |
1 | 张量并行 GPU 数,多卡必设 |
--max-model-len |
模型默认 | 最大上下文长度,降低可省显存 |
--gpu-memory-utilization |
0.9 | GPU 显存使用比例,0.0-1.0 |
--max-num-seqs |
256 | 最大并发序列数 |
--dtype |
auto | 数据类型:auto/half/float16/bfloat16 |
--quantization |
None | 量化方式:awq/gptq/fp8/squeezellm |
--enable-prefix-caching |
False | 启用前缀缓存,多轮对话推荐开启 |
量化支持
| 量化方式 | 精度损失 | 显存节省 | 说明 |
|---|---|---|---|
| FP16/BF16 | 无 | 基准 | 默认精度 |
| AWQ | 极小 | ~50% | 推荐,4bit 量化,需预量化模型 |
| GPTQ | 小 | ~50% | 经典方案,社区模型多 |
| FP8 | 极小 | ~50% | H100/L40S 原生支持,推荐新硬件 |
vllm serve TheBloke/Llama-2-70B-Chat-AWQ --quantization awq
与同类工具对比
| 特性 | vLLM | Ollama | TGI | llama.cpp |
|---|---|---|---|---|
| 定位 | 生产级高吞吐推理 | 本地便捷运行 | HuggingFace 官方 | CPU/边缘推理 |
| 吞吐量 | 极高 | 中等 | 高 | 低-中 |
| 多卡支持 | 原生 TP/PP | 不支持 | 支持 | 有限 |
| 量化 | AWQ/GPTQ/FP8 | GGUF | AWQ/GPTQ/BnB | GGUF 专精 |
| 适用场景 | 服务端大规模部署 | 个人本地使用 | HF 生态集成 | 低资源设备 |
常见问题排查
- OOM 错误:降低
--max-model-len或--gpu-memory-utilization - 模型加载慢:使用
--load-format safetensors,确保本地有缓存 - 多卡不均衡:检查
CUDA_VISIBLE_DEVICES和 NVLink 拓扑 - 输出乱码:确认模型和 tokenizer 版本匹配,检查 chat template
安全使用建议
This skill is a coherent deployment guide for vLLM, but follow safe practices before running the recommended commands: 1) Verify the pip package and Docker image sources (vllm on PyPI, vllm/vllm-openai on Docker Hub) and prefer pinned versions in production rather than 'latest'. 2) Be aware that mounting ~/.cache/huggingface into a container gives that container access to your local cached models and artifacts — avoid mounting sensitive directories or tokens. 3) Exposing port 8000 will make the model server reachable from the host/network; restrict binding or use firewalls/auth if you don't want public access. 4) The commands download and execute third-party code (pip/docker) and require GPU/CUDA compatibility; run them in an isolated environment if you want to limit blast radius. 5) If you plan to load private models, confirm where your model tokens are stored and avoid unintentionally exposing them to containers. These precautions will reduce risk while using this (otherwise coherent) skill.
功能分析
Type: OpenClaw Skill
Name: vllm
Version: 1.0.0
The skill bundle is a standard documentation and instruction set for deploying the vLLM inference engine. It contains legitimate installation commands, Docker configurations, and parameter explanations consistent with the official vLLM project, with no evidence of malicious intent, data exfiltration, or prompt injection attacks in SKILL.md.
能力评估
Purpose & Capability
Name/description (vLLM inference engine, OpenAI-compatible API, PagedAttention, deployment tips) match the SKILL.md content: installation commands, docker usage, 'vllm serve' examples, model/parameter guidance and troubleshooting. Nothing requested or described appears unrelated to deploying or running vLLM.
Instruction Scope
The SKILL.md instructs the user to run pip install and docker run commands and to mount ~/.cache/huggingface into the container, expose port 8000, and enable GPU access. Those actions are expected for deploying vLLM, but they do grant the runtime container access to local model cache data and expose a server port. The instructions do not attempt to read arbitrary system files or request unrelated secrets, but they do assume access to GPU/CUDA and local caches.
Install Mechanism
This is instruction-only (no install spec or code files). However, the commands instruct users to pip install 'vllm' and to pull/run the 'vllm/vllm-openai:latest' Docker image. Those operations will download and execute third-party code (PyPI package and Docker image) — standard for this use case but worth verifying the sources before running in production.
Credentials
The skill declares no required environment variables or credentials (proportional). The only potential exposure is mounting ~/.cache/huggingface into the container (recommended in the doc) which may expose cached model files and any credentials stored there; also private models may require tokens kept elsewhere. No unexplained credential requests exist.
Persistence & Privilege
always is false and there is no install-time code or persistent modifications described. The skill does not request persistent agent privileges or modify other skills' configuration.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install vllm - 安装完成后,直接呼叫该 Skill 的名称或使用
/vllm触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
vLLM skill 1.0.0 initial release:
- 提供高性能 LLM 推理引擎助手,专注于大模型部署与优化
- 支持 PagedAttention、Continuous Batching、Prefix Caching、Speculative Decoding 等核心特性
- 兼容 OpenAI API,主流模型与多种量化方式支持
- 包含详细安装、部署、关键参数与常见问题指引
- 对比 Ollama、TGI、llama.cpp,定位生产级高吞吐服务端场景
元数据
常见问题
Vllm 是什么?
vLLM 推理引擎助手,精通高性能 LLM 部署、PagedAttention、OpenAI 兼容 API. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 313 次。
如何安装 Vllm?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install vllm」即可一键安装,无需额外配置。
Vllm 是免费的吗?
是的,Vllm 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Vllm 支持哪些平台?
Vllm 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Vllm?
由 zhangifonly(@zhangifonly)开发并维护,当前版本 v1.0.0。
推荐 Skills