← 返回 Skills 市场
nissan

Aa Benchmarking Framework

作者 Nissan Dookeran · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ suspicious
120
总下载
0
收藏
1
当前安装
1
版本数
在 OpenClaw 中安装
/install aa-benchmarking-framework
功能描述
Composite scoring and efficiency frontier analysis for LLM evaluation — combines multiple quality dimensions (accuracy, latency, cost, consistency) into a si...
使用说明 (SKILL.md)

Last used: 2026-03-24 Memory references: 1 Status: Active

AA Benchmarking Framework

STATUS: DRAFT — This skill is planned but not yet fully implemented.

What This Does

Provides a systematic framework for multi-dimensional LLM evaluation using composite scoring, efficiency frontier analysis, and Pareto optimality. Rather than ranking models on a single metric, it helps identify which models are non-dominated — i.e., no other model is better on all dimensions simultaneously. Designed for teams that need principled model selection beyond simple leaderboard rankings.

Planned Capabilities

  • Composite scoring with configurable dimension weights (accuracy, latency, cost, recall, F1)
  • Pareto frontier detection across any two or more evaluation dimensions
  • Radar/spider chart visualisation for multi-dimensional comparison
  • Statistical significance testing across benchmark runs (t-test, Mann-Whitney U)
  • Integration with LangFuse for trace-based evaluation data ingestion
  • Export to CSV/JSON for downstream analysis

When To Use

  • Choosing between 3+ LLM providers on competing objectives (e.g. GPT-4o vs Claude 3.5 vs Gemini)
  • Building an evaluation dashboard for recurring model benchmarks
  • Presenting model selection rationale to stakeholders with visual evidence
  • Running efficiency frontier analysis to identify cost-optimal models for a quality threshold
安全使用建议
This skill is currently a draft with plausible goals, but several inconsistencies make it risky to enable for production use. Before installing or granting the agent access, ask the author to: (1) provide concrete runtime instructions and example commands/scripts; (2) declare any required environment variables (e.g., LangFuse API key) and justify why 'primaryEnv' is set to 'production'; (3) clarify whether outbound networking is required and update metadata accordingly; and (4) supply the implementation (code or install spec) so you can review exactly what will run. If you must test now, do so in an isolated environment where no sensitive credentials or production data are available.
能力评估
Purpose & Capability
The name, description, and planned capabilities match a benchmarking/analysis skill. Requesting python3 as a runtime makes sense for data processing/visuals. However, the metadata's primaryEnv set to 'production' is unexplained and disproportionate for a pure benchmarking helper; the SKILL.md also references integration with LangFuse (an external tracing service) but does not declare any required credentials or network access.
Instruction Scope
The SKILL.md is a draft and contains only high-level planned capabilities, not concrete runtime instructions. It mentions ingesting trace data from LangFuse and exporting results, which implies reading external data and making outbound network requests, yet the metadata claims outbound networking is false and no environment variables or endpoints are declared. Because runtime behavior is underspecified, it's unclear what data the skill will read, what endpoints it will contact, or what credentials it will require.
Install Mechanism
Instruction-only skill with no install spec and no code files. That minimizes immediate disk/write risk. Declaring python3 as a required binary is reasonable for a planned implementation; otherwise there is nothing being fetched or installed.
Credentials
No environment variables are declared, yet the metadata sets primaryEnv to 'production' and the text promises LangFuse integration (which normally requires an API key). This mismatch means either the skill will need secrets/network access that are not declared, or the manifest is incorrect; both are red flags for incomplete or inconsistent security posture.
Persistence & Privilege
always is false and there are no install hooks or instructions to modify agent/system configuration. The skill does not request persistent elevated presence in its current form.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install aa-benchmarking-framework
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /aa-benchmarking-framework 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
New skill: hypothesis-driven model evaluation framework for local inference routing
元数据
Slug aa-benchmarking-framework
版本 0.1.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 1
常见问题

Aa Benchmarking Framework 是什么?

Composite scoring and efficiency frontier analysis for LLM evaluation — combines multiple quality dimensions (accuracy, latency, cost, consistency) into a si... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 120 次。

如何安装 Aa Benchmarking Framework?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install aa-benchmarking-framework」即可一键安装,无需额外配置。

Aa Benchmarking Framework 是免费的吗?

是的,Aa Benchmarking Framework 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Aa Benchmarking Framework 支持哪些平台?

Aa Benchmarking Framework 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Aa Benchmarking Framework?

由 Nissan Dookeran(@nissan)开发并维护,当前版本 v0.1.0。

💬 留言讨论