功能描述

$0 test-time scaling with online learning. Classify, generate, and verify using free model ensembles. Models self-select via ELO scoring + A/B testing from d...

使用说明 (SKILL.md)

Free Scaling

Name: Free Scaling
Author: isotrivial

$0 test-time scaling infrastructure using NVIDIA NIM free tier.

Three patterns, one API key:

from free_scaling import scale, generate, health

# Classify — vote on labels
result = scale("Is this safe?", context=code, k=3,
               answer_patterns=["SAFE", "VULNERABLE"])

# Generate — best-of-k with cross-evaluation
result = generate("Summarize this paper.", context=paper, k=3)

# Verify — just scale() with source+output as context
check = scale("Any hallucinated claims?",
              context=f"Source:\
{src}\
\
Output:\
{draft}",
              k=3, answer_patterns=["YES", "NO"])

Setup

Get a free API key at build.nvidia.com
export NVIDIA_API_KEY="nvapi-..."
No pip install — stdlib only (Python 3.10+)

Core API

`scale(question, context, k, answer_patterns)` → CascadeResult

Classification via ensemble voting. Ask k models, majority wins.

result = scale(
    "Is this email urgent? Answer URGENT, NORMAL, or IGNORE.",
    context=email_body,
    k=3,
    answer_patterns=["URGENT", "NORMAL", "IGNORE"]
)
result.answer       # "NORMAL"
result.confidence   # 1.0
result.calls_made   # 3
result.elapsed_s    # 1.8

Parameters:

question — what to judge (should end with "Answer X or Y")
context — material to evaluate (placed in system message)
k — models to query: 1, 3, 5, or "auto" (smart cascade)
answer_patterns — expected answers (e.g. ["YES", "NO"])
models — override model selection (list of aliases)

`generate(question, context, k)` → GenerateResult

Best-of-k generation with cross-evaluation. Round 1: k models generate. Round 2: k different models judge which is best.

result = generate(
    "Summarize this email in 2 sentences.",
    context=email_text,
    k=3,
    max_tokens=200,
)
result.output          # winning summary
result.all_outputs     # all 3 summaries
result.winner_model    # "llama-3.3"
result.judge_votes     # ["2", "2", "2"]
result.total_calls     # 6 (3 gen + 3 judge)

`scale_batch(items, k)` / `generate_batch(items, k)`

Parallel batch versions. Each item is a dict with question, context, answer_patterns.

results = scale_batch([
    {"question": "Urgent?", "context": e, "answer_patterns": ["YES", "NO"]}
    for e in emails
], k=3)

`health(models=None)` → dict

Probe models. Returns status per model (ok/dead/slow/error + latency).

status = health()  # all models
status = health(models=["llama-3.3", "gemma-27b"])  # specific

Dead models are auto-skipped in subsequent calls and retried after 5 minutes.

Online Learning (v3.3)

Models self-select through deployment data. No manual benchmarking needed.

from free_scaling import elo, feedback
from free_scaling.evolve import evolve, report

# Every scale() call automatically:
# 1. Logs votes to ELO tracker
# 2. Runs 1 shadow challenger for A/B data
# 3. Logs result for user feedback resolution

# Check current rankings
print(elo.summary())

# User feedback (4× stronger than consensus signal)
feedback.resolve_by_reaction("discord-msg-id", "👍")   # confirm
feedback.resolve_by_reaction("discord-msg-id", "🅱️")   # Panel B wins
feedback.resolve_by_reaction("discord-msg-id", "🔴")   # override to URGENT

# Weekly panel evolution
result = evolve(dry_run=True)   # check if panel should change
result = evolve(dry_run=False)  # apply the change

How it works:

Consensus: models that agree with majority get +ELO (K=16)
Override: user feedback is 4× stronger (K=64)
Shadow challenger: 1 extra model per call for free A/B data
Evolution: top-3 by ELO become champion panel (requires 30+ calls/model)

Smart Features

Online learning: ELO-based model scoring from deployment data (see above)
A/B testing: shadow challengers run alongside panel for competitive signal
Auto-heal: 404/410 models get marked dead, substituted with same-tier alternatives, retried after 5min TTL
Context routing: context goes in system message, question stays in user message
Parallel short-circuit: submits all k models in parallel, cancels remaining when first 2 agree
Task classification: k="auto" classifies the question type and routes to the best expert
Copilot integration: cp-* aliases route automatically through GitHub Copilot API
User feedback loop: Discord reaction → ELO update (👍 confirm, 🅰️🅱️ A/B, 🔴🟡⚪ override)
Error isolation: batch functions catch per-item failures without killing the batch

13 Models Included

Tier	Models	Latency
Fast	llama-3.3 70B, gemma-27b, nemotron-super-49b, dracarys-70b, jamba-mini	\x3C1s
Medium	mistral-large 675B, kimi-k2, qwen-397b, llama-405b, mistral-medium	1-3s
Thinking	deepseek-v3.1, minimax-m2.5 🧠, kimi-k2.5 🧠	3s+

All free via NVIDIA NIM. One API key covers everything.

CLI

python3 -m nim_ensemble.cli scale "Is this safe?" -k 3 --answers "SAFE,VULNERABLE"
python3 -m nim_ensemble.cli models     # list available models
python3 -m nim_ensemble.cli panels     # list panels

Capability Profiling (optional)

Profile models on your tasks for data-driven routing:

python3 -m nim_ensemble.capability_map --models llama-3.3 gemma-27b mistral-large --trials 3

Generates capability_map.json — the cascade loads it automatically.

Architecture

nim_ensemble/
├── __init__.py       # Exports: scale, generate, health, scale_batch, generate_batch
├── cascade.py        # scale(), scale_batch(), smart cascade
├── generate.py       # generate(), generate_batch(), best-of-k
├── voter.py          # Core voting engine, NIM + Copilot backends
├── health.py         # Model probing, dead-model tracking, substitution
├── models.py         # Model registry, panels
├── parser.py         # Answer extraction (thinking models, negation, word boundaries)
├── elo.py            # Online ELO scoring, model ranking
├── feedback.py       # User feedback loop (reactions → ELO updates)
├── evolve.py         # Weekly panel evolution (promote/demote by ELO)
├── cli.py            # CLI interface
├── benchmark.py      # Single-trial profiling
└── capability_map.py # Multi-trial profiling with error correlation

Requirements

NVIDIA_API_KEY environment variable (free at build.nvidia.com)
Python 3.10+ (stdlib only, no pip dependencies)
Optional: GitHub Copilot token for cp-* model aliases

安全使用建议

This package appears to implement what it claims (an ensemble/cascade with online ELO-based model selection), but there are several operational and security mismatches you should consider before installing: - Credentials not declared: SKILL.md and README instruct you to export NVIDIA_API_KEY and mention optional Copilot and Discord integrations, but the registry metadata lists no required env vars or primary credential. Inspect the code (call_model, call_copilot, feedback) to find exactly which tokens and endpoints are used before providing keys. - Data sent to remote services: Your 'context' is placed into the system prompt and sent to external model endpoints (NVIDIA NIM and optionally Copilot). Do not pass secrets, private PII, or sensitive code you don't want transmitted to third-party models. - Persistent local state: The skill writes ELO/tracking to ~/.cache/free-scaling/elo.json (or FREE_SCALING_STATE_DIR). That file contains aggregated vote history and could include snippets of raw responses. If that concerns you, set FREE_SCALING_STATE_DIR to a controlled location or inspect/reset the file regularly. - Undeclared optional integrations: Discord reaction feedback and GitHub Copilot routing are referenced but tokens for those services are not declared in metadata. If you intend to use those features, find the exact env var names and confirm where tokens are stored and how the code uses them. - Review network calls: If you need higher assurance, search call_model / voter / feedback code for HTTP hosts and endpoints (build.nvidia.com, GitHub/Copilot, Discord) and verify they are legitimate. Run the code in an isolated environment or on non-sensitive data first. Bottom line: the skill is functionally coherent with its stated purpose, but the missing declarations around required env vars/credentials and the persistence of deployment data make it suspicious until you verify the code paths that talk to external services and confirm which credentials will be used.

功能分析

Type: OpenClaw Skill Name: free-scaling Version: 3.3.1 The bundle is classified as suspicious due to unauthorized credential-harvesting logic found in `nim_ensemble/voter.py`. The function `_refresh_copilot_token` uses `glob` to scan the `~/.openclaw/agents/` directory, specifically targeting `auth-profiles.json` files to extract GitHub OAuth tokens (`ghu_`) belonging to other OpenClaw agents. While this behavior is used to facilitate the advertised "Copilot" model backend, the silent extraction of sensitive credentials from other agents' private directories is a significant security risk and violates isolation principles. The tool also communicates with external endpoints at `integrate.api.nvidia.com` and `api.individual.githubcopilot.com`.

能力评估

ℹ Purpose & Capability

The files implement an ensemble/cascade, ELO scoring, benchmarking, and feedback loops consistent with the 'Free Scaling' description. However the SKILL metadata declares no required environment variables or credentials while the README and SKILL.md explicitly instruct the user to set an NVIDIA_API_KEY and the code references other environment hooks (FREE_SCALING_STATE_DIR, OPENCLAW_WORKSPACE) and optional integrations (GitHub Copilot aliases, Discord reactions). This mismatch between declared requirements and actual needs is inconsistent.

⚠ Instruction Scope

SKILL.md instructs users to export NVIDIA_API_KEY and to use feedback.resolve_by_reaction with Discord message ids and mentions Copilot integration; the runtime instructions and code place user-provided 'context' into model system prompts. The skill also automatically logs every scale() call into persistent ELO state and runs shadow challengers. The instructions therefore direct read/write of local state (~/.cache/free-scaling/elo.json) and transmission of user-provided context to remote model endpoints (NIM and optional Copilot). The SKILL metadata did not declare these external endpoints or credential needs, and the practice of putting arbitrary context into the system prompt can let that content influence model behavior (and be sent to external services).

✓ Install Mechanism

No install spec (instruction-only skill) and package uses only stdlib per SKILL.md. There is no remote download/install step in the registry metadata. The code bundle is included in the skill (many Python files) but nothing indicates it will fetch arbitrary executables or archives on install. No high-risk download URLs were found in the provided files.

⚠ Credentials

The skill requires at least an NVIDIA_API_KEY at runtime (SKILL.md, README) and the code references FREE_SCALING_STATE_DIR and OPENCLAW_WORKSPACE, but the registry metadata lists no required env vars or primary credential. Additionally the code documents optional integrations (GitHub Copilot aliases 'cp-*' and Discord reaction-based feedback) that would require additional tokens/credentials (not declared). Requesting an API key for the model provider is reasonable, but the skill failing to declare these env vars / credentials and to document what additional tokens (Copilot, Discord) are needed is an incoherence and an operational risk (credential leakage or surprise network calls).

ℹ Persistence & Privilege

The skill persists online-learning state to disk (default: ~/.cache/free-scaling/elo.json or FREE_SCALING_STATE_DIR), and automatically logs votes from every scale() call. 'always' is false. The skill writes its own state (normal for this functionality) but this persistent logging means usage data and possibly user-provided contexts are stored locally and used to alter routing. That persistent behavior is coherent with the stated online-learning design but should be understood by users because it creates a long-lived record derived from inputs.

版本历史

v3.3.1

Patch release: fix case-insensitive ELO/feedback scoring, resolve message-id feedback lookups correctly, distinguish A/B tags, and harden state writes with regression tests.

v3.3.0

v3.3: Online learning — ELO scoring, shadow challengers, A/B testing, user feedback loop. Models self-select from deployment data.

v3.2.0

v3.2.0: Three-pattern API (scale/generate/health), context parameter, auto-heal dead models, parallel short-circuit, Copilot integration. 23 review findings fixed (Claude Code + Gemini).

v2.1.2

Fix display name from 'Nim Ensemble' to 'Free Scaling'

v2.1.1

Add Copilot backend + audit preset to README

v2.1.0

v2.1.0: hybrid audit preset + parser/route fixes

元数据

Slug free-scaling

版本 3.3.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 6

常见问题

Free Scaling 是什么？

$0 test-time scaling with online learning. Classify, generate, and verify using free model ensembles. Models self-select via ELO scoring + A/B testing from d... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 335 次。

如何安装 Free Scaling？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install free-scaling」即可一键安装，无需额外配置。

Free Scaling 是免费的吗？

是的，Free Scaling 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Free Scaling 支持哪些平台？

Free Scaling 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Free Scaling？

由 isotrivial（@isotrivial）开发并维护，当前版本 v3.3.1。

Free Scaling

Free Scaling

Setup

Core API

scale(question, context, k, answer_patterns) → CascadeResult

generate(question, context, k) → GenerateResult

scale_batch(items, k) / generate_batch(items, k)

health(models=None) → dict

Online Learning (v3.3)

Smart Features

13 Models Included

CLI

Capability Profiling (optional)

Architecture

Requirements

Free Scaling 是什么？

如何安装 Free Scaling？

Free Scaling 是免费的吗？

Free Scaling 支持哪些平台？

谁开发了 Free Scaling？

💬 留言讨论

`scale(question, context, k, answer_patterns)` → CascadeResult

`generate(question, context, k)` → GenerateResult

`scale_batch(items, k)` / `generate_batch(items, k)`

`health(models=None)` → dict