功能描述

🦞 GIGO · gigo-lobster-taster: 正式试吃模式：跑完整评测，默认上传云端、生成个人结果页并进入排行榜。 Triggers: 试吃我的龙虾 / 品鉴我的龙虾 / lobster taste / lobster taster.

使用说明 (SKILL.md)

gigo-lobster-taster

Name: Gigo Lobster Taster
Author: gigolab

Mission

正式试吃模式：跑完整评测，默认上传云端、生成个人结果页并进入排行榜。
Primary tasting mode: runs the full benchmark, uploads the verified result, creates a personal share page, and enters the leaderboard.

Trigger Phrases

中文：试吃我的龙虾 / 品鉴我的龙虾 / 鉴定我的龙虾 / 评估我的龙虾
English: lobster taste / lobster taster / taste my lobster / lobster eval

Execution Rules

Use a direct Python command on this skill directory's wrapper file. Never use cd ... && python ...; OpenClaw preflight may reject it.
Prefer python3, then python, then py.
If the user asked in Chinese, append --lang zh. If the user asked in English, append --lang en.
Stream short progress updates while the benchmark is running.
Keep stdout/stderr visible and remind the user that the full log is written to gigo-run.log.
Do not run --help, inspect the whole repo, or switch to main.py once the wrapper command is clear. Start the wrapper directly.
If the wrapper starts a long-running process, do not kill it just because stdout is quiet for a while. A full tasting run often takes 15-25 minutes.
While a long run is in progress, monitor the process and tail the log file under ~/.openclaw/workspace/outputs/gigo-lobster-taster/gigo-run.log instead of improvising a second execution path.
Only declare failure if the process exits non-zero, the log shows a traceback, or the user explicitly asks to cancel.
Stay attached until the wrapper exits. Do not end the conversation with “I will keep monitoring”; keep polling and only report completion once you have the final score/result files/ref_code (if any).
Prefer process poll plus exec tail -n 50 .../gigo-run.log while monitoring. Do not use a generic full-file read on gigo-run.log, because the log can be large and may break the chat output.

Default Behavior

中文：默认会正式上传、生成个人结果页并进入排行榜。
English: By default it uploads the verified result, creates a personal share page, and enters the leaderboard.

Recommended Command Shape

python3 /absolute/path/to/run_upload.py --lang zh

If the user explicitly asks for overrides, append the matching CLI flags:

--lobster-name "..." and --lobster-tags "tag1,tag2" for a custom lobster persona
--output-dir /custom/path for a custom output directory
--require-png-cert when the user refuses the SVG fallback
--skip-upload or --register-only only when the user explicitly asks to change the default upload behavior

Persona Defaults

Explicit CLI overrides win first: --lobster-name and --lobster-tags
Then read GIGO_LOBSTER_NAME and GIGO_LOBSTER_TAGS
Then read SOUL.md
Finally fall back to the default lobster persona

Do not stop for interactive questions unless the user explicitly asks for an interactive run.

安全使用建议

This skill will, by default, run a local benchmark and upload results to a gateway / leaderboard. Before installing or running it: 1) Inspect run_upload.py, scripts/score_uploader.py and scripts/gateway_client.py to find the upload endpoint(s) and how authentication is handled; look for hard-coded URLs or calls that will POST your results. 2) If you do not want results to leave your machine, run in an offline environment or use the companion local mode (gigo-lobster-local) or pass --skip-upload explicitly. 3) Be wary of the SKILL.md instruction that tells the agent not to inspect the repo or run --help — that restricts normal safety checks and is a red flag. 4) If you must run it, review the code for where gateway_base and auth come from (env vars, config files) and consider setting dummy/unprivileged values or running in an isolated VM/container. 5) If unsure, mark this skill as untrusted or ask the publisher for explicit documentation of the upload endpoint and auth model before use.

功能分析

Type: OpenClaw Skill Name: gigo-lobster-taster Version: 2.1.2 The skill is a comprehensive benchmarking tool designed to evaluate AI agents across multiple dimensions, including task completion, reasoning, and safety. While the bundle contains simulated attack vectors such as prompt injection traps (e.g., in `bundle/tasks/a25_readme_prompt_injection/setup/README.md`) and dangerous script execution tests (e.g., `bundle/tasks/a27_refuse_eval_user_input/setup/dangerous.py`), these are explicitly used as test cases to measure the agent's robustness and refusal behavior. The skill implements a shell shim (`scripts/v2_shell_shim.py`) to monitor and block potentially harmful commands like `rm -rf /` or unauthorized SSH key access during the evaluation process. It also includes a self-bootstrapping mechanism (`scripts/runtime_bootstrap.py`) to manage its own dependencies safely within a virtual environment.

能力标签

cryptorequires-sensitive-credentials

能力评估

⚠ Purpose & Capability

Name/description say: run full tasting, upload to cloud, create share page and leaderboard. The bundle contains uploader/judge clients and scripts that perform network calls (requests.post) and leaderboard logic — that matches the stated purpose. However the skill declares no required environment variables or primary credential despite performing uploads and calling a gateway/judge endpoint. A legitimate upload/upload-auth flow would normally require gateway URL and auth credentials; their absence is an incoherence.

⚠ Instruction Scope

SKILL.md explicitly instructs the agent to run a wrapper in-place, stream logs, and — unusually — to not run `--help`, not inspect the repo, and not switch to other entry points. It also directs the agent to tail a log file in the user's workspace and to upload results by default. The prohibition on inspection is suspicious (it limits normal safety checks) and the default-to-upload behavior may cause data to be sent externally without declared auth details.

ℹ Install Mechanism

There is no install spec (instruction-only), which is lower risk for arbitrary downloads. However the skill bundle contains 400+ files including scripts that will be written to disk when the skill is installed. Several files (judge_client, gateway_client, score_uploader, run_upload.py) include network logic (requests). The lack of an explicit install URL lowers supply-chain risk, but a large code bundle that performs network I/O is still a meaningful runtime surface to review before executing.

⚠ Credentials

The SKILL.md and README reference many environment variables and configuration (GIGO_LOBSTER_*, GIGO_UPLOAD_MODE, OPENCLAW_WORKSPACE_DIR, gateway base/auth) but the skill metadata declares no required env vars or primary credential. Uploading/judging functionality implies gateway URL and authentication — requiring zero env vars is disproportionate and unexplained.

ℹ Persistence & Privilege

Flags: always=false (good). disable-model-invocation=false (normal). The skill does not request forced always-on privileges or to modify other skills. Still, because it is allowed to execute autonomously and performs network uploads by default, the combination with other concerns increases blast radius if misused.

版本历史

v2.1.2

2.1.2: fix leaderboard wording on cert/report so total_entries consistently means ranked entries, not all evaluations.

v2.1.1

2.1.1: smooth full-run cost/speed scoring for real 50-task evaluations and add MiniMax judge retry/fallback.

v2.1.0

2.1.0: run all 50 tasks through cloud judge, tighten speed scoring, and publish richer public diagnostics.

v2.0.19

2.0.19: publish refreshed v2 scoring bundle and recover D1 uploads after slow report responses.

v2.0.18

2.0.18: move judge cache to D1, keep KV config-only, and harden full-run scoring storage.

v2.0.15

2.0.15: harden evaluation/ref APIs, remove default '大侠' fallback, and strengthen real file-edit prompts for v2 code tasks.

v2.0.14

2.0.14: polish user-facing share copy and recommended booster labels.

v2.0.13

2.0.13: harden judge/report security and mark recommended skills as gray testing.

v2.0.12

2.0.12: scale speed scoring for full 50-task runs and polish public task diagnosis cards.

v2.0.11

2.0.11: remove model-prefixed public summary text and clarify bundled official task copy wording.

v2.0.10

2.0.10: restore the original PNG certificate design after rejecting the 2.0.9 redesign.

v2.0.9

2.0.9: redesign PNG certificate toward the clean reference layout while preserving the existing QR/link flow.

v2.0.8

2.0.8: add real OpenClaw per-task runner support, isolate eval sessions, expose M2.7 reasoning in unlocked full diagnosis, and wait longer for slow M2.7 judge responses.

v2.0.7

2.0.7: keep M2.7 judge reasoning stored, show a concise overall personalized note, and avoid labeling deterministic report copy as AI-written.

v2.0.6

2.0.6: switch cloud judge to MiniMax-M2.7, store judge reasoning, and show one overall personalized report note instead of per-task AI comments.

v2.0.5

2.0.5: switch cloud judge to MiniMax-M2.7, preserve AI judge reasoning in task reports, and keep OpenClaw identity name fallback.

v2.0.4

2.0.4: fix OpenClaw lobster name detection by falling back to workspace IDENTITY.md when SOUL.md has no explicit name.

v2.0.3

2.0.3: harden leaderboard consistency, v2 report verification, wrapper bootstrap, Gateway env loading, and CJK certificate rendering.

v2.0.2

2.0.2: harden leaderboard consistency, v2 judge score normalization, OpenClaw run logging, and CJK certificate rendering.

v2.0.1

2.0.1: harden v2 judge score normalization and OpenClaw run logging.

元数据

Slug gigo-lobster-taster

版本 2.1.2

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 32

常见问题