← 返回 Skills 市场
openlark

NVIDIA LocateAnything-3B vision-language grounding model

作者 OpenLark · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
16
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install locateanything
功能描述
NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8...
使用说明 (SKILL.md)

LocateAnything — Vision-Language Grounding

NVIDIA Eagle family VLM, based on Parallel Box Decoding (PBD) for single-step parallel prediction of complete coordinates. 12.7 BPS (H100) ≈ 10× Qwen3-VL.

Architecture: MoonViT-SO-400M → MLP → Qwen2.5-3B → PBD

Installation

git clone https://github.com/NVlabs/Eagle eagle && cd eagle/Embodied
pip install -e .
# Optional MagiAttention (Hopper/Blackwell only, long sequences 32K+):
# git clone https://github.com/SandAI-org/MagiAttention.git && cd MagiAttention && git checkout v1.0.5
# git submodule update --init --recursive && pip install --no-build-isolation .

Inference API

from locateanything_worker import LocateAnythingWorker
from PIL import Image

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("e.jpg").convert("RGB")

worker.detect(img, ["person", "car"])                    # Object detection
worker.ground_single(img, "the red car")                  # Single-instance grounding
worker.ground_multi(img, "people wearing hats")           # Multi-instance grounding
worker.detect_text(img)                                   # OCR
worker.ground_gui(img, "search button")                   # GUI box
worker.ground_gui(img, "search", output_type="point")     # GUI point
worker.point(img, "the traffic light")                    # Point grounding

# Low-level: worker.predict(image, question, generation_mode="hybrid")
# mode: fast(MTP) | slow(NTP) | hybrid(default)

Output Parsing

Box: \x3Cref>label\x3C/ref>\x3Cbox>\x3Cx1>\x3Cy1>\x3Cx2>\x3Cy2>\x3C/box>
Point: \x3Cbox>\x3Cx>\x3Cy>\x3C/box>
Empty: \x3Cbox>none\x3C/box>

Coordinates [0,1000] integers, divide by 1000 for relative coordinates.

boxes = LocateAnythingWorker.parse_boxes(answer, w, h)  # Pixel coordinates
points = LocateAnythingWorker.parse_points(answer, w, h)

Data Preparation

JSONL (ShareGPT Format)

{"conversations":[{"from":"human","value":"Detect all objects in \x3Cimage-1>."},{"from":"gpt","value":"\x3Cref>car\x3C/ref>\x3Cbox>\x3C100>\x3C200>\x3C400>\x3C500>\x3C/box>"}],"image":"train/00001.jpg"}

Recipe JSON

{"my_data":{"annotation":["a.jsonl","b.jsonl"],"root":"/data/images/","repeat_time":1.0,"data_augment":true}}

repeat_time: ≥1 oversample, \x3C1 downsample. Coordinates normalized to [0,1000].

8 Task Prompts

Task Method Prompt
Detection detect(cats) Locate all the instances that matches: cat1\x3C/c>cat2.
Single instance ground_single(p) Locate a single instance that matches: phrase.
Multi instance ground_multi(p) Locate all instances that match: phrase.
OCR detect_text() Detect all the text in box format.
Text grounding ground_text(p) Please locate the text referred as phrase.
GUI box ground_gui(p) Locate the region that matches: element.
GUI point ground_gui(p,pt) Point to: element.
Point grounding point(p) Point to: target.

Plain text: omit image field. Multi-image: image_list + \x3Cimage-1> \x3Cimage-2>.

Training

torchrun --nproc_per_node=8 eaglevl/train/locany_finetune_magi_stream.py \
  --model_name_or_path nvidia/LocateAnything-3B \
  --meta_path "./recipe.json" --output_dir work_dirs/sft \
  --max_steps 25000 --lr 2e-5 --bf16 True --block_size 6 \
  --attn_implementation magi --max_seq_length 16384 --max_num_tokens 25600 \
  --deepspeed deepspeed_configs/zero_stage2_config.json

Key Parameters

Parameter Description
--block_size MTP chunk size (default 4), use --causal_attn False during training
--attn_implementation magi (Hopper/Blackwell 32K+) or sdpa (any GPU ~4K)
--freeze_llm/backbone/mlp Freeze corresponding modules
--max_num_tokens Token budget per batch (recommend 2-3× max_num_tokens_per_sample)
--packing_buffer_size Online packing buffer (default 32, 64-128 for higher efficiency)

Non-Hopper GPU: --attn_implementation sdpa --max_seq_length 4096. OOM: --grad_checkpoint True + reduce --max_num_tokens.

Streaming Packing: Best-Fit + Big-Rocks-First algorithm, checkpoint resume bit-identical. DeepSpeed recommended zero_stage2.

Evaluation

# COCO / LVIS
bash evaluation/scripts/eval_coco.sh --model_path ... --test_jsonl ... --coco_json ... --output_dir ...
bash evaluation/scripts/eval_lvis.sh --model_path ... --test_jsonl ... --lvis_json ... --output_dir ...

# General grounding (Dense200/DocLayNet/HumanRef/RefCOCOg/VisDrone etc.)
bash evaluation/scripts/eval_grounding.sh --dataset Dense200 --eval_type box_eval ...

# Point evaluation / ScreenSpot-Pro
bash evaluation/scripts/eval_grounding.sh --dataset COCO --eval_type point_eval ...
bash evaluation/scripts/eval_sspro.sh --model_path ... --test_jsonl ... --output_dir ...

Requires Rex-Omni fastevaluate + data Mountchicken/Rex-Omni-EvalData likaixin/ScreenSpot-Pro.

Key Results

Benchmark Score Comparison
LVIS F1@Mean 50.7 +3.8 vs Rex-Omni
COCO F1@Mean 54.7 +1.8 vs Rex-Omni
M6Doc F1@Mean 70.1 +14.5 vs Rex-Omni
ScreenSpot-Pro Avg 60.3 SOTA
RefCOCOg val F1@Mean 76.7 SOTA
Pointing (7 benchmarks) Best on all
PBD dense scenes 2-6× faster vs NTP

Model Info

License

Code Apache 2.0 | Model NVIDIA License (non-commercial research)

References

安全使用建议
Install only if you are comfortable cloning and running the referenced NVIDIA/SandAI ML repositories. Use a virtual environment or container, review the upstream licenses, and note that the model is described as non-commercial research use.
能力评估
Purpose & Capability
The stated purpose is vision-language grounding, OCR, GUI element recognition, training, and evaluation guidance; the content stays aligned with that purpose.
Instruction Scope
Instructions are explicit examples for cloning the upstream project, installing Python packages, running inference, preparing datasets, training, and evaluation; no prompt overrides or hidden agent instructions were found.
Install Mechanism
Installation involves cloning external GitHub repositories and running pip install commands, which is expected for this ML workflow but should be done in an isolated environment.
Credentials
GPU-heavy training and evaluation commands are proportionate to the model workflow and are disclosed; local data paths are examples rather than broad indexing instructions.
Persistence & Privilege
No background services, startup hooks, privilege escalation, credential/session access, destructive commands, or persistent agent behavior were present.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install locateanything
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /locateanything 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of LocateAnything skill, built on NVIDIA's LocateAnything-3B vision-language grounding model. - Supports core inference APIs for detection, grounding (single/multi-instance), point grounding, OCR, and GUI element recognition. - Provides guidance on data preparation using JSONL/Recipe format and covers 8 vision-language grounding tasks. - Includes instructions for installation, training/fine-tuning, and evaluation on standard benchmarks. - Offers output parsing utilities and documents model performance and architecture.
元数据
Slug locateanything
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

NVIDIA LocateAnything-3B vision-language grounding model 是什么?

NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 16 次。

如何安装 NVIDIA LocateAnything-3B vision-language grounding model?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install locateanything」即可一键安装,无需额外配置。

NVIDIA LocateAnything-3B vision-language grounding model 是免费的吗?

是的,NVIDIA LocateAnything-3B vision-language grounding model 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

NVIDIA LocateAnything-3B vision-language grounding model 支持哪些平台?

NVIDIA LocateAnything-3B vision-language grounding model 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 NVIDIA LocateAnything-3B vision-language grounding model?

由 OpenLark(@openlark)开发并维护,当前版本 v1.0.0。

💬 留言讨论