/install locateanything
LocateAnything — Vision-Language Grounding
NVIDIA Eagle family VLM, based on Parallel Box Decoding (PBD) for single-step parallel prediction of complete coordinates. 12.7 BPS (H100) ≈ 10× Qwen3-VL.
Architecture: MoonViT-SO-400M → MLP → Qwen2.5-3B → PBD
Installation
git clone https://github.com/NVlabs/Eagle eagle && cd eagle/Embodied
pip install -e .
# Optional MagiAttention (Hopper/Blackwell only, long sequences 32K+):
# git clone https://github.com/SandAI-org/MagiAttention.git && cd MagiAttention && git checkout v1.0.5
# git submodule update --init --recursive && pip install --no-build-isolation .
Inference API
from locateanything_worker import LocateAnythingWorker
from PIL import Image
worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("e.jpg").convert("RGB")
worker.detect(img, ["person", "car"]) # Object detection
worker.ground_single(img, "the red car") # Single-instance grounding
worker.ground_multi(img, "people wearing hats") # Multi-instance grounding
worker.detect_text(img) # OCR
worker.ground_gui(img, "search button") # GUI box
worker.ground_gui(img, "search", output_type="point") # GUI point
worker.point(img, "the traffic light") # Point grounding
# Low-level: worker.predict(image, question, generation_mode="hybrid")
# mode: fast(MTP) | slow(NTP) | hybrid(default)
Output Parsing
Box: \x3Cref>label\x3C/ref>\x3Cbox>\x3Cx1>\x3Cy1>\x3Cx2>\x3Cy2>\x3C/box>
Point: \x3Cbox>\x3Cx>\x3Cy>\x3C/box>
Empty: \x3Cbox>none\x3C/box>
Coordinates [0,1000] integers, divide by 1000 for relative coordinates.
boxes = LocateAnythingWorker.parse_boxes(answer, w, h) # Pixel coordinates
points = LocateAnythingWorker.parse_points(answer, w, h)
Data Preparation
JSONL (ShareGPT Format)
{"conversations":[{"from":"human","value":"Detect all objects in \x3Cimage-1>."},{"from":"gpt","value":"\x3Cref>car\x3C/ref>\x3Cbox>\x3C100>\x3C200>\x3C400>\x3C500>\x3C/box>"}],"image":"train/00001.jpg"}
Recipe JSON
{"my_data":{"annotation":["a.jsonl","b.jsonl"],"root":"/data/images/","repeat_time":1.0,"data_augment":true}}
repeat_time: ≥1 oversample, \x3C1 downsample. Coordinates normalized to [0,1000].
8 Task Prompts
| Task | Method | Prompt |
|---|---|---|
| Detection | detect(cats) |
Locate all the instances that matches: cat1\x3C/c>cat2. |
| Single instance | ground_single(p) |
Locate a single instance that matches: phrase. |
| Multi instance | ground_multi(p) |
Locate all instances that match: phrase. |
| OCR | detect_text() |
Detect all the text in box format. |
| Text grounding | ground_text(p) |
Please locate the text referred as phrase. |
| GUI box | ground_gui(p) |
Locate the region that matches: element. |
| GUI point | ground_gui(p,pt) |
Point to: element. |
| Point grounding | point(p) |
Point to: target. |
Plain text: omit image field. Multi-image: image_list + \x3Cimage-1> \x3Cimage-2>.
Training
torchrun --nproc_per_node=8 eaglevl/train/locany_finetune_magi_stream.py \
--model_name_or_path nvidia/LocateAnything-3B \
--meta_path "./recipe.json" --output_dir work_dirs/sft \
--max_steps 25000 --lr 2e-5 --bf16 True --block_size 6 \
--attn_implementation magi --max_seq_length 16384 --max_num_tokens 25600 \
--deepspeed deepspeed_configs/zero_stage2_config.json
Key Parameters
| Parameter | Description |
|---|---|
--block_size |
MTP chunk size (default 4), use --causal_attn False during training |
--attn_implementation |
magi (Hopper/Blackwell 32K+) or sdpa (any GPU ~4K) |
--freeze_llm/backbone/mlp |
Freeze corresponding modules |
--max_num_tokens |
Token budget per batch (recommend 2-3× max_num_tokens_per_sample) |
--packing_buffer_size |
Online packing buffer (default 32, 64-128 for higher efficiency) |
Non-Hopper GPU: --attn_implementation sdpa --max_seq_length 4096. OOM: --grad_checkpoint True + reduce --max_num_tokens.
Streaming Packing: Best-Fit + Big-Rocks-First algorithm, checkpoint resume bit-identical. DeepSpeed recommended zero_stage2.
Evaluation
# COCO / LVIS
bash evaluation/scripts/eval_coco.sh --model_path ... --test_jsonl ... --coco_json ... --output_dir ...
bash evaluation/scripts/eval_lvis.sh --model_path ... --test_jsonl ... --lvis_json ... --output_dir ...
# General grounding (Dense200/DocLayNet/HumanRef/RefCOCOg/VisDrone etc.)
bash evaluation/scripts/eval_grounding.sh --dataset Dense200 --eval_type box_eval ...
# Point evaluation / ScreenSpot-Pro
bash evaluation/scripts/eval_grounding.sh --dataset COCO --eval_type point_eval ...
bash evaluation/scripts/eval_sspro.sh --model_path ... --test_jsonl ... --output_dir ...
Requires Rex-Omni fastevaluate + data Mountchicken/Rex-Omni-EvalData likaixin/ScreenSpot-Pro.
Key Results
| Benchmark | Score | Comparison |
|---|---|---|
| LVIS F1@Mean | 50.7 | +3.8 vs Rex-Omni |
| COCO F1@Mean | 54.7 | +1.8 vs Rex-Omni |
| M6Doc F1@Mean | 70.1 | +14.5 vs Rex-Omni |
| ScreenSpot-Pro Avg | 60.3 | SOTA |
| RefCOCOg val F1@Mean | 76.7 | SOTA |
| Pointing (7 benchmarks) | — | Best on all |
| PBD dense scenes | 2-6× faster vs NTP |
Model Info
- Name:
nvidia/LocateAnything-3B| LLM: Qwen2.5-3B | Vision: MoonViT-SO-400M | Length: 25K - HF: https://huggingface.co/nvidia/LocateAnything-3B
- Demo: https://huggingface.co/spaces/nvidia/LocateAnything
- Paper: arXiv:2605.27365
License
Code Apache 2.0 | Model NVIDIA License (non-commercial research)
References
- GitHub: https://github.com/NVlabs/Eagle
- Project page: https://nvlabs.github.io/Eagle/
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install locateanything - 安装完成后,直接呼叫该 Skill 的名称或使用
/locateanything触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
NVIDIA LocateAnything-3B vision-language grounding model 是什么?
NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 16 次。
如何安装 NVIDIA LocateAnything-3B vision-language grounding model?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install locateanything」即可一键安装,无需额外配置。
NVIDIA LocateAnything-3B vision-language grounding model 是免费的吗?
是的,NVIDIA LocateAnything-3B vision-language grounding model 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
NVIDIA LocateAnything-3B vision-language grounding model 支持哪些平台?
NVIDIA LocateAnything-3B vision-language grounding model 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 NVIDIA LocateAnything-3B vision-language grounding model?
由 OpenLark(@openlark)开发并维护,当前版本 v1.0.0。