← Back to Skills Marketplace
openlark

NVIDIA LocateAnything-3B vision-language grounding model

by OpenLark · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
16
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install locateanything
Description
NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8...
README (SKILL.md)

LocateAnything — Vision-Language Grounding

NVIDIA Eagle family VLM, based on Parallel Box Decoding (PBD) for single-step parallel prediction of complete coordinates. 12.7 BPS (H100) ≈ 10× Qwen3-VL.

Architecture: MoonViT-SO-400M → MLP → Qwen2.5-3B → PBD

Installation

git clone https://github.com/NVlabs/Eagle eagle && cd eagle/Embodied
pip install -e .
# Optional MagiAttention (Hopper/Blackwell only, long sequences 32K+):
# git clone https://github.com/SandAI-org/MagiAttention.git && cd MagiAttention && git checkout v1.0.5
# git submodule update --init --recursive && pip install --no-build-isolation .

Inference API

from locateanything_worker import LocateAnythingWorker
from PIL import Image

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
img = Image.open("e.jpg").convert("RGB")

worker.detect(img, ["person", "car"])                    # Object detection
worker.ground_single(img, "the red car")                  # Single-instance grounding
worker.ground_multi(img, "people wearing hats")           # Multi-instance grounding
worker.detect_text(img)                                   # OCR
worker.ground_gui(img, "search button")                   # GUI box
worker.ground_gui(img, "search", output_type="point")     # GUI point
worker.point(img, "the traffic light")                    # Point grounding

# Low-level: worker.predict(image, question, generation_mode="hybrid")
# mode: fast(MTP) | slow(NTP) | hybrid(default)

Output Parsing

Box: \x3Cref>label\x3C/ref>\x3Cbox>\x3Cx1>\x3Cy1>\x3Cx2>\x3Cy2>\x3C/box>
Point: \x3Cbox>\x3Cx>\x3Cy>\x3C/box>
Empty: \x3Cbox>none\x3C/box>

Coordinates [0,1000] integers, divide by 1000 for relative coordinates.

boxes = LocateAnythingWorker.parse_boxes(answer, w, h)  # Pixel coordinates
points = LocateAnythingWorker.parse_points(answer, w, h)

Data Preparation

JSONL (ShareGPT Format)

{"conversations":[{"from":"human","value":"Detect all objects in \x3Cimage-1>."},{"from":"gpt","value":"\x3Cref>car\x3C/ref>\x3Cbox>\x3C100>\x3C200>\x3C400>\x3C500>\x3C/box>"}],"image":"train/00001.jpg"}

Recipe JSON

{"my_data":{"annotation":["a.jsonl","b.jsonl"],"root":"/data/images/","repeat_time":1.0,"data_augment":true}}

repeat_time: ≥1 oversample, \x3C1 downsample. Coordinates normalized to [0,1000].

8 Task Prompts

Task Method Prompt
Detection detect(cats) Locate all the instances that matches: cat1\x3C/c>cat2.
Single instance ground_single(p) Locate a single instance that matches: phrase.
Multi instance ground_multi(p) Locate all instances that match: phrase.
OCR detect_text() Detect all the text in box format.
Text grounding ground_text(p) Please locate the text referred as phrase.
GUI box ground_gui(p) Locate the region that matches: element.
GUI point ground_gui(p,pt) Point to: element.
Point grounding point(p) Point to: target.

Plain text: omit image field. Multi-image: image_list + \x3Cimage-1> \x3Cimage-2>.

Training

torchrun --nproc_per_node=8 eaglevl/train/locany_finetune_magi_stream.py \
  --model_name_or_path nvidia/LocateAnything-3B \
  --meta_path "./recipe.json" --output_dir work_dirs/sft \
  --max_steps 25000 --lr 2e-5 --bf16 True --block_size 6 \
  --attn_implementation magi --max_seq_length 16384 --max_num_tokens 25600 \
  --deepspeed deepspeed_configs/zero_stage2_config.json

Key Parameters

Parameter Description
--block_size MTP chunk size (default 4), use --causal_attn False during training
--attn_implementation magi (Hopper/Blackwell 32K+) or sdpa (any GPU ~4K)
--freeze_llm/backbone/mlp Freeze corresponding modules
--max_num_tokens Token budget per batch (recommend 2-3× max_num_tokens_per_sample)
--packing_buffer_size Online packing buffer (default 32, 64-128 for higher efficiency)

Non-Hopper GPU: --attn_implementation sdpa --max_seq_length 4096. OOM: --grad_checkpoint True + reduce --max_num_tokens.

Streaming Packing: Best-Fit + Big-Rocks-First algorithm, checkpoint resume bit-identical. DeepSpeed recommended zero_stage2.

Evaluation

# COCO / LVIS
bash evaluation/scripts/eval_coco.sh --model_path ... --test_jsonl ... --coco_json ... --output_dir ...
bash evaluation/scripts/eval_lvis.sh --model_path ... --test_jsonl ... --lvis_json ... --output_dir ...

# General grounding (Dense200/DocLayNet/HumanRef/RefCOCOg/VisDrone etc.)
bash evaluation/scripts/eval_grounding.sh --dataset Dense200 --eval_type box_eval ...

# Point evaluation / ScreenSpot-Pro
bash evaluation/scripts/eval_grounding.sh --dataset COCO --eval_type point_eval ...
bash evaluation/scripts/eval_sspro.sh --model_path ... --test_jsonl ... --output_dir ...

Requires Rex-Omni fastevaluate + data Mountchicken/Rex-Omni-EvalData likaixin/ScreenSpot-Pro.

Key Results

Benchmark Score Comparison
LVIS F1@Mean 50.7 +3.8 vs Rex-Omni
COCO F1@Mean 54.7 +1.8 vs Rex-Omni
M6Doc F1@Mean 70.1 +14.5 vs Rex-Omni
ScreenSpot-Pro Avg 60.3 SOTA
RefCOCOg val F1@Mean 76.7 SOTA
Pointing (7 benchmarks) Best on all
PBD dense scenes 2-6× faster vs NTP

Model Info

License

Code Apache 2.0 | Model NVIDIA License (non-commercial research)

References

Usage Guidance
Install only if you are comfortable cloning and running the referenced NVIDIA/SandAI ML repositories. Use a virtual environment or container, review the upstream licenses, and note that the model is described as non-commercial research use.
Capability Assessment
Purpose & Capability
The stated purpose is vision-language grounding, OCR, GUI element recognition, training, and evaluation guidance; the content stays aligned with that purpose.
Instruction Scope
Instructions are explicit examples for cloning the upstream project, installing Python packages, running inference, preparing datasets, training, and evaluation; no prompt overrides or hidden agent instructions were found.
Install Mechanism
Installation involves cloning external GitHub repositories and running pip install commands, which is expected for this ML workflow but should be done in an isolated environment.
Credentials
GPU-heavy training and evaluation commands are proportionate to the model workflow and are disclosed; local data paths are examples rather than broad indexing instructions.
Persistence & Privilege
No background services, startup hooks, privilege escalation, credential/session access, destructive commands, or persistent agent behavior were present.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install locateanything
  3. After installation, invoke the skill by name or use /locateanything
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of LocateAnything skill, built on NVIDIA's LocateAnything-3B vision-language grounding model. - Supports core inference APIs for detection, grounding (single/multi-instance), point grounding, OCR, and GUI element recognition. - Provides guidance on data preparation using JSONL/Recipe format and covers 8 vision-language grounding tasks. - Includes instructions for installation, training/fine-tuning, and evaluation on standard benchmarks. - Offers output parsing utilities and documents model performance and architecture.
Metadata
Slug locateanything
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is NVIDIA LocateAnything-3B vision-language grounding model?

NVIDIA LocateAnything-3B vision-language grounding model. Covers inference API (detect/ground/point/detect_text/ground_gui), data preparation (JSONL+Recipe 8... It is an AI Agent Skill for Claude Code / OpenClaw, with 16 downloads so far.

How do I install NVIDIA LocateAnything-3B vision-language grounding model?

Run "/install locateanything" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is NVIDIA LocateAnything-3B vision-language grounding model free?

Yes, NVIDIA LocateAnything-3B vision-language grounding model is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does NVIDIA LocateAnything-3B vision-language grounding model support?

NVIDIA LocateAnything-3B vision-language grounding model is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created NVIDIA LocateAnything-3B vision-language grounding model?

It is built and maintained by OpenLark (@openlark); the current version is v1.0.0.

💬 Comments