← Back to Skills Marketplace
carlosdelfino

Dataset Search

by Carlos Delfino · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
43
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install dataset-search
Description
Find, compare, and obtain datasets or data lakes across ML repositories, cloud public data registries, government portals, scientific archives, geospatial/cl...
README (SKILL.md)

Dataset Search

Use this skill when the user needs a dataset, benchmark, public data lake, open-data portal, or data source for analysis, ML, BI, RAG, geospatial work, climate, NLP, multimodal projects, or data engineering.

Workflow

  1. Convert the user request into a dataset brief:
    • research question or analysis goal
    • domain and task type: classification, forecasting, geospatial, NLP, BI, econometrics, image-text, logs, etc.
    • geography, language, period, granularity, expected scale, file format, license, access constraints
    • must-have fields, acceptable proxies, and sources to prefer or avoid
  2. Start broad with the bundled script:
python3 skills/dataset-search/scripts/dataset_search.py search "solar radiation hourly Brazil agriculture" --profile climate --region BR --limit 8 --format markdown
  1. Narrow by source when the best family is clear:
python3 skills/dataset-search/scripts/dataset_search.py search "credit card fraud transactions" --source kaggle,huggingface,openml --limit 10
python3 skills/dataset-search/scripts/dataset_search.py search "sentinel crop classification" --profile geospatial --source aws-open-data,copernicus,huggingface
  1. Compare candidates by relevance, provenance, license, update cadence, schema/metadata quality, access method, size, and whether the source supports direct download or cloud-native querying.
  2. For huge cloud data lakes, prefer native access paths such as S3, BigQuery, Delta Sharing, Spark, Athena, Databricks, or cloud storage instead of downloading everything locally.
  3. Before downloading gated, paid, huge, sensitive, or license-restricted data, summarize the source, expected size, license, and access requirements for the user.

Bundled Script

scripts/dataset_search.py is a standard-library Python helper. It queries direct APIs/CLIs where practical and emits resilient fallback search links for sources without a stable public search API.

Common commands:

python3 skills/dataset-search/scripts/dataset_search.py sources
python3 skills/dataset-search/scripts/dataset_search.py search "income inequality Brazil time series" --profile economics --format json --output /tmp/dataset-results.json
python3 skills/dataset-search/scripts/dataset_search.py search "large multilingual instruction dataset" --profile nlp --offline --format markdown
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets --yes

By default, download prints a safe acquisition plan. It only executes downloads or source CLIs with --yes.

Useful options:

  • --profile: general, ml, nlp, geospatial, climate, economics, government, brazil, biomed, multimodal, cloud
  • --region: country, state, language, or geographic hint, such as BR, EU, US, Ceara, Portuguese
  • --source: comma-separated source ids, or all
  • --brief: JSON file with structured fields such as question, domain, task, geography, period, format, license, must_have, avoid, preferred_sources
  • --offline: do not call the network; return source-specific search URLs and acquisition guidance

Source Coverage

The script has direct adapters for Hugging Face Datasets, Kaggle CLI, OpenML, UCI when its API is reachable, Zenodo, Figshare, data.gov CKAN, NASA/CDC Socrata catalogs, Harvard Dataverse, GBIF, and generic CKAN-style portals when configured in the script.

It also produces guided search/acquisition entries for AWS Registry of Open Data, Google Cloud Public Datasets, Azure Open Datasets, Databricks Marketplace, Snowflake Marketplace, World Bank Open Data, data.europa.eu, IBGE, dados.gov.br, Eurostat, UN Data, WHO GHO, FRED, IMF, Our World in Data, CERN Open Data, NOAA, Copernicus, NASA POWER, NASA Earthdata, USGS, OpenStreetMap/Geofabrik, OpenAQ, Google Dataset Search, DataHub, data.world, Dryad, Mendeley Data, OpenAIRE, Awesome Public Datasets, Common Crawl, The Pile/EleutherAI, LAION, Nasdaq Data Link, and other registry-style sources.

Source-Specific Notes

  • Kaggle downloads require the kaggle CLI and local credentials.
  • Hugging Face downloads work best with huggingface-cli; gated datasets require authentication and acceptance of the dataset terms.
  • Databricks Marketplace, Google Cloud, Azure, and many enterprise catalogs often require browser/account access; use the script output as a discovery plan, then use the available browser, web, SQL, Spark, or cloud tools.
  • Government and CKAN portals vary in metadata quality. Prefer resources with explicit formats, update dates, dictionaries, and licenses.
  • Scientific repositories often expose DOI metadata and file URLs, but the files may be large or numerous. Inspect before downloading all files.

Response Rules

  • Return the top candidates with source, title, URL, why it matches, access method, license, format, and caveats.
  • Distinguish confirmed API results from generated search links.
  • If the script reports network failures, say which sources failed and whether the result is a fallback.
  • Do not claim a dataset is usable until license/access and minimum schema suitability are checked.
  • For cloud data lakes, include query/access examples instead of promising a local file.
Usage Guidance
Install if you want a dataset search/acquisition helper that contacts public data services. Use --offline for private queries, inspect the dry-run plan before adding --yes, and avoid raw --url downloads or result files from untrusted sources, especially when local Kaggle or Hugging Face credentials are configured.
Capability Tags
requires-sensitive-credentials
Capability Assessment
Purpose & Capability
The stated purpose is to find, compare, and obtain datasets, and the script's API searches, generated acquisition plans, optional downloads, and Kaggle/Hugging Face CLI use fit that purpose.
Instruction Scope
The activation text is broad for data-source tasks, but it remains centered on dataset discovery/acquisition; download execution is a separate command gated by --yes.
Install Mechanism
Install metadata only declares python3, with no package installation hooks or background setup; static scan and VirusTotal telemetry were clean.
Credentials
Network queries, optional filesystem writes, and optional use of local Kaggle or Hugging Face credentials are proportionate for obtaining datasets, but users should expect those side effects.
Persistence & Privilege
No persistence, privilege escalation, or background worker behavior was found; the script can write result files or downloaded data only when invoked with output paths/download commands.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install dataset-search
  3. After installation, invoke the skill by name or use /dataset-search
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of dataset-search: a skill for discovering datasets across major public repositories and catalogs. - Search, compare, and obtain datasets for ML, analytics, and data science across cloud, open, and government sources. - Bundled Python script supports broad and source-targeted searches, download plans, and direct adapters for leading repositories. - Covers API-driven, registry, and catalog sources, falling back to guided search links as needed. - Provides workflow, guidance, and response rules for dataset suitability, licensing, and access constraints. - Designed for flexible discovery: ML, NLP, geospatial, climate, economics, government, multimodal, and cloud-native data projects.
Metadata
Slug dataset-search
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Dataset Search?

Find, compare, and obtain datasets or data lakes across ML repositories, cloud public data registries, government portals, scientific archives, geospatial/cl... It is an AI Agent Skill for Claude Code / OpenClaw, with 43 downloads so far.

How do I install Dataset Search?

Run "/install dataset-search" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Dataset Search free?

Yes, Dataset Search is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Dataset Search support?

Dataset Search is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Dataset Search?

It is built and maintained by Carlos Delfino (@carlosdelfino); the current version is v1.0.0.

💬 Comments