← 返回 Skills 市场
carlosdelfino

Dataset Search

作者 Carlos Delfino · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
43
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install dataset-search
功能描述
Find, compare, and obtain datasets or data lakes across ML repositories, cloud public data registries, government portals, scientific archives, geospatial/cl...
使用说明 (SKILL.md)

Dataset Search

Use this skill when the user needs a dataset, benchmark, public data lake, open-data portal, or data source for analysis, ML, BI, RAG, geospatial work, climate, NLP, multimodal projects, or data engineering.

Workflow

  1. Convert the user request into a dataset brief:
    • research question or analysis goal
    • domain and task type: classification, forecasting, geospatial, NLP, BI, econometrics, image-text, logs, etc.
    • geography, language, period, granularity, expected scale, file format, license, access constraints
    • must-have fields, acceptable proxies, and sources to prefer or avoid
  2. Start broad with the bundled script:
python3 skills/dataset-search/scripts/dataset_search.py search "solar radiation hourly Brazil agriculture" --profile climate --region BR --limit 8 --format markdown
  1. Narrow by source when the best family is clear:
python3 skills/dataset-search/scripts/dataset_search.py search "credit card fraud transactions" --source kaggle,huggingface,openml --limit 10
python3 skills/dataset-search/scripts/dataset_search.py search "sentinel crop classification" --profile geospatial --source aws-open-data,copernicus,huggingface
  1. Compare candidates by relevance, provenance, license, update cadence, schema/metadata quality, access method, size, and whether the source supports direct download or cloud-native querying.
  2. For huge cloud data lakes, prefer native access paths such as S3, BigQuery, Delta Sharing, Spark, Athena, Databricks, or cloud storage instead of downloading everything locally.
  3. Before downloading gated, paid, huge, sensitive, or license-restricted data, summarize the source, expected size, license, and access requirements for the user.

Bundled Script

scripts/dataset_search.py is a standard-library Python helper. It queries direct APIs/CLIs where practical and emits resilient fallback search links for sources without a stable public search API.

Common commands:

python3 skills/dataset-search/scripts/dataset_search.py sources
python3 skills/dataset-search/scripts/dataset_search.py search "income inequality Brazil time series" --profile economics --format json --output /tmp/dataset-results.json
python3 skills/dataset-search/scripts/dataset_search.py search "large multilingual instruction dataset" --profile nlp --offline --format markdown
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets --yes

By default, download prints a safe acquisition plan. It only executes downloads or source CLIs with --yes.

Useful options:

  • --profile: general, ml, nlp, geospatial, climate, economics, government, brazil, biomed, multimodal, cloud
  • --region: country, state, language, or geographic hint, such as BR, EU, US, Ceara, Portuguese
  • --source: comma-separated source ids, or all
  • --brief: JSON file with structured fields such as question, domain, task, geography, period, format, license, must_have, avoid, preferred_sources
  • --offline: do not call the network; return source-specific search URLs and acquisition guidance

Source Coverage

The script has direct adapters for Hugging Face Datasets, Kaggle CLI, OpenML, UCI when its API is reachable, Zenodo, Figshare, data.gov CKAN, NASA/CDC Socrata catalogs, Harvard Dataverse, GBIF, and generic CKAN-style portals when configured in the script.

It also produces guided search/acquisition entries for AWS Registry of Open Data, Google Cloud Public Datasets, Azure Open Datasets, Databricks Marketplace, Snowflake Marketplace, World Bank Open Data, data.europa.eu, IBGE, dados.gov.br, Eurostat, UN Data, WHO GHO, FRED, IMF, Our World in Data, CERN Open Data, NOAA, Copernicus, NASA POWER, NASA Earthdata, USGS, OpenStreetMap/Geofabrik, OpenAQ, Google Dataset Search, DataHub, data.world, Dryad, Mendeley Data, OpenAIRE, Awesome Public Datasets, Common Crawl, The Pile/EleutherAI, LAION, Nasdaq Data Link, and other registry-style sources.

Source-Specific Notes

  • Kaggle downloads require the kaggle CLI and local credentials.
  • Hugging Face downloads work best with huggingface-cli; gated datasets require authentication and acceptance of the dataset terms.
  • Databricks Marketplace, Google Cloud, Azure, and many enterprise catalogs often require browser/account access; use the script output as a discovery plan, then use the available browser, web, SQL, Spark, or cloud tools.
  • Government and CKAN portals vary in metadata quality. Prefer resources with explicit formats, update dates, dictionaries, and licenses.
  • Scientific repositories often expose DOI metadata and file URLs, but the files may be large or numerous. Inspect before downloading all files.

Response Rules

  • Return the top candidates with source, title, URL, why it matches, access method, license, format, and caveats.
  • Distinguish confirmed API results from generated search links.
  • If the script reports network failures, say which sources failed and whether the result is a fallback.
  • Do not claim a dataset is usable until license/access and minimum schema suitability are checked.
  • For cloud data lakes, include query/access examples instead of promising a local file.
安全使用建议
Install if you want a dataset search/acquisition helper that contacts public data services. Use --offline for private queries, inspect the dry-run plan before adding --yes, and avoid raw --url downloads or result files from untrusted sources, especially when local Kaggle or Hugging Face credentials are configured.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The stated purpose is to find, compare, and obtain datasets, and the script's API searches, generated acquisition plans, optional downloads, and Kaggle/Hugging Face CLI use fit that purpose.
Instruction Scope
The activation text is broad for data-source tasks, but it remains centered on dataset discovery/acquisition; download execution is a separate command gated by --yes.
Install Mechanism
Install metadata only declares python3, with no package installation hooks or background setup; static scan and VirusTotal telemetry were clean.
Credentials
Network queries, optional filesystem writes, and optional use of local Kaggle or Hugging Face credentials are proportionate for obtaining datasets, but users should expect those side effects.
Persistence & Privilege
No persistence, privilege escalation, or background worker behavior was found; the script can write result files or downloaded data only when invoked with output paths/download commands.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install dataset-search
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /dataset-search 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of dataset-search: a skill for discovering datasets across major public repositories and catalogs. - Search, compare, and obtain datasets for ML, analytics, and data science across cloud, open, and government sources. - Bundled Python script supports broad and source-targeted searches, download plans, and direct adapters for leading repositories. - Covers API-driven, registry, and catalog sources, falling back to guided search links as needed. - Provides workflow, guidance, and response rules for dataset suitability, licensing, and access constraints. - Designed for flexible discovery: ML, NLP, geospatial, climate, economics, government, multimodal, and cloud-native data projects.
元数据
Slug dataset-search
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Dataset Search 是什么?

Find, compare, and obtain datasets or data lakes across ML repositories, cloud public data registries, government portals, scientific archives, geospatial/cl... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 43 次。

如何安装 Dataset Search?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install dataset-search」即可一键安装,无需额外配置。

Dataset Search 是免费的吗?

是的,Dataset Search 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Dataset Search 支持哪些平台?

Dataset Search 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Dataset Search?

由 Carlos Delfino(@carlosdelfino)开发并维护,当前版本 v1.0.0。

💬 留言讨论