Biomed Dataset Finder
/install biomed-dataset-finder
Biomedical Dataset Finder
Search public biomedical datasets from NCBI, NGDC, and CNGB by conversational query keywords.
Usage Trigger
User asks for datasets related to a disease/treatment/species/subtype/data type combination. Examples:
- "Find colon cancer dMMR immunotherapy single-cell data"
- "hepatocellular carcinoma PD-1 scRNA-seq baseline"
- "lung cancer immunotherapy single cell data"
Data Sources (Priority Order)
| Priority | Source | Database | Accession Prefix |
|---|---|---|---|
| 1st | NCBI | GEO Datasets (gds) | GSE |
| 1st | NCBI | SRA (single-cell queries) | SRP/SRR |
| 1st | NGDC | Genome Sequence Archive | CRA |
| 2nd | CNGB | CNGBdb | CNP (requires token for some data) |
Workflow
Step 1 — Parse Query
Extract from user message:
- Disease/Cancer: e.g. colon cancer, hepatocellular carcinoma, lung cancer
- Treatment: e.g. immunotherapy, PD-1, chemotherapy, baseline therapy
- Species: human, mouse (defaults to human if unspecified)
- Pathology Subtype: e.g. dMMR, MSI-H, KRAS mutant
- Data Type: e.g. scRNA-seq, single-cell, RNA-seq, ChIP-seq, ATAC-seq
If any critical field is missing, ask the user to clarify.
Step 2 — NCBI Search (Primary)
Use NCBI E-utilities (free, no auth).
- Search
gdsdatabase (GEO Datasets, NOTgse) with combined keywords - For each result, pull
accession(GSE prefix), title, summary, andpubmedids(list) - Fetch article info (authors, title, journal, year, DOI) for each PMID
- For single-cell queries, also search
sradatabase
Query: ({disease}) AND ({treatment}) AND ({species}) AND ({data_type})
Rate limit: ~3 requests/second.
Step 3 — NGDC Search (Primary)
API: https://ngdc.cncb.ac.cn/search/api/specific?q={keywords}&db=gsa&size=20
Requires User-Agent header. Filter response for type=="GSA" entries (CRA accessions).
Step 4 — CNGB Search (Secondary)
If CNGB token provided: search CNGBdb API. On auth error: ask user if they want to provide token or skip.
Step 5 — Output
Markdown table with bold dataset ID, article info (authors, title, journal, year, DOI), and direct links.
If no results: "No public datasets found matching your criteria. Try adjusting keywords or switching data sources."
Factuality Requirements (Critical — No Hallucinations)
This skill handles scientific research data. Fabricating a single dataset entry undermines the user's work.
Hard Rules
- Dataset IDs: Only use IDs returned by actual API responses. Never invent, guess, or infer IDs.
- Article info: Only populate from actual API/PubMed responses. Leave blank if no data returned.
- Links: Build from verified accession patterns (e.g.
https://.../acc.cgi?acc={GSE}). Never guess URLs. - "Not found" is valid: If a source returns 0 results, output the empty result — do not fabricate entries to fill the table.
Verification Checklist (before presenting results)
- Every Dataset ID is from an API response, not memory or guess
- Every Article Title + Authors + Journal is from a PubMed/API response, not reconstructed
- Every Link follows the confirmed URL pattern for that database
- If a field is empty in the API response, it must be blank
-in the table — never fill with plausible text
Why This Matters
A researcher using wrong dataset IDs or fake article info could: waste weeks on non-existent data, cite non-existent papers, or compromise the validity of their research. The cost of hallucination here is far higher than in general conversation.
Security Notes
- User keywords are private — do NOT log the raw search query string to stderr/stdout. Log only counts (e.g. "Searching 5 keywords...").
- Token handling — CNGB token is passed via CLI arg only; never hardcode or log it.
- No external exfiltration — results table contains only public dataset metadata; no user-provided content is stored or transmitted elsewhere.
CLI Tool
python3 skills/biomed-dataset-finder/scripts/search_datasets.py \
--disease "colon cancer" --treatment "immunotherapy" \
--species human --subtype dMMR --type scRNA-seq --max-results 10
API Reference
See references/ncbi_api.md for NCBI E-utilities details.
See references/ngdc_api.md for NGDC GSA API details.
See references/cngb_api.md for CNGBdb API details.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install biomed-dataset-finder - 安装完成后,直接呼叫该 Skill 的名称或使用
/biomed-dataset-finder触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Biomed Dataset Finder 是什么?
Search NCBI GEO/SRA, NGDC-GSA, and CNGB for biomedical datasets by disease, treatment, species, pathology subtype, and data type. Returns bold dataset ID, li... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 53 次。
如何安装 Biomed Dataset Finder?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install biomed-dataset-finder」即可一键安装,无需额外配置。
Biomed Dataset Finder 是免费的吗?
是的,Biomed Dataset Finder 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Biomed Dataset Finder 支持哪些平台?
Biomed Dataset Finder 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Biomed Dataset Finder?
由 Shuhuan Cao(@gateswell)开发并维护,当前版本 v1.0.0。