/install graph-rag-builder
GraphRAG Builder Skill
Turns any documentation website into a runnable MCP knowledge server in 5 pipeline steps,
each run on the user's local machine using scripts in the scripts/ folder.
Quick Reference
| Step | Script | What it does |
|---|---|---|
| M1 | crawl.py |
BFS crawl → raw HTML + metadata per page |
| M2 | extract_concepts.py |
HTML → chunks → LLM concept extraction |
| M3 | build_graph.py |
Concepts + links → networkx knowledge graph |
| M4 | build_embeddings.py |
Chunks + concepts → numpy vector index |
| M5 | generate_mcp_server.py |
Graph + embeddings → standalone server.py |
All scripts require Python 3.10+ and auto-install their own dependencies on first run.
Step 0: Clarify Requirements
Before running anything, ask the user:
- URL: Which site to crawl (required — starting page)
- Depth: How many link-hops to follow (default 3; suggest 2 for large sites)
- Model: Which Claude model for concept extraction —
haiku(fast/cheap) orsonnet(higher quality). Default:haiku
Set the output slug from the URL: https://strudel.cc → strudel-cc-mcp.
Step 1: Crawl (M1)
Provide this command for the user to run locally:
python scripts/crawl.py \
--url \x3CURL> \
--max-depth \x3CDEPTH> \
--output ./output
What to expect:
- Creates
output/\x3Cslug>/raw_content/*.json(one per page) - Creates
output/\x3Cslug>/crawl.json(state tracking) - Prints a summary: pages crawled, JS fallbacks used, failures
Common issues:
- JS-heavy single-page apps → many Playwright fallbacks (normal, just slower)
- Rate limiting → add
--rate-limit 1.5to slow down - First run needs:
pip install playwright && playwright install chromium
Step 2: Extract Concepts (M2)
The user must set ANTHROPIC_API_KEY first. Provide this command:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/\x3Cslug>-mcp \
--model haiku
Dry-run first (no API cost):
python scripts/extract_concepts.py --input ./output/\x3Cslug>-mcp --dry-run
This validates chunking quality before spending API budget. Show them chunk counts and section names from dry-run output.
What to expect:
- Processes ~2–5 pages/minute on haiku
- Creates
output/\x3Cslug>-mcp/extracted/*.json(one per page) - Each file contains chunks with: concepts, tags, code examples, prerequisites, relationships
- Skips already-extracted pages (safe to re-run after interruption)
Common issues:
- Pages showing
no_chunks→ likely JS-rendered content not captured; acceptable for a minority of pages - API rate limiting → script retries automatically with exponential backoff
--max-pages 10flag to test on a small sample first
Re-running after a partial run:
python scripts/extract_concepts.py --input ./output/\x3Cslug>-mcp --model haiku
# (automatically skips already-extracted pages)
Force re-extraction of everything:
python scripts/extract_concepts.py --input ./output/\x3Cslug>-mcp --force
Step 3: Build Graph (M3)
python scripts/build_graph.py --input ./output/\x3Cslug>-mcp
What to expect:
- Reads all non-dry-run
extracted/*.jsonfiles - Deduplicates concept names (case-insensitive, strips trailing
()) - Creates
output/\x3Cslug>-mcp/graph.json - Prints node/edge counts by type
Healthy output looks like:
Pages: 46
Chunks: 357
Concepts: 200+
Total edges: 1000+
MENTIONS 600+
REQUIRES 100+
HAS_CHUNK 357
LINKS_TO 80+
RELATED 40+
If concepts = 0 and "Skipped N dry-run files" appears, M2 hasn't been run with a real API key yet.
Step 4: Build Embeddings (M4)
python scripts/build_embeddings.py --input ./output/\x3Cslug>-mcp
First run downloads all-MiniLM-L6-v2 (~80MB, cached after that).
Add --smoke-test to query both collections immediately after building:
python scripts/build_embeddings.py --input ./output/\x3Cslug>-mcp --smoke-test
What to expect:
- Creates
output/\x3Cslug>-mcp/embeddings/with 5 numpy files (no database needed) - Two indexes:
chunks(semantic search) andconcepts(concept lookup)
Step 5: Generate MCP Server (M5)
python scripts/generate_mcp_server.py --input ./output/\x3Cslug>-mcp
Outputs:
output/\x3Cslug>-mcp/server.py— the runnable MCP serveroutput/\x3Cslug>-mcp/mcp_config.json— Claude Desktop config snippet
Install into Claude Desktop:
- Open
~/Library/Application Support/Claude/claude_desktop_config.json - Merge the contents of
mcp_config.jsoninto the"mcpServers"key - Restart Claude Desktop
- The server name (e.g.,
strudel-cc) appears in Claude's available tools
Test the server standalone:
python output/\x3Cslug>-mcp/server.py
# Should print "Loading ... knowledge graph... Ready: N nodes, M edges"
The 8 MCP Tools
Once installed, Claude can use these tools against the knowledge base:
| Tool | Description |
|---|---|
search(query, n=5) |
Semantic search over all content chunks |
get_concept(name) |
Concept details + chunks where it appears |
get_related(concept, n=5) |
Related concepts via graph edges |
get_learning_path(start, goal) |
Shortest concept path between topics |
get_prerequisites(concept) |
What must be understood first |
get_examples(concept) |
Code examples for a concept |
list_concepts(tag?, limit=20) |
Browse all indexed concepts |
get_page(url) |
All chunks for a specific doc page |
Complete Pipeline Command Sequence
For a fresh install, provide the user with all commands in order:
# 0. Install system deps (once)
pip install requests beautifulsoup4 lxml playwright anthropic \
networkx numpy sentence-transformers mcp
playwright install chromium
# 1. Crawl
python scripts/crawl.py --url \x3CURL> --max-depth 3 --output ./output
# 2. Extract concepts (dry-run first)
python scripts/extract_concepts.py --input ./output/\x3Cslug>-mcp --dry-run
# Then real run:
ANTHROPIC_API_KEY=sk-ant-... python scripts/extract_concepts.py \
--input ./output/\x3Cslug>-mcp --model haiku
# 3. Build graph
python scripts/build_graph.py --input ./output/\x3Cslug>-mcp
# 4. Build embeddings
python scripts/build_embeddings.py --input ./output/\x3Cslug>-mcp --smoke-test
# 5. Generate server
python scripts/generate_mcp_server.py --input ./output/\x3Cslug>-mcp
# 6. Test server
python output/\x3Cslug>-mcp/server.py
Output Directory Layout
output/\x3Cslug>-mcp/
├── crawl.json State tracking (incremental re-runs)
├── raw_content/ One JSON per crawled page (HTML + links)
├── extracted/ One JSON per page (chunks + LLM concepts)
├── graph.json networkx knowledge graph
├── embeddings/ numpy indexes (chunks.npy, concepts.npy + JSON)
├── server.py The runnable MCP server ← share this
└── mcp_config.json Claude Desktop config snippet ← install this
The entire output/\x3Cslug>-mcp/ folder is the deliverable. The user can move it anywhere
as long as server.py, graph.json, and embeddings/ stay together.
Troubleshooting
"No module named X" → The script auto-installs deps, but if it fails:
pip install \x3Cpackage> --break-system-packages
Crawl gets 0 pages → Check robots.txt and try --force to bypass the crawl cache.
extract_concepts produces tiny concepts count → The page content may be JS-only.
Check fetched_with field in raw_content/*.json — pages fetched via requests with
very little text should have been picked up by Playwright. Re-crawl with --force.
Server fails to start → Run python output/\x3Cslug>-mcp/server.py directly and check
stderr for import errors. Most common cause: mcp package not installed.
Claude Desktop doesn't show the server → Verify the path in mcp_config.json is
absolute and the file exists. Restart Claude Desktop after any config change.
Deferred Features
See TODO.md for planned improvements including:
- YouTube transcript fetching
- Neo4j export for large graphs
- OpenAI/Voyage embedding API support
- Scheduled re-crawls
- Graph visualization
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install graph-rag-builder - After installation, invoke the skill by name or use
/graph-rag-builder - Provide required inputs per the skill's parameter spec and get structured output
What is GraphRAGBuilder?
Builds a fully runnable MCP (Model Context Protocol) knowledge server from any website or documentation URL. Crawls the site, extracts concepts using Claude,... It is an AI Agent Skill for Claude Code / OpenClaw, with 80 downloads so far.
How do I install GraphRAGBuilder?
Run "/install graph-rag-builder" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is GraphRAGBuilder free?
Yes, GraphRAGBuilder is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does GraphRAGBuilder support?
GraphRAGBuilder is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created GraphRAGBuilder?
It is built and maintained by nacmonad (@nacmonad); the current version is v1.0.0.