功能描述

Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns

使用说明 (SKILL.md)

ClawText Ingest — Production-Ready Memory Ingestion

Name: Clawtext Ingest
Author: ragesaq

Version: 1.3.0 | License: MIT | Status: Production ✅
Author: ragesaq | Category: Memory & Knowledge Management
GitHub: https://github.com/ragesaq/clawtext-ingest

🎯 What It Does

ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.

The Problem It Solves

❌ Manual ingestion — Tedious, error-prone, no metadata
❌ Duplicate memories — Same data ingested multiple times
❌ Unstructured data — No hierarchy, no context preservation
❌ One-time imports — No recurring/scheduled ingestion
❌ Discord-specific gaps — Can't preserve forum post↔reply structure

The Solution

✅ One command imports from Discord, files, URLs, or JSON
✅ 100% idempotent — Run 1000x, zero duplicates
✅ Automatic metadata — YAML frontmatter with date, project, type, entities
✅ 6 agent patterns — Autonomous workflows documented and ready
✅ Discord-native — Forum hierarchy preserved, progress bars, auto-batch mode

✨ Key Features

🎯 Discord Integration (New in v1.3.0)

Forum + Channel + Thread support
Hierarchy preservation — Post↔reply structure in metadata
Real-time progress — Live feedback for large ingestions
Auto-batch mode — \x3C500 posts: full, ≥500 posts: streaming
One-command setup — 5-minute bot creation

📁 Multi-Source Ingestion

Files — Glob patterns (Markdown, text, etc.)
URLs — Single or bulk URL ingestion
JSON — Chat exports, API responses
Raw text — Quick knowledge capture
Batch operations — Unified ingestion from multiple sources

🔄 Deduplication & Safety

SHA1-based — Cryptographic hash matching
100% idempotent — Safe for repeated runs
Configurable — checkDedupe: true/false per operation
Zero data loss — Failed items tracked, fallback per-item ingestion
Hash persistence — .ingest_hashes.json for cross-session tracking

🤖 Agent-Ready

6 documented patterns — Direct API, Discord Agent, CLI, Cron, Batch, Thread
Working code examples — Copy-paste ready
Real-world patterns — GitHub sync, Discord monitoring, team decisions
Error handling — Comprehensive error recovery
Progress callbacks — Track ingestion in real-time

🛠️ Developer-Friendly

CLI tool — clawtext-ingest + clawtext-ingest-discord commands
Node.js API — Simple imports for programmatic use
TypeScript-ready — Clear method signatures
Extensible — Custom transforms, field mapping
Well-documented — 11 guides, 20+ examples

🔗 ClawText Integration

Automatic cluster indexing — New memories indexed after rebuild
RAG injection — Relevant context injected into agent prompts
Project routing — Organize memories by project/source
Entity linking — Auto-extract and link related entities

🚀 Quick Start

Installation

# Via npm
npm install clawtext-ingest

# Via OpenClaw
openclaw install clawtext-ingest

Discord Ingestion (5 minutes)

# 1. Set up Discord bot (see DISCORD_BOT_SETUP.md)
# 2. Get bot token, set DISCORD_TOKEN env var

# 3. Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# 4. Ingest with progress
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUM_ID

# 5. Rebuild ClawText clusters
clawtext-ingest rebuild

File Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs"

Node.js API

import { ClawTextIngest } from 'clawtext-ingest';

const ingest = new ClawTextIngest();

// Ingest files
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs', type: 'fact' });

// Ingest JSON
await ingest.fromJSON(chatArray, { project: 'team' }, {
  keyMap: { contentKey: 'message', dateKey: 'timestamp', authorKey: 'user' }
});

// Rebuild clusters for RAG injection
await ingest.rebuildClusters();

🤖 Agent Integration (6 Patterns)

Pattern 1: Direct API

For: In-agent code
Use when: Agents need to ingest as part of workflow

const ingest = new ClawTextIngest();
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs' });

Pattern 2: Discord Agent

For: Autonomous Discord ingestion
Use when: Agents need to fetch Discord forums

const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
  forumId, mode: 'batch', token: process.env.DISCORD_TOKEN
});

Pattern 3: CLI Subprocess

For: Agents executing commands
Use when: Simpler CLI-based execution needed

await execAsync('clawtext-ingest-discord fetch-discord --forum-id ID');

Pattern 4: Cron/Scheduled

For: Recurring tasks
Use when: Daily/hourly ingestion needed

cron.schedule('0 * * * *', () => agentIngest());

Pattern 5: Batch Multi-Source

For: Unified ingestion
Use when: Multiple sources in one operation

await ingest.ingestAll([
  { type: 'files', data: ['docs/**/*.md'], metadata: {...} },
  { type: 'json', data: chatExport, metadata: {...} }
]);

Pattern 6: Discord Thread

For: Thread-specific ingestion
Use when: Single thread fetch needed

await runner.ingestThread(threadId);

→ See AGENT_GUIDE.md for complete examples

📊 Real-World Examples

Example 1: Daily Documentation Sync

async function syncDocsDaily() {
  const ingest = new ClawTextIngest();
  const result = await ingest.ingestAll([
    { type: 'files', data: ['docs/**/*.md'], metadata: { project: 'docs' } },
    { type: 'urls', data: ['https://docs.example.com/api'], metadata: { project: 'api-docs' } }
  ]);
  await ingest.rebuildClusters();
  return result;
}

Example 2: Discord Forum Monitoring

async function monitorDiscordForum(forumId) {
  const ingest = new ClawTextIngest();
  const runner = new DiscordIngestionRunner(ingest);
  
  const result = await runner.ingestForumAutonomous({
    forumId,
    mode: 'batch',
    token: process.env.DISCORD_TOKEN,
    onProgress: (p) => console.log(`${p.percent}% complete...`)
  });
  
  return result;
}

Example 3: Team Decisions Ingestion

async function ingestTeamDecisions() {
  const ingest = new ClawTextIngest();
  
  const result = await ingest.ingestAll([
    { type: 'files', data: ['decisions/adr/**/*.md'], metadata: { type: 'adr' } },
    { type: 'json', data: slackThread, metadata: { type: 'decision', source: 'slack' } }
  ]);
  
  await ingest.rebuildClusters();
  return result;
}

🛒 CLI Commands

`clawtext-ingest` — File/URL/JSON/Text Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs" --verbose
clawtext-ingest ingest-urls --input="https://example.com" --project="research"
clawtext-ingest ingest-json --input=messages.json --source="slack"
clawtext-ingest ingest-text --input="Finding: X is better than Y" --project="findings"
clawtext-ingest batch --config=sources.json
clawtext-ingest rebuild
clawtext-ingest status

`clawtext-ingest-discord` — Discord Integration

# Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# Fetch & ingest
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord \
  --forum-id FORUM_ID \
  --mode batch \
  --batch-size 100 \
  --verbose

📚 Documentation

Document	Purpose	Read Time
README.md	Overview + quick start	5 min
QUICKSTART.md	5-minute setup	5 min
AGENT_GUIDE.md	6 autonomous patterns	10 min
API_REFERENCE.md	Complete API docs	15 min
PHASE2_CLI_GUIDE.md	CLI commands	10 min
DISCORD_BOT_SETUP.md	Bot creation	5 min
CLAYHUB_GUIDE.md	Publication	5 min
INDEX.md	Documentation index	2 min

🎯 Who Should Use This

✅ AI/Agent developers — Building knowledge-aware agents
✅ RAG engineers — Populating memory for context injection
✅ Teams using Discord — Leveraging Discord as knowledge base
✅ DevOps/MLOps — Automated knowledge ingestion pipelines
✅ Researchers — Structuring unstructured data sources

⚡ Performance

Operation	Speed	Notes
Ingest 100 files	~5 sec	With SHA1 dedup check
Ingest 1000 JSON items	~15 sec	Batch processing
Small forum (\x3C100 msgs)	~10 sec	Full mode
Large forum (1000+ msgs)	~2 min	Auto-batch, streaming
Rebuild clusters	~5-30 sec	Depends on total memories

✅ Quality Metrics

Metric	Value
Tests	22/22 passing ✅
Code	1,254 production lines
Documentation	92 KB across 11 guides
Examples	20+ working examples
Coverage	100% critical paths

🔗 Integration with ClawText

Ingest data → Creates memories with YAML metadata
Rebuild clusters → ClawText indexes new memories
RAG layer → Relevant context injected on next prompt
Agent response — Enhanced with contextual information

# Complete workflow
clawtext-ingest-discord fetch-discord --forum-id ID  # Step 1
clawtext-ingest rebuild                               # Step 2
# Step 3-4 automatic (ClawText + Agent)

🆘 Support

Documentation: See INDEX.md for navigation
Issues: https://github.com/ragesaq/clawtext-ingest/issues
Examples: 20+ examples in documentation
Troubleshooting: Built into each guide

📦 Installation & Requirements

Requirements:

Node.js ≥ 18.0.0
OpenClaw (for agent patterns)
ClawText ≥ 1.2.0 (for RAG integration)

Installation:

npm install clawtext-ingest
# or
openclaw install clawtext-ingest

Binaries:

clawtext-ingest — File/URL/JSON ingestion
clawtext-ingest-discord — Discord integration

🚀 Why This Over Alternatives

Feature	ClawText-Ingest	Manual	Generic Importer	API Tool
Discord native	✅	❌	❌	❌
Deduplication	✅	❌	Partial	❌
Agent patterns	✅	❌	❌	❌
Metadata auto	✅	❌	Partial	❌
ClawText integration	✅	❌	❌	❌
Idempotent	✅	❌	❌	Partial

📄 License

MIT — Use freely, open source, community supported

🙌 Contributing

Contributions welcome! See GitHub issues for current priorities.

Ready to ingest? Start with QUICKSTART.md (5 min) or AGENT_GUIDE.md if you're building agents.

安全使用建议

This skill appears to implement what it claims (Discord + multi-source ingestion), but there are some red flags you should address before installing: 1) The docs/code require a Discord bot token (DISCORD_TOKEN) but the skill metadata lists no required env vars — ask the publisher to declare required env vars in the skill manifest. 2) SKILL.md contains hidden unicode control characters (possible prompt-injection attempt); inspect and sanitize the file before trusting automated reviews. 3) Review the source files that interact with Discord and the network (src/adapters/discord.js, bin/discord.js, src/agent-runner.js) to confirm: a) tokens are not logged or uploaded to unknown endpoints, b) attachments are handled safely (where they are saved, whether external URLs are fetched), and c) network endpoints are only Discord/GitHub/expected services. 4) Run the package in an isolated environment first (no production tokens); if you must provide a bot token, give the bot minimal read-only scopes and be ready to revoke it. 5) Prefer installing from the upstream GitHub repo referenced in the docs (verify the repo owner and commits) rather than trusting an unverified registry snapshot. If the maintainer cannot clarify the missing metadata and the unicode-control characters, treat the package as untrusted.

功能分析

Type: OpenClaw Skill Name: clawtext-ingest Version: 1.0.1 The clawtext-ingest bundle provides a comprehensive data ingestion suite for AI agents, supporting Discord, local files, and URLs. While the code appears to follow its stated purpose, it contains high-risk capabilities that lack sufficient input sanitization, creating a significant attack surface. Specifically, src/index.js implements file ingestion using broad glob patterns and network ingestion via fetch, which could be exploited for path traversal or Server-Side Request Forgery (SSRF) if the agent is manipulated. Additionally, the tool requires access to sensitive environment variables like DISCORD_TOKEN (bin/discord.js) and performs destructive file operations in src/index.js (rebuildClusters) to manage its workspace. These high-privilege operations are plausibly necessary for the tool's function but represent risky behaviors without clear boundaries.

能力评估

⚠ Purpose & Capability

The skill is legitimately a multi-source ingestion tool (Discord, files, URLs, JSON) and the included code supports that. However, the skill metadata claims no required environment variables while the runtime docs and code clearly rely on a DISCORD_TOKEN (process.env.DISCORD_TOKEN) for Discord ingestion. Declaring zero required env vars is inconsistent with the documented/implemented functionality.

⚠ Instruction Scope

SKILL.md and AGENT_GUIDE direct agents to read local files, ingest JSON/URLs, call the Discord API, run CLI subprocesses (execSync / execFile), and persist hashes to disk (.ingest_hashes.json). Those actions are within the stated purpose but the SKILL.md contains detected 'unicode-control-chars' (prompt-injection signal) which could indicate hidden characters intended to manipulate automated reviewers or agent behavior. The instructions also give agents broad discretion to run subprocesses and scheduled jobs — expected for this tool but increases risk if the skill is untrusted.

✓ Install Mechanism

There is no non-standard external installer or download URL shown; the README suggests normal npm/openclaw installation and repository references a GitHub URL. The package includes source, bins, and package.json (no installer that fetches arbitrary archives from unknown hosts). This is proportionate to a Node.js CLI/library.

⚠ Credentials

The skill metadata declares no required environment variables, but the docs and code require DISCORD_TOKEN for Discord ingestion. Requiring a Discord bot token is proportionate to the feature, but the omission from declared requirements is incoherent and surprising. The skill asks agents to pass environment tokens into subprocesses (e.g., execSync with DISCORD_TOKEN), so confirm token handling, storage, and that only minimum bot scopes are used.

ℹ Persistence & Privilege

always:false (normal). The skill provides autonomous patterns for agents (ingestForumAutonomous, cron jobs, CLI subprocesses) and writes local state (e.g., .ingest_hashes.json, optional outputPath). Autonomous invocation plus the ability to spawn subprocesses is expected for this tooling, but it increases the blast radius if the skill or its maintainer were untrusted—review where files are written and what is sent over the network.

版本历史

v1.0.1

**v1.0.1 — Adds Discord ingestion, CLI, and agent-ready documentation patterns** - Major rewrite: Adds Discord forum/channel/thread ingestion with hierarchy preservation and real-time progress - New CLI tools: `clawtext-ingest` and `clawtext-ingest-discord` for one-command ingestion from files, URLs, JSON, and Discord - Expanded documentation: 11 new guides and references covering agent patterns, CLI use, enhancement workflows, and Discord setup - Improved agent integration: Six documented ingestion patterns for direct API, CLI, batch, Discord agent, cron/scheduled, and thread-specific use - Updated deduplication and error handling for robust, production-ready multi-source ingestion

v1.0.0

- New major release: Multi-source memory ingestion for OpenClaw agents with deduplication and YAML headers. - Supports ingestion from files, URLs, JSON exports, and raw text with flexible metadata and per-source deduplication. - Automatically adds YAML frontmatter (date, project, type, etc.) and indexes entities for RAG. - Integrates directly with the ClawText RAG layer for automatic cluster rebuilding after import. - Includes batch processing, data transformation hooks, and idempotent daily/recurring ingestion patterns. - Provides robust troubleshooting, performance guidance, and examples for agent integration.

元数据

Slug clawtext-ingest

版本 1.0.1

许可证 —

累计安装 1

当前安装数 1

历史版本数 2

常见问题

Clawtext Ingest 是什么？

Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 411 次。

如何安装 Clawtext Ingest？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install clawtext-ingest」即可一键安装，无需额外配置。

Clawtext Ingest 是免费的吗？

是的，Clawtext Ingest 完全免费（开源免费），可自由下载、安装和使用。

Clawtext Ingest 支持哪些平台？

Clawtext Ingest 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Clawtext Ingest？

由 ragesaq（@ragesaq）开发并维护，当前版本 v1.0.1。

Clawtext Ingest