← 返回 Skills 市场
ragesaq

Clawtext Ingest

作者 ragesaq · GitHub ↗ · v1.0.1
cross-platform ⚠ suspicious
411
总下载
0
收藏
1
当前安装
2
版本数
在 OpenClaw 中安装
/install clawtext-ingest
功能描述
Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns
使用说明 (SKILL.md)

ClawText Ingest — Production-Ready Memory Ingestion

Version: 1.3.0 | License: MIT | Status: Production ✅
Author: ragesaq | Category: Memory & Knowledge Management
GitHub: https://github.com/ragesaq/clawtext-ingest


🎯 What It Does

ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.

The Problem It Solves

  • Manual ingestion — Tedious, error-prone, no metadata
  • Duplicate memories — Same data ingested multiple times
  • Unstructured data — No hierarchy, no context preservation
  • One-time imports — No recurring/scheduled ingestion
  • Discord-specific gaps — Can't preserve forum post↔reply structure

The Solution

One command imports from Discord, files, URLs, or JSON
100% idempotent — Run 1000x, zero duplicates
Automatic metadata — YAML frontmatter with date, project, type, entities
6 agent patterns — Autonomous workflows documented and ready
Discord-native — Forum hierarchy preserved, progress bars, auto-batch mode


✨ Key Features

🎯 Discord Integration (New in v1.3.0)

  • Forum + Channel + Thread support
  • Hierarchy preservation — Post↔reply structure in metadata
  • Real-time progress — Live feedback for large ingestions
  • Auto-batch mode — \x3C500 posts: full, ≥500 posts: streaming
  • One-command setup — 5-minute bot creation

📁 Multi-Source Ingestion

  • Files — Glob patterns (Markdown, text, etc.)
  • URLs — Single or bulk URL ingestion
  • JSON — Chat exports, API responses
  • Raw text — Quick knowledge capture
  • Batch operations — Unified ingestion from multiple sources

🔄 Deduplication & Safety

  • SHA1-based — Cryptographic hash matching
  • 100% idempotent — Safe for repeated runs
  • ConfigurablecheckDedupe: true/false per operation
  • Zero data loss — Failed items tracked, fallback per-item ingestion
  • Hash persistence.ingest_hashes.json for cross-session tracking

🤖 Agent-Ready

  • 6 documented patterns — Direct API, Discord Agent, CLI, Cron, Batch, Thread
  • Working code examples — Copy-paste ready
  • Real-world patterns — GitHub sync, Discord monitoring, team decisions
  • Error handling — Comprehensive error recovery
  • Progress callbacks — Track ingestion in real-time

🛠️ Developer-Friendly

  • CLI toolclawtext-ingest + clawtext-ingest-discord commands
  • Node.js API — Simple imports for programmatic use
  • TypeScript-ready — Clear method signatures
  • Extensible — Custom transforms, field mapping
  • Well-documented — 11 guides, 20+ examples

🔗 ClawText Integration

  • Automatic cluster indexing — New memories indexed after rebuild
  • RAG injection — Relevant context injected into agent prompts
  • Project routing — Organize memories by project/source
  • Entity linking — Auto-extract and link related entities

🚀 Quick Start

Installation

# Via npm
npm install clawtext-ingest

# Via OpenClaw
openclaw install clawtext-ingest

Discord Ingestion (5 minutes)

# 1. Set up Discord bot (see DISCORD_BOT_SETUP.md)
# 2. Get bot token, set DISCORD_TOKEN env var

# 3. Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# 4. Ingest with progress
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUM_ID

# 5. Rebuild ClawText clusters
clawtext-ingest rebuild

File Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs"

Node.js API

import { ClawTextIngest } from 'clawtext-ingest';

const ingest = new ClawTextIngest();

// Ingest files
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs', type: 'fact' });

// Ingest JSON
await ingest.fromJSON(chatArray, { project: 'team' }, {
  keyMap: { contentKey: 'message', dateKey: 'timestamp', authorKey: 'user' }
});

// Rebuild clusters for RAG injection
await ingest.rebuildClusters();

🤖 Agent Integration (6 Patterns)

Pattern 1: Direct API

For: In-agent code
Use when: Agents need to ingest as part of workflow

const ingest = new ClawTextIngest();
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs' });

Pattern 2: Discord Agent

For: Autonomous Discord ingestion
Use when: Agents need to fetch Discord forums

const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
  forumId, mode: 'batch', token: process.env.DISCORD_TOKEN
});

Pattern 3: CLI Subprocess

For: Agents executing commands
Use when: Simpler CLI-based execution needed

await execAsync('clawtext-ingest-discord fetch-discord --forum-id ID');

Pattern 4: Cron/Scheduled

For: Recurring tasks
Use when: Daily/hourly ingestion needed

cron.schedule('0 * * * *', () => agentIngest());

Pattern 5: Batch Multi-Source

For: Unified ingestion
Use when: Multiple sources in one operation

await ingest.ingestAll([
  { type: 'files', data: ['docs/**/*.md'], metadata: {...} },
  { type: 'json', data: chatExport, metadata: {...} }
]);

Pattern 6: Discord Thread

For: Thread-specific ingestion
Use when: Single thread fetch needed

await runner.ingestThread(threadId);

→ See AGENT_GUIDE.md for complete examples


📊 Real-World Examples

Example 1: Daily Documentation Sync

async function syncDocsDaily() {
  const ingest = new ClawTextIngest();
  const result = await ingest.ingestAll([
    { type: 'files', data: ['docs/**/*.md'], metadata: { project: 'docs' } },
    { type: 'urls', data: ['https://docs.example.com/api'], metadata: { project: 'api-docs' } }
  ]);
  await ingest.rebuildClusters();
  return result;
}

Example 2: Discord Forum Monitoring

async function monitorDiscordForum(forumId) {
  const ingest = new ClawTextIngest();
  const runner = new DiscordIngestionRunner(ingest);
  
  const result = await runner.ingestForumAutonomous({
    forumId,
    mode: 'batch',
    token: process.env.DISCORD_TOKEN,
    onProgress: (p) => console.log(`${p.percent}% complete...`)
  });
  
  return result;
}

Example 3: Team Decisions Ingestion

async function ingestTeamDecisions() {
  const ingest = new ClawTextIngest();
  
  const result = await ingest.ingestAll([
    { type: 'files', data: ['decisions/adr/**/*.md'], metadata: { type: 'adr' } },
    { type: 'json', data: slackThread, metadata: { type: 'decision', source: 'slack' } }
  ]);
  
  await ingest.rebuildClusters();
  return result;
}

🛒 CLI Commands

clawtext-ingest — File/URL/JSON/Text Ingestion

clawtext-ingest ingest-files --input="docs/*.md" --project="docs" --verbose
clawtext-ingest ingest-urls --input="https://example.com" --project="research"
clawtext-ingest ingest-json --input=messages.json --source="slack"
clawtext-ingest ingest-text --input="Finding: X is better than Y" --project="findings"
clawtext-ingest batch --config=sources.json
clawtext-ingest rebuild
clawtext-ingest status

clawtext-ingest-discord — Discord Integration

# Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose

# Fetch & ingest
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord \
  --forum-id FORUM_ID \
  --mode batch \
  --batch-size 100 \
  --verbose

📚 Documentation

Document Purpose Read Time
README.md Overview + quick start 5 min
QUICKSTART.md 5-minute setup 5 min
AGENT_GUIDE.md 6 autonomous patterns 10 min
API_REFERENCE.md Complete API docs 15 min
PHASE2_CLI_GUIDE.md CLI commands 10 min
DISCORD_BOT_SETUP.md Bot creation 5 min
CLAYHUB_GUIDE.md Publication 5 min
INDEX.md Documentation index 2 min

🎯 Who Should Use This

  • AI/Agent developers — Building knowledge-aware agents
  • RAG engineers — Populating memory for context injection
  • Teams using Discord — Leveraging Discord as knowledge base
  • DevOps/MLOps — Automated knowledge ingestion pipelines
  • Researchers — Structuring unstructured data sources

⚡ Performance

Operation Speed Notes
Ingest 100 files ~5 sec With SHA1 dedup check
Ingest 1000 JSON items ~15 sec Batch processing
Small forum (\x3C100 msgs) ~10 sec Full mode
Large forum (1000+ msgs) ~2 min Auto-batch, streaming
Rebuild clusters ~5-30 sec Depends on total memories

✅ Quality Metrics

Metric Value
Tests 22/22 passing ✅
Code 1,254 production lines
Documentation 92 KB across 11 guides
Examples 20+ working examples
Coverage 100% critical paths

🔗 Integration with ClawText

  1. Ingest data → Creates memories with YAML metadata
  2. Rebuild clusters → ClawText indexes new memories
  3. RAG layer → Relevant context injected on next prompt
  4. Agent response — Enhanced with contextual information
# Complete workflow
clawtext-ingest-discord fetch-discord --forum-id ID  # Step 1
clawtext-ingest rebuild                               # Step 2
# Step 3-4 automatic (ClawText + Agent)

🆘 Support


📦 Installation & Requirements

Requirements:

  • Node.js ≥ 18.0.0
  • OpenClaw (for agent patterns)
  • ClawText ≥ 1.2.0 (for RAG integration)

Installation:

npm install clawtext-ingest
# or
openclaw install clawtext-ingest

Binaries:

  • clawtext-ingest — File/URL/JSON ingestion
  • clawtext-ingest-discord — Discord integration

🚀 Why This Over Alternatives

Feature ClawText-Ingest Manual Generic Importer API Tool
Discord native
Deduplication Partial
Agent patterns
Metadata auto Partial
ClawText integration
Idempotent Partial

📄 License

MIT — Use freely, open source, community supported


🙌 Contributing

Contributions welcome! See GitHub issues for current priorities.


Ready to ingest? Start with QUICKSTART.md (5 min) or AGENT_GUIDE.md if you're building agents.

安全使用建议
This skill appears to implement what it claims (Discord + multi-source ingestion), but there are some red flags you should address before installing: 1) The docs/code require a Discord bot token (DISCORD_TOKEN) but the skill metadata lists no required env vars — ask the publisher to declare required env vars in the skill manifest. 2) SKILL.md contains hidden unicode control characters (possible prompt-injection attempt); inspect and sanitize the file before trusting automated reviews. 3) Review the source files that interact with Discord and the network (src/adapters/discord.js, bin/discord.js, src/agent-runner.js) to confirm: a) tokens are not logged or uploaded to unknown endpoints, b) attachments are handled safely (where they are saved, whether external URLs are fetched), and c) network endpoints are only Discord/GitHub/expected services. 4) Run the package in an isolated environment first (no production tokens); if you must provide a bot token, give the bot minimal read-only scopes and be ready to revoke it. 5) Prefer installing from the upstream GitHub repo referenced in the docs (verify the repo owner and commits) rather than trusting an unverified registry snapshot. If the maintainer cannot clarify the missing metadata and the unicode-control characters, treat the package as untrusted.
功能分析
Type: OpenClaw Skill Name: clawtext-ingest Version: 1.0.1 The clawtext-ingest bundle provides a comprehensive data ingestion suite for AI agents, supporting Discord, local files, and URLs. While the code appears to follow its stated purpose, it contains high-risk capabilities that lack sufficient input sanitization, creating a significant attack surface. Specifically, src/index.js implements file ingestion using broad glob patterns and network ingestion via fetch, which could be exploited for path traversal or Server-Side Request Forgery (SSRF) if the agent is manipulated. Additionally, the tool requires access to sensitive environment variables like DISCORD_TOKEN (bin/discord.js) and performs destructive file operations in src/index.js (rebuildClusters) to manage its workspace. These high-privilege operations are plausibly necessary for the tool's function but represent risky behaviors without clear boundaries.
能力评估
Purpose & Capability
The skill is legitimately a multi-source ingestion tool (Discord, files, URLs, JSON) and the included code supports that. However, the skill metadata claims no required environment variables while the runtime docs and code clearly rely on a DISCORD_TOKEN (process.env.DISCORD_TOKEN) for Discord ingestion. Declaring zero required env vars is inconsistent with the documented/implemented functionality.
Instruction Scope
SKILL.md and AGENT_GUIDE direct agents to read local files, ingest JSON/URLs, call the Discord API, run CLI subprocesses (execSync / execFile), and persist hashes to disk (.ingest_hashes.json). Those actions are within the stated purpose but the SKILL.md contains detected 'unicode-control-chars' (prompt-injection signal) which could indicate hidden characters intended to manipulate automated reviewers or agent behavior. The instructions also give agents broad discretion to run subprocesses and scheduled jobs — expected for this tool but increases risk if the skill is untrusted.
Install Mechanism
There is no non-standard external installer or download URL shown; the README suggests normal npm/openclaw installation and repository references a GitHub URL. The package includes source, bins, and package.json (no installer that fetches arbitrary archives from unknown hosts). This is proportionate to a Node.js CLI/library.
Credentials
The skill metadata declares no required environment variables, but the docs and code require DISCORD_TOKEN for Discord ingestion. Requiring a Discord bot token is proportionate to the feature, but the omission from declared requirements is incoherent and surprising. The skill asks agents to pass environment tokens into subprocesses (e.g., execSync with DISCORD_TOKEN), so confirm token handling, storage, and that only minimum bot scopes are used.
Persistence & Privilege
always:false (normal). The skill provides autonomous patterns for agents (ingestForumAutonomous, cron jobs, CLI subprocesses) and writes local state (e.g., .ingest_hashes.json, optional outputPath). Autonomous invocation plus the ability to spawn subprocesses is expected for this tooling, but it increases the blast radius if the skill or its maintainer were untrusted—review where files are written and what is sent over the network.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install clawtext-ingest
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /clawtext-ingest 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
**v1.0.1 — Adds Discord ingestion, CLI, and agent-ready documentation patterns** - Major rewrite: Adds Discord forum/channel/thread ingestion with hierarchy preservation and real-time progress - New CLI tools: `clawtext-ingest` and `clawtext-ingest-discord` for one-command ingestion from files, URLs, JSON, and Discord - Expanded documentation: 11 new guides and references covering agent patterns, CLI use, enhancement workflows, and Discord setup - Improved agent integration: Six documented ingestion patterns for direct API, CLI, batch, Discord agent, cron/scheduled, and thread-specific use - Updated deduplication and error handling for robust, production-ready multi-source ingestion
v1.0.0
- New major release: Multi-source memory ingestion for OpenClaw agents with deduplication and YAML headers. - Supports ingestion from files, URLs, JSON exports, and raw text with flexible metadata and per-source deduplication. - Automatically adds YAML frontmatter (date, project, type, etc.) and indexes entities for RAG. - Integrates directly with the ClawText RAG layer for automatic cluster rebuilding after import. - Includes batch processing, data transformation hooks, and idempotent daily/recurring ingestion patterns. - Provides robust troubleshooting, performance guidance, and examples for agent integration.
元数据
Slug clawtext-ingest
版本 1.0.1
许可证
累计安装 1
当前安装数 1
历史版本数 2
常见问题

Clawtext Ingest 是什么?

Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 411 次。

如何安装 Clawtext Ingest?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install clawtext-ingest」即可一键安装,无需额外配置。

Clawtext Ingest 是免费的吗?

是的,Clawtext Ingest 完全免费(开源免费),可自由下载、安装和使用。

Clawtext Ingest 支持哪些平台?

Clawtext Ingest 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Clawtext Ingest?

由 ragesaq(@ragesaq)开发并维护,当前版本 v1.0.1。

💬 留言讨论