Long Context Rag Analyzer
/install long-context-rag-analyzer
\r \r
Long Context RAG Analyzer\r
\r \r
AI技术最新动态 [2026-05-25更新]\r
\r | 动态类型 | 内容摘要 | 影响范围 |\r |---------|---------|---------|\r | AI技术 | 2026年MCP协议三层架构支持动态权限控制和结构化数据验证 | RAG架构指南需增加MCP集成和合规要求 |\r | AI技术 | 长上下文RAG与MCP工具集成成为主流架构模式 | RAG架构指南需增加MCP集成和合规要求 |\r | AI技术 | 企业级RAG系统需关注数据安全、访问控制和合规要求 | RAG架构指南需增加MCP集成和合规要求 |\r \r
数据截止: 2026-05-25 | 来源:国家金融监督管理总局、安永Q1分析、行业公开信息\r 声明: 以上动态供参考,具体以官方最新发布为准\r \r
Overview\r
\r With Gemini 3.1 Ultra's 2M token context window and DeepSeek V4's 1M token context, the era of "dump everything into the prompt" has arrived. But raw context isn't enough — the real challenge is building intelligent retrieval systems that extract the right information, rank it by relevance, and synthesize it into actionable insights. This skill provides a complete framework for building, evaluating, and optimizing long-context RAG pipelines for professional use.\r \r
Title\r
\r Long Context RAG Analyzer — From Massive Documents to Actionable Insights\r \r
Triggers\r
\r
- "long context analysis" / "长文本分析" / "超长文档分析"\r
- "RAG optimization" / "RAG优化" / "检索增强生成"\r
- "document chunking" / "文档分块策略"\r
- "hybrid search" / "混合检索" / "向量搜索"\r
- "context window optimization" / "上下文窗口优化"\r
- "multi-document reasoning" / "多文档推理"\r
- "retrieval quality evaluation" / "检索质量评估"\r
- "financial report RAG" / "财报RAG分析"\r
- "legal document analysis" / "法律文书分析"\r
- "research paper synthesis" / "论文综合分析"\r
- "100K token" / "1M token" / "2M token context"\r
- "vector database" / "向量数据库"\r \r ---\r \r
Workflow\r
\r
Phase 1 — Document Intake & Preprocessing\r
\r Step 1.1: Document Classification\r \r Classify incoming documents by type, structure, and processing priority.\r \r Document Taxonomy:\r \r | Category | Examples | Structure | Processing Priority |\r |----------|----------|-----------|-------------------|\r | Financial Report | Annual report, 10-K, earnings transcript | Semi-structured, tables | CRITICAL |\r | Legal Contract | Insurance policy, loan agreement | Highly structured, dense | HIGH |\r | Research Paper | Academic paper, market study | Well-structured, citations | MEDIUM |\r | Internal Memo | Meeting notes, internal email | Unstructured | LOW |\r | Regulatory Filing | CBIRC submission, SEC filing | Structured, tabular | CRITICAL |\r \r Step 1.2: Metadata Extraction\r \r Extract key metadata to enable filtering and ranking.\r \r Required metadata:\r
- Document ID, title, date, source\r
- Entity mentions (companies, people, products)\r
- Key dates (report period, deadlines, event dates)\r
- Sentiment/tone indicators\r
- Page count, token count (estimated)\r \r For financial reports specifically:\r
- Company name, ticker, fiscal period\r
- Revenue, net income, key ratios (extracted if available)\r
- Auditor, filing date\r
- Related entities (subsidiaries, parent companies)\r \r ---\r \r
Phase 2 — Chunking Strategy Selection\r
\r Step 2.1: Choose Chunking Approach\r \r Different document types require different chunking strategies. Select based on:\r \r
Chunking Strategy Matrix:\r
\r
| Strategy | Best For | Chunk Size | Overlap | Preserves |\r
|----------|----------|------------|---------|-----------|\r
| Fixed-size | Homogeneous content (logs, tickets) | 512-1024 tokens | 50-100 tokens | Speed |\r
| Semantic | Paragraph-level meaning | 512-1500 tokens | 10-20% | Coherence |\r
| Document-structure | Reports, contracts, papers | By section/chapter | 100-200 tokens | Structure |\r
| Recursive | Nested content | Adaptive 256-1024 | 15% | Hierarchy |\r
| Agentic | Mixed content types | Dynamic | Context-aware | Intent |\r
\r
For financial reports: RECOMMEND → Semantic + Document-structure hybrid\r
For legal contracts: RECOMMEND → Recursive with section boundaries\r
For research papers: RECOMMEND → Document-structure by section + citation graph\r
```\r
\r
**Step 2.2: Calculate Optimal Chunk Size**\r
\r
```python\r
# Chunk size calculator\r
def calculate_optimal_chunk_size(document_tokens, query_pattern):\r
# Estimate based on query complexity\r
if "detailed analysis" in query_pattern or "deep dive" in query_pattern:\r
chunk_size = 1500 # Larger chunks for complex queries\r
elif "comparison" in query_pattern or "summary" in query_pattern:\r
chunk_size = 2048 # Section-level for comparative analysis\r
elif "specific fact" in query_pattern or "look up" in query_pattern:\r
chunk_size = 256 # Small chunks for precise retrieval\r
else:\r
chunk_size = 768 # Default\r
\r
overlap = int(chunk_size * 0.15) # 15% overlap\r
return chunk_size, overlap\r
```\r
\r
---\r
\r
### Phase 3 — Indexing & Retrieval\r
\r
**Step 3.1: Hybrid Search Setup**\r
\r
Combine vector similarity search with keyword (BM25) search for optimal retrieval.\r
\r
**Hybrid Search Architecture:**\r
\r
```\r
Query → [Vector Search (cosine similarity)] ←→ [BM25 Keyword Search]\r
↓ ↓\r
Top-K semantic results Top-K keyword results\r
↓ ↓\r
Reciprocal Rank Fusion (RRF) → Final ranked results\r
```\r
\r
**Configuration for different use cases:**\r
\r
```python\r
# China financial report RAG — Hybrid config\r
HYBRID_CONFIG = {\r
"vector": {\r
"model": "text-embedding-3-large", # 3072 dim for high quality\r
"dimension": 3072,\r
"召回率_top_k": 20,\r
"similarity_threshold": 0.75\r
},\r
"keyword": {\r
"algorithm": "BM25",\r
"k1": 1.5,\r
"b": 0.75,\r
"召回率_top_k": 20\r
},\r
"fusion": {\r
"method": "RRF", # Reciprocal Rank Fusion\r
"rrf_k": 60 # Standard RRF parameter\r
},\r
"rerank": {\r
"model": "cross-encoder/ms-marco-MiniLM-L-12v2",\r
"top_n": 5 # Final reranked results\r
}\r
}\r
```\r
\r
**Step 3.2: Retrieval Quality Evaluation**\r
\r
Evaluate the RAG pipeline before deploying.\r
\r
**Metrics to measure:**\r
\r
| Metric | What it measures | Target |\r
|--------|----------------|--------|\r
| Precision@K | % of retrieved docs relevant | > 0.85 |\r
| Recall@K | % of relevant docs retrieved | > 0.80 |\r
| MRR (Mean Reciprocal Rank) | Rank of first relevant doc | > 0.70 |\r
| NDCG@K | Ranking quality weighted by relevance | > 0.75 |\r
| Context Precision | % of context chunks actually used | > 0.60 |\r
| Hallucination Rate | Factual errors per 1000 tokens | \x3C 0.05 |\r
\r
**Example evaluation:**\r
\r
```\r
## RAG Pipeline Evaluation Report\r
\r
Test Set: 50 financial Q&A pairs from annual reports\r
Index: 120 documents (5 years × 24 companies)\r
Chunk size: 1024 tokens, 15% overlap\r
\r
### Retrieval Metrics\r
- Precision@5: 0.89 ✅\r
- Recall@10: 0.82 ✅\r
- MRR: 0.76 ✅\r
- NDCG@5: 0.81 ✅\r
\r
### Quality Issues Identified\r
❌ Table data losing structure when chunked — fix: preserve tables as JSON chunks\r
❌ Chinese financial terms inconsistently embedded — fix: add bilingual glossary\r
⚠️ Long queries (>500 tokens) retrieving irrelevant context — fix: query compression\r
\r
### Action Plan\r
1. [HIGH] Implement table-aware chunking for financial tables\r
2. [MEDIUM] Add financial terminology glossary to embedding model\r
3. [LOW] Add query compression预处理 layer\r
```\r
\r
---\r
\r
### Phase 4 — Multi-Document Reasoning\r
\r
**Step 4.1: Cross-Document Synthesis**\r
\r
When a query spans multiple documents (e.g., "compare 5-year revenue trends across 3 insurers"), synthesize findings across documents.\r
\r
**Synthesis Strategy:**\r
\r
```\r
1. Retrieve top-K chunks from each document\r
2. Group by document and dimension (revenue, cost, risk, etc.)\r
3. For each dimension, generate a summary finding\r
4. Cross-reference findings — flag contradictions\r
5. Generate comparative analysis with supporting citations\r
6. Format as structured report with confidence scores\r
```\r
\r
**Step 4.2: Financial Report Pipeline (Specialized)**\r
\r
Tailored workflow for analyzing financial reports (annual reports, 10-Ks, CBIRC filings).\r
\r
**Pipeline:**\r
\r
```\r
1. PDF Ingestion → Structured Text + Tables\r
2. Page-level chunking (preserve table structure)\r
3. Entity extraction: company names, financial metrics, dates\r
4. Section classification: 业务回顾, 财务报表, 风险因素, 治理结构\r
5. Index with financial metadata filters\r
6. Query interface: natural language → structured answer + source citations\r
```\r
\r
**Example query:**\r
> "Compare 国寿, 平安, 太保's solvency margin ratios over the past 3 years, and identify which company showed the most improvement."\r
\r
**Output:**\r
```\r
## Solvency Margin Comparison: 国寿 vs 平安 vs 太保 (2023-2025)\r
\r
| Company | 2023 | 2024 | 2025 | Change | Rating |\r
|---------|------|------|------|--------|--------|\r
| 国寿 | 218% | 224% | 231% | +13pp ⬆️ | Strong |\r
| 平安 | 195% | 201% | 208% | +13pp ⬆️ | Adequate |\r
| 太保 | 189% | 197% | 205% | +16pp ⬆️ | Adequate |\r
\r
### Key Findings\r
1. **太保 showed the strongest improvement** (+16pp) driven by capital raise\r
2. **国寿 maintains the highest absolute level** (231%), well above CBIRC minimum\r
3. **平安 is most consistent** — steady improvement trajectory\r
4. **Risk:** All three face pressure from interest rate environment in Q2 2026\r
\r
Sources: 国寿 2025 Annual Report p.42, 平安 2025 Annual Report p.38, 太保 2025 Annual Report p.35\r
Confidence: 92%\r
```\r
\r
---\r
\r
## Input / Output Examples\r
\r
### Example 1: Insurance CBIRC Filing Analysis\r
\r
**Input:**\r
```\r
I need to analyze CBIRC's 2025 Insurance Industry Annual Report\r
(~300 pages, Chinese). Extract:\r
1. Key industry metrics (total premium, solvency, claims ratio)\r
2. Regulatory changes from 2023-2025\r
3. Top 5 market trends\r
4. Investment risk warnings\r
\r
Build a RAG pipeline so I can query this report interactively.\r
```\r
\r
**Output:**\r
```\r
## CBIRC Filing RAG Pipeline — Setup Complete\r
\r
### Document Processing\r
- Input: 300 pages (~420K tokens)\r
- Chunking: Semantic by section + recursive for tables\r
- Total chunks: 847\r
- Estimated indexing time: 4 minutes (local embedding)\r
\r
### Index Configuration\r
- Embedding: text-embedding-3-large (3072 dims)\r
- Storage: ChromaDB (local, no API needed)\r
- Language: Chinese optimized (use 'paraphrase-multilingual-MiniLM-L12-v2')\r
\r
### Interactive Query Interface\r
\r
Query 1: "What were the total insurance premiums in 2025?"\r
Answer: ¥3.2 trillion (+8.3% YoY). Source: p.15\r
Confidence: 95%\r
\r
Query 2: "What new regulations were introduced for group insurance?"\r
Answer: [List of 4 regulatory updates with dates and reference pages]\r
Confidence: 88%\r
\r
Query 3: "What are the top investment risks mentioned?"\r
Answer: [3 key risks with severity ratings and regulatory references]\r
Confidence: 91%\r
\r
### Query Templates for Your Analysis\r
- "Compare [metric] across [companies/time periods]"\r
- "List all [regulatory change / product type / market trend]"\r
- "What drove [increase/decrease] in [metric]?"\r
- "Identify risk factors related to [theme]"\r
\r
### Export Options\r
- Generate comparative table: CSV/Excel\r
- Generate narrative report: Markdown/DOCX\r
- Generate slide deck: PowerPoint\r
```\r
\r
### Example 2: Contract Review with Long Context\r
\r
**Input:**\r
```\r
I have a 200-page group insurance contract. I need to:\r
1. Extract all liability exclusions\r
2. Find clauses with financial thresholds > ¥500K\r
3. Identify auto-renewal and termination clauses\r
4. Compare against a standard template\r
\r
Can I do this with RAG?\r
```\r
\r
**Output:**\r
```\r
## Contract Analysis — RAG-Enhanced Review\r
\r
### Document: [Group Insurance Contract v3.2]\r
- Pages: 200 (~280K tokens)\r
- Processing: Full document upload (2M context can handle directly)\r
- Chunking: Section-preserving (chapter-level) + clause-level for key sections\r
\r
### Key Findings\r
\r
**1. Liability Exclusions (12 found)**\r
| Clause ID | Exclusion Type | Page | Severity |\r
|-----------|---------------|------|----------|\r
| EX-003 | Pre-existing conditions | p.24 | HIGH |\r
| EX-007 | Natural disaster cap | p.31 | MEDIUM |\r
| EX-011 | War/nuclear risk | p.45 | STANDARD |\r
\r
**2. Financial Thresholds > ¥500K (4 found)**\r
| Clause | Threshold | Type | Page |\r
|--------|-----------|------|------|\r
| CL-015 | ¥2M | Claim limit | p.52 |\r
| CL-022 | ¥800K | Deductible | p.63 |\r
| CL-031 | ¥5M | Annual aggregate | p.71 |\r
\r
**3. Auto-Renewal & Termination Clauses**\r
- Auto-renewal: p.89 — 30-day notice to cancel, otherwise auto-renew\r
- Termination for non-payment: p.91 — Policy lapses after 30 days past due\r
- CBIRC-mandated cooling period: p.93 — 15-day free look period ✅\r
\r
**4. Comparison vs. Standard Template**\r
Deviations from standard CBIRC group insurance template:\r
- ⚠️ Liability cap 15% lower than standard\r
- ⚠️ Deductible 20% higher than standard \r
- ✅ 5 additional exclusions not in standard (review for reasonableness)\r
- ✅ Cooling period compliant with CBIRC requirements\r
\r
### Recommended Actions\r
1. [URGENT] Renegotiate CL-015 claim limit upward\r
2. [HIGH] Add actuarial justification memo for non-standard exclusions\r
3. [MEDIUM] Standardize auto-renewal notice period to 45 days (recommended)\r
```\r
\r
---\r
\r
## Advanced: Context Window Optimization\r
\r
When the document exceeds the model's context window (even with 2M tokens):\r
\r
**Tiered retrieval strategy:**\r
\r
```\r
Level 1 — Global overview: Summarize entire corpus (50-100 chunks → 1 summary)\r
Level 2 — Topic-level: Identify relevant sections (~20 chunks → section summaries)\r
Level 3 — Granular: Retrieve specific chunks for final synthesis (~5 chunks → answer)\r
```\r
\r
**For Chinese documents, special considerations:**\r
- Use bilingual or Chinese-specialized embedding models\r
- Handle mixed Chinese/English terminology consistently\r
- Preserve financial terminology precision (exact translation of regulatory terms)\r
- Check CBIRC-specific glossaries for regulatory documents\r
\r
---\r
\r
## Notes & Best Practices\r
\r
1. **Chunking is 80% of RAG quality.** Invest time in domain-specific chunking strategies rather than defaulting to fixed-size chunks.\r
2. **Context window ≠ useful context.** A model that can read 2M tokens still performs better when retrieval is precise. Don't skip the retrieval optimization layer.\r
3. **Chinese financial documents:** Annual reports in Chinese often contain dense tabular data and regulatory citations. Use table-aware chunking and add CBIRC/CIRC glossary terms to your embedding model.\r
4. **Hallucination guardrails:** Always require citations (page numbers, section refs) in RAG outputs for financial and legal use cases.\r
5. **Hybrid search > vector-only.** Pure vector search misses keyword-specific queries. Always implement hybrid with RRF fusion.\r
6. **Reranking is essential** for long-context RAG — the first-pass retrieval is noisy.\r
7. **Cost management:** Long-context inference is expensive. Use hierarchical retrieval (summarize → retrieve → synthesize) instead of dumping everything into context.\r
\r
---\r
\r
*Author: @gechengling | Skill: long-context-rag-analyzer | clawhub.ai/gechengling/long-context-rag-analyzer*\r
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install long-context-rag-analyzer - 安装完成后,直接呼叫该 Skill 的名称或使用
/long-context-rag-analyzer触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Long Context Rag Analyzer 是什么?
Analyze and optimize retrieval-augmented generation pipelines for 100K–2M token documents using hybrid search, chunking, reasoning, and structured reporting. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 53 次。
如何安装 Long Context Rag Analyzer?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install long-context-rag-analyzer」即可一键安装,无需额外配置。
Long Context Rag Analyzer 是免费的吗?
是的,Long Context Rag Analyzer 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Long Context Rag Analyzer 支持哪些平台?
Long Context Rag Analyzer 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Long Context Rag Analyzer?
由 lingfeng-19(@gechengling)开发并维护,当前版本 v3.0.1。