Description

Analyze and optimize retrieval-augmented generation pipelines for 100K–2M token documents using hybrid search, chunking, reasoning, and structured reporting.

README (SKILL.md)

\r \r

Long Context RAG Analyzer\r

Name: Long Context Rag Analyzer
Author: gechengling

\r \r

AI技术最新动态 [2026-05-25更新]\r

\r | 动态类型 | 内容摘要 | 影响范围 |\r |---------|---------|---------|\r | AI技术 | 2026年MCP协议三层架构支持动态权限控制和结构化数据验证 | RAG架构指南需增加MCP集成和合规要求 |\r | AI技术 | 长上下文RAG与MCP工具集成成为主流架构模式 | RAG架构指南需增加MCP集成和合规要求 |\r | AI技术 | 企业级RAG系统需关注数据安全、访问控制和合规要求 | RAG架构指南需增加MCP集成和合规要求 |\r \r

数据截止: 2026-05-25 | 来源：国家金融监督管理总局、安永Q1分析、行业公开信息\r 声明: 以上动态供参考，具体以官方最新发布为准\r \r

Overview\r

\r With Gemini 3.1 Ultra's 2M token context window and DeepSeek V4's 1M token context, the era of "dump everything into the prompt" has arrived. But raw context isn't enough — the real challenge is building intelligent retrieval systems that extract the right information, rank it by relevance, and synthesize it into actionable insights. This skill provides a complete framework for building, evaluating, and optimizing long-context RAG pipelines for professional use.\r \r

Title\r

\r Long Context RAG Analyzer — From Massive Documents to Actionable Insights\r \r

Triggers\r

\r

"long context analysis" / "长文本分析" / "超长文档分析"\r
"RAG optimization" / "RAG优化" / "检索增强生成"\r
"document chunking" / "文档分块策略"\r
"hybrid search" / "混合检索" / "向量搜索"\r
"context window optimization" / "上下文窗口优化"\r
"multi-document reasoning" / "多文档推理"\r
"retrieval quality evaluation" / "检索质量评估"\r
"financial report RAG" / "财报RAG分析"\r
"legal document analysis" / "法律文书分析"\r
"research paper synthesis" / "论文综合分析"\r
"100K token" / "1M token" / "2M token context"\r
"vector database" / "向量数据库"\r \r ---\r \r

Workflow\r

\r

Phase 1 — Document Intake & Preprocessing\r

\r Step 1.1: Document Classification\r \r Classify incoming documents by type, structure, and processing priority.\r \r Document Taxonomy:\r \r | Category | Examples | Structure | Processing Priority |\r |----------|----------|-----------|-------------------|\r | Financial Report | Annual report, 10-K, earnings transcript | Semi-structured, tables | CRITICAL |\r | Legal Contract | Insurance policy, loan agreement | Highly structured, dense | HIGH |\r | Research Paper | Academic paper, market study | Well-structured, citations | MEDIUM |\r | Internal Memo | Meeting notes, internal email | Unstructured | LOW |\r | Regulatory Filing | CBIRC submission, SEC filing | Structured, tabular | CRITICAL |\r \r Step 1.2: Metadata Extraction\r \r Extract key metadata to enable filtering and ranking.\r \r Required metadata:\r

Document ID, title, date, source\r
Entity mentions (companies, people, products)\r
Key dates (report period, deadlines, event dates)\r
Sentiment/tone indicators\r
Page count, token count (estimated)\r \r For financial reports specifically:\r
Company name, ticker, fiscal period\r
Revenue, net income, key ratios (extracted if available)\r
Auditor, filing date\r
Related entities (subsidiaries, parent companies)\r \r ---\r \r

Phase 2 — Chunking Strategy Selection\r

\r Step 2.1: Choose Chunking Approach\r \r Different document types require different chunking strategies. Select based on:\r \r

Chunking Strategy Matrix:\r
\r
| Strategy | Best For | Chunk Size | Overlap | Preserves |\r
|----------|----------|------------|---------|-----------|\r
| Fixed-size | Homogeneous content (logs, tickets) | 512-1024 tokens | 50-100 tokens | Speed |\r
| Semantic | Paragraph-level meaning | 512-1500 tokens | 10-20% | Coherence |\r
| Document-structure | Reports, contracts, papers | By section/chapter | 100-200 tokens | Structure |\r
| Recursive | Nested content | Adaptive 256-1024 | 15% | Hierarchy |\r
| Agentic | Mixed content types | Dynamic | Context-aware | Intent |\r
\r
For financial reports: RECOMMEND → Semantic + Document-structure hybrid\r
For legal contracts: RECOMMEND → Recursive with section boundaries\r
For research papers: RECOMMEND → Document-structure by section + citation graph\r
```\r
\r
**Step 2.2: Calculate Optimal Chunk Size**\r
\r
```python\r
# Chunk size calculator\r
def calculate_optimal_chunk_size(document_tokens, query_pattern):\r
    # Estimate based on query complexity\r
    if "detailed analysis" in query_pattern or "deep dive" in query_pattern:\r
        chunk_size = 1500  # Larger chunks for complex queries\r
    elif "comparison" in query_pattern or "summary" in query_pattern:\r
        chunk_size = 2048  # Section-level for comparative analysis\r
    elif "specific fact" in query_pattern or "look up" in query_pattern:\r
        chunk_size = 256   # Small chunks for precise retrieval\r
    else:\r
        chunk_size = 768  # Default\r
    \r
    overlap = int(chunk_size * 0.15)  # 15% overlap\r
    return chunk_size, overlap\r
```\r
\r
---\r
\r
### Phase 3 — Indexing & Retrieval\r
\r
**Step 3.1: Hybrid Search Setup**\r
\r
Combine vector similarity search with keyword (BM25) search for optimal retrieval.\r
\r
**Hybrid Search Architecture:**\r
\r
```\r
Query → [Vector Search (cosine similarity)] ←→ [BM25 Keyword Search]\r
              ↓                                    ↓\r
        Top-K semantic results              Top-K keyword results\r
              ↓                                    ↓\r
        Reciprocal Rank Fusion (RRF) → Final ranked results\r
```\r
\r
**Configuration for different use cases:**\r
\r
```python\r
# China financial report RAG — Hybrid config\r
HYBRID_CONFIG = {\r
    "vector": {\r
        "model": "text-embedding-3-large",  # 3072 dim for high quality\r
        "dimension": 3072,\r
        "召回率_top_k": 20,\r
        "similarity_threshold": 0.75\r
    },\r
    "keyword": {\r
        "algorithm": "BM25",\r
        "k1": 1.5,\r
        "b": 0.75,\r
        "召回率_top_k": 20\r
    },\r
    "fusion": {\r
        "method": "RRF",  # Reciprocal Rank Fusion\r
        "rrf_k": 60  # Standard RRF parameter\r
    },\r
    "rerank": {\r
        "model": "cross-encoder/ms-marco-MiniLM-L-12v2",\r
        "top_n": 5  # Final reranked results\r
    }\r
}\r
```\r
\r
**Step 3.2: Retrieval Quality Evaluation**\r
\r
Evaluate the RAG pipeline before deploying.\r
\r
**Metrics to measure:**\r
\r
| Metric | What it measures | Target |\r
|--------|----------------|--------|\r
| Precision@K | % of retrieved docs relevant | > 0.85 |\r
| Recall@K | % of relevant docs retrieved | > 0.80 |\r
| MRR (Mean Reciprocal Rank) | Rank of first relevant doc | > 0.70 |\r
| NDCG@K | Ranking quality weighted by relevance | > 0.75 |\r
| Context Precision | % of context chunks actually used | > 0.60 |\r
| Hallucination Rate | Factual errors per 1000 tokens | \x3C 0.05 |\r
\r
**Example evaluation:**\r
\r
```\r
## RAG Pipeline Evaluation Report\r
\r
Test Set: 50 financial Q&A pairs from annual reports\r
Index: 120 documents (5 years × 24 companies)\r
Chunk size: 1024 tokens, 15% overlap\r
\r
### Retrieval Metrics\r
- Precision@5: 0.89 ✅\r
- Recall@10: 0.82 ✅\r
- MRR: 0.76 ✅\r
- NDCG@5: 0.81 ✅\r
\r
### Quality Issues Identified\r
❌ Table data losing structure when chunked — fix: preserve tables as JSON chunks\r
❌ Chinese financial terms inconsistently embedded — fix: add bilingual glossary\r
⚠️ Long queries (>500 tokens) retrieving irrelevant context — fix: query compression\r
\r
### Action Plan\r
1. [HIGH] Implement table-aware chunking for financial tables\r
2. [MEDIUM] Add financial terminology glossary to embedding model\r
3. [LOW] Add query compression预处理 layer\r
```\r
\r
---\r
\r
### Phase 4 — Multi-Document Reasoning\r
\r
**Step 4.1: Cross-Document Synthesis**\r
\r
When a query spans multiple documents (e.g., "compare 5-year revenue trends across 3 insurers"), synthesize findings across documents.\r
\r
**Synthesis Strategy:**\r
\r
```\r
1. Retrieve top-K chunks from each document\r
2. Group by document and dimension (revenue, cost, risk, etc.)\r
3. For each dimension, generate a summary finding\r
4. Cross-reference findings — flag contradictions\r
5. Generate comparative analysis with supporting citations\r
6. Format as structured report with confidence scores\r
```\r
\r
**Step 4.2: Financial Report Pipeline (Specialized)**\r
\r
Tailored workflow for analyzing financial reports (annual reports, 10-Ks, CBIRC filings).\r
\r
**Pipeline:**\r
\r
```\r
1. PDF Ingestion → Structured Text + Tables\r
2. Page-level chunking (preserve table structure)\r
3. Entity extraction: company names, financial metrics, dates\r
4. Section classification: 业务回顾, 财务报表, 风险因素, 治理结构\r
5. Index with financial metadata filters\r
6. Query interface: natural language → structured answer + source citations\r
```\r
\r
**Example query:**\r
> "Compare 国寿, 平安, 太保's solvency margin ratios over the past 3 years, and identify which company showed the most improvement."\r
\r
**Output:**\r
```\r
## Solvency Margin Comparison: 国寿 vs 平安 vs 太保 (2023-2025)\r
\r
| Company | 2023 | 2024 | 2025 | Change | Rating |\r
|---------|------|------|------|--------|--------|\r
| 国寿 | 218% | 224% | 231% | +13pp ⬆️ | Strong |\r
| 平安 | 195% | 201% | 208% | +13pp ⬆️ | Adequate |\r
| 太保 | 189% | 197% | 205% | +16pp ⬆️ | Adequate |\r
\r
### Key Findings\r
1. **太保 showed the strongest improvement** (+16pp) driven by capital raise\r
2. **国寿 maintains the highest absolute level** (231%), well above CBIRC minimum\r
3. **平安 is most consistent** — steady improvement trajectory\r
4. **Risk:** All three face pressure from interest rate environment in Q2 2026\r
\r
Sources: 国寿 2025 Annual Report p.42, 平安 2025 Annual Report p.38, 太保 2025 Annual Report p.35\r
Confidence: 92%\r
```\r
\r
---\r
\r
## Input / Output Examples\r
\r
### Example 1: Insurance CBIRC Filing Analysis\r
\r
**Input:**\r
```\r
I need to analyze CBIRC's 2025 Insurance Industry Annual Report\r
(~300 pages, Chinese). Extract:\r
1. Key industry metrics (total premium, solvency, claims ratio)\r
2. Regulatory changes from 2023-2025\r
3. Top 5 market trends\r
4. Investment risk warnings\r
\r
Build a RAG pipeline so I can query this report interactively.\r
```\r
\r
**Output:**\r
```\r
## CBIRC Filing RAG Pipeline — Setup Complete\r
\r
### Document Processing\r
- Input: 300 pages (~420K tokens)\r
- Chunking: Semantic by section + recursive for tables\r
- Total chunks: 847\r
- Estimated indexing time: 4 minutes (local embedding)\r
\r
### Index Configuration\r
- Embedding: text-embedding-3-large (3072 dims)\r
- Storage: ChromaDB (local, no API needed)\r
- Language: Chinese optimized (use 'paraphrase-multilingual-MiniLM-L12-v2')\r
\r
### Interactive Query Interface\r
\r
Query 1: "What were the total insurance premiums in 2025?"\r
Answer: ¥3.2 trillion (+8.3% YoY). Source: p.15\r
Confidence: 95%\r
\r
Query 2: "What new regulations were introduced for group insurance?"\r
Answer: [List of 4 regulatory updates with dates and reference pages]\r
Confidence: 88%\r
\r
Query 3: "What are the top investment risks mentioned?"\r
Answer: [3 key risks with severity ratings and regulatory references]\r
Confidence: 91%\r
\r
### Query Templates for Your Analysis\r
- "Compare [metric] across [companies/time periods]"\r
- "List all [regulatory change / product type / market trend]"\r
- "What drove [increase/decrease] in [metric]?"\r
- "Identify risk factors related to [theme]"\r
\r
### Export Options\r
- Generate comparative table: CSV/Excel\r
- Generate narrative report: Markdown/DOCX\r
- Generate slide deck: PowerPoint\r
```\r
\r
### Example 2: Contract Review with Long Context\r
\r
**Input:**\r
```\r
I have a 200-page group insurance contract. I need to:\r
1. Extract all liability exclusions\r
2. Find clauses with financial thresholds > ¥500K\r
3. Identify auto-renewal and termination clauses\r
4. Compare against a standard template\r
\r
Can I do this with RAG?\r
```\r
\r
**Output:**\r
```\r
## Contract Analysis — RAG-Enhanced Review\r
\r
### Document: [Group Insurance Contract v3.2]\r
- Pages: 200 (~280K tokens)\r
- Processing: Full document upload (2M context can handle directly)\r
- Chunking: Section-preserving (chapter-level) + clause-level for key sections\r
\r
### Key Findings\r
\r
**1. Liability Exclusions (12 found)**\r
| Clause ID | Exclusion Type | Page | Severity |\r
|-----------|---------------|------|----------|\r
| EX-003 | Pre-existing conditions | p.24 | HIGH |\r
| EX-007 | Natural disaster cap | p.31 | MEDIUM |\r
| EX-011 | War/nuclear risk | p.45 | STANDARD |\r
\r
**2. Financial Thresholds > ¥500K (4 found)**\r
| Clause | Threshold | Type | Page |\r
|--------|-----------|------|------|\r
| CL-015 | ¥2M | Claim limit | p.52 |\r
| CL-022 | ¥800K | Deductible | p.63 |\r
| CL-031 | ¥5M | Annual aggregate | p.71 |\r
\r
**3. Auto-Renewal & Termination Clauses**\r
- Auto-renewal: p.89 — 30-day notice to cancel, otherwise auto-renew\r
- Termination for non-payment: p.91 — Policy lapses after 30 days past due\r
- CBIRC-mandated cooling period: p.93 — 15-day free look period ✅\r
\r
**4. Comparison vs. Standard Template**\r
Deviations from standard CBIRC group insurance template:\r
- ⚠️ Liability cap 15% lower than standard\r
- ⚠️ Deductible 20% higher than standard  \r
- ✅ 5 additional exclusions not in standard (review for reasonableness)\r
- ✅ Cooling period compliant with CBIRC requirements\r
\r
### Recommended Actions\r
1. [URGENT] Renegotiate CL-015 claim limit upward\r
2. [HIGH] Add actuarial justification memo for non-standard exclusions\r
3. [MEDIUM] Standardize auto-renewal notice period to 45 days (recommended)\r
```\r
\r
---\r
\r
## Advanced: Context Window Optimization\r
\r
When the document exceeds the model's context window (even with 2M tokens):\r
\r
**Tiered retrieval strategy:**\r
\r
```\r
Level 1 — Global overview: Summarize entire corpus (50-100 chunks → 1 summary)\r
Level 2 — Topic-level: Identify relevant sections (~20 chunks → section summaries)\r
Level 3 — Granular: Retrieve specific chunks for final synthesis (~5 chunks → answer)\r
```\r
\r
**For Chinese documents, special considerations:**\r
- Use bilingual or Chinese-specialized embedding models\r
- Handle mixed Chinese/English terminology consistently\r
- Preserve financial terminology precision (exact translation of regulatory terms)\r
- Check CBIRC-specific glossaries for regulatory documents\r
\r
---\r
\r
## Notes & Best Practices\r
\r
1. **Chunking is 80% of RAG quality.** Invest time in domain-specific chunking strategies rather than defaulting to fixed-size chunks.\r
2. **Context window ≠ useful context.** A model that can read 2M tokens still performs better when retrieval is precise. Don't skip the retrieval optimization layer.\r
3. **Chinese financial documents:** Annual reports in Chinese often contain dense tabular data and regulatory citations. Use table-aware chunking and add CBIRC/CIRC glossary terms to your embedding model.\r
4. **Hallucination guardrails:** Always require citations (page numbers, section refs) in RAG outputs for financial and legal use cases.\r
5. **Hybrid search > vector-only.** Pure vector search misses keyword-specific queries. Always implement hybrid with RRF fusion.\r
6. **Reranking is essential** for long-context RAG — the first-pass retrieval is noisy.\r
7. **Cost management:** Long-context inference is expensive. Use hierarchical retrieval (summarize → retrieve → synthesize) instead of dumping everything into context.\r
\r
---\r
\r
*Author: @gechengling | Skill: long-context-rag-analyzer | clawhub.ai/gechengling/long-context-rag-analyzer*\r

Usage Guidance

Install only if you trust the publisher and intend to use these ClawHub maintainer workflows. Before running the autoreview helper, prefer `--no-yolo` or `AUTOREVIEW_YOLO=0` unless full sandbox-bypassing access is truly needed, and use the moderation commands only from an authenticated staff account after checking the exact target and reason.

Capability Assessment

⚠ Purpose & Capability

The artifacts support ClawHub maintainer workflows, Convex development, moderation, PR review, UI proof, and remote validation; those purposes are coherent, but they include high-impact capabilities such as user bans, role changes, PR publishing, remote execution, and full-access nested review.

⚠ Instruction Scope

Moderation commands are explicitly scoped with target, reason, confirmation, auth, and audit-log requirements, but the autoreview helper defaults to `--dangerously-bypass-approvals-and-sandbox --sandbox danger-full-access`, which is broader than normal review needs.

✓ Install Mechanism

I found no install hook, self-modifying installer, or hidden persistence mechanism in the skill files; the artifacts are repo-local skill instructions and a helper script.

⚠ Credentials

Repo-local commands, GitHub CLI access, remote Crabbox providers, staff API actions, and full-access nested Codex are disclosed, but the default full-access review mode is overbroad for an advisory review workflow.

⚠ Persistence & Privilege

There is no automatic background persistence, but the helper grants broad execution privilege by default; Crabbox leases and proof artifacts are user-invoked and include cleanup or dry-run guidance.

Version History

v3.0.1

- Enhanced support for analyzing and optimizing Retrieval Augmented Generation (RAG) pipelines for document sets ranging from 100K to 2M tokens. - Added detailed chunking strategy selection, including optimized approaches for financial reports, legal contracts, and research papers. - Introduced hybrid search workflow combining vector similarity and keyword search with reciprocal rank fusion and reranking for improved retrieval quality. - Included a comprehensive framework for metadata extraction, chunk size calculation, retrieval evaluation metrics, and workflow best practices. - Updated documentation with latest AI technical trends, compliance considerations, and actionable pipeline evaluation plans.

Metadata

Slug long-context-rag-analyzer

Version 3.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Long Context Rag Analyzer?

Analyze and optimize retrieval-augmented generation pipelines for 100K–2M token documents using hybrid search, chunking, reasoning, and structured reporting. It is an AI Agent Skill for Claude Code / OpenClaw, with 53 downloads so far.

How do I install Long Context Rag Analyzer?

Run "/install long-context-rag-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Long Context Rag Analyzer free?

Yes, Long Context Rag Analyzer is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Long Context Rag Analyzer support?

Long Context Rag Analyzer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Long Context Rag Analyzer?

It is built and maintained by lingfeng-19 (@gechengling); the current version is v3.0.1.

More Skills

Long Context Rag Analyzer