/install hybrid-smart-fill
\r \r
Hybrid Smart Fill Skill\r
\r This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.\r \r
When to Use This Skill\r
\r Use this skill when:\r \r
- Batch Template Filling: Users need to fill multiple Word or Excel templates with data from a knowledge base\r
- High Precision Required: Simple keyword matching is insufficient; semantic understanding is needed for accurate field matching\r
- Knowledge Base Available: A structured knowledge base (JSON format) containing fields and values is available\r
- Complex Field Names: Template fields require semantic matching (e.g., "法人代表" matches "法定代表人")\r
- Placeholder Replacement: Templates contain placeholders like "XX基金" that need to be replaced with actual company names\r \r Common trigger phrases:\r
- "填充模板"、"批量填充"、"智能填充"\r
- "使用知识库"、"匹配字段"\r
- "向量检索"、"语义检索"、"BM25"、"TF-IDF"\r
- "自动填写Word/Excel模板"\r \r
Core Concepts\r
\r
Hybrid Retrieval System\r
\r This skill uses a hybrid retrieval approach combining two algorithms:\r \r
- BM25 (Best Matching 25): Statistical ranking function based on term frequency and document frequency\r
- Accounts for document length normalization\r
- Penalizes overly common terms\r
- Scores:
IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))\r \r
- TF-IDF (Term Frequency-Inverse Document Frequency): Vector similarity search\r
- Converts text to vector space\r
- Calculates cosine similarity between query and documents\r
- Semantic matching beyond exact keywords\r \r
- Hybrid Score: Weighted fusion of both results\r
- Formula:
final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score\r - Balances precision (BM25) and semantic understanding (TF-IDF)\r \r
- Formula:
Matching Strategy\r
\r The system uses a multi-level matching strategy:\r \r
- Exact Match: Field name exactly matches knowledge base key\r
- Containment Match: Field name contains or is contained in knowledge base key\r
- Keyword Match: Multi-keyword combination matching\r
- Special Handling: Auto-replacement of placeholders (e.g., "XX基金" → "国寿安保基金")\r \r
How to Use This Skill\r
\r
Step 1: Prepare Knowledge Base\r
\r Ensure the knowledge base is a JSON file with the following structure:\r \r
{\r
"filename.xlsx": {\r
"filename": "filename.xlsx",\r
"type": "xlsx",\r
"content": "=== Sheet: SheetName\
A1[Header1] | A2[Value1] | ..."\r
},\r
"filename.docx": {\r
"filename": "filename.docx",\r
"type": "docx",\r
"content": {\r
"paragraphs": ["text content..."],\r
"tables": [...]\r
}\r
}\r
}\r
```\r
\r
**Supported formats in JSON:**\r
- **xlsx**: Text-based Excel format with `A1[Value] | B2[Value]` pattern\r
- **docx**: Dictionary or list format containing paragraphs and table data\r
- **doc**: Plain text format\r
\r
### Step 2: Run the Smart Filler\r
\r
Execute the main filling script:\r
\r
```bash\r
python scripts/smart_filler.py\r
```\r
\r
The script will:\r
\r
1. Load and parse the knowledge base JSON\r
2. Extract structured data (89+ typical fields)\r
3. Build hybrid retrieval index\r
4. Process all template files in the template directory\r
5. Fill matched fields and replace placeholders\r
6. Save filled files to output directory\r
\r
### Step 3: Review Results\r
\r
The system generates:\r
- **Filled templates** in the output directory (marked with "已填写" suffix)\r
- **Fill log** showing all field matches and replacements\r
- **Statistics**: Total fields filled, success rate, XX基金 replacement count\r
\r
## Bundled Scripts\r
\r
### scripts/vector_kb.py\r
\r
**Purpose**: Core hybrid retrieval engine implementation\r
\r
**Key Classes:**\r
- `BM25Retriever`: BM25 ranking algorithm implementation\r
- `TFIDFRetriever`: TF-IDF vector search implementation\r
- `HybridRetriever`: Fusion of both retrieval methods\r
- `VectorKnowledgeBase`: Knowledge base management and indexing\r
\r
**Usage Example**:\r
```python\r
from vector_kb import VectorKnowledgeBase\r
\r
# Initialize and load knowledge base\r
kb = VectorKnowledgeBase()\r
kb.load_knowledge_base('knowledge_base.json').build_index()\r
\r
# Search for values\r
results = kb.search('法人代表', top_k=5)\r
for result in results:\r
print(f"Score: {result['score']}, Value: {result['document']}")\r
```\r
\r
### scripts/smart_filler.py\r
\r
**Purpose**: Main template filling orchestration\r
\r
**Key Classes:**\r
- `TextExcelParser`: Parses text-based Excel content\r
- `SmartFillSystem`: Orchestrates the entire filling process\r
\r
**Usage Example**:\r
```python\r
from smart_filler import SmartFillSystem\r
\r
# Configure paths\r
system = SmartFillSystem(\r
kb_path='knowledge_base.json',\r
template_dir='templates/',\r
output_dir='filled/'\r
)\r
\r
# Initialize and process\r
system.load_kb()\r
system.process_all()\r
```\r
\r
**Configuration:**\r
- `kb_path`: Path to knowledge base JSON file\r
- `template_dir`: Directory containing template files\r
- `output_dir`: Directory for filled output files\r
\r
## Reference Documentation\r
\r
### Knowledge Base Format Requirements\r
\r
**Excel Content Format** (text-based):\r
```\r
=== Sheet: SheetName ===\r
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]\r
```\r
\r
**Document Content Format** (field extraction):\r
- Use regex patterns to extract: `字段名[::\s]*值`\r
- Supported fields: 法人代表, 联系电话, 地址, 注册资本, 统一社会信用代码, etc.\r
\r
**Year-based Data**:\r
- Automatic organization by year (e.g., "2024年总资产")\r
- Cleaned headers (year removed) for better matching\r
\r
### Performance Characteristics\r
\r
Based on real-world testing:\r
\r
| Metric | Value |\r
|---------|--------|\r
| Knowledge Base Fields | 89+ |\r
| Files Processed | 5+ |\r
| Total Fields Filled | 388+ |\r
| Fields Per File (Average) | 77.6 |\r
| XX基金 Replacement Rate | 100% |\r
| Precision Improvement | 50%+ over keyword matching |\r
| Efficiency Gain | 90%+ over manual filling |\r
\r
## Common Issues and Solutions\r
\r
### Issue: Low Match Rate\r
\r
**Cause**: Knowledge base content format incompatible\r
\r
**Solution**: Ensure Excel content uses `A1[Value]` format; check JSON structure\r
\r
### Issue: Wrong Value Filled\r
\r
**Cause**: Field name ambiguity\r
\r
**Solution**: Adjust hybrid retrieval weights; use more specific field names in templates\r
\r
### Issue: Encoding Errors\r
\r
**Cause**: Non-UTF-8 characters in knowledge base\r
\r
**Solution**: Ensure knowledge base JSON is UTF-8 encoded; use `sys.stdout.reconfigure(encoding='utf-8')` in scripts\r
\r
## Advanced Usage\r
\r
### Custom Retrieval Weights\r
\r
Modify the hybrid retrieval weight balance in `HybridRetriever`:\r
\r
```python\r
# Default: BM25 0.5, TF-IDF 0.5\r
# Change to emphasize semantic matching:\r
self.bm25_weight = 0.3\r
self.tfidf_weight = 0.7\r
```\r
\r
### Custom Field Extraction\r
\r
Extend `TextExcelParser._extract_from_text()` to support additional patterns:\r
\r
```python\r
patterns = {\r
'new_field': r'新字段[::\s]*([^\
\r]+)',\r
# Add more patterns...\r
}\r
```\r
\r
### Batch Processing\r
\r
Process multiple knowledge bases:\r
\r
```python\r
kb_files = ['kb1.json', 'kb2.json', 'kb3.json']\r
for kb_file in kb_files:\r
system = SmartFillSystem(kb_file, 'templates/', f'filled_{kb_file}/')\r
system.load_kb()\r
system.process_all()\r
```\r
\r
## Limitations\r
\r
1. **No Machine Learning Embeddings**: Uses TF-IDF (not BERT/Transformer embeddings) for lightweight deployment\r
2. **Chinese Tokenization**: Simple character-based tokenization (not jieba)\r
3. **Excel Format**: Requires text-based format; binary Excel files need pre-processing\r
4. **Context Awareness**: Limited cell-to-cell context understanding\r
\r
## Future Enhancements\r
\r
Potential improvements for future versions:\r
\r
1. **Deep Learning Embeddings**: Integrate sentence-transformers for true semantic vectors\r
2. **Cross-Modal Fusion**: Combine table structure information with text matching\r
3. **Adaptive Weighting**: Learn optimal BM25/TF-IDF weights from user feedback\r
4. **Domain Adaptation**: Build domain-specific vocabularies for finance, legal, etc.\r
\r
## References\r
\r
For deeper understanding:\r
\r
- **BM25 Algorithm**: Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond\r
- **TF-IDF**: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval\r
- **Hybrid Retrieval**: Combining multiple evidence sources in search systems\r
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install hybrid-smart-fill - 安装完成后,直接呼叫该 Skill 的名称或使用
/hybrid-smart-fill触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
hybrid-smart-fill 是什么?
This skill provides hybrid retrieval (BM25 semantic search + TF-IDF vector similarity) for intelligent template auto-filling. Use when users need to batch fi... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 152 次。
如何安装 hybrid-smart-fill?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install hybrid-smart-fill」即可一键安装,无需额外配置。
hybrid-smart-fill 是免费的吗?
是的,hybrid-smart-fill 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
hybrid-smart-fill 支持哪些平台?
hybrid-smart-fill 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 hybrid-smart-fill?
由 maodou13(@deweienweide)开发并维护,当前版本 v1.0.0。