/install hybrid-smart-fill
\r \r
Hybrid Smart Fill Skill\r
\r This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.\r \r
When to Use This Skill\r
\r Use this skill when:\r \r
- Batch Template Filling: Users need to fill multiple Word or Excel templates with data from a knowledge base\r
- High Precision Required: Simple keyword matching is insufficient; semantic understanding is needed for accurate field matching\r
- Knowledge Base Available: A structured knowledge base (JSON format) containing fields and values is available\r
- Complex Field Names: Template fields require semantic matching (e.g., "法人代表" matches "法定代表人")\r
- Placeholder Replacement: Templates contain placeholders like "XX基金" that need to be replaced with actual company names\r \r Common trigger phrases:\r
- "填充模板"、"批量填充"、"智能填充"\r
- "使用知识库"、"匹配字段"\r
- "向量检索"、"语义检索"、"BM25"、"TF-IDF"\r
- "自动填写Word/Excel模板"\r \r
Core Concepts\r
\r
Hybrid Retrieval System\r
\r This skill uses a hybrid retrieval approach combining two algorithms:\r \r
- BM25 (Best Matching 25): Statistical ranking function based on term frequency and document frequency\r
- Accounts for document length normalization\r
- Penalizes overly common terms\r
- Scores:
IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))\r \r
- TF-IDF (Term Frequency-Inverse Document Frequency): Vector similarity search\r
- Converts text to vector space\r
- Calculates cosine similarity between query and documents\r
- Semantic matching beyond exact keywords\r \r
- Hybrid Score: Weighted fusion of both results\r
- Formula:
final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score\r - Balances precision (BM25) and semantic understanding (TF-IDF)\r \r
- Formula:
Matching Strategy\r
\r The system uses a multi-level matching strategy:\r \r
- Exact Match: Field name exactly matches knowledge base key\r
- Containment Match: Field name contains or is contained in knowledge base key\r
- Keyword Match: Multi-keyword combination matching\r
- Special Handling: Auto-replacement of placeholders (e.g., "XX基金" → "国寿安保基金")\r \r
How to Use This Skill\r
\r
Step 1: Prepare Knowledge Base\r
\r Ensure the knowledge base is a JSON file with the following structure:\r \r
{\r
"filename.xlsx": {\r
"filename": "filename.xlsx",\r
"type": "xlsx",\r
"content": "=== Sheet: SheetName\
A1[Header1] | A2[Value1] | ..."\r
},\r
"filename.docx": {\r
"filename": "filename.docx",\r
"type": "docx",\r
"content": {\r
"paragraphs": ["text content..."],\r
"tables": [...]\r
}\r
}\r
}\r
```\r
\r
**Supported formats in JSON:**\r
- **xlsx**: Text-based Excel format with `A1[Value] | B2[Value]` pattern\r
- **docx**: Dictionary or list format containing paragraphs and table data\r
- **doc**: Plain text format\r
\r
### Step 2: Run the Smart Filler\r
\r
Execute the main filling script:\r
\r
```bash\r
python scripts/smart_filler.py\r
```\r
\r
The script will:\r
\r
1. Load and parse the knowledge base JSON\r
2. Extract structured data (89+ typical fields)\r
3. Build hybrid retrieval index\r
4. Process all template files in the template directory\r
5. Fill matched fields and replace placeholders\r
6. Save filled files to output directory\r
\r
### Step 3: Review Results\r
\r
The system generates:\r
- **Filled templates** in the output directory (marked with "已填写" suffix)\r
- **Fill log** showing all field matches and replacements\r
- **Statistics**: Total fields filled, success rate, XX基金 replacement count\r
\r
## Bundled Scripts\r
\r
### scripts/vector_kb.py\r
\r
**Purpose**: Core hybrid retrieval engine implementation\r
\r
**Key Classes:**\r
- `BM25Retriever`: BM25 ranking algorithm implementation\r
- `TFIDFRetriever`: TF-IDF vector search implementation\r
- `HybridRetriever`: Fusion of both retrieval methods\r
- `VectorKnowledgeBase`: Knowledge base management and indexing\r
\r
**Usage Example**:\r
```python\r
from vector_kb import VectorKnowledgeBase\r
\r
# Initialize and load knowledge base\r
kb = VectorKnowledgeBase()\r
kb.load_knowledge_base('knowledge_base.json').build_index()\r
\r
# Search for values\r
results = kb.search('法人代表', top_k=5)\r
for result in results:\r
print(f"Score: {result['score']}, Value: {result['document']}")\r
```\r
\r
### scripts/smart_filler.py\r
\r
**Purpose**: Main template filling orchestration\r
\r
**Key Classes:**\r
- `TextExcelParser`: Parses text-based Excel content\r
- `SmartFillSystem`: Orchestrates the entire filling process\r
\r
**Usage Example**:\r
```python\r
from smart_filler import SmartFillSystem\r
\r
# Configure paths\r
system = SmartFillSystem(\r
kb_path='knowledge_base.json',\r
template_dir='templates/',\r
output_dir='filled/'\r
)\r
\r
# Initialize and process\r
system.load_kb()\r
system.process_all()\r
```\r
\r
**Configuration:**\r
- `kb_path`: Path to knowledge base JSON file\r
- `template_dir`: Directory containing template files\r
- `output_dir`: Directory for filled output files\r
\r
## Reference Documentation\r
\r
### Knowledge Base Format Requirements\r
\r
**Excel Content Format** (text-based):\r
```\r
=== Sheet: SheetName ===\r
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]\r
```\r
\r
**Document Content Format** (field extraction):\r
- Use regex patterns to extract: `字段名[::\s]*值`\r
- Supported fields: 法人代表, 联系电话, 地址, 注册资本, 统一社会信用代码, etc.\r
\r
**Year-based Data**:\r
- Automatic organization by year (e.g., "2024年总资产")\r
- Cleaned headers (year removed) for better matching\r
\r
### Performance Characteristics\r
\r
Based on real-world testing:\r
\r
| Metric | Value |\r
|---------|--------|\r
| Knowledge Base Fields | 89+ |\r
| Files Processed | 5+ |\r
| Total Fields Filled | 388+ |\r
| Fields Per File (Average) | 77.6 |\r
| XX基金 Replacement Rate | 100% |\r
| Precision Improvement | 50%+ over keyword matching |\r
| Efficiency Gain | 90%+ over manual filling |\r
\r
## Common Issues and Solutions\r
\r
### Issue: Low Match Rate\r
\r
**Cause**: Knowledge base content format incompatible\r
\r
**Solution**: Ensure Excel content uses `A1[Value]` format; check JSON structure\r
\r
### Issue: Wrong Value Filled\r
\r
**Cause**: Field name ambiguity\r
\r
**Solution**: Adjust hybrid retrieval weights; use more specific field names in templates\r
\r
### Issue: Encoding Errors\r
\r
**Cause**: Non-UTF-8 characters in knowledge base\r
\r
**Solution**: Ensure knowledge base JSON is UTF-8 encoded; use `sys.stdout.reconfigure(encoding='utf-8')` in scripts\r
\r
## Advanced Usage\r
\r
### Custom Retrieval Weights\r
\r
Modify the hybrid retrieval weight balance in `HybridRetriever`:\r
\r
```python\r
# Default: BM25 0.5, TF-IDF 0.5\r
# Change to emphasize semantic matching:\r
self.bm25_weight = 0.3\r
self.tfidf_weight = 0.7\r
```\r
\r
### Custom Field Extraction\r
\r
Extend `TextExcelParser._extract_from_text()` to support additional patterns:\r
\r
```python\r
patterns = {\r
'new_field': r'新字段[::\s]*([^\
\r]+)',\r
# Add more patterns...\r
}\r
```\r
\r
### Batch Processing\r
\r
Process multiple knowledge bases:\r
\r
```python\r
kb_files = ['kb1.json', 'kb2.json', 'kb3.json']\r
for kb_file in kb_files:\r
system = SmartFillSystem(kb_file, 'templates/', f'filled_{kb_file}/')\r
system.load_kb()\r
system.process_all()\r
```\r
\r
## Limitations\r
\r
1. **No Machine Learning Embeddings**: Uses TF-IDF (not BERT/Transformer embeddings) for lightweight deployment\r
2. **Chinese Tokenization**: Simple character-based tokenization (not jieba)\r
3. **Excel Format**: Requires text-based format; binary Excel files need pre-processing\r
4. **Context Awareness**: Limited cell-to-cell context understanding\r
\r
## Future Enhancements\r
\r
Potential improvements for future versions:\r
\r
1. **Deep Learning Embeddings**: Integrate sentence-transformers for true semantic vectors\r
2. **Cross-Modal Fusion**: Combine table structure information with text matching\r
3. **Adaptive Weighting**: Learn optimal BM25/TF-IDF weights from user feedback\r
4. **Domain Adaptation**: Build domain-specific vocabularies for finance, legal, etc.\r
\r
## References\r
\r
For deeper understanding:\r
\r
- **BM25 Algorithm**: Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond\r
- **TF-IDF**: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval\r
- **Hybrid Retrieval**: Combining multiple evidence sources in search systems\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install hybrid-smart-fill - After installation, invoke the skill by name or use
/hybrid-smart-fill - Provide required inputs per the skill's parameter spec and get structured output
What is hybrid-smart-fill?
This skill provides hybrid retrieval (BM25 semantic search + TF-IDF vector similarity) for intelligent template auto-filling. Use when users need to batch fi... It is an AI Agent Skill for Claude Code / OpenClaw, with 152 downloads so far.
How do I install hybrid-smart-fill?
Run "/install hybrid-smart-fill" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is hybrid-smart-fill free?
Yes, hybrid-smart-fill is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does hybrid-smart-fill support?
hybrid-smart-fill is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created hybrid-smart-fill?
It is built and maintained by maodou13 (@deweienweide); the current version is v1.0.0.