Description

This skill provides hybrid retrieval (BM25 semantic search + TF-IDF vector similarity) for intelligent template auto-filling. Use when users need to batch fi...

README (SKILL.md)

\r \r

Hybrid Smart Fill Skill\r

Name: hybrid-smart-fill
Author: deweienweide

\r This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.\r \r

When to Use This Skill\r

\r Use this skill when:\r \r

Batch Template Filling: Users need to fill multiple Word or Excel templates with data from a knowledge base\r
High Precision Required: Simple keyword matching is insufficient; semantic understanding is needed for accurate field matching\r
Knowledge Base Available: A structured knowledge base (JSON format) containing fields and values is available\r
Complex Field Names: Template fields require semantic matching (e.g., "法人代表" matches "法定代表人")\r
Placeholder Replacement: Templates contain placeholders like "XX基金" that need to be replaced with actual company names\r \r Common trigger phrases:\r

"填充模板"、"批量填充"、"智能填充"\r
"使用知识库"、"匹配字段"\r
"向量检索"、"语义检索"、"BM25"、"TF-IDF"\r
"自动填写Word/Excel模板"\r \r

Core Concepts\r

\r

Hybrid Retrieval System\r

\r This skill uses a hybrid retrieval approach combining two algorithms:\r \r

BM25 (Best Matching 25): Statistical ranking function based on term frequency and document frequency\r
- Accounts for document length normalization\r
- Penalizes overly common terms\r
- Scores: IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))\r \r
TF-IDF (Term Frequency-Inverse Document Frequency): Vector similarity search\r
- Converts text to vector space\r
- Calculates cosine similarity between query and documents\r
- Semantic matching beyond exact keywords\r \r
Hybrid Score: Weighted fusion of both results\r
- Formula: final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score\r
- Balances precision (BM25) and semantic understanding (TF-IDF)\r \r

Matching Strategy\r

\r The system uses a multi-level matching strategy:\r \r

Exact Match: Field name exactly matches knowledge base key\r
Containment Match: Field name contains or is contained in knowledge base key\r
Keyword Match: Multi-keyword combination matching\r
Special Handling: Auto-replacement of placeholders (e.g., "XX基金" → "国寿安保基金")\r \r

How to Use This Skill\r

\r

Step 1: Prepare Knowledge Base\r

\r Ensure the knowledge base is a JSON file with the following structure:\r \r

{\r
  "filename.xlsx": {\r
    "filename": "filename.xlsx",\r
    "type": "xlsx",\r
    "content": "=== Sheet: SheetName\
A1[Header1] | A2[Value1] | ..."\r
  },\r
  "filename.docx": {\r
    "filename": "filename.docx",\r
    "type": "docx",\r
    "content": {\r
      "paragraphs": ["text content..."],\r
      "tables": [...]\r
    }\r
  }\r
}\r
```\r
\r
**Supported formats in JSON:**\r
- **xlsx**: Text-based Excel format with `A1[Value] | B2[Value]` pattern\r
- **docx**: Dictionary or list format containing paragraphs and table data\r
- **doc**: Plain text format\r
\r
### Step 2: Run the Smart Filler\r
\r
Execute the main filling script:\r
\r
```bash\r
python scripts/smart_filler.py\r
```\r
\r
The script will:\r
\r
1. Load and parse the knowledge base JSON\r
2. Extract structured data (89+ typical fields)\r
3. Build hybrid retrieval index\r
4. Process all template files in the template directory\r
5. Fill matched fields and replace placeholders\r
6. Save filled files to output directory\r
\r
### Step 3: Review Results\r
\r
The system generates:\r
- **Filled templates** in the output directory (marked with "已填写" suffix)\r
- **Fill log** showing all field matches and replacements\r
- **Statistics**: Total fields filled, success rate, XX基金 replacement count\r
\r
## Bundled Scripts\r
\r
### scripts/vector_kb.py\r
\r
**Purpose**: Core hybrid retrieval engine implementation\r
\r
**Key Classes:**\r
- `BM25Retriever`: BM25 ranking algorithm implementation\r
- `TFIDFRetriever`: TF-IDF vector search implementation\r
- `HybridRetriever`: Fusion of both retrieval methods\r
- `VectorKnowledgeBase`: Knowledge base management and indexing\r
\r
**Usage Example**:\r
```python\r
from vector_kb import VectorKnowledgeBase\r
\r
# Initialize and load knowledge base\r
kb = VectorKnowledgeBase()\r
kb.load_knowledge_base('knowledge_base.json').build_index()\r
\r
# Search for values\r
results = kb.search('法人代表', top_k=5)\r
for result in results:\r
    print(f"Score: {result['score']}, Value: {result['document']}")\r
```\r
\r
### scripts/smart_filler.py\r
\r
**Purpose**: Main template filling orchestration\r
\r
**Key Classes:**\r
- `TextExcelParser`: Parses text-based Excel content\r
- `SmartFillSystem`: Orchestrates the entire filling process\r
\r
**Usage Example**:\r
```python\r
from smart_filler import SmartFillSystem\r
\r
# Configure paths\r
system = SmartFillSystem(\r
    kb_path='knowledge_base.json',\r
    template_dir='templates/',\r
    output_dir='filled/'\r
)\r
\r
# Initialize and process\r
system.load_kb()\r
system.process_all()\r
```\r
\r
**Configuration:**\r
- `kb_path`: Path to knowledge base JSON file\r
- `template_dir`: Directory containing template files\r
- `output_dir`: Directory for filled output files\r
\r
## Reference Documentation\r
\r
### Knowledge Base Format Requirements\r
\r
**Excel Content Format** (text-based):\r
```\r
=== Sheet: SheetName ===\r
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]\r
```\r
\r
**Document Content Format** (field extraction):\r
- Use regex patterns to extract: `字段名[：:\s]*值`\r
- Supported fields: 法人代表, 联系电话, 地址, 注册资本, 统一社会信用代码, etc.\r
\r
**Year-based Data**:\r
- Automatic organization by year (e.g., "2024年总资产")\r
- Cleaned headers (year removed) for better matching\r
\r
### Performance Characteristics\r
\r
Based on real-world testing:\r
\r
| Metric | Value |\r
|---------|--------|\r
| Knowledge Base Fields | 89+ |\r
| Files Processed | 5+ |\r
| Total Fields Filled | 388+ |\r
| Fields Per File (Average) | 77.6 |\r
| XX基金 Replacement Rate | 100% |\r
| Precision Improvement | 50%+ over keyword matching |\r
| Efficiency Gain | 90%+ over manual filling |\r
\r
## Common Issues and Solutions\r
\r
### Issue: Low Match Rate\r
\r
**Cause**: Knowledge base content format incompatible\r
\r
**Solution**: Ensure Excel content uses `A1[Value]` format; check JSON structure\r
\r
### Issue: Wrong Value Filled\r
\r
**Cause**: Field name ambiguity\r
\r
**Solution**: Adjust hybrid retrieval weights; use more specific field names in templates\r
\r
### Issue: Encoding Errors\r
\r
**Cause**: Non-UTF-8 characters in knowledge base\r
\r
**Solution**: Ensure knowledge base JSON is UTF-8 encoded; use `sys.stdout.reconfigure(encoding='utf-8')` in scripts\r
\r
## Advanced Usage\r
\r
### Custom Retrieval Weights\r
\r
Modify the hybrid retrieval weight balance in `HybridRetriever`:\r
\r
```python\r
# Default: BM25 0.5, TF-IDF 0.5\r
# Change to emphasize semantic matching:\r
self.bm25_weight = 0.3\r
self.tfidf_weight = 0.7\r
```\r
\r
### Custom Field Extraction\r
\r
Extend `TextExcelParser._extract_from_text()` to support additional patterns:\r
\r
```python\r
patterns = {\r
    'new_field': r'新字段[：:\s]*([^\
\r]+)',\r
    # Add more patterns...\r
}\r
```\r
\r
### Batch Processing\r
\r
Process multiple knowledge bases:\r
\r
```python\r
kb_files = ['kb1.json', 'kb2.json', 'kb3.json']\r
for kb_file in kb_files:\r
    system = SmartFillSystem(kb_file, 'templates/', f'filled_{kb_file}/')\r
    system.load_kb()\r
    system.process_all()\r
```\r
\r
## Limitations\r
\r
1. **No Machine Learning Embeddings**: Uses TF-IDF (not BERT/Transformer embeddings) for lightweight deployment\r
2. **Chinese Tokenization**: Simple character-based tokenization (not jieba)\r
3. **Excel Format**: Requires text-based format; binary Excel files need pre-processing\r
4. **Context Awareness**: Limited cell-to-cell context understanding\r
\r
## Future Enhancements\r
\r
Potential improvements for future versions:\r
\r
1. **Deep Learning Embeddings**: Integrate sentence-transformers for true semantic vectors\r
2. **Cross-Modal Fusion**: Combine table structure information with text matching\r
3. **Adaptive Weighting**: Learn optimal BM25/TF-IDF weights from user feedback\r
4. **Domain Adaptation**: Build domain-specific vocabularies for finance, legal, etc.\r
\r
## References\r
\r
For deeper understanding:\r
\r
- **BM25 Algorithm**: Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond\r
- **TF-IDF**: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval\r
- **Hybrid Retrieval**: Combining multiple evidence sources in search systems\r

Usage Guidance

This skill appears to do what it claims (local knowledge-base → Word/Excel template filling) and does not request credentials or contact outside endpoints. Before running: (1) inspect and edit scripts/smart_filler.py to set kb_path, template_dir, and output_dir to directories you control (the shipped script has hard-coded Windows paths), (2) review the regex patterns and the hard-coded placeholder replacement ('XX'→'国寿安保基金') to ensure they are appropriate for your data, (3) run in a sandbox or test directory first to verify behavior, (4) install required Python packages (python-docx, openpyxl) in a virtual environment, and (5) ensure your knowledge_base.json does not contain sensitive secrets you don't want processed or written into output files. If you want higher assurance, ask the author to remove hard-coded paths and make configuration explicit (command-line args or config file) or provide a small sanitized example KB and templates to test with.

Capability Analysis

Type: OpenClaw Skill Name: hybrid-smart-fill Version: 1.0.0 The hybrid-smart-fill skill is a specialized utility for automating the population of Word (.docx) and Excel (.xlsx) templates using data from a local JSON knowledge base. It implements custom BM25 and TF-IDF retrieval logic in 'scripts/vector_kb.py' and 'scripts/smart_filler.py' to perform semantic matching of template fields. The code is transparent, lacks network or system-level execution capabilities, and contains no evidence of data exfiltration or prompt injection. While 'smart_filler.py' contains hardcoded Windows file paths specific to a Chinese financial context (e.g., '国寿安保基金'), these are functional artifacts rather than malicious indicators.

Capability Assessment

✓ Purpose & Capability

Name/description (hybrid retrieval + template filling) match the included code and docs. The Python modules implement BM25/TF-IDF hybrid retrieval and template fill logic; required inputs (knowledge-base JSON, templates) align with the stated purpose.

⚠ Instruction Scope

SKILL.md instructs running the bundled scripts which is expected, but smart_filler.py contains hard-coded absolute Windows paths (kb_path, template_dir, output_dir) that will be used if the script is executed without editing. Running the script as-is could attempt to read those local paths; the code also performs broad regex extraction/replacement (including a hard-coded 'XX' → '国寿安保基金' replacement) which is domain-specific. There are no instructions to read unrelated system files or external endpoints, but the hard-coded paths are a moderate risk if left unchanged.

✓ Install Mechanism

No install spec; instruction-only plus included Python scripts. No downloaded archives, no external installers, and no package pulls in the skill metadata. Scripts have minimal third-party dependency hints (python-docx, openpyxl) but those must be installed by the user.

✓ Credentials

The skill requests no environment variables or credentials. The code reads only files (knowledge base and template files); there are no hidden env var usages or secrets requests.

✓ Persistence & Privilege

Flags show no always:true and no special privileges. The skill does not modify other skills or global agent configuration and has no automatic installation hooks.

Version History

v1.0.0

`hybrid-smart-fill` 是一个基于混合检索模式（BM25语义检索 + TF-IDF向量检索）的智能模板自动填充技能。它可以高精度地从知识库中匹配字段并自动填充Word文档和Excel表格。

Metadata

Slug hybrid-smart-fill

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is hybrid-smart-fill?

This skill provides hybrid retrieval (BM25 semantic search + TF-IDF vector similarity) for intelligent template auto-filling. Use when users need to batch fi... It is an AI Agent Skill for Claude Code / OpenClaw, with 152 downloads so far.

How do I install hybrid-smart-fill?

Run "/install hybrid-smart-fill" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is hybrid-smart-fill free?

Yes, hybrid-smart-fill is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does hybrid-smart-fill support?

hybrid-smart-fill is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created hybrid-smart-fill?

It is built and maintained by maodou13 (@deweienweide); the current version is v1.0.0.

More Skills

hybrid-smart-fill