Description

Unified heterogeneous knowledge QA system. Automatically routes natural language queries to SQL databases, Knowledge Graphs, or table files using 4-layer det...

README (SKILL.md)

HeteroMind

Name: HeteroMind - Unified Knowledge QA
Author: bahuia

Unified heterogeneous knowledge QA system with automatic source detection and multi-stage reasoning.

Core Concept

Natural language queries are automatically routed to the appropriate knowledge source (SQL, Knowledge Graph, or Table files) without requiring users to specify the data source. A 4-layer detection architecture ensures accurate source identification, followed by multi-stage query generation with self-revision and voting.

User Query → Source Detection (4 layers) → Query Generation → Self-Revision → Voting → Execution → Answer

When to Use

Trigger	Action
"How many employees in X?"	NL2SQL engine
"Who is the founder of X?"	NL2SPARQL engine (KG)
"Which quarter had highest sales?"	TableQA engine
"Show average salary by department"	Auto-detect SQL
Queries with aggregations, filters, joins	Route to SQL
Entity relationship queries	Route to KG
Questions about CSV/Excel files	Route to TableQA
Multi-hop queries across sources	Decompose + fuse

Architecture

4-Layer Source Detection

Layer 1 (15%): Rule-Based
  - 20+ keywords per source type
  - 7 regex patterns (aggregation, comparison, relation)
  - Fast pre-filtering

Layer 2 (35%): LLM Semantic
  - Intent classification
  - Entity/predicate detection
  - Multi-hop identification

Layer 3a (25%): SQL Schema Match
  - Inverted index on tables/columns
  - Automatic JOIN inference
  - Confidence scoring

Layer 3b (25%): KG Entity Link
  - Entity mention extraction
  - SPARQL endpoint lookup
  - Predicate pattern matching

Layer 3c (25%+30%): Entity Verification
  - Cross-source entity existence check
  - 30% score boost for verified entities

Layer 4: Multi-Source Fusion
  - Weighted aggregation
  - Execution plan generation

Query Generation Pipeline

1. Schema/Entity Linking     → Identify relevant tables/columns/entities
2. Parallel Generation       → Generate 3 candidates concurrently
3. Multi-Round Revision      → 2 rounds of self-review
4. Validation               → Syntax and semantic checks
5. Voting                   → Select best candidate
6. Execution                → Run query
7. Result Verification      → Validate reasonableness

Engines

NL2SQL Engine

from src.engines.nl2sql.multi_stage_engine import MultiStageNL2SQLEngine

engine = MultiStageNL2SQLEngine({
    "name": "sql_engine",
    "schema": schema,
    "llm_config": {
        "model": "deepseek-chat",
        "api_key": "sk-...",
    },
    "generation_config": {
        "num_candidates": 3,
        "max_revisions": 2,
        "parallel_generation": True,
    },
})

result = await engine.execute("How many employees in Engineering?", {})

Features:

Schema linking (rule-based + LLM)
Parallel SQL candidate generation
Multi-round self-revision
Voting mechanism
Result verification

NL2SPARQL Engine

from src.engines.nl2sparql.multi_stage_engine import MultiStageNL2SPARQLEngine

engine = MultiStageNL2SPARQLEngine({
    "name": "sparql_engine",
    "endpoint_url": "https://dbpedia.org/sparql",
    "ontology": ontology,
    "llm_config": {"model": "gpt-4", "api_key": "sk-..."},
})

result = await engine.execute("Who founded Microsoft?", {})

Features:

Entity linking to KG
Ontology retrieval
SPARQL generation with revision
Multi-endpoint support

TableQA Engine

from src.engines.table_qa.multi_stage_engine import MultiStageTableQAEngine

engine = MultiStageTableQAEngine({
    "name": "table_engine",
    "table_path": "data/sales.csv",
    "llm_config": {"model": "deepseek-chat", "api_key": "sk-..."},
})

result = await engine.execute("Which quarter had highest sales?", {})

Features:

Table schema analysis
Query intent interpretation
Pandas code generation
Safe execution sandbox

Multi-LLM Support

Override model and API key at runtime:

# Initialize with default
engine = MultiStageNL2SQLEngine({
    "llm_config": {"model": "deepseek-chat", "api_key": "sk-deepseek-key"},
})

# Override per-call
result = await engine.execute(
    query="Complex query",
    context={},
    model="gpt-4-turbo",      # Override model
    api_key="sk-openai-key",  # Override API key
)

Supported Providers

Provider	Models	Configuration
DeepSeek	deepseek-chat	`base_url: https://api.deepseek.com/v1`
OpenAI	gpt-4, gpt-3.5-turbo	Default endpoint
Azure OpenAI	gpt-4	`base_url: https://{resource}.openai.azure.com`
Local (Ollama)	llama2, mistral	`base_url: http://localhost:11434/v1`

Configuration

LLM Configuration

llm_config:
  model: deepseek-chat
  api_key: sk-...
  base_url: https://api.deepseek.com/v1  # Optional
  temperature: 0.1
  max_tokens: 500
  timeout: 30

Generation Configuration

generation_config:
  num_candidates: 3           # SQL/SPARQL candidates to generate
  max_revisions: 2            # Self-revision rounds
  parallel_generation: true   # Concurrent candidate generation
  voting_enabled: true        # Multi-candidate voting

Source Detection Weights

weights:
  rule_based: 0.15      # Layer 1
  llm_based: 0.35       # Layer 2
  schema_based: 0.25    # Layer 3a/3b
  verification: 0.25    # Layer 3c
verification_boost: 0.3  # 30% boost for verified entities

Workflows

Complete Query Flow

from src.orchestrator import HeteroMindOrchestrator

orchestrator = HeteroMindOrchestrator({
    "source_detection": {
        "layer2": {"api_key": "sk-...", "model": "gpt-4"},
        "layer3": {"schemas": [schema], "kg_endpoints": [...]},
    },
    "engines": {
        "sql": [{"name": "default", "enabled": True}],
        "sparql": [{"name": "default", "enabled": True}],
        "table_qa": [{"name": "default", "enabled": True}],
    },
})

response = await orchestrator.query("How many employees in Engineering?")
print(f"Answer: {response.answer}")
print(f"Source: {response.sources}")
print(f"Confidence: {response.confidence:.2f}")

Source Detection Only

from src.classifier import SourceDetectorOrchestrator

detector = SourceDetectorOrchestrator({
    "layer2": {"api_key": "sk-...", "model": "gpt-4"},
    "layer3": {"schemas": [schema]},
})

decision = await detector.detect("How many employees?")
print(f"Primary Source: {decision.primary_source.value}")
print(f"Confidence: {decision.confidence:.2f}")
print(f"Execution Plan: {decision.execution_plan}")

Test Results

Engine	Tests	Passed	Accuracy	Avg Confidence	Avg Time
SQL (NL2SQL)	3	3	100.0%	0.60	22.5s
SPARQL (NL2SPARQL)	2	2	100.0%	0.20	36.3s
TableQA	3	3	100.0%	0.62	24.2s
Overall	8	8	100.0%	0.51	26.6s

Environment Variables

Required (for LLM-based generation)

Variable	Description	Example
`DEEPSEEK_API_KEY`	DeepSeek API key	`sk-...`
`OPENAI_API_KEY`	OpenAI API key	`sk-...`

Optional (for specific features)

Variable	Description	Example
`MYSQL_CONNECTION_STRING`	MySQL database connection	`mysql://user:pass@host/db`
`CUSTOM_KG_ENDPOINT`	Custom KG SPARQL endpoint	`https://example.com/sparql`
`WORKSPACE`	Base path for table file scanning	`/path/to/workspace`

Setup

# Copy example env file
cp .env.example .env

# Edit with your credentials
nano .env

# Load environment
export $(cat .env | xargs)

Installation

cd HeteroMind
pip install -r requirements.txt

Requirements

Python 3.10+
aiohttp, pandas, openpyxl
OpenAI-compatible API key (optional)

Project Structure

HeteroMind/
├── src/
│   ├── classifier/          # 4-layer source detection
│   │   ├── rule_detector.py      # Layer 1
│   │   ├── llm_detector.py       # Layer 2
│   │   ├── sql_schema_matcher.py # Layer 3a
│   │   ├── kg_entity_linker.py   # Layer 3b
│   │   ├── entity_verifier.py    # Layer 3c
│   │   └── source_fusion.py      # Layer 4
│   ├── engines/             # Query engines
│   │   ├── nl2sql/
│   │   ├── nl2sparql/
│   │   └── table_qa/
│   ├── decomposer/          # Task decomposition
│   ├── fusion/              # Result fusion
│   ├── generator/           # Answer generation
│   └── orchestrator.py      # Main orchestrator
├── config/
│   └── source_detection.yaml
├── tests/
│   └── test_data/
├── comprehensive_tests.py
└── SKILL.md

Examples

SQL: Aggregation with Filter

Query: "How many employees are in the Engineering department?"

Generated SQL:

SELECT COUNT(*) FROM employees e 
JOIN departments d ON e.department_id = d.id 
WHERE d.name = 'Engineering'

SPARQL: Entity Relationship

Query: "Who is the founder of Microsoft?"

Generated SPARQL:

SELECT ?founder WHERE {
    \x3Chttp://dbpedia.org/resource/Microsoft> 
    \x3Chttp://dbpedia.org/ontology/founder> ?founder
}

TableQA: Aggregation

Query: "Which quarter had the highest sales in 2024?"

Generated Code:

result = df.groupby('quarter')['sales'].sum().idxmax()

Skill Contract

Skills that use HeteroMind should declare:

heteromind:
  reads: [Database Schema, KG Ontology, Table Files]
  writes: [Generated SQL, SPARQL, Pandas Code]
  requires:
    - LLM API key (for generation stages)
    - Schema metadata (for source detection)
  postconditions:
    - Generated query passes validation
    - Result verified for reasonableness

Integration Patterns

With Agent Memory

Log query execution for audit:

from src.orchestrator import HeteroMindOrchestrator

orchestrator = HeteroMindOrchestrator(config)
response = await orchestrator.query(query)

# Log to agent memory
memory.record({
    "action": "knowledge_query",
    "query": query,
    "source": response.sources,
    "confidence": response.confidence,
    "answer": response.answer,
})

Multi-Source Fusion

For queries requiring multiple sources:

# Query automatically detects hybrid need
response = await orchestrator.query(
    "Show employees who published papers"
)
# Routes to: SQL (employees) + KG (papers) + Fusion

References

README.md — Full documentation and API reference
USAGE.md — Detailed usage guide with multi-LLM examples
config/source_detection.yaml — Detection configuration
tests/test_data/ — Example schemas and test data

Version: 0.1.0
Last Updated: 2026-04-12
Test Coverage: 100.0% accuracy on 8 test cases

Usage Guidance

Do not install or supply credentials until the metadata mismatch is resolved. Specific steps to consider before enabling this skill: 1) Ask the publisher/registry why registry metadata lists no required env vars while SKILL.md requires DEEPSEEK_API_KEY and OPENAI_API_KEY. 2) If you proceed, provide only least-privilege credentials (read-only DB users, scoped API keys) and explicit table paths rather than wildcard/mounted workspaces. 3) Keep auto_execute disabled and require confirmation; review and test generated queries in a safe/non-production database. 4) Be aware logging/debug options can capture intermediate queries/results—disable verbose logging if data sensitivity is a concern. 5) Inspect SECURITY.md and source files (especially src/utils/api_security.py and any logging code) for how secrets and outputs are handled. 6) Consider running the package in an isolated environment first (no production credentials) to verify behavior.

Capability Analysis

Type: OpenClaw Skill Name: heteromind Version: 0.3.0 The HeteroMind skill bundle implements a complex QA system with high-risk execution patterns, most notably the use of `exec()` in `src/engines/table_qa/multi_stage_engine.py` to run LLM-generated Python code. It also executes generated SQL and SPARQL queries, which are susceptible to injection attacks if the LLM is manipulated. While the bundle includes defensive utilities in `src/utils/api_security.py` and a detailed `SECURITY.md` acknowledging these risks, the 'sandbox' provided for Python execution is a simple dictionary scope that is easily bypassed. No clear evidence of intentional malice was found, but the architecture presents a significant RCE (Remote Code Execution) attack surface.

Capability Tags

requires-oauth-token

Capability Assessment

⚠ Purpose & Capability

The functionality (NL→SQL/NL→SPARQL/TableQA, multi-LLM support) justifies requesting LLM API keys and optional DB connection strings; however the registry metadata claims no required environment variables or credentials while SKILL.md lists required_env_vars (DEEPSEEK_API_KEY, OPENAI_API_KEY) and optional connection strings. This mismatch between metadata and the runtime instructions/code is an incoherence that should be resolved before trusting the package.

ℹ Instruction Scope

SKILL.md instructs the agent to use LLM API keys, to connect to SQL/PG/MySQL endpoints and optional custom KG endpoints, and to read explicitly-specified table files. That behavior is in-scope for a heterogeneous QA engine. Two things to flag: (1) default config enables detailed logging (log_layer_outputs, log_verification_details) which may record intermediate query text / schema / results (potentially sensitive), and (2) per-call API key/endpoint overrides permit the skill to be directed to use arbitrary keys/endpoints at runtime. SKILL.md does not instruct reading unrelated system files or exfiltrating data to hidden endpoints.

✓ Install Mechanism

No install spec (instruction-only) is present, and the package includes code and a plain requirements.txt referencing common PyPI packages (openai, pandas, sqlalchemy, rdflib, etc.). There are no downloads from arbitrary URLs or extract/install steps in the provided metadata. Nothing in the install footprint indicates hidden remote installers or unusual persistence.

⚠ Credentials

The environment variables referenced in SKILL.md (DEEPSEEK_API_KEY, OPENAI_API_KEY, MYSQL_CONNECTION_STRING, POSTGRES_CONNECTION_STRING, CUSTOM_KG_ENDPOINT, TABLE_PATHS) are plausible for the stated purpose. The concern is the mismatch: registry metadata declared no required env vars while SKILL.md marks two API keys as required and several sensitive optional values. That mismatch can lead to surprise credential prompts. Also, the skill can be configured to query databases and read files — supply only least-privilege credentials and explicit table paths.

✓ Persistence & Privilege

The skill does not request 'always: true' and does not declare modifications to other skills or system-wide configuration. SKILL.md and config default to auto_execute=false and require_confirmation=true for safety, which is appropriate for a tool that runs queries against user data.

Version History

v0.3.0

- Added required and optional environment variables to SKILL.md for easier setup and integration. - Listed DEEPSEEK_API_KEY and OPENAI_API_KEY as required variables for LLM-based features. - Introduced optional variables for database and table source configuration (e.g., MYSQL_CONNECTION_STRING, TABLE_PATHS). - Added SECURITY.md and RESPONSE_TO_REVIEW.md files for improved documentation and security practices.

v0.2.0

**Major update: Multi-stage engine core and comprehensive detection/classification modules added.** - Introduced core query engines (`nl2sql`, `nl2sparql`, `table_qa`) with multi-stage orchestration and self-revision workflows. - Added new source detection modules using a detailed 4-layer approach (rule-based, LLM, schema/entity matching, verification). - Implemented support for multi-LLM providers and per-query model override. - Enhanced system documentation with architecture, configuration, and usage workflows. - Comprehensive test results and detailed project structure included.

v0.1.0

Initial release of heteromind – a unified QA system for heterogeneous knowledge sources. - Automatically routes natural language queries to SQL databases, knowledge graphs, or table files - Uses a 4-layer detection architecture for smart source selection and verification - Supports multi-hop queries and result fusion across data types - Handles both Chinese and English queries - Demonstrated 100% accuracy in initial benchmark tests - Provides NL-to-SQL, NL-to-SPARQL, and table question answering capabilities

Metadata

Slug heteromind

Version 0.3.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is HeteroMind - Unified Knowledge QA?

Unified heterogeneous knowledge QA system. Automatically routes natural language queries to SQL databases, Knowledge Graphs, or table files using 4-layer det... It is an AI Agent Skill for Claude Code / OpenClaw, with 81 downloads so far.

How do I install HeteroMind - Unified Knowledge QA?

Run "/install heteromind" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is HeteroMind - Unified Knowledge QA free?

Yes, HeteroMind - Unified Knowledge QA is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does HeteroMind - Unified Knowledge QA support?

HeteroMind - Unified Knowledge QA is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created HeteroMind - Unified Knowledge QA?

It is built and maintained by Yongrui Chen (@bahuia); the current version is v0.3.0.

More Skills

HeteroMind - Unified Knowledge QA

HeteroMind

Core Concept

When to Use

Architecture

4-Layer Source Detection

Query Generation Pipeline

Engines

NL2SQL Engine

NL2SPARQL Engine

TableQA Engine

Multi-LLM Support

Supported Providers

Configuration

LLM Configuration

Generation Configuration

Source Detection Weights

Workflows

Complete Query Flow

Source Detection Only

Test Results

Environment Variables

Required (for LLM-based generation)

Optional (for specific features)

Setup

Installation

Requirements

Project Structure

Examples

SQL: Aggregation with Filter

SPARQL: Entity Relationship

TableQA: Aggregation

Skill Contract

Integration Patterns

With Agent Memory

Multi-Source Fusion

References

What is HeteroMind - Unified Knowledge QA?

How do I install HeteroMind - Unified Knowledge QA?

Is HeteroMind - Unified Knowledge QA free?

Which platforms does HeteroMind - Unified Knowledge QA support?

Who created HeteroMind - Unified Knowledge QA?

💬 Comments