Vector Search: SQLite + BM25 Hybrid Retrieval, 0.7/0.3 Weight Fusion and Embedding Fallback Chain
Chapter 28 Vector Retrieval Implementation: SQLite + BM25 Hybrid Search, 0.7/0.3 Weight Fusion, and the Embedding Fallback Chain
"The best vector database is the one you don't need to operate separately." — OpenClaw Architecture Decision Record
28.1 Why SQLite Instead of a Standalone Vector Database?
In the vector retrieval space, dedicated vector databases such as Pinecone, Weaviate, Qdrant, and Milvus have proliferated rapidly. When making architecture choices, OpenClaw explicitly chose SQLite + the sqlite-vec extension over any standalone vector database.
28.1.1 The Cost of Standalone Vector Databases
| Cost | Specific Manifestation |
|---|---|
| Operational complexity | Requires separate deployment, configuration, monitoring, and backup of the vector database service |
| Network dependency | Every retrieval requires a network request, introducing latency and availability risk |
| Data fragmentation | Text data in Markdown files, vector data in a remote service — consistency is hard to guarantee |
| Expense | Cloud-hosted vector databases charge by index size and request volume |
| Cognitive overhead | Developers must learn proprietary APIs and understand each database's consistency model |
28.1.2 SQLite's Advantages
OpenClaw's vector retrieval stack:
┌─────────────────────────────────┐
│ SQLite (single-file database) │
│ + sqlite-vec (vector extension) │
│ + FTS5 (full-text search ext.) │
└─────────────────────────────────┘
File path: ~/.openclaw/memory/<agentId>.sqlite
Core reasons for choosing SQLite:
- Zero operations: SQLite is an in-process database — no server required, starts with the application
- File portability: The entire vector index is a single file that can be copied, backed up, and migrated
- Derived and rebuildable: As noted earlier, SQLite is derived from Markdown files; database corruption means no information loss
- Native SQL: Vector results can be queried, joined, and filtered using standard SQL — no proprietary query language to learn
- Fast enough: For agent personal memory scale (typically tens of thousands to millions of records), SQLite performance fully meets requirements
28.2 FTS5 Virtual Tables: Keyword Retrieval Principles
28.2.1 What Is FTS5?
FTS5 (Full-Text Search 5) is SQLite's built-in full-text search extension, implemented through the virtual table mechanism. A virtual table looks and behaves exactly like a regular table, but its data storage and retrieval logic are fully customized.
-- Create FTS5 virtual table
CREATE VIRTUAL TABLE memory_fts USING fts5(
content, -- text content to search
source_file, -- source file path
chunk_id, -- corresponding ID in memory_embeddings
tokenize = 'porter unicode61' -- tokenizer
);
28.2.2 The BM25 Ranking Algorithm
FTS5 uses the BM25 (Best Matching 25) algorithm by default to rank retrieval results. BM25 is a probabilistic retrieval model based on term frequency (TF) and inverse document frequency (IDF):
$$ \text{BM25}(q, d) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} $$
Where:
- $f(q_i, d)$: frequency of term $q_i$ in document $d$
- $|d|$: document length (word count)
- $\text{avgdl}$: average document length in the corpus
- $k_1 = 1.2$: term frequency saturation parameter (default)
- $b = 0.75$: length normalization parameter (default)
28.2.3 FTS5 Retrieval Examples
-- Basic keyword search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH 'JWT validation RS256'
ORDER BY rank
LIMIT 24;
-- Phrase search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH '"JWT validation"' -- exact phrase
ORDER BY rank;
-- Boolean search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH 'auth AND (token OR session)'
ORDER BY rank;
28.3 Vector Storage Format and Cosine Similarity Calculation
28.3.1 Vector Storage Table Structure
CREATE TABLE memory_embeddings (
id INTEGER PRIMARY KEY,
content TEXT NOT NULL, -- original text chunk
embedding FLOAT[768], -- vector (dimension depends on embedding model)
source_file TEXT, -- source file (e.g., memory/2026-04-26.md)
chunk_index INTEGER, -- chunk sequence number within source file
created_at INTEGER, -- Unix timestamp
model_name TEXT -- name of the model that generated the vector
);
FLOAT[768] is a special column type introduced by the sqlite-vec extension that supports efficient vector operations.
28.3.2 Cosine Similarity Calculation
Vector similarity uses cosine similarity, computed entirely within SQL:
$$ \text{cosine_similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| \cdot |\mathbf{v}|} $$
sqlite-vec encapsulates this calculation as a SQL function:
-- Vector similarity search (query vector is :query_embedding)
SELECT
id,
content,
source_file,
vec_cosine_similarity(embedding, :query_embedding) AS vector_score
FROM memory_embeddings
ORDER BY vector_score DESC
LIMIT 24;
Why cosine similarity rather than Euclidean distance?
Cosine similarity measures directional similarity, not absolute distance. For semantic embedding vectors, the semantic similarity between two texts is primarily reflected in the vector direction, independent of the vector magnitude (which may vary with text length).
28.4 Union Strategy vs. Intersection Strategy
28.4.1 Definitions of the Two Strategies
In hybrid retrieval, BM25 and vector search each return a batch of candidate results. Two strategies exist for merging these two result sets:
Intersection strategy:
BM25 Top-K ∩ Vector Top-K → Keep only documents appearing in both result sets
Union strategy:
BM25 Top-K ∪ Vector Top-K → Merge both result sets and rank uniformly
28.4.2 Why OpenClaw Chose the Union Strategy
OpenClaw explicitly uses the Union strategy for the following reasons:
| Consideration | Intersection | Union (OpenClaw's choice) |
|---|---|---|
| Recall rate | Low (only keeps what both found) | High (both backends can contribute results) |
| Complementarity | Cannot leverage each backend's complementary strengths | BM25 excels at exact terms; Vector excels at semantics — complementary |
| Degradation capability | Either backend failing yields no results | Either backend failing still leaves the other's results |
| Suitable for | High-precision needs (tolerable low recall) | General memory retrieval (high recall required) |
For agent memory retrieval, high recall is typically more important than high precision — it's better to retrieve a few loosely related chunks than to miss truly important information.
28.5 candidateMultiplier:4 — The Over-fetching Strategy
28.5.1 Why Over-fetch Candidates?
In hybrid retrieval, each backend (BM25 and Vector) fetches k × candidateMultiplier candidates rather than directly fetching the final k. The default is candidateMultiplier = 4.
If the goal is to return 6 final results, each backend first fetches 6 × 4 = 24 candidates.
28.5.2 Why Over-fetching Is Necessary
BM25 Top-24: [A, B, C, D, E, F, G, H, ...]
Vector Top-24: [A, C, E, X, Y, Z, ...]
After Union with deduplication: [A, B, C, D, E, F, G, H, X, Y, Z, ...]
After weighted re-ranking, Top-6: [A, E, C, X, B, G]
Without over-fetching (each backend fetches only 6):
BM25 Top-6: [A, B, C, D, E, F]
Vector Top-6: [A, C, E, X, Y, Z]
Union: [A, B, C, D, E, F, X, Y, Z] (at most 12)
After re-ranking, Top-6: [A, E, C, X, B, D]
Over-fetching ensures sufficient candidate diversity during re-ranking. Especially when BM25 and Vector result sets overlap heavily, not over-fetching may result in a final candidate pool that is too small.
28.5.3 Parameter Configuration
vectorSearch:
finalK: 6 # Final number of results to return
candidateMultiplier: 4 # Each backend fetches finalK × 4 = 24 candidates
# Each backend fetches 24; after Union, at most 48; after re-ranking, Top-6
28.6 Weight Fusion: 0.7 Vector + 0.3 BM25
28.6.1 The Fusion Formula
$$ \text{finalScore} = 0.7 \times vectorScore + 0.3 \times bm25Score $$
28.6.2 Design Rationale
The 0.7:0.3 weight ratio was chosen based on the following engineering judgments:
| Consideration | Explanation |
|---|---|
| Semantic understanding first | Agent queries are usually natural language; semantic matching matters more than exact keyword matching |
| BM25 as an anchor | BM25 excels at matching specific terms (function names, code variables); weight 0.3 ensures it's not completely ignored |
| Noise resistance | Vector similarity sometimes over-scores semantically similar but irrelevant content; BM25's lexical constraints help correct this |
| Experimental results | Based on ablation experiments on OpenClaw development team's internal test sets, 0.7/0.3 yields the best retrieval quality metrics |
28.6.3 Implementation in Code
function hybridScore(vectorScore: number, bm25Score: number): number {
const VECTOR_WEIGHT = 0.7;
const BM25_WEIGHT = 0.3;
return VECTOR_WEIGHT * vectorScore + BM25_WEIGHT * bm25Score;
}
// Re-rank all candidates
const reranked = candidates
.map(chunk => ({
...chunk,
finalScore: hybridScore(chunk.vectorScore, chunk.normalizedBm25Score)
}))
.sort((a, b) => b.finalScore - a.finalScore)
.slice(0, finalK);
28.7 BM25 Score Normalization Formula
28.7.1 Why Normalization Is Needed
Raw BM25 scores (in SQLite FTS5, the rank column is actually a negative number — the smaller the absolute value, the higher the relevance) and vector similarity scores (floating-point numbers between 0 and 1) are on completely different scales and cannot be directly fused with weights.
BM25 scores must be normalized to the [0, 1] range.
28.7.2 The Normalization Formula
OpenClaw uses rank-based normalization:
$$ \text{bm25_score}(r) = \frac{1}{1 + r} $$
Where $r$ is the rank in BM25 retrieval results, starting from 0:
| Rank $r$ | Normalized BM25 Score |
|---|---|
| 0 (1st place) | $\frac{1}{1+0} = 1.0$ |
| 1 (2nd place) | $\frac{1}{1+1} = 0.5$ |
| 2 (3rd place) | $\frac{1}{1+2} = 0.333$ |
| 3 (4th place) | $\frac{1}{1+3} = 0.25$ |
| 7 (8th place) | $\frac{1}{1+7} = 0.125$ |
| 23 (24th place) | $\frac{1}{1+23} = 0.042$ |
This formula ensures:
- The top-ranked result receives the maximum score of 1.0
- Scores decay with rank, but on a smooth decay curve (harmonic series)
- Regardless of the raw BM25 score's numerical range, the normalized score always falls within [0, 1]
28.7.3 Implementation Code
function normalizeBm25Score(rank: number): number {
// rank starts at 0; rank=0 means BM25 ranked first
return 1 / (1 + rank);
}
// Assign normalized scores to BM25 results by rank
const bm25Results = await bm25Search(query, limit: 24);
const normalizedBm25 = bm25Results.map((result, index) => ({
...result,
normalizedBm25Score: normalizeBm25Score(index) // index is the rank
}));
28.8 Graceful Degradation: Any Backend Failure Does Not Affect the Other
28.8.1 Design Principle
Each backend in hybrid retrieval (BM25 and Vector) is independently wrapped with .catch(() => []), ensuring that if either backend throws an exception, the other backend's results remain available:
const [vectorResults, bm25Results] = await Promise.all([
searchByVector(queryEmbedding, limit: 24)
.catch((err) => {
logger.warn("Vector search failed, falling back to BM25 only", err);
return []; // return empty array; BM25 results unaffected
}),
searchByBm25(query, limit: 24)
.catch((err) => {
logger.warn("BM25 search failed, falling back to vector only", err);
return []; // return empty array; vector results unaffected
})
]);
// Union merge
const combined = unionByChunkId(vectorResults, bm25Results);
28.8.2 Degradation Scenarios
| Scenario | Result |
|---|---|
| Both backends normal | Complete hybrid retrieval results |
| Vector backend fails (e.g., embedding unavailable) | BM25 results only (pure keyword search) |
| BM25 backend fails (FTS5 index corrupted) | Vector results only (pure semantic search) |
| Both backends fail | Empty result set (no exception thrown; task continues) |
28.8.3 Why Use Union Rather Than Letting Promise.all Fail Directly
Without the .catch(() => []) isolation mechanism:
// Dangerous pattern: either backend failing causes the entire retrieval to fail
const [vectorResults, bm25Results] = await Promise.all([
searchByVector(queryEmbedding), // if this fails, the entire Promise.all fails
searchByBm25(query)
]);
With Graceful Degradation, even if the vector database has issues (for example, the embedding service is unavailable), memory retrieval continues to function. The agent does not completely lose memory access due to a partial retrieval system failure.
28.9 The Four-Level Embedding Fallback Chain
28.9.1 The Complete Fallback Chain
Query text
│
▼
[1] Local GGUF model (embeddinggemma-300M-Q8_0.gguf)
│ failure / unavailable
▼
[2] OpenAI text-embedding-3-small (1536 dimensions)
│ failure / unavailable
▼
[3] Gemini gemini-embedding-001 (768 dimensions)
│ failure / unavailable
▼
[4] Pure BM25 (exit vector search; keywords only)
28.9.2 Level 1: Local GGUF Model
embedding:
primary:
type: local_gguf
model_path: ~/.openclaw/models/embeddinggemma-300M-Q8_0.gguf
dimensions: 768
file_size: ~600MB
Advantages:
- Fully offline, no network required
- No API call cost
- Low latency (local CPU/GPU inference)
- Data privacy (text never leaves the local machine)
Disadvantages:
- Requires ~600MB disk space for the model file
- Inference can be slow on low-performance devices
- Model quality is lower than large cloud models
28.9.3 Level 2: OpenAI text-embedding-3-small
embedding:
fallback_1:
type: openai
model: text-embedding-3-small
dimensions: 1536
api_key: ${OPENAI_API_KEY}
The high-dimensional 1536-dimension vectors offer excellent semantic discrimination. Cost is relatively low (~$0.02/1M tokens), making it a good first cloud fallback option.
28.9.4 Level 3: Gemini gemini-embedding-001
embedding:
fallback_2:
type: gemini
model: gemini-embedding-001
dimensions: 768
api_key: ${GEMINI_API_KEY}
As the second cloud option, with 768 dimensions matching the local GGUF model. Suitable for use when OpenAI service is unavailable.
28.9.5 Level 4: Pure BM25
When all embedding options have failed, the system exits vector search and falls back to pure BM25 keyword retrieval:
async function getEmbedding(text: string): Promise<Float32Array | null> {
try { return await localGgufEmbed(text); } catch {}
try { return await openaiEmbed(text); } catch {}
try { return await geminiEmbed(text); } catch {}
return null; // return null; triggers pure BM25 mode
}
// In the retrieval flow
const embedding = await getEmbedding(query);
if (embedding === null) {
// Degrade to pure BM25
return await searchByBm25(query, limit: finalK);
}
// Normal hybrid retrieval flow
return await hybridSearch(query, embedding);
28.9.6 Dimension Incompatibility
Different models produce vectors of different dimensions (768 vs. 1536), so they cannot be mixed:
-- Record the model used at indexing time
UPDATE memory_embeddings
SET model_name = 'text-embedding-3-small'
WHERE id = :id;
When falling back to a model with different dimensions, the vector index must be rebuilt (re-embedding all text data). OpenClaw automatically triggers a background rebuild task when it detects a model change:
if (currentModel !== storedModel) {
logger.info("Embedding model changed, scheduling background re-embedding");
scheduleBackgroundReembedding(); // async rebuild, does not block current retrieval
}
28.10 Chunking: Engineering Trade-offs in Text Segmentation
28.10.1 Why Is Chunking Necessary?
Embedding models typically have an input token limit (e.g., 512 or 8192 tokens). More importantly, overly long text chunks cause:
- Semantic dilution: A single embedding vector must represent too much content, reducing semantic precision
- Coarse retrieval granularity: Returns large text chunks where most content is irrelevant to the query
- Token waste: Occupies excessive space when injected into the Context Window
28.10.2 OpenClaw's Chunking Parameters
| Parameter | Default | Description |
|---|---|---|
| Target size | 400 tokens (~1600 chars) | Target token count per chunk |
| Overlap | 80 tokens | Token overlap between adjacent chunks |
| Batch limit | 8000 tokens/batch | Maximum tokens per single Embedding API request |
| Concurrency | 4 | Number of parallel Embedding requests |
28.10.3 The Trade-offs Behind 400 Token Target Size
Too small (< 100 tokens):
✓ High retrieval precision
✗ Insufficient context; isolated fragments are hard to understand
Too large (> 1000 tokens):
✓ Complete context
✗ Semantic dilution, low retrieval precision; occupies large Context when injected
400 tokens (~1600 characters):
✓ Roughly equivalent to a complete paragraph or function definition
✓ Semantics are sufficiently focused
✓ Reasonable when injected into Context
28.10.4 The Role of 80-token Overlap
Overlap ensures that information spanning chunk boundaries is not lost:
Original text: [...end of paragraph A...][...start of paragraph B...]
↑ chunk 1 ends
↑ chunk 2 starts (80-token overlap)
Chunk 1: [most of paragraph A + first 80 tokens of paragraph B]
Chunk 2: [first 80 tokens of paragraph B + most of paragraph B]
↑ overlapping portion
Without overlap, an important discussion that happens to straddle two chunks may be "cut" during retrieval — each chunk contains only half, and neither receives a high relevance score.
28.10.5 Engineering Details of Batching and Concurrency
const BATCH_MAX_TOKENS = 8000;
const CONCURRENT_REQUESTS = 4;
// Group chunks into batches by token count
const batches = groupIntoBatches(chunks, BATCH_MAX_TOKENS);
// Process concurrently but limit concurrency to 4 (to avoid API rate limiting)
const results = await pLimit(CONCURRENT_REQUESTS)(
batches.map(batch => () => embedBatch(batch))
);
Limiting concurrency to 4 protects against API rate limits: most Embedding APIs restrict concurrent requests, and exceeding the limit causes 429 errors.
28.11 Complete Hybrid Retrieval Flow
User query → Generate query Embedding (four-level fallback chain)
│
┌───────┴───────┐
▼ ▼
BM25 search (FTS5) Vector search (sqlite-vec)
Fetch 24 candidates Fetch 24 candidates
.catch(()=>[]) .catch(()=>[])
│ │
└───────┬───────┘
▼
Union merge with deduplication (by chunk_id)
│
▼
BM25 score normalization: bm25Score = 1/(1+rank)
│
▼
Weighted fusion: finalScore = 0.7×vectorScore + 0.3×bm25Score
│
▼
Sort by finalScore descending
│
▼
Take Top-6 results
│
▼
Inject into Context Window
28.12 Chapter Summary
OpenClaw's vector retrieval implementation is an engineering design with pragmatism at its core:
- SQLite replaces standalone vector databases, achieving zero operations, file portability, and native SQL
- FTS5 virtual tables provide BM25 keyword retrieval, complementing vector search
- Cosine similarity is computed inside SQL, without leaving the SQLite ecosystem
- Union strategy ensures high recall, with the two backends complementing rather than constraining each other
- candidateMultiplier:4 over-fetching ensures candidate diversity during re-ranking
- 0.7/0.3 weight fusion balances semantic understanding and lexical precision
$$ \text{finalScore} = 0.7 \times vectorScore + 0.3 \times \frac{1}{1 + rank_{bm25}} $$
- Graceful Degradation ensures any single backend failure does not affect overall service
- Four-level Embedding fallback chain spans from local to cloud, ensuring vector generation in any environment
- 400/80 token chunking parameters balance retrieval precision and context completeness
Next: Chapter 29 — The Workspace File System Explained: AGENTS.md / SOUL.md / USER.md / HEARTBEAT.md — Roles and Best Practices