Chapter 28

Vector Search: SQLite + BM25 Hybrid Retrieval, 0.7/0.3 Weight Fusion and Embedding Fallback Chain

Chapter 28 Vector Retrieval Implementation: SQLite + BM25 Hybrid Search, 0.7/0.3 Weight Fusion, and the Embedding Fallback Chain

"The best vector database is the one you don't need to operate separately." — OpenClaw Architecture Decision Record


28.1 Why SQLite Instead of a Standalone Vector Database?

In the vector retrieval space, dedicated vector databases such as Pinecone, Weaviate, Qdrant, and Milvus have proliferated rapidly. When making architecture choices, OpenClaw explicitly chose SQLite + the sqlite-vec extension over any standalone vector database.

28.1.1 The Cost of Standalone Vector Databases

Cost Specific Manifestation
Operational complexity Requires separate deployment, configuration, monitoring, and backup of the vector database service
Network dependency Every retrieval requires a network request, introducing latency and availability risk
Data fragmentation Text data in Markdown files, vector data in a remote service — consistency is hard to guarantee
Expense Cloud-hosted vector databases charge by index size and request volume
Cognitive overhead Developers must learn proprietary APIs and understand each database's consistency model

28.1.2 SQLite's Advantages

OpenClaw's vector retrieval stack:
┌─────────────────────────────────┐
│  SQLite (single-file database)   │
│  + sqlite-vec (vector extension) │
│  + FTS5 (full-text search ext.)  │
└─────────────────────────────────┘
File path: ~/.openclaw/memory/<agentId>.sqlite

Core reasons for choosing SQLite:

  1. Zero operations: SQLite is an in-process database — no server required, starts with the application
  2. File portability: The entire vector index is a single file that can be copied, backed up, and migrated
  3. Derived and rebuildable: As noted earlier, SQLite is derived from Markdown files; database corruption means no information loss
  4. Native SQL: Vector results can be queried, joined, and filtered using standard SQL — no proprietary query language to learn
  5. Fast enough: For agent personal memory scale (typically tens of thousands to millions of records), SQLite performance fully meets requirements

28.2 FTS5 Virtual Tables: Keyword Retrieval Principles

28.2.1 What Is FTS5?

FTS5 (Full-Text Search 5) is SQLite's built-in full-text search extension, implemented through the virtual table mechanism. A virtual table looks and behaves exactly like a regular table, but its data storage and retrieval logic are fully customized.

-- Create FTS5 virtual table
CREATE VIRTUAL TABLE memory_fts USING fts5(
    content,          -- text content to search
    source_file,      -- source file path
    chunk_id,         -- corresponding ID in memory_embeddings
    tokenize = 'porter unicode61'  -- tokenizer
);

28.2.2 The BM25 Ranking Algorithm

FTS5 uses the BM25 (Best Matching 25) algorithm by default to rank retrieval results. BM25 is a probabilistic retrieval model based on term frequency (TF) and inverse document frequency (IDF):

$$ \text{BM25}(q, d) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} $$

Where:

28.2.3 FTS5 Retrieval Examples

-- Basic keyword search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH 'JWT validation RS256'
ORDER BY rank
LIMIT 24;

-- Phrase search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH '"JWT validation"'  -- exact phrase
ORDER BY rank;

-- Boolean search
SELECT content, chunk_id, rank
FROM memory_fts
WHERE content MATCH 'auth AND (token OR session)'
ORDER BY rank;

28.3 Vector Storage Format and Cosine Similarity Calculation

28.3.1 Vector Storage Table Structure

CREATE TABLE memory_embeddings (
    id INTEGER PRIMARY KEY,
    content TEXT NOT NULL,        -- original text chunk
    embedding FLOAT[768],         -- vector (dimension depends on embedding model)
    source_file TEXT,             -- source file (e.g., memory/2026-04-26.md)
    chunk_index INTEGER,          -- chunk sequence number within source file
    created_at INTEGER,           -- Unix timestamp
    model_name TEXT               -- name of the model that generated the vector
);

FLOAT[768] is a special column type introduced by the sqlite-vec extension that supports efficient vector operations.

28.3.2 Cosine Similarity Calculation

Vector similarity uses cosine similarity, computed entirely within SQL:

$$ \text{cosine_similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| \cdot |\mathbf{v}|} $$

sqlite-vec encapsulates this calculation as a SQL function:

-- Vector similarity search (query vector is :query_embedding)
SELECT 
    id,
    content,
    source_file,
    vec_cosine_similarity(embedding, :query_embedding) AS vector_score
FROM memory_embeddings
ORDER BY vector_score DESC
LIMIT 24;

Why cosine similarity rather than Euclidean distance?

Cosine similarity measures directional similarity, not absolute distance. For semantic embedding vectors, the semantic similarity between two texts is primarily reflected in the vector direction, independent of the vector magnitude (which may vary with text length).


28.4 Union Strategy vs. Intersection Strategy

28.4.1 Definitions of the Two Strategies

In hybrid retrieval, BM25 and vector search each return a batch of candidate results. Two strategies exist for merging these two result sets:

Intersection strategy:

BM25 Top-K ∩ Vector Top-K → Keep only documents appearing in both result sets

Union strategy:

BM25 Top-K ∪ Vector Top-K → Merge both result sets and rank uniformly

28.4.2 Why OpenClaw Chose the Union Strategy

OpenClaw explicitly uses the Union strategy for the following reasons:

Consideration Intersection Union (OpenClaw's choice)
Recall rate Low (only keeps what both found) High (both backends can contribute results)
Complementarity Cannot leverage each backend's complementary strengths BM25 excels at exact terms; Vector excels at semantics — complementary
Degradation capability Either backend failing yields no results Either backend failing still leaves the other's results
Suitable for High-precision needs (tolerable low recall) General memory retrieval (high recall required)

For agent memory retrieval, high recall is typically more important than high precision — it's better to retrieve a few loosely related chunks than to miss truly important information.


28.5 candidateMultiplier:4 — The Over-fetching Strategy

28.5.1 Why Over-fetch Candidates?

In hybrid retrieval, each backend (BM25 and Vector) fetches k × candidateMultiplier candidates rather than directly fetching the final k. The default is candidateMultiplier = 4.

If the goal is to return 6 final results, each backend first fetches 6 × 4 = 24 candidates.

28.5.2 Why Over-fetching Is Necessary

BM25 Top-24:   [A, B, C, D, E, F, G, H, ...]
Vector Top-24: [A, C, E, X, Y, Z, ...]

After Union with deduplication: [A, B, C, D, E, F, G, H, X, Y, Z, ...]

After weighted re-ranking, Top-6: [A, E, C, X, B, G]

Without over-fetching (each backend fetches only 6):

BM25 Top-6:   [A, B, C, D, E, F]
Vector Top-6: [A, C, E, X, Y, Z]

Union: [A, B, C, D, E, F, X, Y, Z] (at most 12)

After re-ranking, Top-6: [A, E, C, X, B, D]

Over-fetching ensures sufficient candidate diversity during re-ranking. Especially when BM25 and Vector result sets overlap heavily, not over-fetching may result in a final candidate pool that is too small.

28.5.3 Parameter Configuration

vectorSearch:
  finalK: 6                    # Final number of results to return
  candidateMultiplier: 4       # Each backend fetches finalK × 4 = 24 candidates
  # Each backend fetches 24; after Union, at most 48; after re-ranking, Top-6

28.6 Weight Fusion: 0.7 Vector + 0.3 BM25

28.6.1 The Fusion Formula

$$ \text{finalScore} = 0.7 \times vectorScore + 0.3 \times bm25Score $$

28.6.2 Design Rationale

The 0.7:0.3 weight ratio was chosen based on the following engineering judgments:

Consideration Explanation
Semantic understanding first Agent queries are usually natural language; semantic matching matters more than exact keyword matching
BM25 as an anchor BM25 excels at matching specific terms (function names, code variables); weight 0.3 ensures it's not completely ignored
Noise resistance Vector similarity sometimes over-scores semantically similar but irrelevant content; BM25's lexical constraints help correct this
Experimental results Based on ablation experiments on OpenClaw development team's internal test sets, 0.7/0.3 yields the best retrieval quality metrics

28.6.3 Implementation in Code

function hybridScore(vectorScore: number, bm25Score: number): number {
    const VECTOR_WEIGHT = 0.7;
    const BM25_WEIGHT = 0.3;
    return VECTOR_WEIGHT * vectorScore + BM25_WEIGHT * bm25Score;
}

// Re-rank all candidates
const reranked = candidates
    .map(chunk => ({
        ...chunk,
        finalScore: hybridScore(chunk.vectorScore, chunk.normalizedBm25Score)
    }))
    .sort((a, b) => b.finalScore - a.finalScore)
    .slice(0, finalK);

28.7 BM25 Score Normalization Formula

28.7.1 Why Normalization Is Needed

Raw BM25 scores (in SQLite FTS5, the rank column is actually a negative number — the smaller the absolute value, the higher the relevance) and vector similarity scores (floating-point numbers between 0 and 1) are on completely different scales and cannot be directly fused with weights.

BM25 scores must be normalized to the [0, 1] range.

28.7.2 The Normalization Formula

OpenClaw uses rank-based normalization:

$$ \text{bm25_score}(r) = \frac{1}{1 + r} $$

Where $r$ is the rank in BM25 retrieval results, starting from 0:

Rank $r$ Normalized BM25 Score
0 (1st place) $\frac{1}{1+0} = 1.0$
1 (2nd place) $\frac{1}{1+1} = 0.5$
2 (3rd place) $\frac{1}{1+2} = 0.333$
3 (4th place) $\frac{1}{1+3} = 0.25$
7 (8th place) $\frac{1}{1+7} = 0.125$
23 (24th place) $\frac{1}{1+23} = 0.042$

This formula ensures:

28.7.3 Implementation Code

function normalizeBm25Score(rank: number): number {
    // rank starts at 0; rank=0 means BM25 ranked first
    return 1 / (1 + rank);
}

// Assign normalized scores to BM25 results by rank
const bm25Results = await bm25Search(query, limit: 24);
const normalizedBm25 = bm25Results.map((result, index) => ({
    ...result,
    normalizedBm25Score: normalizeBm25Score(index)  // index is the rank
}));

28.8 Graceful Degradation: Any Backend Failure Does Not Affect the Other

28.8.1 Design Principle

Each backend in hybrid retrieval (BM25 and Vector) is independently wrapped with .catch(() => []), ensuring that if either backend throws an exception, the other backend's results remain available:

const [vectorResults, bm25Results] = await Promise.all([
    searchByVector(queryEmbedding, limit: 24)
        .catch((err) => {
            logger.warn("Vector search failed, falling back to BM25 only", err);
            return [];  // return empty array; BM25 results unaffected
        }),
    searchByBm25(query, limit: 24)
        .catch((err) => {
            logger.warn("BM25 search failed, falling back to vector only", err);
            return [];  // return empty array; vector results unaffected
        })
]);

// Union merge
const combined = unionByChunkId(vectorResults, bm25Results);

28.8.2 Degradation Scenarios

Scenario Result
Both backends normal Complete hybrid retrieval results
Vector backend fails (e.g., embedding unavailable) BM25 results only (pure keyword search)
BM25 backend fails (FTS5 index corrupted) Vector results only (pure semantic search)
Both backends fail Empty result set (no exception thrown; task continues)

28.8.3 Why Use Union Rather Than Letting Promise.all Fail Directly

Without the .catch(() => []) isolation mechanism:

// Dangerous pattern: either backend failing causes the entire retrieval to fail
const [vectorResults, bm25Results] = await Promise.all([
    searchByVector(queryEmbedding),  // if this fails, the entire Promise.all fails
    searchByBm25(query)
]);

With Graceful Degradation, even if the vector database has issues (for example, the embedding service is unavailable), memory retrieval continues to function. The agent does not completely lose memory access due to a partial retrieval system failure.


28.9 The Four-Level Embedding Fallback Chain

28.9.1 The Complete Fallback Chain

Query text
    │
    ▼
[1] Local GGUF model (embeddinggemma-300M-Q8_0.gguf)
    │ failure / unavailable
    ▼
[2] OpenAI text-embedding-3-small (1536 dimensions)
    │ failure / unavailable
    ▼
[3] Gemini gemini-embedding-001 (768 dimensions)
    │ failure / unavailable
    ▼
[4] Pure BM25 (exit vector search; keywords only)

28.9.2 Level 1: Local GGUF Model

embedding:
  primary:
    type: local_gguf
    model_path: ~/.openclaw/models/embeddinggemma-300M-Q8_0.gguf
    dimensions: 768
    file_size: ~600MB

Advantages:

Disadvantages:

28.9.3 Level 2: OpenAI text-embedding-3-small

embedding:
  fallback_1:
    type: openai
    model: text-embedding-3-small
    dimensions: 1536
    api_key: ${OPENAI_API_KEY}

The high-dimensional 1536-dimension vectors offer excellent semantic discrimination. Cost is relatively low (~$0.02/1M tokens), making it a good first cloud fallback option.

28.9.4 Level 3: Gemini gemini-embedding-001

embedding:
  fallback_2:
    type: gemini
    model: gemini-embedding-001
    dimensions: 768
    api_key: ${GEMINI_API_KEY}

As the second cloud option, with 768 dimensions matching the local GGUF model. Suitable for use when OpenAI service is unavailable.

28.9.5 Level 4: Pure BM25

When all embedding options have failed, the system exits vector search and falls back to pure BM25 keyword retrieval:

async function getEmbedding(text: string): Promise<Float32Array | null> {
    try { return await localGgufEmbed(text); } catch {}
    try { return await openaiEmbed(text); } catch {}
    try { return await geminiEmbed(text); } catch {}
    return null;  // return null; triggers pure BM25 mode
}

// In the retrieval flow
const embedding = await getEmbedding(query);
if (embedding === null) {
    // Degrade to pure BM25
    return await searchByBm25(query, limit: finalK);
}
// Normal hybrid retrieval flow
return await hybridSearch(query, embedding);

28.9.6 Dimension Incompatibility

Different models produce vectors of different dimensions (768 vs. 1536), so they cannot be mixed:

-- Record the model used at indexing time
UPDATE memory_embeddings 
SET model_name = 'text-embedding-3-small'
WHERE id = :id;

When falling back to a model with different dimensions, the vector index must be rebuilt (re-embedding all text data). OpenClaw automatically triggers a background rebuild task when it detects a model change:

if (currentModel !== storedModel) {
    logger.info("Embedding model changed, scheduling background re-embedding");
    scheduleBackgroundReembedding();  // async rebuild, does not block current retrieval
}

28.10 Chunking: Engineering Trade-offs in Text Segmentation

28.10.1 Why Is Chunking Necessary?

Embedding models typically have an input token limit (e.g., 512 or 8192 tokens). More importantly, overly long text chunks cause:

  1. Semantic dilution: A single embedding vector must represent too much content, reducing semantic precision
  2. Coarse retrieval granularity: Returns large text chunks where most content is irrelevant to the query
  3. Token waste: Occupies excessive space when injected into the Context Window

28.10.2 OpenClaw's Chunking Parameters

Parameter Default Description
Target size 400 tokens (~1600 chars) Target token count per chunk
Overlap 80 tokens Token overlap between adjacent chunks
Batch limit 8000 tokens/batch Maximum tokens per single Embedding API request
Concurrency 4 Number of parallel Embedding requests

28.10.3 The Trade-offs Behind 400 Token Target Size

Too small (< 100 tokens):
✓ High retrieval precision
✗ Insufficient context; isolated fragments are hard to understand

Too large (> 1000 tokens):
✓ Complete context
✗ Semantic dilution, low retrieval precision; occupies large Context when injected

400 tokens (~1600 characters):
✓ Roughly equivalent to a complete paragraph or function definition
✓ Semantics are sufficiently focused
✓ Reasonable when injected into Context

28.10.4 The Role of 80-token Overlap

Overlap ensures that information spanning chunk boundaries is not lost:

Original text: [...end of paragraph A...][...start of paragraph B...]
                                              ↑ chunk 1 ends
                                   ↑ chunk 2 starts (80-token overlap)

Chunk 1: [most of paragraph A + first 80 tokens of paragraph B]
Chunk 2: [first 80 tokens of paragraph B + most of paragraph B]
                         ↑ overlapping portion

Without overlap, an important discussion that happens to straddle two chunks may be "cut" during retrieval — each chunk contains only half, and neither receives a high relevance score.

28.10.5 Engineering Details of Batching and Concurrency

const BATCH_MAX_TOKENS = 8000;
const CONCURRENT_REQUESTS = 4;

// Group chunks into batches by token count
const batches = groupIntoBatches(chunks, BATCH_MAX_TOKENS);

// Process concurrently but limit concurrency to 4 (to avoid API rate limiting)
const results = await pLimit(CONCURRENT_REQUESTS)(
    batches.map(batch => () => embedBatch(batch))
);

Limiting concurrency to 4 protects against API rate limits: most Embedding APIs restrict concurrent requests, and exceeding the limit causes 429 errors.


28.11 Complete Hybrid Retrieval Flow

User query → Generate query Embedding (four-level fallback chain)
                │
        ┌───────┴───────┐
        ▼               ▼
BM25 search (FTS5)   Vector search (sqlite-vec)
Fetch 24 candidates  Fetch 24 candidates
    .catch(()=>[])       .catch(()=>[])
        │               │
        └───────┬───────┘
                ▼
        Union merge with deduplication (by chunk_id)
                │
                ▼
        BM25 score normalization: bm25Score = 1/(1+rank)
                │
                ▼
        Weighted fusion: finalScore = 0.7×vectorScore + 0.3×bm25Score
                │
                ▼
        Sort by finalScore descending
                │
                ▼
        Take Top-6 results
                │
                ▼
        Inject into Context Window

28.12 Chapter Summary

OpenClaw's vector retrieval implementation is an engineering design with pragmatism at its core:

$$ \text{finalScore} = 0.7 \times vectorScore + 0.3 \times \frac{1}{1 + rank_{bm25}} $$


Next: Chapter 29 — The Workspace File System Explained: AGENTS.md / SOUL.md / USER.md / HEARTBEAT.md — Roles and Best Practices

Rate this chapter
4.5  / 5  (3 ratings)

💬 Comments