Chapter 60

Google Vertex AI Integration: Multi-Region Endpoints, Data Residency and Feature Differences from Direct API

Chapter 60: LlamaIndex Integration: Document Intelligence and Enterprise Knowledge Bases

60.1 LlamaIndex's Role: Infrastructure for Document Intelligence

LlamaIndex (formerly GPT Index) is an LLM application framework focused on data connectivity and retrieval. If LangChain is the "Swiss Army knife of LLM applications," LlamaIndex is the "professional toolbox for document intelligence." Its core value lies in:

Unified data connectors: Support for 100+ data sources (PDF, Word, Excel, Confluence, Notion, databases, etc.)
High-performance indexing engine: Vector indexes, keyword indexes, and knowledge graph indexes, composable
Query engine: Converts natural language queries into structured retrieval, then uses an LLM to generate answers
Agent framework: ReAct Agents based on retrieval

Using Claude as LlamaIndex's LLM backend produces enterprise knowledge base systems with strong comprehension, fewer hallucinations, and the ability to handle extremely long contexts.

60.2 Environment Setup

pip install llama-index llama-index-llms-anthropic llama-index-embeddings-voyageai
pip install llama-index-readers-file
pip install llama-index-vector-stores-chroma  # optional

60.2.1 Configuring Claude as LlamaIndex's LLM

from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings

llm = Anthropic(
    model="claude-opus-4-5",
    api_key="your-api-key",  # or ANTHROPIC_API_KEY env var
    max_tokens=4096,
    temperature=0,
)
Settings.llm = llm

# Configure embedding model (Voyage AI, an Anthropic product, is recommended)
from llama_index.embeddings.voyageai import VoyageEmbedding
embed_model = VoyageEmbedding(
    model_name="voyage-3",
    voyage_api_key="your-voyage-key"
)
Settings.embed_model = embed_model

# Claude's 200K context allows larger chunks
Settings.chunk_size = 1024
Settings.chunk_overlap = 128

60.3 Building a Basic Document Index

60.3.1 Building an Index from Local Files

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext, load_index_from_storage
import os

documents = SimpleDirectoryReader(
    input_dir="./company_docs",
    recursive=True,
    required_exts=[".pdf", ".docx", ".md", ".txt"],
    filename_as_id=True
).load_data()

print(f"Loaded {len(documents)} document chunks")

index = VectorStoreIndex.from_documents(documents, show_progress=True)
index.storage_context.persist(persist_dir="./index_storage")
print("Index saved")

# Load on subsequent runs without rebuilding
if os.path.exists("./index_storage"):
    storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
    index = load_index_from_storage(storage_context)

60.3.2 Loading from Multiple Data Sources

from llama_index.core import Document
from llama_index.readers.file import PDFReader
from llama_index.core.node_parser import SentenceSplitter

pdf_reader = PDFReader()
pdf_docs = pdf_reader.load_data("./reports/annual_report_2024.pdf")

# Custom Documents from API data
api_docs = [
    Document(
        text=article["content"],
        metadata={
            "source": "company_wiki",
            "author": article["author"],
            "created_at": article["created_at"],
            "department": article["department"],
            "doc_type": "policy"
        }
    )
    for article in wiki_api.get_articles()
]

all_docs = pdf_docs + api_docs

# Custom splitter: split at sentence boundaries to preserve semantic integrity
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=128,
    paragraph_separator="\n\n"
)

index = VectorStoreIndex.from_documents(
    all_docs,
    transformations=[splitter],
    show_progress=True
)

60.4 Query Engines: Extracting Knowledge from Documents

60.4.1 Basic Query Engine

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="tree_summarize",  # Hierarchical summarization for long documents
    verbose=True
)

response = query_engine.query("What are the business travel reimbursement limits?")
print(response.response)

# Inspect source documents
for node in response.source_nodes:
    print(f"\nSource: {node.metadata.get('file_name', 'Unknown')}")
    print(f"Relevance: {node.score:.3f}")
    print(f"Excerpt: {node.text[:200]}...")

60.4.2 Advanced Query Modes

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool

# Separate indexes for different document types
policy_index = VectorStoreIndex.from_documents(policy_docs)
technical_index = VectorStoreIndex.from_documents(technical_docs)
hr_index = VectorStoreIndex.from_documents(hr_docs)

tools = [
    QueryEngineTool.from_defaults(
        query_engine=policy_index.as_query_engine(),
        name="policy_search",
        description="Search company policies and regulations including reimbursements, attendance, and procurement"
    ),
    QueryEngineTool.from_defaults(
        query_engine=technical_index.as_query_engine(),
        name="technical_docs",
        description="Search technical documentation, API docs, and architecture design documents"
    ),
    QueryEngineTool.from_defaults(
        query_engine=hr_index.as_query_engine(),
        name="hr_knowledge",
        description="Search HR documents including benefits, leave policies, and promotion processes"
    )
]

# SubQuestion engine: automatically decomposes complex questions
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=tools,
    verbose=True
)

response = sub_question_engine.query(
    "What are the travel policies? How does accommodation reimbursement work, and how is overtime calculated during business trips?"
)
print(response.response)

60.4.3 Streaming Responses

query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Summarize the company's main product milestones in 2024")

for text in streaming_response.response_gen:
    print(text, end="", flush=True)
print()

for node in streaming_response.source_nodes:
    print(f"Source: {node.metadata.get('source')}")

60.5 Building an Enterprise Knowledge Base Agent

60.5.1 ReAct-based Knowledge Base Agent

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-opus-4-5", max_tokens=4096)

kb_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="knowledge_base",
    description="""Search the company's internal knowledge base. Suitable for:
    - Company policy queries (reimbursement, attendance, leave)
    - Product documentation and technical specifications
    - Historical project information
    - Organizational structure and contacts
    Not suitable for: questions requiring real-time data"""
)

def get_employee_info(employee_id: str) -> str:
    """Retrieve real-time employee information from the HR system."""
    return f"Employee {employee_id}: Name=Alice Johnson, Dept=Engineering, Manager=Bob Smith"

employee_tool = FunctionTool.from_defaults(
    fn=get_employee_info,
    name="get_employee_info",
    description="Look up employee information by employee ID, returns name, department, and manager"
)

def get_current_projects() -> str:
    """Get list of currently active projects."""
    return "Active projects:\n1. Intelligent Support System (Q2 2025)\n2. Data Platform Build-out (Q3 2025)"

projects_tool = FunctionTool.from_defaults(
    fn=get_current_projects,
    name="get_current_projects",
    description="Get the list of currently active company projects"
)

agent = ReActAgent.from_tools(
    tools=[kb_tool, employee_tool, projects_tool],
    llm=llm,
    verbose=True,
    max_iterations=8,
    context="""You are Aria, the company's intelligent knowledge assistant.
Your role is to help employees find information including company policies, technical docs, and personnel information.
Always cite your sources accurately. Clearly state when you are uncertain."""
)

response = agent.chat("What is the reimbursement limit for new employee E12345? Which project are they on?")
print(response.response)

from llama_index.core.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core import SimpleDirectoryReader, Document

mm_llm = AnthropicMultiModal(model="claude-opus-4-5", max_new_tokens=1024)

image_docs = SimpleDirectoryReader(
    input_dir="./diagrams",
    required_exts=[".png", ".jpg", ".jpeg"]
).load_data()

# Use Claude's vision to generate text descriptions of diagrams for indexing
text_descriptions = []
for img_doc in image_docs:
    description = mm_llm.complete(
        prompt="Please describe this image in detail, especially any process steps, component relationships, or data shown.",
        image_documents=[img_doc]
    )
    text_descriptions.append(Document(
        text=description.text,
        metadata={
            "source_image": img_doc.metadata.get("file_name"),
            "doc_type": "image_description"
        }
    ))

all_docs = text_docs + text_descriptions
index = VectorStoreIndex.from_documents(all_docs)

60.6 Incremental Index Updates

Enterprise knowledge bases change continuously; incremental updates are essential to avoid full rebuilds.

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.voyageai import VoyageEmbedding

embed_model = VoyageEmbedding(model_name="voyage-3")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=128),
        embed_model
    ],
    docstore=SimpleDocumentStore(),  # Tracks processed documents
)

# Full initialization
nodes = pipeline.run(documents=initial_docs, show_progress=True)
index = VectorStoreIndex(nodes)
index.storage_context.persist(persist_dir="./index_storage")
pipeline.persist("./pipeline_storage")

def update_knowledge_base(new_or_modified_docs: list):
    """Incrementally update the knowledge base, skipping unchanged documents"""
    pipeline = IngestionPipeline.from_persist_dir("./pipeline_storage")
    
    # run() automatically detects duplicates via doc_id + hash
    new_nodes = pipeline.run(documents=new_or_modified_docs)
    
    if new_nodes:
        storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
        index = load_index_from_storage(storage_context)
        
        for node in new_nodes:
            index.insert_nodes([node])
        
        index.storage_context.persist(persist_dir="./index_storage")
        pipeline.persist("./pipeline_storage")
        print(f"Updated {len(new_nodes)} new nodes")
    else:
        print("No document changes detected")

60.7 Retrieval Quality Evaluation

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)
from llama_index.llms.anthropic import Anthropic

eval_llm = Anthropic(model="claude-opus-4-5")

# Faithfulness: is the answer grounded in the retrieved context (anti-hallucination)?
faithfulness_evaluator = FaithfulnessEvaluator(llm=eval_llm)

# Relevancy: are the retrieved documents relevant to the question?
relevancy_evaluator = RelevancyEvaluator(llm=eval_llm)

test_questions = [
    "How many annual leave days do employees get?",
    "How do I apply for travel reimbursement?",
    "What benefits do probationary employees receive?"
]

for question in test_questions:
    response = query_engine.query(question)
    
    faithfulness_result = faithfulness_evaluator.evaluate_response(response=response)
    relevancy_result = relevancy_evaluator.evaluate_response(query=question, response=response)
    
    print(f"\nQuestion: {question}")
    print(f"Faithfulness: {faithfulness_result.score:.2f} ({faithfulness_result.feedback})")
    print(f"Relevancy: {relevancy_result.score:.2f}")

60.8 Production Deployment Recommendations

60.8.1 Using Chroma or Qdrant as Production Vector Database

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
import chromadb

chroma_client = chromadb.HttpClient(host="localhost", port=8000)
chroma_collection = chroma_client.get_or_create_collection("company_knowledge")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True
)

60.8.2 Caching Strategy

from llama_index.core.storage.index_store import RedisIndexStore
from llama_index.core.storage.docstore import RedisDocumentStore

# Redis caching to avoid reloading large indexes
storage_context = StorageContext.from_defaults(
    index_store=RedisIndexStore.from_host_and_port(host="localhost", port=6379),
    docstore=RedisDocumentStore.from_host_and_port(host="localhost", port=6379)
)

Summary

The combination of LlamaIndex and Claude has clear advantages in enterprise knowledge base scenarios: Claude's 200K token context window handles extremely long documents in one pass, while LlamaIndex provides the complete toolchain for index building, incremental updates, and multi-source data fusion. The core architecture is: multi-source data loading → SentenceSplitter chunking → Voyage Embedding → vector store → query engine → Claude answer generation. In production, the incremental update Pipeline and a vector database (Chroma/Qdrant) are key to keeping the knowledge base current, while FaithfulnessEvaluator is an important tool for controlling hallucination risk.

Rate this chapter

4.6 / 5 (3 ratings)