Chapter 4

Model Integration Guide: OpenAI, Claude, Local Models and Cost Comparison

Chapter 4: Complete Model Integration Guide — OpenAI / Claude / Local Models: Configuration and Cost Comparison

Choosing the wrong model can make your application cost 10x more than the optimal solution, but choosing correctly requires understanding each model's capability boundaries, API characteristics, and pricing mechanisms.

Chapter Overview

The AI model market in 2024 is flourishing: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Qwen2.5, DeepSeek-V3... New models launch every month, each more capable and cheaper than the last. One of Dify's core advantages is model agnosticism: a workflow you tune on GPT-4 today can be switched to Claude or a local model tomorrow by changing a single configuration.

But this "switching" isn't lossless. Different models have significant differences in context length, tool calling capability, multilingual performance, and pricing structure. Wrong model choices lead to missing functionality or skyrocketing costs.

This chapter systematically covers the full picture of model integration in Dify, helping you make optimal choices in different scenarios.

By the end of this chapter, you will be able to:

Configure OpenAI, Anthropic, and local models (Ollama) in Dify
Understand the core differences between models in capability, price, and usage limits
Build multi-model strategies (primary model + fallback model + specialized models)
Calculate and control model costs in actual production environments
Understand Dify's model gateway working principles and rate limiting mechanisms

Level 1: Foundational Understanding (1-3 Years Experience)

What Models Does Dify Support?

Dify categorizes models into four types, each with different purposes:

Model Type	Purpose	Representative Models
LLM (Large Language Model)	Conversation, reasoning, text generation	GPT-4o, Claude 3.5, Qwen2.5
Embedding	Knowledge base document vectorization, semantic retrieval	text-embedding-3-small, bge-m3
Rerank	Re-scoring knowledge base retrieval results	bge-reranker-v2-m3, cohere-rerank
Speech (Speech to Text)	Voice input to text	Whisper-1

Key insight: In Dify, LLM is not the only model you need to configure. If you enable the knowledge base, you also need an Embedding model; for better retrieval results, you also need a Rerank model.

Mainstream Model Capability Comparison (2024)

Here's a cross-model comparison to help you quickly identify the right fit:

Model	Context Length	Multilingual	Tool Calling	Vision	Price per 1M tokens (in/out)
GPT-4o	128K	Excellent	Excellent	Excellent	$5 / $15
GPT-4o-mini	128K	Good	Good	Good	$0.15 / $0.6
Claude 3.5 Sonnet	200K	Good	Excellent	Excellent	$3 / $15
Claude 3 Haiku	200K	Fair	Good	Fair	$0.25 / $1.25
Gemini 1.5 Pro	1M	Good	Good	Excellent	$3.5 / $10.5
Qwen2.5-72B	128K	Excellent (Chinese)	Good	Fair	$0.56 / $2.24
DeepSeek-V3	64K	Excellent (Chinese)	Good	No	$0.27 / $1.1
Llama 3.1 70B (local)	128K	Fair	Fair	No	Server cost only

Note: Prices and capabilities change with version updates — check official sources for latest data

Configuring OpenAI Models in Dify

Prerequisites: An OpenAI API Key (platform.openai.com)

Configuration steps:

Go to Dify workspace → Click avatar in top right → "Settings"
Click "Model Provider" → Find OpenAI → Click "Configure"
Enter your API Key:

API Key: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Organization ID: org-xxxxxxxx  (optional, for enterprise accounts)
API Base: https://api.openai.com/v1  (default; change for Azure OpenAI)

Click "Save" — the system automatically validates the key

Using Azure OpenAI (common for enterprises, data stays in Microsoft US data centers):

Provider: Azure OpenAI
Azure Endpoint: https://your-resource.openai.azure.com/
API Key: your-azure-key
API Version: 2024-02-01

Note: Azure OpenAI requires specifying a "Deployment Name" in the model configuration, not the model name.

Configuring Anthropic Claude in Dify

Claude frequently outperforms GPT-4 on long-context processing and code generation.

Get API Key: Visit console.anthropic.com to register

Dify configuration:

Model Provider → Anthropic → Configure
Enter the API Key (format: sk-ant-xxxxxxxx)
Available models:
- claude-3-5-sonnet-20241022: Best overall, suited for complex tasks
- claude-3-5-haiku-20241022: Fast, low cost, suited for high-frequency calls
- claude-3-opus-20240229: Formerly the strongest, now surpassed by 3.5 Sonnet

Key differences: Claude 3.5 Sonnet vs GPT-4o:

Advantages (Claude 3.5 Sonnet relative to GPT-4o):
✓ Larger context window (200K vs 128K)
✓ Higher code generation quality (leads on multiple benchmarks)
✓ Lower input price ($3 vs $5 per 1M tokens)
✓ More accurate long document analysis (due to stronger long-context capability)

Disadvantages:
✗ Tool calling sometimes less stable than GPT-4o
✗ Slightly weaker on some specific multilingual tasks (task-dependent)
✗ No image generation (only understanding)

Configuring Local Models in Dify (Ollama)

Local deployment means data never leaves your servers — ideal for scenarios with strict data compliance requirements.

Step 1: Install Ollama (ollama.ai)

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Download a model (Llama 3.1 as example)
ollama pull llama3.1:8b      # 8B params, suitable for 8GB RAM machines
ollama pull llama3.1:70b     # 70B params, requires 40GB+ RAM

# Verify it's running
ollama serve
curl http://localhost:11434/api/tags  # View downloaded models

Step 2: Configure Ollama in Dify

Model Provider → Ollama → Configure
Fill in the configuration:

Base URL: http://localhost:11434  (if Ollama and Dify are on the same machine)
or
Base URL: http://your-ollama-server:11434  (if on different machines)

The model list will automatically show models already available in Ollama

Performance expectations for local models (8B vs 70B):

Llama 3.1 8B (consumer GPU, e.g., RTX 4090):
  - Inference speed: ~50 tokens/second
  - Code generation: Medium (comparable to GPT-3.5)
  - Multilingual support: Fair
  - Memory requirement: 8GB VRAM

Llama 3.1 70B (professional GPU, e.g., A100 40GB x 2):
  - Inference speed: ~15 tokens/second
  - Code generation: Close to GPT-4 level
  - Multilingual support: Good
  - Memory requirement: 40GB VRAM

Level 2: Mechanism Deep Dive (3-5 Years Experience)

Understanding Model Pricing Mechanisms

Model pricing is based on Tokens, not word count or character count. Understanding tokens is the key to cost control.

What is a Token?

A token is the basic unit by which models process text. For English, roughly 1 Token = 0.75 words (4 characters). For Chinese, roughly 1 Token = 1-1.5 characters.

Examples:
"Hello, world!" ≈ 4 tokens
A typical English word ≈ 1 token on average
A Chinese character typically takes 1 token
(Chinese text is less token-efficient than English for the same information)

Actual GPT-4o cost calculation:

Scenario: 10,000 knowledge base Q&A conversations per month

Token composition per conversation:
- System prompt: 300 tokens
- Knowledge base retrieval results (5 chunks × 100 tokens): 500 tokens
- User question: 50 tokens
- Conversation history (5 rounds): 500 tokens
- AI response: 200 tokens

Input tokens: 300 + 500 + 50 + 500 = 1,350 tokens
Output tokens: 200 tokens

GPT-4o cost:
  Input: 1,350 × 10,000 / 1,000,000 × $5 = $67.50
  Output: 200 × 10,000 / 1,000,000 × $15 = $30.00
  Monthly total: $97.50

Using gpt-3.5-turbo-0125 instead (input $0.5/1M, output $1.5/1M):
  Input: 1,350 × 10,000 / 1,000,000 × $0.5 = $6.75
  Output: 200 × 10,000 / 1,000,000 × $1.5 = $3.00
  Monthly total: $9.75

Cost difference: 10x!

Multi-Model Strategy: Different Models for Different Tasks

Best practice in production is a multi-model strategy: choose models appropriate to task complexity rather than using the most expensive model for everything.

Three-tier model architecture:

Tier 1: Lightweight models (high-frequency, simple tasks)
  - Models: gpt-4o-mini or Claude 3 Haiku
  - Use cases: Intent classification, simple Q&A, content filtering
  - Cost: ~$0.001 per call

Tier 2: Primary models (medium complexity tasks)
  - Models: gpt-4o or Claude 3.5 Sonnet
  - Use cases: Knowledge Q&A, code generation, document analysis
  - Cost: ~$0.01 per call

Tier 3: Expert models (complex reasoning tasks)
  - Models: o1 or Claude 3.5 Sonnet (complex prompts)
  - Use cases: Complex analysis, multi-step reasoning, high-accuracy decisions
  - Cost: ~$0.1 per call

Implementing tiered routing in Dify Workflow:

Workflow node design:

[Start Node] → Receives user question
     ↓
[LLM Node: Intent Classification] (using gpt-4o-mini)
  Prompt: Classify the following question as simple/medium/complex
  Output: {"complexity": "simple/medium/complex"}
     ↓
[IF/ELSE Branch]
  ├── complexity == "simple" → [LLM Node] (gpt-4o-mini) → [End]
  ├── complexity == "medium" → [LLM Node] (gpt-4o) → [End]
  └── complexity == "complex" → [LLM Node] (claude-3-5-sonnet) → [End]

Configuring Model Fallback Strategy

Configure backup models for the primary model in Dify for automatic degradation:

Dify model configuration supports fallback (when calling via API):

# When calling Dify API, you can specify a fallback model
payload = {
    "inputs": {},
    "query": user_question,
    "response_mode": "streaming",
    "model_config": {
        "provider": "openai",
        "name": "gpt-4o",
        "fallback": {
            "provider": "anthropic",
            "name": "claude-3-5-sonnet-20241022"
        }
    }
}

Multi-provider backup strategy in practice:

# Build a fault-tolerant call wrapper
class ResilientDifyClient:
    def __init__(self):
        self.endpoints = [
            {"url": "https://api.dify.ai/v1", "key": PRIMARY_KEY},
            {"url": "https://your-self-hosted-dify.com/v1", "key": BACKUP_KEY},
        ]
    
    def chat(self, message: str, conversation_id: str = None):
        last_error = None
        
        for endpoint in self.endpoints:
            try:
                response = self._call(endpoint, message, conversation_id)
                return response
            except Exception as e:
                last_error = e
                print(f"Endpoint {endpoint['url']} failed: {e}, trying next...")
                continue
        
        raise last_error

Rate Limit Handling: Essential Production Knowledge

Every model provider has call frequency limits (Rate Limits) — understanding them is critical for production stability.

OpenAI Rate Limits (2024, varies by account tier):

Tier 1 (after $5 spend):
  - GPT-4o: 500 RPM (requests/minute), 30,000 TPM (tokens/minute)
  - gpt-4o-mini: 500 RPM, 200,000 TPM

Tier 3 (after $100 spend):
  - GPT-4o: 5,000 RPM, 300,000 TPM
  - gpt-4o-mini: 5,000 RPM, 4,000,000 TPM

Tier 5 (after $10,000 spend):
  - Contact OpenAI business team for custom limits

Dify's rate limit handling:

Dify's model gateway has built-in exponential backoff retry:

# Retry logic in api/core/model_runtime/ (simplified)
@retry(
    reraise=True,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type(RateLimitError)
)
def invoke_with_retry(provider, messages, params):
    return provider.invoke(messages, params)

This means:

After 1st failure, wait 4 seconds then retry
After 2nd failure, wait 8 seconds then retry
After 3rd failure, raise the exception

Request queue design for production:

If your application has high-concurrency needs, add a queue layer in front of Dify:

# Token bucket rate limiter using Redis
import redis
import time

class TokenBucketRateLimiter:
    def __init__(self, redis_client, key: str, rate: float, capacity: int):
        """
        rate: tokens replenished per second (e.g., 8.0 for 500 RPM / 60 seconds)
        capacity: token bucket capacity (prevents burst traffic)
        """
        self.redis = redis_client
        self.key = key
        self.rate = rate
        self.capacity = capacity
    
    def acquire(self, tokens: int = 1) -> bool:
        """Try to acquire `tokens` tokens, returns whether successful"""
        now = time.time()
        
        # Use Lua script to ensure atomicity
        lua_script = """
        local key = KEYS[1]
        local rate = tonumber(ARGV[1])
        local capacity = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])
        local requested = tonumber(ARGV[4])
        
        local tokens = redis.call('hget', key, 'tokens')
        local last_time = redis.call('hget', key, 'last_time')
        
        if tokens == false then
            tokens = capacity
            last_time = now
        else
            tokens = tonumber(tokens)
            last_time = tonumber(last_time)
            local elapsed = now - last_time
            tokens = math.min(capacity, tokens + elapsed * rate)
        end
        
        if tokens >= requested then
            tokens = tokens - requested
            redis.call('hmset', key, 'tokens', tokens, 'last_time', now)
            return 1
        else
            return 0
        end
        """
        
        result = self.redis.eval(lua_script, 1, self.key,
                                 self.rate, self.capacity, now, tokens)
        return bool(result)

Choosing an Embedding Model

Knowledge base retrieval quality depends heavily on the Embedding model quality.

Mainstream Embedding model comparison:

Model	Dimensions	Chinese Support	Multilingual	Price (1M tokens)	Notes
text-embedding-3-small	1536	Good	Yes	$0.02	OpenAI balanced choice
text-embedding-3-large	3072	Very good	Yes	$0.13	OpenAI highest quality
text-embedding-ada-002	1536	Good	Yes	$0.10	Legacy, not recommended
bge-m3 (local)	1024	Excellent	Yes	Compute cost only	Best open-source choice
bge-large-en (local)	1024	No	No	Compute cost only	Pure English use cases

Real test data (on a multilingual Q&A dataset):

Recall@5 metric (proportion of top-5 retrieved results containing the correct answer):

bge-m3:                  92.3%
text-embedding-3-large:  89.7%
text-embedding-3-small:  85.1%
text-embedding-ada-002:  78.4%

Conclusions:
- Multilingual/Chinese-heavy: bge-m3 > text-embedding-3-large > text-embedding-3-small
- Pure English: text-embedding-3-large ≈ bge-m3
- Cost-first: text-embedding-3-small (best price/performance ratio among commercial models)

Level 3: Source Code and Principles (5+ Years Experience)

Complete Implementation of Dify's Model Gateway

Dify's model gateway (api/core/model_runtime/) uses a plugin-based architecture where each model provider is a separate plugin package.

Directory structure:

api/core/model_runtime/
├── model_providers/           # Individual provider implementations
│   ├── openai/
│   │   ├── _assets/          # Provider icons and assets
│   │   ├── openai.py         # Provider main class
│   │   ├── openai.yaml       # Provider config (available models, credential definitions)
│   │   ├── llm/
│   │   │   ├── openai_llm.py # LLM adapter implementation
│   │   │   └── gpt-4o.yaml   # Specific model parameter definitions
│   │   └── text_embedding/
│   │       └── openai_text_embedding.py
│   ├── anthropic/
│   │   ├── anthropic.yaml
│   │   └── llm/
│   │       └── anthropic_llm.py
│   └── ollama/
│       └── llm/
│           └── ollama_llm.py
├── entities/                  # Data entity definitions
│   ├── message_entities.py   # Message formats (PromptMessage etc.)
│   └── model_entities.py     # Model metadata
└── errors/                   # Error type definitions
    ├── invoke_error.py
    └── credentials_validate_error.py

Provider YAML configuration example (simplified openai.yaml):

provider: openai
label:
  en_US: OpenAI
icon_small:
  en_US: icon_s_en.svg
supported_model_types:
  - llm
  - text-embedding
  - speech2text
  - tts
configurate_methods:
  - predefined-model
  - customizable-model
provider_credential_schema:
  credential_form_schemas:
    - variable: openai_api_key
      label:
        en_US: API Key
      type: secret-input
      required: true
      placeholder:
        en_US: Enter your OpenAI API key
    - variable: openai_organization
      label:
        en_US: Organization
      type: text-input
      required: false

Core LLM adapter implementation (key parts of openai_llm.py):

class OpenAILargeLanguageModel(LargeLanguageModel):
    
    def _invoke(
        self,
        model: str,
        credentials: dict,
        prompt_messages: list[PromptMessage],
        model_parameters: dict,
        tools: list[PromptMessageTool] | None = None,
        stop: list[str] | None = None,
        stream: bool = True,
        user: str | None = None,
    ) -> LLMResult | Generator:
        
        # Initialize OpenAI client
        client = OpenAI(
            api_key=credentials["openai_api_key"],
            organization=credentials.get("openai_organization"),
            base_url=credentials.get("openai_api_base", "https://api.openai.com/v1")
        )
        
        # Convert Dify internal message format to OpenAI format
        openai_messages = self._convert_messages(prompt_messages)
        
        # Convert Dify tool definitions to OpenAI function format
        openai_tools = self._convert_tools(tools) if tools else None
        
        # Build request parameters
        params = {
            "model": model,
            "messages": openai_messages,
            "stream": stream,
            "temperature": model_parameters.get("temperature", 0.7),
            "max_tokens": model_parameters.get("max_tokens", 4096),
        }
        
        if openai_tools:
            params["tools"] = openai_tools
            params["tool_choice"] = "auto"
        
        if stop:
            params["stop"] = stop
        
        # Call the API
        if stream:
            return self._handle_stream_response(client.chat.completions.create(**params))
        else:
            response = client.chat.completions.create(**params)
            return self._handle_chat_response(response)
    
    def _handle_stream_response(self, stream) -> Generator:
        """Handle streaming response, convert OpenAI format to Dify internal format"""
        for chunk in stream:
            if not chunk.choices:
                continue
            
            delta = chunk.choices[0].delta
            
            if delta.content:
                yield LLMResultChunk(
                    model=chunk.model,
                    prompt_messages=[],
                    delta=LLMResultChunkDelta(
                        index=0,
                        message=AssistantPromptMessage(content=delta.content),
                        finish_reason=chunk.choices[0].finish_reason
                    )
                )
            
            # Handle tool calls
            if delta.tool_calls:
                for tool_call in delta.tool_calls:
                    yield LLMResultChunk(
                        model=chunk.model,
                        prompt_messages=[],
                        delta=LLMResultChunkDelta(
                            index=tool_call.index,
                            message=AssistantPromptMessage(
                                tool_calls=[ToolCall(
                                    id=tool_call.id,
                                    type="function",
                                    function=ToolCallFunction(
                                        name=tool_call.function.name,
                                        arguments=tool_call.function.arguments
                                    )
                                )]
                            )
                        )
                    )

Precise Token Counting Implementation

Dify records token usage after each call for statistics and billing:

# Token counting implementation (simplified)
class TokenCounter:
    def __init__(self, model: str):
        self.model = model
        # For OpenAI models, use tiktoken library for token calculation
        import tiktoken
        try:
            self.encoder = tiktoken.encoding_for_model(model)
        except KeyError:
            self.encoder = tiktoken.get_encoding("cl100k_base")  # Default encoding
    
    def count_message_tokens(self, messages: list[dict]) -> int:
        """Count total tokens for a message list"""
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # Fixed overhead per message
            for key, value in message.items():
                if isinstance(value, str):
                    num_tokens += len(self.encoder.encode(value))
        num_tokens += 2  # Conversation end marker
        return num_tokens

Key detail: Token counting differs by provider:

OpenAI: Precise calculation using tiktoken (BPE algorithm)
Claude: Uses Anthropic's tokenizer (slightly different from OpenAI)
Local models: Depends on which tokenizer the specific model uses

Level 4: Production Pitfalls and Decision Making (Expert Perspective)

Pitfall 1: Hidden Context Length Constraints

Many people assume GPT-4o's 128K context means they can safely pass very long documents. But there are two hidden constraints:

Constraint 1: Price scales linearly with context length

Cost of 128K token input:
128,000 × $5 / 1,000,000 = $0.64 per call

If processing 1,000 such requests daily:
$640/day = $19,200/month

Constraint 2: Attention decay in very long contexts

Research shows (Lost in the Middle, Liu et al. 2023) that when key information is in the middle of a long context, LLM retrieval accuracy drops significantly. This is one of the core reasons why RAG (splitting documents into small searchable chunks) outperforms putting the entire document in context.

Practical recommendations:

Don't stuff the entire document into context just because the model supports 128K
Knowledge base + RAG is the correct approach for long documents
If you must process very long documents, use a "segment processing + result merging" workflow

Pitfall 2: API Key Security Management

In production, API Key management is critical for security. Common mistakes:

Wrong approaches:

# Dangerous! API Key hardcoded in code
API_KEY = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Dangerous! API Key in .env file committed to Git
# Forgetting to add .env to .gitignore

Correct approach:

# 1. Use environment variables
import os
API_KEY = os.environ.get("DIFY_API_KEY")
if not API_KEY:
    raise ValueError("DIFY_API_KEY environment variable not set")

# 2. In Dify self-hosted: inject via .env file (not committed to Git)
# .gitignore MUST include .env

# 3. Production: use a Secret management service
# AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets, etc.

# 4. Rotate API Keys regularly
# Recommended: every 90 days; revoke immediately if compromise is suspected

Monitor API Key usage anomalies:

# OpenAI usage monitoring API
curl -H "Authorization: Bearer $OPENAI_KEY" \
  "https://api.openai.com/dashboard/billing/usage?start_date=2024-01-01&end_date=2024-01-31"

# Configure alerts: send email notification when daily spend exceeds threshold

Pitfall 3: Token Cost Spiral

Problem scenario: A company launched a Dify application and received an OpenAI bill 3x higher than expected in the first month.

Common causes and solutions:

Cause	Solution
System prompt too long	Streamline prompt, remove redundant content
Too many retrieval chunks (top_k too high)	Lower top_k from default 5 to 3
Too many conversation history rounds	Lower history rounds from 10 to 5
Model selection too expensive	Evaluate if gpt-3.5-turbo can replace gpt-4o
Users pasting large amounts of text	Limit input character count

Model Selection Decision Tree

Use this decision tree to quickly determine model configuration for new projects:

Question 1: Can data leave your premises?
├── No (compliance requirement) → Local model (Ollama + Llama/Qwen)
└── Yes → Question 2

Question 2: What is the primary task?
├── Multilingual/Chinese knowledge Q&A → Qwen2.5-72B (best Chinese) or GPT-4o
├── Code generation → Claude 3.5 Sonnet (leads on coding tasks)
├── Long document analysis → Claude 3.5 Sonnet (200K context)
├── Image understanding → GPT-4o or Gemini 1.5 Pro
└── General tasks → GPT-4o or Claude 3.5 Sonnet

Question 3: What is the daily call volume?
├── < 1,000 calls → Any model — cost difference is negligible
├── 1,000-10,000 calls → Consider gpt-4o-mini or Claude 3 Haiku
└── > 10,000 calls → Must optimize cost, consider tiered routing

Question 4: What is the budget?
├── < $100/month → gpt-4o-mini or local models
├── $100-500/month → GPT-4o or Claude 3.5 Sonnet, manage usage carefully
└── Unrestricted → Choose by quality, use the most appropriate model for each task

Chapter Summary

Model integration isn't as simple as "fill in an API Key." Production-grade model management requires considering capability matching, cost control, rate limit handling, and security management across multiple dimensions.

Key Takeaways:

Tier your models: Different complexity tasks warrant different models — a 10x cost difference means a 10x margin difference
bge-m3 is the go-to Embedding model for multilingual content: Better retrieval results than OpenAI's Embedding for multilingual and Chinese-heavy content
Local models have real value: Data compliance isn't the only reason — total cost of local models can be far lower than cloud models at large scale
Rate limits are essential production knowledge: Understand each provider's limits in advance and implement proper queue and backoff logic at the application layer
API Key security: Production environments must use a Secret management service — never hardcode keys

The next chapter dives deep into the core principles of RAG: comparing vector retrieval, full-text retrieval, and hybrid retrieval mechanisms, and how to configure optimal retrieval in Dify.

Rate this chapter

4.5 / 5 (70 ratings)