Model Integration Guide: OpenAI, Claude, Local Models and Cost Comparison
Chapter 4: Complete Model Integration Guide — OpenAI / Claude / Local Models: Configuration and Cost Comparison
Choosing the wrong model can make your application cost 10x more than the optimal solution, but choosing correctly requires understanding each model's capability boundaries, API characteristics, and pricing mechanisms.
Chapter Overview
The AI model market in 2024 is flourishing: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Qwen2.5, DeepSeek-V3... New models launch every month, each more capable and cheaper than the last. One of Dify's core advantages is model agnosticism: a workflow you tune on GPT-4 today can be switched to Claude or a local model tomorrow by changing a single configuration.
But this "switching" isn't lossless. Different models have significant differences in context length, tool calling capability, multilingual performance, and pricing structure. Wrong model choices lead to missing functionality or skyrocketing costs.
This chapter systematically covers the full picture of model integration in Dify, helping you make optimal choices in different scenarios.
By the end of this chapter, you will be able to:
- Configure OpenAI, Anthropic, and local models (Ollama) in Dify
- Understand the core differences between models in capability, price, and usage limits
- Build multi-model strategies (primary model + fallback model + specialized models)
- Calculate and control model costs in actual production environments
- Understand Dify's model gateway working principles and rate limiting mechanisms
Level 1: Foundational Understanding (1-3 Years Experience)
What Models Does Dify Support?
Dify categorizes models into four types, each with different purposes:
| Model Type | Purpose | Representative Models |
|---|---|---|
| LLM (Large Language Model) | Conversation, reasoning, text generation | GPT-4o, Claude 3.5, Qwen2.5 |
| Embedding | Knowledge base document vectorization, semantic retrieval | text-embedding-3-small, bge-m3 |
| Rerank | Re-scoring knowledge base retrieval results | bge-reranker-v2-m3, cohere-rerank |
| Speech (Speech to Text) | Voice input to text | Whisper-1 |
Key insight: In Dify, LLM is not the only model you need to configure. If you enable the knowledge base, you also need an Embedding model; for better retrieval results, you also need a Rerank model.
Mainstream Model Capability Comparison (2024)
Here's a cross-model comparison to help you quickly identify the right fit:
| Model | Context Length | Multilingual | Tool Calling | Vision | Price per 1M tokens (in/out) |
|---|---|---|---|---|---|
| GPT-4o | 128K | Excellent | Excellent | Excellent | $5 / $15 |
| GPT-4o-mini | 128K | Good | Good | Good | $0.15 / $0.6 |
| Claude 3.5 Sonnet | 200K | Good | Excellent | Excellent | $3 / $15 |
| Claude 3 Haiku | 200K | Fair | Good | Fair | $0.25 / $1.25 |
| Gemini 1.5 Pro | 1M | Good | Good | Excellent | $3.5 / $10.5 |
| Qwen2.5-72B | 128K | Excellent (Chinese) | Good | Fair | $0.56 / $2.24 |
| DeepSeek-V3 | 64K | Excellent (Chinese) | Good | No | $0.27 / $1.1 |
| Llama 3.1 70B (local) | 128K | Fair | Fair | No | Server cost only |
Note: Prices and capabilities change with version updates — check official sources for latest data
Configuring OpenAI Models in Dify
Prerequisites: An OpenAI API Key (platform.openai.com)
Configuration steps:
- Go to Dify workspace → Click avatar in top right → "Settings"
- Click "Model Provider" → Find OpenAI → Click "Configure"
- Enter your API Key:
API Key: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Organization ID: org-xxxxxxxx (optional, for enterprise accounts)
API Base: https://api.openai.com/v1 (default; change for Azure OpenAI)
- Click "Save" — the system automatically validates the key
Using Azure OpenAI (common for enterprises, data stays in Microsoft US data centers):
Provider: Azure OpenAI
Azure Endpoint: https://your-resource.openai.azure.com/
API Key: your-azure-key
API Version: 2024-02-01
Note: Azure OpenAI requires specifying a "Deployment Name" in the model configuration, not the model name.
Configuring Anthropic Claude in Dify
Claude frequently outperforms GPT-4 on long-context processing and code generation.
Get API Key: Visit console.anthropic.com to register
Dify configuration:
- Model Provider → Anthropic → Configure
- Enter the API Key (format:
sk-ant-xxxxxxxx) - Available models:
claude-3-5-sonnet-20241022: Best overall, suited for complex tasksclaude-3-5-haiku-20241022: Fast, low cost, suited for high-frequency callsclaude-3-opus-20240229: Formerly the strongest, now surpassed by 3.5 Sonnet
Key differences: Claude 3.5 Sonnet vs GPT-4o:
Advantages (Claude 3.5 Sonnet relative to GPT-4o):
✓ Larger context window (200K vs 128K)
✓ Higher code generation quality (leads on multiple benchmarks)
✓ Lower input price ($3 vs $5 per 1M tokens)
✓ More accurate long document analysis (due to stronger long-context capability)
Disadvantages:
✗ Tool calling sometimes less stable than GPT-4o
✗ Slightly weaker on some specific multilingual tasks (task-dependent)
✗ No image generation (only understanding)
Configuring Local Models in Dify (Ollama)
Local deployment means data never leaves your servers — ideal for scenarios with strict data compliance requirements.
Step 1: Install Ollama (ollama.ai)
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Download a model (Llama 3.1 as example)
ollama pull llama3.1:8b # 8B params, suitable for 8GB RAM machines
ollama pull llama3.1:70b # 70B params, requires 40GB+ RAM
# Verify it's running
ollama serve
curl http://localhost:11434/api/tags # View downloaded models
Step 2: Configure Ollama in Dify
- Model Provider → Ollama → Configure
- Fill in the configuration:
Base URL: http://localhost:11434 (if Ollama and Dify are on the same machine)
or
Base URL: http://your-ollama-server:11434 (if on different machines)
- The model list will automatically show models already available in Ollama
Performance expectations for local models (8B vs 70B):
Llama 3.1 8B (consumer GPU, e.g., RTX 4090):
- Inference speed: ~50 tokens/second
- Code generation: Medium (comparable to GPT-3.5)
- Multilingual support: Fair
- Memory requirement: 8GB VRAM
Llama 3.1 70B (professional GPU, e.g., A100 40GB x 2):
- Inference speed: ~15 tokens/second
- Code generation: Close to GPT-4 level
- Multilingual support: Good
- Memory requirement: 40GB VRAM
Level 2: Mechanism Deep Dive (3-5 Years Experience)
Understanding Model Pricing Mechanisms
Model pricing is based on Tokens, not word count or character count. Understanding tokens is the key to cost control.
What is a Token?
A token is the basic unit by which models process text. For English, roughly 1 Token = 0.75 words (4 characters). For Chinese, roughly 1 Token = 1-1.5 characters.
Examples:
"Hello, world!" ≈ 4 tokens
A typical English word ≈ 1 token on average
A Chinese character typically takes 1 token
(Chinese text is less token-efficient than English for the same information)
Actual GPT-4o cost calculation:
Scenario: 10,000 knowledge base Q&A conversations per month
Token composition per conversation:
- System prompt: 300 tokens
- Knowledge base retrieval results (5 chunks × 100 tokens): 500 tokens
- User question: 50 tokens
- Conversation history (5 rounds): 500 tokens
- AI response: 200 tokens
Input tokens: 300 + 500 + 50 + 500 = 1,350 tokens
Output tokens: 200 tokens
GPT-4o cost:
Input: 1,350 × 10,000 / 1,000,000 × $5 = $67.50
Output: 200 × 10,000 / 1,000,000 × $15 = $30.00
Monthly total: $97.50
Using gpt-3.5-turbo-0125 instead (input $0.5/1M, output $1.5/1M):
Input: 1,350 × 10,000 / 1,000,000 × $0.5 = $6.75
Output: 200 × 10,000 / 1,000,000 × $1.5 = $3.00
Monthly total: $9.75
Cost difference: 10x!
Multi-Model Strategy: Different Models for Different Tasks
Best practice in production is a multi-model strategy: choose models appropriate to task complexity rather than using the most expensive model for everything.
Three-tier model architecture:
Tier 1: Lightweight models (high-frequency, simple tasks)
- Models: gpt-4o-mini or Claude 3 Haiku
- Use cases: Intent classification, simple Q&A, content filtering
- Cost: ~$0.001 per call
Tier 2: Primary models (medium complexity tasks)
- Models: gpt-4o or Claude 3.5 Sonnet
- Use cases: Knowledge Q&A, code generation, document analysis
- Cost: ~$0.01 per call
Tier 3: Expert models (complex reasoning tasks)
- Models: o1 or Claude 3.5 Sonnet (complex prompts)
- Use cases: Complex analysis, multi-step reasoning, high-accuracy decisions
- Cost: ~$0.1 per call
Implementing tiered routing in Dify Workflow:
Workflow node design:
[Start Node] → Receives user question
↓
[LLM Node: Intent Classification] (using gpt-4o-mini)
Prompt: Classify the following question as simple/medium/complex
Output: {"complexity": "simple/medium/complex"}
↓
[IF/ELSE Branch]
├── complexity == "simple" → [LLM Node] (gpt-4o-mini) → [End]
├── complexity == "medium" → [LLM Node] (gpt-4o) → [End]
└── complexity == "complex" → [LLM Node] (claude-3-5-sonnet) → [End]
Configuring Model Fallback Strategy
Configure backup models for the primary model in Dify for automatic degradation:
Dify model configuration supports fallback (when calling via API):
# When calling Dify API, you can specify a fallback model
payload = {
"inputs": {},
"query": user_question,
"response_mode": "streaming",
"model_config": {
"provider": "openai",
"name": "gpt-4o",
"fallback": {
"provider": "anthropic",
"name": "claude-3-5-sonnet-20241022"
}
}
}
Multi-provider backup strategy in practice:
# Build a fault-tolerant call wrapper
class ResilientDifyClient:
def __init__(self):
self.endpoints = [
{"url": "https://api.dify.ai/v1", "key": PRIMARY_KEY},
{"url": "https://your-self-hosted-dify.com/v1", "key": BACKUP_KEY},
]
def chat(self, message: str, conversation_id: str = None):
last_error = None
for endpoint in self.endpoints:
try:
response = self._call(endpoint, message, conversation_id)
return response
except Exception as e:
last_error = e
print(f"Endpoint {endpoint['url']} failed: {e}, trying next...")
continue
raise last_error
Rate Limit Handling: Essential Production Knowledge
Every model provider has call frequency limits (Rate Limits) — understanding them is critical for production stability.
OpenAI Rate Limits (2024, varies by account tier):
Tier 1 (after $5 spend):
- GPT-4o: 500 RPM (requests/minute), 30,000 TPM (tokens/minute)
- gpt-4o-mini: 500 RPM, 200,000 TPM
Tier 3 (after $100 spend):
- GPT-4o: 5,000 RPM, 300,000 TPM
- gpt-4o-mini: 5,000 RPM, 4,000,000 TPM
Tier 5 (after $10,000 spend):
- Contact OpenAI business team for custom limits
Dify's rate limit handling:
Dify's model gateway has built-in exponential backoff retry:
# Retry logic in api/core/model_runtime/ (simplified)
@retry(
reraise=True,
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry=retry_if_exception_type(RateLimitError)
)
def invoke_with_retry(provider, messages, params):
return provider.invoke(messages, params)
This means:
- After 1st failure, wait 4 seconds then retry
- After 2nd failure, wait 8 seconds then retry
- After 3rd failure, raise the exception
Request queue design for production:
If your application has high-concurrency needs, add a queue layer in front of Dify:
# Token bucket rate limiter using Redis
import redis
import time
class TokenBucketRateLimiter:
def __init__(self, redis_client, key: str, rate: float, capacity: int):
"""
rate: tokens replenished per second (e.g., 8.0 for 500 RPM / 60 seconds)
capacity: token bucket capacity (prevents burst traffic)
"""
self.redis = redis_client
self.key = key
self.rate = rate
self.capacity = capacity
def acquire(self, tokens: int = 1) -> bool:
"""Try to acquire `tokens` tokens, returns whether successful"""
now = time.time()
# Use Lua script to ensure atomicity
lua_script = """
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local tokens = redis.call('hget', key, 'tokens')
local last_time = redis.call('hget', key, 'last_time')
if tokens == false then
tokens = capacity
last_time = now
else
tokens = tonumber(tokens)
last_time = tonumber(last_time)
local elapsed = now - last_time
tokens = math.min(capacity, tokens + elapsed * rate)
end
if tokens >= requested then
tokens = tokens - requested
redis.call('hmset', key, 'tokens', tokens, 'last_time', now)
return 1
else
return 0
end
"""
result = self.redis.eval(lua_script, 1, self.key,
self.rate, self.capacity, now, tokens)
return bool(result)
Choosing an Embedding Model
Knowledge base retrieval quality depends heavily on the Embedding model quality.
Mainstream Embedding model comparison:
| Model | Dimensions | Chinese Support | Multilingual | Price (1M tokens) | Notes |
|---|---|---|---|---|---|
| text-embedding-3-small | 1536 | Good | Yes | $0.02 | OpenAI balanced choice |
| text-embedding-3-large | 3072 | Very good | Yes | $0.13 | OpenAI highest quality |
| text-embedding-ada-002 | 1536 | Good | Yes | $0.10 | Legacy, not recommended |
| bge-m3 (local) | 1024 | Excellent | Yes | Compute cost only | Best open-source choice |
| bge-large-en (local) | 1024 | No | No | Compute cost only | Pure English use cases |
Real test data (on a multilingual Q&A dataset):
Recall@5 metric (proportion of top-5 retrieved results containing the correct answer):
bge-m3: 92.3%
text-embedding-3-large: 89.7%
text-embedding-3-small: 85.1%
text-embedding-ada-002: 78.4%
Conclusions:
- Multilingual/Chinese-heavy: bge-m3 > text-embedding-3-large > text-embedding-3-small
- Pure English: text-embedding-3-large ≈ bge-m3
- Cost-first: text-embedding-3-small (best price/performance ratio among commercial models)
Level 3: Source Code and Principles (5+ Years Experience)
Complete Implementation of Dify's Model Gateway
Dify's model gateway (api/core/model_runtime/) uses a plugin-based architecture where each model provider is a separate plugin package.
Directory structure:
api/core/model_runtime/
├── model_providers/ # Individual provider implementations
│ ├── openai/
│ │ ├── _assets/ # Provider icons and assets
│ │ ├── openai.py # Provider main class
│ │ ├── openai.yaml # Provider config (available models, credential definitions)
│ │ ├── llm/
│ │ │ ├── openai_llm.py # LLM adapter implementation
│ │ │ └── gpt-4o.yaml # Specific model parameter definitions
│ │ └── text_embedding/
│ │ └── openai_text_embedding.py
│ ├── anthropic/
│ │ ├── anthropic.yaml
│ │ └── llm/
│ │ └── anthropic_llm.py
│ └── ollama/
│ └── llm/
│ └── ollama_llm.py
├── entities/ # Data entity definitions
│ ├── message_entities.py # Message formats (PromptMessage etc.)
│ └── model_entities.py # Model metadata
└── errors/ # Error type definitions
├── invoke_error.py
└── credentials_validate_error.py
Provider YAML configuration example (simplified openai.yaml):
provider: openai
label:
en_US: OpenAI
icon_small:
en_US: icon_s_en.svg
supported_model_types:
- llm
- text-embedding
- speech2text
- tts
configurate_methods:
- predefined-model
- customizable-model
provider_credential_schema:
credential_form_schemas:
- variable: openai_api_key
label:
en_US: API Key
type: secret-input
required: true
placeholder:
en_US: Enter your OpenAI API key
- variable: openai_organization
label:
en_US: Organization
type: text-input
required: false
Core LLM adapter implementation (key parts of openai_llm.py):
class OpenAILargeLanguageModel(LargeLanguageModel):
def _invoke(
self,
model: str,
credentials: dict,
prompt_messages: list[PromptMessage],
model_parameters: dict,
tools: list[PromptMessageTool] | None = None,
stop: list[str] | None = None,
stream: bool = True,
user: str | None = None,
) -> LLMResult | Generator:
# Initialize OpenAI client
client = OpenAI(
api_key=credentials["openai_api_key"],
organization=credentials.get("openai_organization"),
base_url=credentials.get("openai_api_base", "https://api.openai.com/v1")
)
# Convert Dify internal message format to OpenAI format
openai_messages = self._convert_messages(prompt_messages)
# Convert Dify tool definitions to OpenAI function format
openai_tools = self._convert_tools(tools) if tools else None
# Build request parameters
params = {
"model": model,
"messages": openai_messages,
"stream": stream,
"temperature": model_parameters.get("temperature", 0.7),
"max_tokens": model_parameters.get("max_tokens", 4096),
}
if openai_tools:
params["tools"] = openai_tools
params["tool_choice"] = "auto"
if stop:
params["stop"] = stop
# Call the API
if stream:
return self._handle_stream_response(client.chat.completions.create(**params))
else:
response = client.chat.completions.create(**params)
return self._handle_chat_response(response)
def _handle_stream_response(self, stream) -> Generator:
"""Handle streaming response, convert OpenAI format to Dify internal format"""
for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.content:
yield LLMResultChunk(
model=chunk.model,
prompt_messages=[],
delta=LLMResultChunkDelta(
index=0,
message=AssistantPromptMessage(content=delta.content),
finish_reason=chunk.choices[0].finish_reason
)
)
# Handle tool calls
if delta.tool_calls:
for tool_call in delta.tool_calls:
yield LLMResultChunk(
model=chunk.model,
prompt_messages=[],
delta=LLMResultChunkDelta(
index=tool_call.index,
message=AssistantPromptMessage(
tool_calls=[ToolCall(
id=tool_call.id,
type="function",
function=ToolCallFunction(
name=tool_call.function.name,
arguments=tool_call.function.arguments
)
)]
)
)
)
Precise Token Counting Implementation
Dify records token usage after each call for statistics and billing:
# Token counting implementation (simplified)
class TokenCounter:
def __init__(self, model: str):
self.model = model
# For OpenAI models, use tiktoken library for token calculation
import tiktoken
try:
self.encoder = tiktoken.encoding_for_model(model)
except KeyError:
self.encoder = tiktoken.get_encoding("cl100k_base") # Default encoding
def count_message_tokens(self, messages: list[dict]) -> int:
"""Count total tokens for a message list"""
num_tokens = 0
for message in messages:
num_tokens += 4 # Fixed overhead per message
for key, value in message.items():
if isinstance(value, str):
num_tokens += len(self.encoder.encode(value))
num_tokens += 2 # Conversation end marker
return num_tokens
Key detail: Token counting differs by provider:
- OpenAI: Precise calculation using
tiktoken(BPE algorithm) - Claude: Uses Anthropic's tokenizer (slightly different from OpenAI)
- Local models: Depends on which tokenizer the specific model uses
Level 4: Production Pitfalls and Decision Making (Expert Perspective)
Pitfall 1: Hidden Context Length Constraints
Many people assume GPT-4o's 128K context means they can safely pass very long documents. But there are two hidden constraints:
Constraint 1: Price scales linearly with context length
Cost of 128K token input:
128,000 × $5 / 1,000,000 = $0.64 per call
If processing 1,000 such requests daily:
$640/day = $19,200/month
Constraint 2: Attention decay in very long contexts
Research shows (Lost in the Middle, Liu et al. 2023) that when key information is in the middle of a long context, LLM retrieval accuracy drops significantly. This is one of the core reasons why RAG (splitting documents into small searchable chunks) outperforms putting the entire document in context.
Practical recommendations:
- Don't stuff the entire document into context just because the model supports 128K
- Knowledge base + RAG is the correct approach for long documents
- If you must process very long documents, use a "segment processing + result merging" workflow
Pitfall 2: API Key Security Management
In production, API Key management is critical for security. Common mistakes:
Wrong approaches:
# Dangerous! API Key hardcoded in code
API_KEY = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Dangerous! API Key in .env file committed to Git
# Forgetting to add .env to .gitignore
Correct approach:
# 1. Use environment variables
import os
API_KEY = os.environ.get("DIFY_API_KEY")
if not API_KEY:
raise ValueError("DIFY_API_KEY environment variable not set")
# 2. In Dify self-hosted: inject via .env file (not committed to Git)
# .gitignore MUST include .env
# 3. Production: use a Secret management service
# AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets, etc.
# 4. Rotate API Keys regularly
# Recommended: every 90 days; revoke immediately if compromise is suspected
Monitor API Key usage anomalies:
# OpenAI usage monitoring API
curl -H "Authorization: Bearer $OPENAI_KEY" \
"https://api.openai.com/dashboard/billing/usage?start_date=2024-01-01&end_date=2024-01-31"
# Configure alerts: send email notification when daily spend exceeds threshold
Pitfall 3: Token Cost Spiral
Problem scenario: A company launched a Dify application and received an OpenAI bill 3x higher than expected in the first month.
Common causes and solutions:
| Cause | Solution |
|---|---|
| System prompt too long | Streamline prompt, remove redundant content |
| Too many retrieval chunks (top_k too high) | Lower top_k from default 5 to 3 |
| Too many conversation history rounds | Lower history rounds from 10 to 5 |
| Model selection too expensive | Evaluate if gpt-3.5-turbo can replace gpt-4o |
| Users pasting large amounts of text | Limit input character count |
Model Selection Decision Tree
Use this decision tree to quickly determine model configuration for new projects:
Question 1: Can data leave your premises?
├── No (compliance requirement) → Local model (Ollama + Llama/Qwen)
└── Yes → Question 2
Question 2: What is the primary task?
├── Multilingual/Chinese knowledge Q&A → Qwen2.5-72B (best Chinese) or GPT-4o
├── Code generation → Claude 3.5 Sonnet (leads on coding tasks)
├── Long document analysis → Claude 3.5 Sonnet (200K context)
├── Image understanding → GPT-4o or Gemini 1.5 Pro
└── General tasks → GPT-4o or Claude 3.5 Sonnet
Question 3: What is the daily call volume?
├── < 1,000 calls → Any model — cost difference is negligible
├── 1,000-10,000 calls → Consider gpt-4o-mini or Claude 3 Haiku
└── > 10,000 calls → Must optimize cost, consider tiered routing
Question 4: What is the budget?
├── < $100/month → gpt-4o-mini or local models
├── $100-500/month → GPT-4o or Claude 3.5 Sonnet, manage usage carefully
└── Unrestricted → Choose by quality, use the most appropriate model for each task
Chapter Summary
Model integration isn't as simple as "fill in an API Key." Production-grade model management requires considering capability matching, cost control, rate limit handling, and security management across multiple dimensions.
Key Takeaways:
- Tier your models: Different complexity tasks warrant different models — a 10x cost difference means a 10x margin difference
- bge-m3 is the go-to Embedding model for multilingual content: Better retrieval results than OpenAI's Embedding for multilingual and Chinese-heavy content
- Local models have real value: Data compliance isn't the only reason — total cost of local models can be far lower than cloud models at large scale
- Rate limits are essential production knowledge: Understand each provider's limits in advance and implement proper queue and backoff logic at the application layer
- API Key security: Production environments must use a Secret management service — never hardcode keys
The next chapter dives deep into the core principles of RAG: comparing vector retrieval, full-text retrieval, and hybrid retrieval mechanisms, and how to configure optimal retrieval in Dify.