Observability: Logging, Tracing, Cost Control and Alerting
Chapter 20: Observability โ Logging, Tracing, Cost Control and Alerting
You cannot manage what you cannot see โ this chapter shows you how to build a complete observability stack for Dify so every LLM call and every token consumed is fully accountable.
Chapter Overview
A production Dify deployment is like a racing car at full speed. Without a dashboard, you won't know when a tire is about to blow. Observability rests on three pillars:
- Logs: What happened? Who did what, and when?
- Metrics: How healthy is the system? QPS, latency, error rate?
- Traces: Which services did a request touch? What was slow?
For an LLM application platform like Dify, there is a critical fourth dimension:
- Cost: Which models were called? How many tokens were consumed? What did it cost today?
This chapter walks you through building a complete observability stack from scratch: Prometheus + Grafana for metrics, Loki for log aggregation, OpenTelemetry for distributed tracing, and token-based cost control with budget alerts.
By the end, you will be able to:
- Deploy Prometheus + Grafana to monitor all core Dify metrics
- Configure Loki to centralize and search Dify logs
- Use OpenTelemetry to trace every LLM call end-to-end
- Set up cost budget alerts to prevent runaway spending
- Build production-grade alert rules that notify within 5 minutes of an anomaly
Level 1: Core Concepts (1โ3 Years Experience)
Why LLM Apps Need Special Observability
Traditional web app observability focuses on latency and error rate. LLM applications add new challenges:
- Highly variable response times: The same question might return in 500ms or 30s depending on token count and model load
- Cost tightly coupled to usage: A poorly written prompt can inflate token consumption 10x
- Quality hard to measure with traditional metrics: HTTP 200 doesn't mean the answer was correct
- Complex call chains: One user request may trigger RAG retrieval โ multiple LLM calls โ tool execution โ more LLM calls
Analogy: Traditional observability is like watching a car's fuel gauge and speedometer. LLM observability also asks "how much did this journey cost in fuel?" and "was the passenger satisfied with the trip?"
Dify Built-in Monitoring
The Dify console provides basic built-in monitoring without additional configuration:
App monitoring (Console โ App โ Monitor):
- Total conversations, active user count
- Average response time trends
- Token consumption (daily/weekly/monthly)
- Model cost estimates
- User feedback (thumbs up/down)
Log viewer (Console โ Logs):
- Complete record of each conversation (input/output/prompt used)
- Token consumption details
- Execution timing
Limitations: Built-in monitoring covers only individual apps, cannot aggregate across apps, cannot set alerts, and retains data for only 30 days.
Quick Prometheus Integration
Dify API exposes a /metrics endpoint in Prometheus format:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dify-api'
static_configs:
- targets: ['dify-api:5001']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'weaviate'
static_configs:
- targets: ['weaviate:2112']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Add monitoring stack to docker-compose
services:
prometheus:
image: prom/prometheus:v2.48.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.2.0
environment:
GF_SECURITY_ADMIN_PASSWORD: StrongGrafanaPassword
GF_USERS_ALLOW_SIGN_UP: 'false'
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3001:3000"
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
environment:
DATA_SOURCE_NAME: postgresql://dify:password@db:5432/dify?sslmode=disable
redis-exporter:
image: oliver006/redis_exporter:v1.55.0
environment:
REDIS_ADDR: redis://redis:6379
REDIS_PASSWORD: RedisPassword
node-exporter:
image: prom/node-exporter:v1.7.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
Level 2: Mechanism Deep Dive (3โ5 Years Experience)
Key Prometheus Queries
API request rate:
rate(http_requests_total{job="dify-api"}[5m])
LLM latency distribution:
# P50
histogram_quantile(0.50, rate(dify_llm_request_duration_seconds_bucket[5m]))
# P95
histogram_quantile(0.95, rate(dify_llm_request_duration_seconds_bucket[5m]))
# P99
histogram_quantile(0.99, rate(dify_llm_request_duration_seconds_bucket[5m]))
Token consumption by model:
rate(dify_llm_tokens_total{model=~"gpt-4.*"}[5m])
rate(dify_llm_tokens_total{model=~"gpt-3.5.*"}[5m])
Worker queue backlog:
celery_tasks_received_total - celery_tasks_succeeded_total - celery_tasks_failed_total
Log Aggregation with Loki + Promtail
services:
loki:
image: grafana/loki:2.9.0
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki_data:/loki
command: -config.file=/etc/loki/local-config.yaml
ports:
- "3100:3100"
promtail:
image: grafana/promtail:2.9.0
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yaml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
# promtail-config.yaml
scrape_configs:
- job_name: dify-containers
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: [__meta_docker_container_name]
regex: /dify.*
action: keep
- source_labels: [__meta_docker_container_label_com_docker_compose_service]
target_label: service
pipeline_stages:
- json:
expressions:
level: level
message: message
request_id: request_id
- labels:
level:
service:
Useful LogQL queries:
# All errors from dify-api
{service="api"} |= "ERROR"
# Full call chain for a specific request
{service=~"api|worker"} |= "req-abc-12345"
# Error rate per minute over the last hour
rate({service="api"} |= "ERROR" [1m])
# LLM timeouts
{service="api"} |~ "timeout|LLM.*error"
OpenTelemetry Distributed Tracing
# Enable in .env
ENABLE_OTEL=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=dify-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling to reduce overhead
services:
jaeger:
image: jaegertracing/all-in-one:1.52
environment:
COLLECTOR_OTLP_ENABLED: 'true'
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
Prometheus Alert Rules
# alerts.yaml
groups:
- name: dify_alerts
rules:
# Token consumption > 100K per hour
- alert: HighTokenConsumption
expr: increase(dify_llm_tokens_total[1h]) > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "High token consumption"
description: "Token consumption {{ $value }} exceeded 100K in last hour"
# Estimated daily cost > $100
- alert: DailyBudgetExceeded
expr: sum(increase(dify_llm_cost_usd_total[24h])) > 100
for: 5m
labels:
severity: critical
annotations:
summary: "Daily budget exceeded"
description: "Estimated daily cost ${{ $value }} exceeded budget of $100"
# Error rate > 5%
- alert: HighErrorRate
expr: >
rate(http_requests_total{job="dify-api",status=~"5.."}[5m])
/
rate(http_requests_total{job="dify-api"}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High API error rate"
description: "Error rate {{ $value | humanizePercentage }} exceeded 5%"
# P95 latency > 10 seconds
- alert: HighLatency
expr: >
histogram_quantile(0.95,
rate(dify_llm_request_duration_seconds_bucket[5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High LLM latency"
description: "P95 latency {{ $value }}s exceeded 10s threshold"
# Redis memory > 85%
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
Level 3: Source Code and Architecture (5+ Years)
Dify's Internal Observability Implementation
Dify's Flask API uses prometheus_flask_exporter for metrics exposure:
from prometheus_flask_exporter import PrometheusMetrics
from opentelemetry import trace
tracer = trace.get_tracer('dify.model_runtime')
class BaseModelProvider:
def invoke(self, model_parameters: dict, **kwargs):
with tracer.start_as_current_span(f"llm.invoke.{self.model}") as span:
span.set_attribute("llm.model", self.model)
span.set_attribute("llm.provider", self.provider)
try:
response = self._invoke_model(model_parameters, **kwargs)
if hasattr(response, 'usage'):
span.set_attribute("llm.input_tokens", response.usage.prompt_tokens)
span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
return response
except Exception as e:
span.record_exception(e)
raise
Structured Logging Setup
import structlog
def configure_structlog():
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
add_open_telemetry_spans, # Correlate with trace IDs
structlog.processors.JSONRenderer()
],
logger_factory=structlog.stdlib.LoggerFactory(),
)
def add_open_telemetry_spans(logger, method, event_dict):
from opentelemetry import trace
span = trace.get_current_span()
if span.is_recording():
ctx = span.get_span_context()
event_dict['trace_id'] = format(ctx.trace_id, '032x')
event_dict['span_id'] = format(ctx.span_id, '016x')
return event_dict
Level 4: Production Traps and Decisions (Expert Perspective)
Real Incident Post-Mortems
Incident 1: OpenAI cost explosion at a 500-person manufacturing company
One morning the daily OpenAI bill jumped to $850 โ 10x normal.
Investigation:
- Grafana token consumption panel showed anomaly starting at 2:00 AM
- Loki logs revealed massive volume from
app_id=prod-sales-assistant - Deeper log analysis showed Prompt size grew from 1,000 to 12,000 tokens
- Audit log showed an employee had accidentally copied an entire example document into the System Prompt field
Root cause: No alert on prompt token size, no confirmation step for critical console changes.
Fix:
- alert: LargePromptDetected
expr: avg(dify_llm_prompt_tokens) by (app_id) > 5000
for: 5m
annotations:
summary: "App {{ $labels.app_id }} has abnormally large prompt tokens"
Incident 2: Celery tasks silently dropped
Document uploads showed no knowledge base updates for hours. Some tasks disappeared entirely.
Investigation: Redis memory was full. The allkeys-lru eviction policy was evicting Celery queue messages from Redis.
Fix: Use a separate Redis database (db 1) for Celery with noeviction policy:
CELERY_BROKER_URL: redis://:password@redis:6379/1
Cost Optimization Strategies
Strategy 1: Model routing by complexity
def smart_model_selector(query: str, complexity: float) -> str:
if complexity < 0.3 or is_simple_faq(query):
return "gpt-3.5-turbo" # $0.002/1K tokens
elif complexity < 0.7:
return "gpt-4-turbo" # $0.01/1K tokens
else:
return "gpt-4" # $0.03/1K tokens
Strategy 2: Response caching for repetitive queries
def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
cache_key = f"llm:{hashlib.md5(f'{model}:{prompt}'.encode()).hexdigest()}"
cached = redis_client.get(cache_key)
if cached:
return cached.decode()
response = llm.invoke(prompt, model=model)
redis_client.setex(cache_key, ttl, response.content)
return response.content
Strategy 3: Per-user daily token budgets
class TokenBudgetMiddleware:
def check_budget(self, user_id: str, daily_limit: int = 50000) -> bool:
today = datetime.now().strftime('%Y-%m-%d')
key = f"token_budget:{user_id}:{today}"
used = int(redis_client.get(key) or 0)
return used < daily_limit
def consume(self, user_id: str, tokens: int):
today = datetime.now().strftime('%Y-%m-%d')
key = f"token_budget:{user_id}:{today}"
redis_client.incrby(key, tokens)
redis_client.expire(key, 86400)
Observability Maturity Model
| Level | Description | Stack |
|---|---|---|
| 0 | No monitoring, learn of problems from user complaints | Dify built-in logs only |
| 1 | Basic metrics monitoring | Prometheus + Grafana |
| 2 | Log aggregation + alerting | + Loki + AlertManager |
| 3 | Distributed tracing | + OpenTelemetry + Jaeger |
| 4 | Business metrics + cost visibility | + Custom metrics + Cost Dashboard |
| 5 | AIOps: anomaly detection + auto root cause analysis | + ML anomaly detection |
Target Level 3 for most enterprise deployments; Level 4 is the aspirational state.
Chapter Summary
Key takeaways:
-
LLM observability = traditional metrics + token cost + retrieval quality โ all three are essential.
-
Prometheus + Grafana + Loki is a lightweight but complete monitoring stack suitable for teams of 50โ1,000 people.
-
Structured logging (JSON format with Trace ID correlation) is the key to fast incident diagnosis โ never rely on print statements.
-
Cost alerts must be configured before going live, not after the bill arrives.
-
Business metrics (retrieval similarity scores, user thumbs up/down) reveal true service quality better than technical metrics alone.
-
Celery queue Redis must be isolated from the cache Redis โ if memory fills up and eviction kicks in, tasks silently disappear.
Alert priority reference:
| Alert | Threshold | Severity | Notification |
|---|---|---|---|
| API error rate | > 5% | Critical | Phone/SMS immediately |
| Daily cost budget | > $100 | Critical | Instant message |
| P95 latency | > 10s | Warning | Group chat |
| Redis memory | > 85% | Warning | Group chat |
| Worker backlog | > 100 tasks | Warning | |
| Avg prompt tokens | > 5,000 | Info | Daily report |