Chapter 20

Observability: Logging, Tracing, Cost Control and Alerting

Chapter 20: Observability — Logging, Tracing, Cost Control and Alerting

You cannot manage what you cannot see — this chapter shows you how to build a complete observability stack for Dify so every LLM call and every token consumed is fully accountable.

Chapter Overview

A production Dify deployment is like a racing car at full speed. Without a dashboard, you won't know when a tire is about to blow. Observability rests on three pillars:

Logs: What happened? Who did what, and when?
Metrics: How healthy is the system? QPS, latency, error rate?
Traces: Which services did a request touch? What was slow?

For an LLM application platform like Dify, there is a critical fourth dimension:

Cost: Which models were called? How many tokens were consumed? What did it cost today?

This chapter walks you through building a complete observability stack from scratch: Prometheus + Grafana for metrics, Loki for log aggregation, OpenTelemetry for distributed tracing, and token-based cost control with budget alerts.

By the end, you will be able to:

Deploy Prometheus + Grafana to monitor all core Dify metrics
Configure Loki to centralize and search Dify logs
Use OpenTelemetry to trace every LLM call end-to-end
Set up cost budget alerts to prevent runaway spending
Build production-grade alert rules that notify within 5 minutes of an anomaly

Level 1: Core Concepts (1–3 Years Experience)

Why LLM Apps Need Special Observability

Traditional web app observability focuses on latency and error rate. LLM applications add new challenges:

Highly variable response times: The same question might return in 500ms or 30s depending on token count and model load
Cost tightly coupled to usage: A poorly written prompt can inflate token consumption 10x
Quality hard to measure with traditional metrics: HTTP 200 doesn't mean the answer was correct
Complex call chains: One user request may trigger RAG retrieval → multiple LLM calls → tool execution → more LLM calls

Analogy: Traditional observability is like watching a car's fuel gauge and speedometer. LLM observability also asks "how much did this journey cost in fuel?" and "was the passenger satisfied with the trip?"

Dify Built-in Monitoring

The Dify console provides basic built-in monitoring without additional configuration:

App monitoring (Console → App → Monitor):

Total conversations, active user count
Average response time trends
Token consumption (daily/weekly/monthly)
Model cost estimates
User feedback (thumbs up/down)

Log viewer (Console → Logs):

Complete record of each conversation (input/output/prompt used)
Token consumption details
Execution timing

Limitations: Built-in monitoring covers only individual apps, cannot aggregate across apps, cannot set alerts, and retains data for only 30 days.

Quick Prometheus Integration

Dify API exposes a /metrics endpoint in Prometheus format:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'dify-api'
    static_configs:
      - targets: ['dify-api:5001']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'weaviate'
    static_configs:
      - targets: ['weaviate:2112']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

# Add monitoring stack to docker-compose
services:
  prometheus:
    image: prom/prometheus:v2.48.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.2.0
    environment:
      GF_SECURITY_ADMIN_PASSWORD: StrongGrafanaPassword
      GF_USERS_ALLOW_SIGN_UP: 'false'
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3001:3000"

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    environment:
      DATA_SOURCE_NAME: postgresql://dify:password@db:5432/dify?sslmode=disable

  redis-exporter:
    image: oliver006/redis_exporter:v1.55.0
    environment:
      REDIS_ADDR: redis://redis:6379
      REDIS_PASSWORD: RedisPassword

  node-exporter:
    image: prom/node-exporter:v1.7.0
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

Level 2: Mechanism Deep Dive (3–5 Years Experience)

Key Prometheus Queries

API request rate:

rate(http_requests_total{job="dify-api"}[5m])

LLM latency distribution:

# P50
histogram_quantile(0.50, rate(dify_llm_request_duration_seconds_bucket[5m]))

# P95
histogram_quantile(0.95, rate(dify_llm_request_duration_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(dify_llm_request_duration_seconds_bucket[5m]))

Token consumption by model:

rate(dify_llm_tokens_total{model=~"gpt-4.*"}[5m])
rate(dify_llm_tokens_total{model=~"gpt-3.5.*"}[5m])

Worker queue backlog:

celery_tasks_received_total - celery_tasks_succeeded_total - celery_tasks_failed_total

Log Aggregation with Loki + Promtail

services:
  loki:
    image: grafana/loki:2.9.0
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

# promtail-config.yaml
scrape_configs:
  - job_name: dify-containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_name]
        regex: /dify.*
        action: keep
      - source_labels: [__meta_docker_container_label_com_docker_compose_service]
        target_label: service
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
            request_id: request_id
      - labels:
          level:
          service:

Useful LogQL queries:

# All errors from dify-api
{service="api"} |= "ERROR"

# Full call chain for a specific request
{service=~"api|worker"} |= "req-abc-12345"

# Error rate per minute over the last hour
rate({service="api"} |= "ERROR" [1m])

# LLM timeouts
{service="api"} |~ "timeout|LLM.*error"

OpenTelemetry Distributed Tracing

# Enable in .env
ENABLE_OTEL=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=dify-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling to reduce overhead

services:
  jaeger:
    image: jaegertracing/all-in-one:1.52
    environment:
      COLLECTOR_OTLP_ENABLED: 'true'
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC

Prometheus Alert Rules

# alerts.yaml
groups:
  - name: dify_alerts
    rules:
      # Token consumption > 100K per hour
      - alert: HighTokenConsumption
        expr: increase(dify_llm_tokens_total[1h]) > 100000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High token consumption"
          description: "Token consumption {{ $value }} exceeded 100K in last hour"

      # Estimated daily cost > $100
      - alert: DailyBudgetExceeded
        expr: sum(increase(dify_llm_cost_usd_total[24h])) > 100
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Daily budget exceeded"
          description: "Estimated daily cost ${{ $value }} exceeded budget of $100"

      # Error rate > 5%
      - alert: HighErrorRate
        expr: >
          rate(http_requests_total{job="dify-api",status=~"5.."}[5m])
          /
          rate(http_requests_total{job="dify-api"}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High API error rate"
          description: "Error rate {{ $value | humanizePercentage }} exceeded 5%"

      # P95 latency > 10 seconds
      - alert: HighLatency
        expr: >
          histogram_quantile(0.95,
            rate(dify_llm_request_duration_seconds_bucket[5m])
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High LLM latency"
          description: "P95 latency {{ $value }}s exceeded 10s threshold"

      # Redis memory > 85%
      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage high"

Level 3: Source Code and Architecture (5+ Years)

Dify's Internal Observability Implementation

Dify's Flask API uses prometheus_flask_exporter for metrics exposure:

from prometheus_flask_exporter import PrometheusMetrics
from opentelemetry import trace

tracer = trace.get_tracer('dify.model_runtime')

class BaseModelProvider:
    def invoke(self, model_parameters: dict, **kwargs):
        with tracer.start_as_current_span(f"llm.invoke.{self.model}") as span:
            span.set_attribute("llm.model", self.model)
            span.set_attribute("llm.provider", self.provider)
            
            try:
                response = self._invoke_model(model_parameters, **kwargs)
                if hasattr(response, 'usage'):
                    span.set_attribute("llm.input_tokens", response.usage.prompt_tokens)
                    span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
                return response
            except Exception as e:
                span.record_exception(e)
                raise

Structured Logging Setup

import structlog

def configure_structlog():
    structlog.configure(
        processors=[
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            add_open_telemetry_spans,  # Correlate with trace IDs
            structlog.processors.JSONRenderer()
        ],
        logger_factory=structlog.stdlib.LoggerFactory(),
    )

def add_open_telemetry_spans(logger, method, event_dict):
    from opentelemetry import trace
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict['trace_id'] = format(ctx.trace_id, '032x')
        event_dict['span_id'] = format(ctx.span_id, '016x')
    return event_dict

Level 4: Production Traps and Decisions (Expert Perspective)

Real Incident Post-Mortems

Incident 1: OpenAI cost explosion at a 500-person manufacturing company

One morning the daily OpenAI bill jumped to $850 — 10x normal.

Investigation:

Grafana token consumption panel showed anomaly starting at 2:00 AM
Loki logs revealed massive volume from app_id=prod-sales-assistant
Deeper log analysis showed Prompt size grew from 1,000 to 12,000 tokens
Audit log showed an employee had accidentally copied an entire example document into the System Prompt field

Root cause: No alert on prompt token size, no confirmation step for critical console changes.

Fix:

- alert: LargePromptDetected
  expr: avg(dify_llm_prompt_tokens) by (app_id) > 5000
  for: 5m
  annotations:
    summary: "App {{ $labels.app_id }} has abnormally large prompt tokens"

Incident 2: Celery tasks silently dropped

Document uploads showed no knowledge base updates for hours. Some tasks disappeared entirely.

Investigation: Redis memory was full. The allkeys-lru eviction policy was evicting Celery queue messages from Redis.

Fix: Use a separate Redis database (db 1) for Celery with noeviction policy:

CELERY_BROKER_URL: redis://:password@redis:6379/1

Cost Optimization Strategies

Strategy 1: Model routing by complexity

def smart_model_selector(query: str, complexity: float) -> str:
    if complexity < 0.3 or is_simple_faq(query):
        return "gpt-3.5-turbo"    # $0.002/1K tokens
    elif complexity < 0.7:
        return "gpt-4-turbo"       # $0.01/1K tokens
    else:
        return "gpt-4"             # $0.03/1K tokens

Strategy 2: Response caching for repetitive queries

def cached_llm_call(prompt: str, model: str, ttl: int = 3600) -> str:
    cache_key = f"llm:{hashlib.md5(f'{model}:{prompt}'.encode()).hexdigest()}"
    cached = redis_client.get(cache_key)
    if cached:
        return cached.decode()
    response = llm.invoke(prompt, model=model)
    redis_client.setex(cache_key, ttl, response.content)
    return response.content

Strategy 3: Per-user daily token budgets

class TokenBudgetMiddleware:
    def check_budget(self, user_id: str, daily_limit: int = 50000) -> bool:
        today = datetime.now().strftime('%Y-%m-%d')
        key = f"token_budget:{user_id}:{today}"
        used = int(redis_client.get(key) or 0)
        return used < daily_limit

    def consume(self, user_id: str, tokens: int):
        today = datetime.now().strftime('%Y-%m-%d')
        key = f"token_budget:{user_id}:{today}"
        redis_client.incrby(key, tokens)
        redis_client.expire(key, 86400)

Observability Maturity Model

Level	Description	Stack
0	No monitoring, learn of problems from user complaints	Dify built-in logs only
1	Basic metrics monitoring	Prometheus + Grafana
2	Log aggregation + alerting	+ Loki + AlertManager
3	Distributed tracing	+ OpenTelemetry + Jaeger
4	Business metrics + cost visibility	+ Custom metrics + Cost Dashboard
5	AIOps: anomaly detection + auto root cause analysis	+ ML anomaly detection

Target Level 3 for most enterprise deployments; Level 4 is the aspirational state.

Chapter Summary

Key takeaways:

LLM observability = traditional metrics + token cost + retrieval quality — all three are essential.
Prometheus + Grafana + Loki is a lightweight but complete monitoring stack suitable for teams of 50–1,000 people.
Structured logging (JSON format with Trace ID correlation) is the key to fast incident diagnosis — never rely on print statements.
Cost alerts must be configured before going live, not after the bill arrives.
Business metrics (retrieval similarity scores, user thumbs up/down) reveal true service quality better than technical metrics alone.
Celery queue Redis must be isolated from the cache Redis — if memory fills up and eviction kicks in, tasks silently disappear.

Alert priority reference:

Alert	Threshold	Severity	Notification
API error rate	> 5%	Critical	Phone/SMS immediately
Daily cost budget	> $100	Critical	Instant message
P95 latency	> 10s	Warning	Group chat
Redis memory	> 85%	Warning	Group chat
Worker backlog	> 100 tasks	Warning	Email
Avg prompt tokens	> 5,000	Info	Daily report

Rate this chapter

4.8 / 5 (9 ratings)