Description

Patterns and procedures for building AI agent workflows that survive real-world failures. Use when asked to build a multi-step automation, pipeline, or agent...

README (SKILL.md)

Durable Workflow Patterns

Name: Durable Workflow
Author: old-greggyboy

Build automations that survive API failures, timeouts, and unexpected state — without rebuilding from scratch every time something breaks.

Core Principle

Every step in a multi-step workflow must answer three questions:

What did I finish? (checkpoint)
What do I do if this step fails? (recovery)
Who finds out if something goes wrong? (alerting)

Skip any of these and the workflow will eventually fail silently.

Scripts

Ready-to-use implementations in scripts/:

Script	Purpose
`workflow-template.js`	Complete workflow skeleton with checkpoints, retry, DLQ, exit handler
`lock.js`	File-based process lock — prevents concurrent runs

workflow-template.js

Copy and fill in the step TODOs:

cp scripts/workflow-template.js my-workflow.js
node my-workflow.js           # Run (or re-run — resumes from last checkpoint)
WORKFLOW_STATE_PATH=/tmp/state.json node my-workflow.js   # Custom state path

Features: atomic state saves, exponential backoff, timeout wrapper, DLQ, abnormal-exit logging.

lock.js

Prevent two instances of the same workflow from running at once:

const { withLock, LockError } = require('./lock');

withLock('/tmp/my-workflow.lock', async () => {
  // Only one process runs this block at a time
  await runWorkflow();
}).catch(e => {
  if (e.name === 'LockError') {
    console.error('Already running:', e.message);
  } else {
    throw e;
  }
});

Pattern 1: Checkpoint State

Save progress after every meaningful step. Never trust in-memory state across network calls.

// checkpoint.js pattern
const state = loadState('workflow-id') || { step: 0, results: [] };

if (state.step \x3C 1) {
  state.results.push(await fetchData());
  state.step = 1;
  saveState('workflow-id', state);
}
if (state.step \x3C 2) {
  state.results.push(await processData(state.results[0]));
  state.step = 2;
  saveState('workflow-id', state);
}
// Restart from any step — already-done steps are skipped

Pattern 2: Circuit Breaker

Stop hammering a failing service. Open the circuit after N failures, half-open after a cooldown.

class CircuitBreaker {
  constructor(threshold = 3, cooldownMs = 30000) {
    this.failures = 0; this.threshold = threshold;
    this.state = 'closed'; this.nextRetry = 0;
  }
  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() \x3C this.nextRetry) throw new Error('Circuit open');
      this.state = 'half-open';
    }
    try {
      const result = await fn();
      this.failures = 0; this.state = 'closed';
      return result;
    } catch (e) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.state = 'open';
        this.nextRetry = Date.now() + this.cooldownMs;
      }
      throw e;
    }
  }
}

Pattern 3: Exponential Backoff with Jitter

async function withRetry(fn, maxAttempts = 4, baseDelayMs = 1000) {
  for (let attempt = 0; attempt \x3C maxAttempts; attempt++) {
    try { return await fn(); }
    catch (e) {
      if (attempt === maxAttempts - 1) throw e;
      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 500;
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

Pattern 4: Dead Letter Queue

When a step fails after all retries, don't silently drop it. Route it somewhere reviewable.

async function processWithDLQ(items, processFn, dlqPath) {
  const failed = [];
  for (const item of items) {
    try { await withRetry(() => processFn(item)); }
    catch (e) { failed.push({ item, error: e.message, failedAt: new Date() }); }
  }
  if (failed.length) {
    const existing = fs.existsSync(dlqPath) ? JSON.parse(fs.readFileSync(dlqPath)) : [];
    fs.writeFileSync(dlqPath, JSON.stringify([...existing, ...failed], null, 2));
  }
}

Pattern 5: Idempotent Operations

Design every step so running it twice produces the same result as running it once.

// BAD: running twice creates two records
await db.insert({ id: uuid(), data });

// GOOD: upsert on natural key
await db.upsert({ id: deterministicId(data), data }, { onConflict: 'update' });

Pattern 6: Instance Lock

Prevent duplicate runs (e.g. cron overlap, manual re-trigger while running).

const { withLock, LockError } = require('./scripts/lock');

const LOCK_PATH = '/tmp/my-workflow.lock';

async function main() {
  await withLock(LOCK_PATH, async () => {
    // Safe: only one instance reaches here at a time
    await runWorkflow();
  });
}

main().catch(e => {
  if (e.name === 'LockError') {
    // Not an error — just another instance running
    console.log(`Skipping: ${e.message}`);
    process.exit(0);
  }
  console.error('Fatal:', e.message);
  process.exit(1);
});

The lock uses PID detection — stale locks from crashed processes are automatically reclaimed.

Workflow Design Checklist

Before shipping any multi-step automation:

Each step saves state before moving to the next
External API calls wrapped in retry + backoff
Circuit breaker on services called more than once per run
Failed items go to a dead letter file/queue, not /dev/null
The workflow can restart from any step without duplicating completed work
Alerting fires when the workflow exits abnormally (not just on exception)
Timeouts set on all external calls (never await fetch() without a deadline)
Instance lock in place if triggered by cron or multiple callers

Alerting

Send a Telegram message on workflow failure so you know before you look. Uses only the https built-in.

Set env vars: ALERT_TELEGRAM_TOKEN and ALERT_CHAT_ID.

const https = require('https');

function sendTelegramAlert(message) {
  const token  = process.env.ALERT_TELEGRAM_TOKEN;
  const chatId = process.env.ALERT_CHAT_ID;
  if (!token || !chatId) return Promise.resolve(); // alerting not configured, skip silently

  const body = JSON.stringify({ chat_id: chatId, text: message, parse_mode: 'Markdown' });
  return new Promise((resolve) => {
    const req = https.request(
      {
        hostname: 'api.telegram.org',
        path: `/bot${token}/sendMessage`,
        method: 'POST',
        headers: { 'Content-Type': 'application/json', 'Content-Length': Buffer.byteLength(body) },
      },
      res => { res.resume(); res.on('end', resolve); }
    );
    req.on('error', () => resolve()); // don't let alert failure crash the workflow
    req.setTimeout(5000, () => { req.destroy(); resolve(); });
    req.write(body);
    req.end();
  });
}

// Usage — in your main() catch block:
main().catch(async e => {
  console.error('Fatal:', e.message);
  await sendTelegramAlert(`❌ *Workflow failed*\
\`${e.message}\``);
  process.exit(1);
});

Common Failure Modes

See references/failure-taxonomy.md for a full catalog of agent workflow failures with diagnosis and fix patterns.

Usage Guidance

This skill appears to do what it says — reusable Node.js workflow patterns and helper scripts. Before installing or running: (1) ensure you have Node.js available (the registry metadata does not declare it); (2) review the two scripts (lock.js and workflow-template.js) — they run locally and will read/write files (defaults: workflow-state.json, workflow-dlq.json, /tmp locks); (3) avoid running them as root and set WORKFLOW_STATE_PATH and WORKFLOW_DLQ_PATH to directories you control to prevent accidental writes to sensitive locations; (4) confirm you are comfortable with process.kill-based PID checks (used by the lock); (5) fill the TODO steps and review notification hooks — do not plug in credentials or external endpoints without auditing how they are used. If you need higher assurance, ask the publisher to declare Node as a required binary and list the env vars the skill reads.

Capability Analysis

Type: OpenClaw Skill Name: durable-workflow Version: 1.0.1 The bundle provides educational patterns and utility scripts for building resilient, multi-step AI agent workflows. The included JavaScript files (scripts/lock.js and scripts/workflow-template.js) implement standard software engineering practices such as file-based process locking, atomic state persistence, and exponential backoff retries. While the documentation includes a snippet for Telegram alerting, it is a standard notification pattern that relies on user-configured environment variables and lacks any automated data exfiltration or malicious execution logic.

Capability Assessment

ℹ Purpose & Capability

The name/description match the included files and patterns (lock.js and workflow-template.js implement checkpointing, retries, DLQ, exit handling). Minor inconsistency: the package is instruction-only and declares no required binaries, but the shipped scripts are Node.js programs — the skill does not declare that Node is required.

✓ Instruction Scope

SKILL.md stays on topic: it instructs copying and running the provided scripts and documents expected behavior (checkpointing, locks, retries). It explicitly tells the agent to read/write local state and lock files and to run node scripts; it does not instruct reading unrelated system secrets or contacting external endpoints.

✓ Install Mechanism

There is no install spec (instruction-only), so nothing will be downloaded or written by an installer. The risk is low. Note: runtime execution will write files to the filesystem when you run the scripts.

⚠ Credentials

The code references environment variables (WORKFLOW_STATE_PATH, WORKFLOW_DLQ_PATH, STEP_TIMEOUT_MS) and defaults to local paths, but the skill metadata lists no required env vars. While these env vars are non-sensitive, the SKILL.md/code access env configuration not declared in the registry metadata — a small coherence gap you should be aware of.

✓ Persistence & Privilege

always:false and default model invocation are set (normal). The skill does not request persistent platform-level privileges or try to modify other skills' config. It writes/reads local state and lock files only, which is appropriate for its purpose.

Version History

v1.0.1

v1.0.1 of durable-workflow - No file changes detected in this version. - No user-facing updates or documentation changes.

v1.0.0

Initial release of durable-workflow: a toolkit and guide for building resilient, restartable multi-step automations. - Provides step-by-step workflow patterns including checkpointing, retry with backoff, circuit breaker, and dead letter queue. - Includes ready-to-use scripts for workflow skeletons and safe file-based process locking. - Features best-practice guides for alerting, idempotency, and failure diagnosis. - Offers a practical checklist to ensure reliable background and agent workflows that withstand errors and restarts.

Metadata

Slug durable-workflow

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Durable Workflow?

Patterns and procedures for building AI agent workflows that survive real-world failures. Use when asked to build a multi-step automation, pipeline, or agent... It is an AI Agent Skill for Claude Code / OpenClaw, with 174 downloads so far.

How do I install Durable Workflow?

Run "/install durable-workflow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Durable Workflow free?

Yes, Durable Workflow is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Durable Workflow support?

Durable Workflow is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Durable Workflow?

It is built and maintained by old-greggyboy (@old-greggyboy); the current version is v1.0.1.

More Skills

Durable Workflow