Description

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor...

README (SKILL.md)

Flow Test

Name: flow test
Author: qipengguo

Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.

This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.

Invoke this skill when:

the task depends on live or changing web content
the output can vary but still be correct
the workflow spans multiple model or tool steps
intermediate evidence matters more than one exact final string
you need to verify user intent was satisfied, not exact wording

Do not use this skill when:

the result is deterministic and easy to assert directly
a schema check, exact match, snapshot, or pure function test is enough
the requirement can be covered fully by normal unit or integration tests

Objective

Turn a fuzzy requirement into a test design that combines:

deterministic checks for stable invariants
evidence collection for dynamic execution
semantic evaluation for variable outcomes
a bounded verdict of pass, fail, or needs_review

Design Principles

1. Keep asserts where they still work

Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:

tool call success
required fields
minimum counts
status codes
domain restrictions
date or freshness constraints when machine-checkable

2. Judge task completion, not exact phrasing

Prefer questions like:

did the agent reach the right source
did it gather relevant information
does the final answer satisfy the user request

Avoid requiring one exact string unless the wording itself is the requirement.

3. Require inspectable evidence

Ask the execution flow to print or capture concise evidence such as:

visited URL
page title
visible headings
extracted entities
timestamps or date clues
key tool outputs
final answer

The evaluator should be able to inspect why a verdict was reached.

4. Use explicit semantic rubrics

Never rely on vague instructions such as "judge whether it looks good."

Always define:

what evidence is required
what counts as a pass
what clearly fails
when uncertainty should become needs_review

5. Prefer bounded confidence

If evidence is incomplete, contradictory, or too weak, do not force a pass.

Return needs_review.

Workflow

When invoked, design the test in the following order.

1. Identify why exact assertions are brittle

Classify the task:

dynamic web browsing
search or retrieval
LLM generation
multi-tool orchestration
end-to-end user flow

Then explain why literal equality or fixed snapshots are not sufficient.

2. Split deterministic checks from semantic checks

Write two groups:

Deterministic Checks

Use exact validation for stable parts, such as:

tool returned successfully
required fields are present
minimum number of results exists
source domain matches expectation
response includes a valid date range

Semantic Checks

Use agent evaluation for variable parts, such as:

relevance to the requested topic
freshness of the retrieved content
whether the answer reflects the gathered evidence
whether the workflow actually satisfies the intended task

3. Define the evidence schema

Specify exactly what the run should log or output.

Recommended evidence fields:

task
source_url
source_title
extracted_items
freshness_signals
intermediate_results
final_answer
evaluator_notes

Keep evidence minimal but sufficient for review.

4. Define the verdict rubric

Use this baseline:

Pass

the agent reached a relevant source or completed the intended flow
collected evidence supports the conclusion
the final output is relevant and sufficiently current for the task
there is no major contradiction between evidence and answer

Fail

the agent failed to reach a relevant source or complete the flow
the result is clearly irrelevant, stale, or fabricated
the output contradicts the evidence
the workflow misses a required user objective

Needs Review

evidence is partial or ambiguous
freshness cannot be determined confidently
multiple interpretations remain plausible

5. Produce a structured test spec

Return the design in this format:

## Test Intent

## Why Exact Assert Fails

## Deterministic Checks

## Evidence To Collect

## Semantic Rubric

## Execution Notes

## Final Verdict Format

Output Template

## Test Intent
- Validate that:

## Why Exact Assert Fails
- Dynamic factors:
- Why literal equality is brittle:

## Deterministic Checks
- Check 1:
- Check 2:

## Evidence To Collect
- Evidence 1:
- Evidence 2:

## Semantic Rubric
- Pass when:
- Fail when:
- Needs review when:

## Execution Notes
- Constraints:
- Allowed variance:
- Safety concerns:

## Final Verdict Format
- verdict: pass | fail | needs_review
- reason:
- evidence:

Example

Task: verify that visiting a news site returns today's news rather than stale content.

Good test design:

deterministic checks confirm the page loads and at least one article item is collected
evidence includes the visited site, page title, visible headlines, date clues, and final summary
semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
semantic rubric fails when headlines are outdated, unrelated, or invented
semantic rubric returns needs_review when freshness cannot be established from the evidence

Bad test design:

assert returned_text == "Today's news is ..."

Guidance

When using this skill:

keep traditional asserts for stable invariants
use semantic evaluation only where exact matching becomes brittle
prefer narrow rubrics over subjective judgment
require visible evidence before passing the test
state uncertainty explicitly instead of masking it

Deliverables

When asked to design a flow test, provide:

a structured test spec
deterministic checks
an evidence schema
a semantic rubric
a final verdict format

Usage Guidance

This skill appears coherent and safe as an instruction-only test designer, but keep in mind: the tests it designs will capture and log evidence (URLs, page content, extracted items), which can include sensitive or private data if run against authenticated services or user content. Only run these tests against public or authorized targets, avoid supplying unrelated credentials, and review collected evidence before storing or sharing it. If you plan to have the agent run tests autonomously against production systems, add safeguards (rate limits, access controls, and human review for 'needs_review' cases).

Capability Analysis

Type: OpenClaw Skill Name: flow-test Version: 0.0.1 The skill bundle defines a framework for designing 'flow tests' that combine deterministic assertions with semantic evaluation by an AI agent. The instructions in SKILL.md are purely methodological, providing templates and principles for testing complex LLM workflows without any evidence of malicious intent, data exfiltration, or unauthorized execution.

Capability Assessment

✓ Purpose & Capability

The name/description match the SKILL.md: it designs semantic/flow tests and asks for evidence and rubrics. There are no unexpected env vars, binaries, or installs required that would be unrelated to test design.

ℹ Instruction Scope

The instructions are scoped to designing tests: splitting deterministic vs semantic checks, specifying evidence to collect, and defining rubrics. They do ask the executor to capture evidence (URLs, page titles, extracted items), which is appropriate for the stated purpose. Note: that evidence could include sensitive or private content if the agent runs against authenticated/private targets — the skill itself does not instruct reading local files or credentials.

✓ Install Mechanism

No install spec and no code files — instruction-only, so nothing is written to disk or downloaded by the skill itself.

✓ Credentials

The skill declares no required environment variables, credentials, or config paths. There is no disproportionate request for secrets or external access in the manifest or instructions.

✓ Persistence & Privilege

always is false and the skill is user-invocable; it does not request permanent presence or modifications to other skills or global agent settings in its instructions.

Version History

v0.0.1

- Initial release of the "flow-test" skill for agent-evaluated testing of browser tasks, LLM outputs, and tool workflows. - Provides a structured approach for designing tests where exact asserts are brittle and semantic success is more important than literal equality. - Introduces clear guidelines for when to use flow tests versus traditional assertions. - Defines an evidence-based evaluation process, including deterministic checks, semantic rubrics, and a bounded verdict system (`pass`, `fail`, `needs_review`). - Includes output templates and deliverable requirements for consistent test design.

Metadata

Slug flow-test

Version 0.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is flow test?

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor... It is an AI Agent Skill for Claude Code / OpenClaw, with 137 downloads so far.

How do I install flow test?

Run "/install flow-test" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is flow test free?

Yes, flow test is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does flow test support?

flow test is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created flow test?

It is built and maintained by QipengGuo (@qipengguo); the current version is v0.0.1.

More Skills

flow test