← Back to Skills Marketplace
qipengguo

flow test

by QipengGuo · GitHub ↗ · v0.0.1 · MIT-0
cross-platform ✓ Security Clean
137
Downloads
1
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install flow-test
Description
Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor...
README (SKILL.md)

Flow Test

Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.

This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.

Invoke this skill when:

  • the task depends on live or changing web content
  • the output can vary but still be correct
  • the workflow spans multiple model or tool steps
  • intermediate evidence matters more than one exact final string
  • you need to verify user intent was satisfied, not exact wording

Do not use this skill when:

  • the result is deterministic and easy to assert directly
  • a schema check, exact match, snapshot, or pure function test is enough
  • the requirement can be covered fully by normal unit or integration tests

Objective

Turn a fuzzy requirement into a test design that combines:

  • deterministic checks for stable invariants
  • evidence collection for dynamic execution
  • semantic evaluation for variable outcomes
  • a bounded verdict of pass, fail, or needs_review

Design Principles

1. Keep asserts where they still work

Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:

  • tool call success
  • required fields
  • minimum counts
  • status codes
  • domain restrictions
  • date or freshness constraints when machine-checkable

2. Judge task completion, not exact phrasing

Prefer questions like:

  • did the agent reach the right source
  • did it gather relevant information
  • does the final answer satisfy the user request

Avoid requiring one exact string unless the wording itself is the requirement.

3. Require inspectable evidence

Ask the execution flow to print or capture concise evidence such as:

  • visited URL
  • page title
  • visible headings
  • extracted entities
  • timestamps or date clues
  • key tool outputs
  • final answer

The evaluator should be able to inspect why a verdict was reached.

4. Use explicit semantic rubrics

Never rely on vague instructions such as "judge whether it looks good."

Always define:

  • what evidence is required
  • what counts as a pass
  • what clearly fails
  • when uncertainty should become needs_review

5. Prefer bounded confidence

If evidence is incomplete, contradictory, or too weak, do not force a pass.

Return needs_review.

Workflow

When invoked, design the test in the following order.

1. Identify why exact assertions are brittle

Classify the task:

  • dynamic web browsing
  • search or retrieval
  • LLM generation
  • multi-tool orchestration
  • end-to-end user flow

Then explain why literal equality or fixed snapshots are not sufficient.

2. Split deterministic checks from semantic checks

Write two groups:

Deterministic Checks

Use exact validation for stable parts, such as:

  • tool returned successfully
  • required fields are present
  • minimum number of results exists
  • source domain matches expectation
  • response includes a valid date range

Semantic Checks

Use agent evaluation for variable parts, such as:

  • relevance to the requested topic
  • freshness of the retrieved content
  • whether the answer reflects the gathered evidence
  • whether the workflow actually satisfies the intended task

3. Define the evidence schema

Specify exactly what the run should log or output.

Recommended evidence fields:

  • task
  • source_url
  • source_title
  • extracted_items
  • freshness_signals
  • intermediate_results
  • final_answer
  • evaluator_notes

Keep evidence minimal but sufficient for review.

4. Define the verdict rubric

Use this baseline:

Pass

  • the agent reached a relevant source or completed the intended flow
  • collected evidence supports the conclusion
  • the final output is relevant and sufficiently current for the task
  • there is no major contradiction between evidence and answer

Fail

  • the agent failed to reach a relevant source or complete the flow
  • the result is clearly irrelevant, stale, or fabricated
  • the output contradicts the evidence
  • the workflow misses a required user objective

Needs Review

  • evidence is partial or ambiguous
  • freshness cannot be determined confidently
  • multiple interpretations remain plausible

5. Produce a structured test spec

Return the design in this format:

## Test Intent

## Why Exact Assert Fails

## Deterministic Checks

## Evidence To Collect

## Semantic Rubric

## Execution Notes

## Final Verdict Format

Output Template

## Test Intent
- Validate that:

## Why Exact Assert Fails
- Dynamic factors:
- Why literal equality is brittle:

## Deterministic Checks
- Check 1:
- Check 2:

## Evidence To Collect
- Evidence 1:
- Evidence 2:

## Semantic Rubric
- Pass when:
- Fail when:
- Needs review when:

## Execution Notes
- Constraints:
- Allowed variance:
- Safety concerns:

## Final Verdict Format
- verdict: pass | fail | needs_review
- reason:
- evidence:

Example

Task: verify that visiting a news site returns today's news rather than stale content.

Good test design:

  • deterministic checks confirm the page loads and at least one article item is collected
  • evidence includes the visited site, page title, visible headlines, date clues, and final summary
  • semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
  • semantic rubric fails when headlines are outdated, unrelated, or invented
  • semantic rubric returns needs_review when freshness cannot be established from the evidence

Bad test design:

  • assert returned_text == "Today's news is ..."

Guidance

When using this skill:

  • keep traditional asserts for stable invariants
  • use semantic evaluation only where exact matching becomes brittle
  • prefer narrow rubrics over subjective judgment
  • require visible evidence before passing the test
  • state uncertainty explicitly instead of masking it

Deliverables

When asked to design a flow test, provide:

  • a structured test spec
  • deterministic checks
  • an evidence schema
  • a semantic rubric
  • a final verdict format
Usage Guidance
This skill appears coherent and safe as an instruction-only test designer, but keep in mind: the tests it designs will capture and log evidence (URLs, page content, extracted items), which can include sensitive or private data if run against authenticated services or user content. Only run these tests against public or authorized targets, avoid supplying unrelated credentials, and review collected evidence before storing or sharing it. If you plan to have the agent run tests autonomously against production systems, add safeguards (rate limits, access controls, and human review for 'needs_review' cases).
Capability Analysis
Type: OpenClaw Skill Name: flow-test Version: 0.0.1 The skill bundle defines a framework for designing 'flow tests' that combine deterministic assertions with semantic evaluation by an AI agent. The instructions in SKILL.md are purely methodological, providing templates and principles for testing complex LLM workflows without any evidence of malicious intent, data exfiltration, or unauthorized execution.
Capability Assessment
Purpose & Capability
The name/description match the SKILL.md: it designs semantic/flow tests and asks for evidence and rubrics. There are no unexpected env vars, binaries, or installs required that would be unrelated to test design.
Instruction Scope
The instructions are scoped to designing tests: splitting deterministic vs semantic checks, specifying evidence to collect, and defining rubrics. They do ask the executor to capture evidence (URLs, page titles, extracted items), which is appropriate for the stated purpose. Note: that evidence could include sensitive or private content if the agent runs against authenticated/private targets — the skill itself does not instruct reading local files or credentials.
Install Mechanism
No install spec and no code files — instruction-only, so nothing is written to disk or downloaded by the skill itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. There is no disproportionate request for secrets or external access in the manifest or instructions.
Persistence & Privilege
always is false and the skill is user-invocable; it does not request permanent presence or modifications to other skills or global agent settings in its instructions.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install flow-test
  3. After installation, invoke the skill by name or use /flow-test
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.0.1
- Initial release of the "flow-test" skill for agent-evaluated testing of browser tasks, LLM outputs, and tool workflows. - Provides a structured approach for designing tests where exact asserts are brittle and semantic success is more important than literal equality. - Introduces clear guidelines for when to use flow tests versus traditional assertions. - Defines an evidence-based evaluation process, including deterministic checks, semantic rubrics, and a bounded verdict system (`pass`, `fail`, `needs_review`). - Includes output templates and deliverable requirements for consistent test design.
Metadata
Slug flow-test
Version 0.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is flow test?

Designs agent-evaluated flow tests for browser tasks, LLM outputs, and tool workflows. Invoke when exact asserts are brittle and semantic success matters mor... It is an AI Agent Skill for Claude Code / OpenClaw, with 137 downloads so far.

How do I install flow test?

Run "/install flow-test" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is flow test free?

Yes, flow test is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does flow test support?

flow test is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created flow test?

It is built and maintained by QipengGuo (@qipengguo); the current version is v0.0.1.

💬 Comments