/install seaportal
Web Navigation with SeaPortal
CLI-first read-only web fetcher. Use the seaportal command. No JavaScript execution — for SPAs/blocked pages, escalate to a real browser.
Core Commands
seaportal \x3Curl> # Markdown + frontmatter (also writes renders/seaportal/*.md, *.json)
seaportal --json \x3Curl> # Full Result struct as JSON
seaportal --xml \x3Curl> # TEI-Lite XML (teiHeader metadata + text/body content)
seaportal --snapshot \x3Curl> # Accessibility tree as JSON
seaportal --snapshot --format=compact \x3Curl> # Accessibility tree as text (most token-efficient)
seaportal --snapshot --filter=interactive --format=compact \x3Curl> # Only links/buttons/inputs
seaportal --max-tokens=2000 \x3Curl> # Cap Markdown body size (paragraph-boundary cut, sets truncated:true)
seaportal --snapshot --max-tokens=2000 \x3Curl> # Cap snapshot tree size
seaportal --fast \x3Curl> # Bail early if browser is needed
seaportal --head-only \x3Curl> # 16 KB range fetch — metadata + canonical only, no body
seaportal --respect-robots \x3Curl> # Consult robots.txt; refuse disallowed fetches
seaportal --rate-limit=500ms \x3Curl> # Min interval between same-host requests
seaportal --probe-search \x3Curl> # Override to needs-browser when search URL yields no results
seaportal --no-dedupe \x3Curl> # Disable repeated-block dedup
seaportal --no-prune-fallback \x3Curl> # Disable the heuristic fallback when readability is thin
seaportal --with-links \x3Curl> # Add structured list of discovered \x3Ca> links to output
seaportal --with-images \x3Curl> # Add structured list of discovered \x3Cimg> entries to output
seaportal --with-tables \x3Curl> # Add structured tables (caption/headers/rows) to output, layout tables flattened
seaportal --with-comments \x3Curl> # Emit user-generated comments (Disqus/native) separately in result.comments (stripped from Content by default either way)
seaportal --links=text \x3Curl> # Markdown link retention: none|text|all|footer (default all)
seaportal --citations \x3Curl> # Synonym for --links=footer (back-compat)
seaportal --chunk=heading \x3Curl> # Chunk Markdown by heading / sentence / window
seaportal --select=".main-content" \x3Curl> # Scope extraction to a CSS subtree
seaportal --strip=".ads,.cookie-banner" \x3Curl> # Remove matching elements before extraction
seaportal --retries 5 \x3Curl> # Override default retry count (3)
seaportal --max-retry-wait 10s \x3Curl> # Override single backoff cap (30s)
seaportal --retry-timeout 60s \x3Curl> # Override total retry budget (90s)
seaportal --base-url=URL - # Read HTML from stdin; --base-url resolves relative links
seaportal --ua=googlebot \x3Curl> # User-Agent preset or literal string
seaportal --proxy=http://user:pass@proxy:8080 \x3Curl> # Route via HTTP(S) proxy
seaportal --cache=/tmp/sp-cache \x3Curl> # Reuse fresh 200 OK responses from disk (opt-in; default TTL 24h)
seaportal --cache=/tmp/sp-cache --cache-stale-tolerance=5m \x3Curl> # Stale-while-revalidate: serve stale within TTL+tolerance, refresh in background
seaportal --no-pdf \x3Curl> # Skip PDF extraction (default: extract PDF text)
seaportal --schema=path/to/schema.yaml \x3Curl> # Apply a CSS schema (JSON/YAML), populate result.schema
seaportal --query="compound interest" \x3Curl> # BM25-rank H2/H3 sections by relevance, annotate result.rankedSections
seaportal --query=... --top-n=3 \x3Curl> # Keep only the top-N most relevant sections in rankedSections
seaportal --query=... --top-n=3 --filter-by-query \x3Curl> # Replace Content with concatenated top-N sections
seaportal --split-out=dir/ --split-bytes=32768 \x3Curl> # Shard output across multiple files; print manifest (path index/of bytes) to stdout
seaportal --version
Subcommands
The default verb (no subcommand) is URL extraction — seaportal \x3Curl> behaves exactly as documented above.
seaportal sitemap \x3Curl>— fetch a sitemap.xml, recurse into nested\x3Csitemapindex>references, decompress.gz, and print one URL per line. Flags:--json(emit JSON array of{loc,lastmod,changefreq,priority}entries),--max-urls N(default 50000),--max-depth N(default 5),--allow-internal(permit trusted private/internal hosts). Example:seaportal sitemap https://example.com/sitemap.xml --json.seaportal feed \x3Curl>— fetch and parse RSS 2.0, Atom 1.0, or JSON Feed 1.x into a unified{title, link, published, summary, author, guid}shape (format sniffed from the root element / first byte). Default output is one TSV line per item (published title link). Flags:--json(emit JSON array),--max-items N(default 200),--allow-internal(permit trusted private/internal hosts). Example:seaportal feed https://example.com/feed.xml --json.seaportal mcp— run as an MCP (Model Context Protocol) server over JSON-RPC 2.0 line-delimited stdio. Exposes four tools —fetch_url,fetch_snapshot,parse_sitemap,parse_feed— each routing to the library entry point of the same shape. No flags; configuration flows through MCP tool arguments. See MCP integration below.seaportal help— usage summary including subcommands.
MCP integration
Register seaportal as an MCP server in your editor (Claude Desktop / Claude Code / Cursor) — example claude_desktop_config.json:
{
"mcpServers": {
"seaportal": {
"command": "seaportal",
"args": ["mcp"]
}
}
}
Tools exposed: fetch_url ({url, dedupe?, fast?, with_links?, with_images?, with_tables?, with_comments?, max_tokens?}), fetch_snapshot ({url, filter?, max_tokens?, allow_internal?}), parse_sitemap ({url, max_depth?, max_urls?, allow_internal?}), parse_feed ({url, max_items?, allow_internal?}). Each returns its library result as a single JSON text content block.
User-Agent presets
--ua=\x3Cname> accepts curated presets; unknown values pass through as literal UA strings. Empty (default) sends a real Chrome UA. Per-host DomainUserAgent overrides still win.
chrome(default),safari,firefox— real browser UAsgooglebot,bingbot,search-bot— bot UAs (may trigger reverse-DNS challenges)seaportal— honest self-identify for cooperative sites
Cache
--cache=\x3Cdir> is opt-in: only fresh 200 OK responses are stored, keyed by SHA-256 of URL + Accept/Accept-Language/User-Agent. --cache-ttl=\x3Cdur> controls freshness (default 24h). --no-cache bypasses reads but still writes — i.e. "force refresh". Errors / 3xx / 4xx / 5xx are never cached. Result includes cacheHit: true on a replay.
Past-TTL entries that carry ETag or Last-Modified are automatically re-validated with a conditional GET (If-None-Match / If-Modified-Since). A 304 Not Modified replays the cached body, refreshes FetchedAt, and sets cacheRevalidated: true (distinct from cacheHit, which means "served from disk with no network call"). A 200 replaces the entry; other statuses leave the cache untouched.
--cache-stale-tolerance=\x3Cdur> enables stale-while-revalidate (SWR) semantics: entries whose age falls within TTL + tolerance are served immediately from disk (with cacheStale: true) while a background goroutine fires the conditional GET to refresh the entry for the next call. Default 0 keeps the existing synchronous-revalidate behaviour. The SWR band does not require validators on the cached entry — within tolerance the body is trusted unconditionally; beyond tolerance the existing validator-gated synchronous revalidation runs. --no-cache bypasses SWR entirely. Background refresh failures are silent and leave the stale entry intact for the next attempt.
PDF extraction
application/pdf responses are extracted by default: text is pulled page-by-page via ledongthuc/pdf and flows through the same Result.Content Markdown pipeline (link retention, truncation, chunking, cache) with ExtractionMethod="pdf" and --- page N --- separators. Pass --no-pdf to restore the legacy "skipped binary content" behaviour. Image-only / scanned PDFs yield an empty extraction error (no OCR).
HTTP transport
- HTTPS connections negotiate HTTP/2 via ALPN when the server offers
h2; fall back to HTTP/1.1 otherwise. - HTTP (no TLS) always uses HTTP/1.1.
- All HTTPS connections use a Chrome 122 TLS fingerprint via utls — bypasses Cloudflare's Go-default-TLS bot detection.
- The negotiated protocol is surfaced on
Result.protocol("h2"or"http/1.1").
Proxy support
--proxy=URL routes the fetch through a proxy. http:// and https:// proxy URLs are supported with Basic auth taken from the URL userinfo (user:pass@host:port). HTTPS targets use a CONNECT tunnel; the Chrome TLS fingerprint is preserved end-to-end with the origin. socks5:// URLs work for HTTP target URLs only — HTTPS-over-SOCKS5 is a V1 limitation. Invalid proxy URLs fail fast with result.Error = "invalid proxy URL: ...".
Security defaults (local / internal URLs)
The CLI is safe by default (DefaultSecurityPolicy): it blocks targets that
resolve to private/internal IPs (SSRF guard), allows only http/https, caps
redirects at 10 with per-hop re-validation, and bounds the raw (50 MiB) and
decompressed (200 MiB) body.
- Reading
localhost,127.0.0.1, a192.168.*/10.*host, or any intranet URL fails by default withError: target resolves to a private/internal IP. Add--allow-internalto permit it (you are vouching the target is trusted). - Other knobs:
--max-redirects N,--allow-domains/--deny-domains,--trusted-resolve-cidrs,--max-response-bytes,--max-decompressed-bytes. - The MCP server applies the same safe default to
fetch_url,fetch_snapshot,parse_sitemap, andparse_feed. - Caveats: with
--proxythe dial-time rebinding check is skipped (the target is still vetted before fetch and on each redirect, just not at connect time).
Per-host rate limiting
--rate-limit=DURATION enforces a minimum interval between requests to the same host. Useful primarily for library callers sharing a HostRateLimiter across calls via Options.RateLimiter; a single CLI invocation only fires one request, so the throttle has no cross-call effect by itself. Combines with --respect-robots crawl-delay (both apply; effective wait is their sum).
Chunking
--chunk=NAME[:SIZE[:OVERLAP]] populates result.chunks (off by default). Runs after --max-tokens truncation; fenced code blocks are never split.
--chunk=heading— split at H2/H3 boundaries; pre-heading prologue is its own chunk.--chunk=sentence:512— group sentences until ~512 tokens; heading inherited from nearest H2/H3 above.--chunk=window:2000:200— 2000-char windows, 200-char overlap, snapped to word boundaries.
Query relevance
--query="..." scores each H2/H3-bounded Markdown section against the query with standard BM25 (k1=1.5, b=0.75) and populates result.rankedSections (descending score). Pure additive by default — Content is untouched. Combine with --top-n=N to truncate, or --filter-by-query to replace Content with the concatenated top-N sections (default top-3 when --top-n is unset). No stopwords/stemming in V1; IDF handles common words naturally.
seaportal --json --query="formula" \x3Curl>— annotate all sections with scores.seaportal --json --query="formula" --top-n=3 \x3Curl>— keep only the 3 highest-scoring sections inrankedSections.seaportal --query="formula" --top-n=3 --filter-by-query \x3Curl>— Markdown body becomes just those 3 sections.
Schema extraction
--schema=\x3Cpath> applies a declarative CSS schema (JSON or YAML, format sniffed from extension) to the raw HTML and surfaces the result as result.schema in JSON output. Three modes per field: single value (selector only), multiple values (multiple: true), nested array of objects (fields: populated). Optional attr: reads an attribute instead of text. Runs on the pre-preprocess DOM so chrome elements (nav/sidebar) are reachable. Bad selectors or load failures become warnings, never crash. Example schema:
fields:
title: { selector: h1 }
tags: { selector: .tag, multiple: true }
products:
selector: .product
fields:
name: { selector: .name }
price: { selector: .price, attr: data-price }
Invoke: seaportal --json --schema=./schema.yaml \x3Curl>. XPath is a V2 limitation — CSS only for now.
Chaining with a browser fetcher
When a page needs JS execution, let pinchtab (or any other fetcher) render the HTML, then pipe it into seaportal for extraction:
pinchtab fetch https://example.com | seaportal --base-url https://example.com --json -
--base-url is required in stdin mode. Network-only flags (--head-only, --respect-robots, --retries) are silently no-ops with a stderr warning.
Workflow: navigating a site
- Fetch the entry point as Markdown:
seaportal \x3Curl>. Read the frontmatter —pageClass,trustworthy,needsBrowser,confidencetell you if extraction is reliable. - Decide next step from
pageClass:static/ssr/hydrated→ trustworthy, keep using seaportal.spa/dynamic→ JS-only content, stop and escalate to a browser (pinchtab).blocked→ bot-protection or login wall, escalate.
- Discover links: extract URLs from the Markdown body, OR run
seaportal --snapshot --filter=interactive --format=compact \x3Curl>to see only links/buttons with theirhrefs. Each entry has a stableref(e1,e2…) and CSSselector. - Follow a link: take the
href, resolve against the page URL if relative, and re-runseaportal \x3Cnew-url>. - Repeat until you have what you need. Track visited URLs to avoid loops.
There is no session, no click, no form submit — every navigation is a fresh HTTP GET. To "click" a link you re-invoke seaportal on its href.
Choosing output format
| Goal | Command |
|---|---|
| Read article / docs content | seaportal \x3Curl> (Markdown) |
| Programmatic decision-making | seaportal --json \x3Curl> |
| Map of page structure (cheap on tokens) | seaportal --snapshot --format=compact \x3Curl> |
| Just the actionable links/buttons | seaportal --snapshot --filter=interactive --format=compact \x3Curl> |
| Large page, must cap tokens | add --max-tokens=2000 (caps snapshot tree in snapshot mode, or Markdown body otherwise; truncates at the latest paragraph boundary and sets truncated: true) |
Compact snapshot rows look like:
e2 link "Docs" \x3Ca> [interactive] href=/docs
e5 heading "Welcome" \x3Ch1> level=1
Classification cheat sheet (when to escalate)
pageClass field in the frontmatter / JSON:
| Class | Action |
|---|---|
static |
Use seaportal — pure HTML, high confidence. |
ssr |
Use seaportal — server-rendered. |
hydrated |
Use seaportal — SSR + JS, usually fine. |
spa |
Escalate — JS-only, seaportal will return little/empty. |
dynamic |
Escalate — heavy client rendering. |
blocked |
Escalate — bot protection, captcha, or login wall. |
Also escalate if needsBrowser: true or validationOk: false. Use --fast when you want seaportal to bail early on any of these instead of doing full extraction.
For programmatic routing, prefer profile.decision + profile.browserRecommended over mapping pageClass yourself: one explicit decision (static-high-confidence / static-ok / static-caution / browser-needed / blocked / unreachable / not-found / unsupported), where browserRecommended: true means a real browser is likely to help. See docs/reference/browser-discriminator.md.
Thin Markdown? Try the snapshot before escalating
If seaportal classified the page as static/ssr/hydrated (i.e. it thinks extraction succeeded) but the Markdown body looks thin — length \x3C ~1500, no real paragraphs, mostly headings or naked links — don't escalate yet. Readability sometimes prunes link-heavy or table-heavy sections that the accessibility tree still has in full. Retry with:
seaportal --snapshot --format=compact \x3Curl> # whole structure
seaportal --snapshot --filter=interactive --format=compact \x3Curl> # just links/buttons
If the snapshot returns substantial nodes, use those. If the snapshot is also thin, then escalate. Canonical situations where the fallback is worth a try: government index pages (usa.gov, gov.uk) where the link set is the content, and reference/listing pages where the prose is incidental to the structure. Don't loop — one snapshot retry is enough; if it's not there, it's not coming.
For search URLs specifically, --probe-search short-circuits CNN/DDG-style JS search shells with reason client-rendered-search — when set, seaportal flips outcome to needs-browser if a search-shaped URL returns a short body with no result-list structure.
Output side-effects
The default Markdown mode (no flag) also writes two files under ./renders/seaportal/\x3Cdomain>_\x3Ctimestamp>.{md,json} from the working directory. If running from a directory where that's unwanted, use --json or --snapshot instead — those only print to stdout.
TEI-Lite XML output (--xml)
--xml emits a TEI-Lite-shaped document: \x3CteiHeader> with title/author/published-date/language/source URL, plus \x3Ctext>\x3Cbody> containing the Markdown converted into \x3Chead>, \x3Clist>/\x3Citem>, \x3Ccode>, and \x3Cp> elements. Mutually exclusive with --json (exit 2). Scope is basic structural shaping — full TEI ODD validation, footnotes, and cross-references are out of scope.
Output splitting
--split-out=\x3Cdir> shards the rendered output into multiple files under \x3Cdir>, capped at --split-bytes (default --max-tokens × 4 or 32 KB). Prefers existing --chunk boundaries; otherwise splits on paragraph boundaries. Files are named \x3Curl-slug>-NNN.{md,json} and written atomically (.tmp + rename). Stdout receives a TSV manifest (path index/of bytes) instead of the content body. Not supported with --xml.
What this skill is NOT
- Not a browser. No JS execution, no clicks, no form fills, no cookies/sessions.
- Not stateful. Each call is independent.
- For interactive flows, multi-step forms, or auth — use the
pinchtabskill instead. A common pattern is: seaportal first, pinchtab on escalation.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install seaportal - 安装完成后,直接呼叫该 Skill 的名称或使用
/seaportal触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
SeaPortal 是什么?
Use this skill when an agent needs to read or navigate websites without a browser: fetch a page as clean Markdown, get a JSON accessibility snapshot of inter... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 54 次。
如何安装 SeaPortal?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install seaportal」即可一键安装,无需额外配置。
SeaPortal 是免费的吗?
是的,SeaPortal 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
SeaPortal 支持哪些平台?
SeaPortal 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 SeaPortal?
由 PinchTab(@pinchtab)开发并维护,当前版本 v0.1.1。