Description

Run only when the user explicitly asks for a fleet-wide Cosmos DB for MongoDB (RU) health check — scans NormalizedRU consumption, service availability, serve...

README (SKILL.md)

\x3C!-- Auto-generated for OpenClaw by pack-openclaw. Notes for OpenClaw users: - Claude Code dynamic expressions (!...) in this file are NOT evaluated by OpenClaw and appear as literal text. Run them manually at the start of the workflow. - Invoke this skill only via slash command (e.g. /amg-check-cosmosdb-mongo-ru). Auto-invocation is disabled on Claude Code but not on OpenClaw. -->

OpenClaw Setup (one-time)

This skill calls MCP tools prefixed with mcp__amg__*, so OpenClaw must have an MCP server registered under the exact name amg. Run this once per workspace before invoking the skill:

openclaw mcp set amg '{"url":"https://\x3Cyour-grafana-instance>/api/azure-mcp","transport":"streamable-http","headers":{"Authorization":"Bearer \x3Cyour-token>"}}'

Replace \x3Cyour-grafana-instance> with your Azure Managed Grafana endpoint and \x3Cyour-token> with a valid Grafana service-account token (starts with glsa_). The server name must be amg — the skill's allowed-tools reference mcp__amg__* and will not find tools under any other name.

Verify the server is registered:

openclaw mcp list

Official skill source: https://github.com/Azure/amg-skills

Runtime Context

Current UTC time: !date -u +%Y-%m-%dT%H:%M:%SZ
Config: !cat memory/amg-check-cosmosdb-mongo-ru/config.md 2>/dev/null || echo "NOT_CONFIGURED"
Prior report: ![ -f memory/amg-check-cosmosdb-mongo-ru/report.md ] && echo "exists ($(grep -c '^### BUG-' memory/amg-check-cosmosdb-mongo-ru/report.md) bugs documented)" || echo "not found"
Arguments: time-range=$0, subscription-override=$1

Known Issues: Before presenting findings, cross-reference results against memory/amg-check-cosmosdb-mongo-ru/report.md.

Cosmos DB for MongoDB (RU) Health Check

Name: AMG Cosmos DB for MongoDB (RU) Health Check
Author: 1w2w3y

Critical Constraints

No subagents for MCP. The Agent tool cannot access MCP tools — all MCP calls must be made from the main context.
Scan every resource. No sampling or early stopping.
Time format: ISO 8601 UTC with explicit from/to — NEVER use timespan (it causes errors).
Safe interval: Always use PT1H — it works for all Cosmos DB metrics. PT6H is NOT supported. DataUsage, IndexUsage, and DocumentCount do NOT support P1D.
Parallelism cap: 30 concurrent MCP calls per batch. Reduce to 4-5 if rate-limited.
Result too large: Save to temp file and parse outside the context window. Prefer node -e "..." if installed; otherwise fall back to python -c "...", jq, or pwsh -Command "...". Bash permission for the chosen interpreter will be prompted on first use.

Progress Tracking

Update checkboxes as you complete each phase:

Phase 1a: Datasource validated
Phase 1b: Accounts discovered (N=?)
Phase 1c: Non-succeeded accounts investigated (if any)
Phase 2: Metric definitions validated
Phase 3: Pulse check completed (N scanned, N findings)
Phase 4: Deep metrics for abnormal accounts
Phase 5: Resource logs for abnormal accounts
Report presented
Known issues updated in memory/amg-check-cosmosdb-mongo-ru/report.md

Configuration

If Config shows NOT_CONFIGURED: Run First-Run Setup at the bottom of this file, then return here.

If Config is populated: Extract the datasource UID and subscription ID from the pre-loaded Runtime Context above and use them for all queries. Use $1 as the subscription override if provided.

Datasource UID: from ## Azure Monitor Datasource > UID
Subscription ID: from ## Subscription (or $1 if provided)
Resource Type: microsoft.documentdb/databaseaccounts (lowercase) with kind == 'MongoDB'

Time Range

Default: 7 days for metrics, 24 hours for logs. Override with $0 (e.g., 3d). Keep log queries to 1-2 days to avoid timeouts.

Workflow

Phase 1a: Validate Datasource

Call amgmcp_datasource_list (no parameters). Find entry with type == "grafana-azure-monitor-datasource".

Matches configured UID → proceed.
Different UID → update memory/amg-check-cosmosdb-mongo-ru/config.md, warn user, use new UID.
Not found → abort with error.

Phase 1b: Discover All Cosmos DB for MongoDB (RU) Accounts

azureMonitorDatasourceUid: {DATASOURCE_UID}
query: |
  resources
  | where type == 'microsoft.documentdb/databaseaccounts'
  | where kind == 'MongoDB'
  | project name, resourceGroup, location, subscriptionId, id, properties.provisioningState
  | order by location asc, name asc

If the config specifies subscription IDs (not "all"), add | where subscriptionId in ('{ID1}', '{ID2}'). Derive region summary by counting accounts per location. Flag accounts not in "Succeeded" state. Stop if zero accounts found.

Why kind == 'MongoDB'? Filters for RU-based MongoDB API accounts. vCore-based MongoDB uses microsoft.documentdb/mongoclusters.

Phase 1c: Activity Log for Non-Succeeded Accounts

If any accounts are not in "Succeeded" state, query the activity log for up to 3 of them:

azureMonitorDatasourceUid: {DATASOURCE_UID}
scope: {account's full ARM resource ID}
startTime: now-3d
endTime: now
select: eventTimestamp,operationName,status,caller,subStatus

If the response exceeds 500 KB, retry with startTime: now-1d. Summarize: operations performed, caller type, success/in-progress status, likely cause.

Phase 2: Validate Available Metrics

Call amgmcp_query_resource_metric_definition on the first account from Phase 1. Confirm expected metrics exist. Run only once — definitions are the same across all accounts.

Phase 3: Tier 1 — Fleet-Wide Pulse Check

azureMonitorDatasourceUid: {DATASOURCE_UID}
pastDays: 7
scenarios: cosmosdb_mongo

Scans all accounts across 3 scenarios: cosmosdb_mongo_ru, cosmosdb_mongo_throttling, cosmosdb_mongo_availability.

Before moving to Phase 4, verify:

scanSummary.totalResourcesScanned matches Phase 1 account count.
All 3 scenarios show status: "completed" in scenarioResults.
If errors non-empty, retry affected scenarios individually.
If >10% accounts missing, fall back to batched amgmcp_query_resource_metric for unscanned accounts.

Accounts in the findings array are abnormal. Also flag any non-Succeeded accounts from Phase 1.

Note: Sustained-high detection (>50% for 6+ hours), RU spike pattern detection (>30pp jump in 1h), and latency analysis require hourly time-series data and are performed in Phase 4 on flagged accounts only.

Phase 4: Tier 2 — Deep Metrics for Abnormal Accounts

Read reference/phase4-deep-metrics.md before starting Phase 4. It contains:

Response size management (critical — fleet-wide PT1H queries exceed 500 KB)
Fleet-wide triage strategy (when >50% accounts are flagged)
Core and secondary metrics tables
Batch strategy and correlation analysis patterns (use ultrathink)

Phase 5: Resource Logs for Abnormal Accounts

Read reference/phase5-resource-logs.md before starting Phase 5. It contains:

5 KQL query templates: throttling, high latency, request volume, top RU operations, error codes
Fallback table guidance (CDBDataPlaneRequests if CDBMongoRequests is empty)

Output

Present the report using the structure in reference/output-format.md.

Classification:

Severity	Criteria
CRITICAL	NormalizedRU = 100% sustained, OR ServiceAvailability \x3C 99.9%, OR latency avg > 50ms
HIGH	NormalizedRU max 85-100% with frequent spikes, OR ReplicationLatency > 1000ms
WARNING	NormalizedRU max 70-85% sustained, OR sustained RU > 50% for 6h+, OR RU spike >30pp in 1h, OR ServiceAvailability \x3C 99.99%, OR latency avg > 10ms, OR ReplicationLatency > 100ms
MODERATE	NormalizedRU max 50-70%
HEALTHY	All metrics within normal ranges (NormalizedRU \x3C 50%)

Update Known Issues

After presenting findings, update memory/amg-check-cosmosdb-mongo-ru/report.md:

Read the current file.
Rebuild the Resource Inventory table at the end: every account, full ARM ID, region, subscription, state. Group by region, sorted alphabetically.
Update existing bug status from today's telemetry (resolved / improving / worsening / still active).
Add new bugs with: severity, account name, region, metric evidence, log evidence, root cause, recommended action.
Update the "Updated" date header.

Only add genuine issues: sustained throttling, availability drops, high latency patterns, or replication problems. Skip transient single-hour spikes or expected maintenance windows.

Error Handling

See reference/error-handling.md for the full recovery table.

Analysis Guidance

Known patterns, signals, root causes: reference/analysis-patterns.md
Optional deep-dive KQL queries: reference/deep-dive-queries.md

Reference

Cosmos DB resource type: microsoft.documentdb/databaseaccounts (kind: MongoDB)
vCore resource type (different): microsoft.documentdb/mongoclusters
Latency metrics: ServerSideLatencyDirect and ServerSideLatencyGateway (the old ServerSideLatency is deprecated)
Resource log tables: CDBMongoRequests (primary), CDBDataPlaneRequests (fallback)
Key error codes: 429 / 16500 (throttling), 50 (server error), 13 (unauthorized)
Safe metric interval: PT1H for all metrics (PT6H NOT supported)
Known issues: memory/amg-check-cosmosdb-mongo-ru/report.md
User config: memory/amg-check-cosmosdb-mongo-ru/config.md

First-Run Setup

Run only when Config shows NOT_CONFIGURED. After completing, return to the Workflow above.

1. Discover Datasource UID: Call amgmcp_datasource_list. Filter type == "grafana-azure-monitor-datasource". Prefer uid == "azure-monitor-oob" if multiple match. Abort if zero match.

2. Discover Subscription ID: Run this Resource Graph query to list all subscriptions with Cosmos DB for MongoDB (RU) accounts, then present the results as a table and ask the user which subscription(s) to use:

resources
| where type == 'microsoft.documentdb/databaseaccounts'
| where kind == 'MongoDB'
| join kind=inner (
    resourcecontainers
    | where type == 'microsoft.resources/subscriptions'
    | project subscriptionId, subscriptionName=name
) on subscriptionId
| summarize AccountCount=count() by subscriptionId, subscriptionName
| order by AccountCount desc

Present the results as a table with columns: Subscription Name, Subscription ID, Account Count. Then ask the user: "Which subscription ID(s) should I configure for this health check? Or type 'all' to scan all subscriptions."

3. Write config: Write memory/amg-check-cosmosdb-mongo-ru/config.md:

# amg-check-cosmosdb-mongo-ru Configuration

User-specific values for the Cosmos DB for MongoDB (RU) health check skill.
This file is auto-generated on first run and can be edited manually.

## Azure Monitor Datasource
- **UID**: {discovered_uid}
- **Name**: {discovered_name}

## Subscription
- {subscription_id_or_"all"}

4. Confirm: Show the resolved config and ask for confirmation before proceeding.

Usage Guidance

This skill appears to do what it says (fleet-wide Cosmos DB RU health checks) and has no install artifacts, but two things to check before you install/use it: (1) It requires you to register a Grafana/Azure-Monitor MCP endpoint under the exact name 'amg' and you will supply a Grafana service-account token inline when you run the provided openclaw command. The token is not declared in the skill metadata — treat it like any sensitive credential and only register a token with the minimum privileges needed, or create a dedicated service account for this purpose. (2) The skill reads/writes persistent workspace files (config.md, report.md) and may spawn shell commands to parse large query results (node/python/jq/pwsh); ensure you are comfortable granting file and Bash access. If you need higher assurance, ask the author for explicit declaration of required credentials in the registry metadata and for details about where the Grafana token will be stored and who/what can access it.

Capability Assessment

✓ Purpose & Capability

The name/description (fleet-wide Cosmos DB for MongoDB RU health check) aligns with the referenced tools and queries: all allowed MCP calls, Azure Monitor KQL queries, and the report/output formats match the stated purpose. The skill legitimately needs access to an Azure Monitor datasource (via AMG/MCP) and the set of metrics/log queries is appropriate.

ℹ Instruction Scope

SKILL.md instructs the agent to discover a Grafana/Azure-Monitor datasource UID, run Resource Graph / metric / log queries across all accounts, and read/write persistent memory files (config.md, report.md). It also suggests saving large MCP responses to temp files and parsing them with node/python/jq/pwsh. Those actions are within the diagnostic scope, but they involve executing shell commands and writing/reading workspace files and temporary results; confirm you are comfortable with that file access and with granting Bash execution when prompted.

✓ Install Mechanism

Instruction-only skill with no install spec or downloaded code. No archive downloads or package installs were observed, which minimizes supply-chain risk.

⚠ Credentials

The registry metadata declares no required environment variables or primary credential, but SKILL.md explicitly instructs the user to run `openclaw mcp set amg ...` with a Grafana service-account token (starts with "glsa_") so the skill can call MCP tools. That credential is necessary for the skill to function but is not declared in the metadata. The skill will also persist a config and reports under memory/amg-check-cosmosdb-mongo-ru/, so sensitive tokens/headers stored via the openclaw mcp registration could be saved in workspace configuration — the metadata should have declared this requirement.

ℹ Persistence & Privilege

The skill writes and reads persistent files (memory/amg-check-cosmosdb-mongo-ru/config.md and report.md) to track known issues across sessions — that matches its stated behavior. It does not request always:true, but it requires registering an MCP server under the exact workspace name 'amg' (workspace-level config). That registration (with Authorization header) is effectively a workspace-scoped credential and could be reused by other tools/skills that reference the same MCP name; be aware of that shared-scope implication.

Version History

v1.0.0

amg-check-cosmosdb-mongo-ru 1.0.0 - Initial release: Comprehensive health check for Cosmos DB for MongoDB (RU) accounts across entire Azure fleet. - Scans account NormalizedRU consumption, service availability, server-side latency, throttling (429 errors), and replication metrics. - Automatically deep-dives into abnormal accounts with resource logs and correlation analysis. - Tracks and cross-references known issues via a persistent report across sessions. - Uses an AMG-MCP pulse check for Tier 1 triage, then batched Azure Monitor queries for detailed Tier 2 investigation. - On first use, auto-discovers datasource UID and prompts for subscription ID setup.

Metadata

Slug amg-check-cosmosdb-mongo-ru

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is AMG Cosmos DB for MongoDB (RU) Health Check?

Run only when the user explicitly asks for a fleet-wide Cosmos DB for MongoDB (RU) health check — scans NormalizedRU consumption, service availability, serve... It is an AI Agent Skill for Claude Code / OpenClaw, with 56 downloads so far.

How do I install AMG Cosmos DB for MongoDB (RU) Health Check?

Run "/install amg-check-cosmosdb-mongo-ru" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is AMG Cosmos DB for MongoDB (RU) Health Check free?

Yes, AMG Cosmos DB for MongoDB (RU) Health Check is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does AMG Cosmos DB for MongoDB (RU) Health Check support?

AMG Cosmos DB for MongoDB (RU) Health Check is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created AMG Cosmos DB for MongoDB (RU) Health Check?

It is built and maintained by 1w2w3y (@1w2w3y); the current version is v1.0.0.

More Skills

AMG Cosmos DB for MongoDB (RU) Health Check