← Back to Skills Marketplace
pintudeyudi

Huawei Cloud Cce Root Cause Analyzer

by shijingcheng · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ pending
23
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install huawei-cloud-cce-root-cause-analyzer
Description
Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod e...
README (SKILL.md)

\r \r

CCE Root Cause Analysis\r

\r

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.\r \r

  • All actions are dispatched through scripts/huawei-cloud.py with --action \x3Caction_name> and --params \x3Cjson_params>\r
  • All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell\r
  • For action names and parameters, see the Core Tools section below\r
  • Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools\r
  • All paths are relative to the skill directory, which is the directory where this SKILL.md resides\r \r

Overview\r

\r This skill converges multi-domain evidence into root cause conclusions for CCE incidents. It orchestrates workload rollout diagnosis, dependency impact analysis, change impact analysis, AOM alarm analysis, and cross-domain drill-down (network, node) to produce a complete Markdown report with investigation steps, timeline, evidence chain, impact scope, Top3 root causes, confidence, counter-evidence, and remediation handoff.\r \r This skill is applicable to the following scenarios:\r \r

  1. Cross-resource incidents involving multiple failure domains (workload + dependency + change + alarm)\r
  2. Root cause analysis when alarms span multiple CCE resources and the user needs comprehensive diagnosis\r
  3. Correlating recent changes (deployments, config updates, network/security policy changes, node changes) with observed failures\r
  4. Dependency impact propagation analysis (Service → Ingress → Pod → Node chain)\r
  5. Workload rollout failures requiring evidence funnel (generation → ReplicaSet → Pod Ready → events → logs → command/args → probes → image)\r
  6. Producing structured Top3 root cause reports with evidence, counter-evidence, and confidence scores\r \r This skill does NOT handle the following:\r \r
  7. Executing any remediation actions (scale, delete, drain, reboot, vulnerability state modification, cluster sleep/wake)\r
  8. Making root cause conclusions from a single alarm without timeline or evidence chain\r
  9. Creating, modifying, or deleting CCE resources\r
  10. Guessing or fabricating diagnosis results without evidence\r \r ---\r \r

Prerequisites\r

\r Before using, you must run the environment check script to complete environment validation and dependency installation in one step:\r \r

  • Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh\r
  • Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1\r \r

Windows Note: Do not use && to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.\r \r The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability.\r If the environment check fails, fix the issues before continuing with other actions.\r \r Environment Variables:\r \r | Variable | Required | Description |\r |----------|----------|-------------|\r | HW_ACCESS_KEY | Yes | Huawei Cloud AK |\r | HW_SECRET_KEY | Yes | Huawei Cloud SK |\r | HW_REGION_NAME | No | Default cn-north-4 |\r | HW_PROJECT_ID | No | Project ID (automatically obtained via IAM API when not set) |\r | HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |\r | HW_CLUSTER_ID | No | Default CCE cluster ID (can also be passed per action) |\r \r Security Constraints:\r \r

  1. Never persist credentials (AK/SK/Token/Certificate) to the filesystem\r
  2. AK/SK exist only within the current request call stack; released after use\r
  3. Only non-sensitive project IDs are cached in process memory (never written to disk)\r
  4. All temporary certificate files must be deleted immediately after use\r
  5. Never expose AK/SK in logs, responses, or error messages\r \r Do not output the values of the above environment variables.\r \r ---\r \r

IAM Permission Requirements\r

\r | API Action | Permission | Purpose |\r |-----------|------------|---------|\r | cce:cluster:get | Get cluster | View cluster details |\r | cce:cluster:list | List clusters | List CCE clusters |\r | cce:node:list | List nodes | List cluster nodes |\r | aom:*:get | Read AOM | Query AOM metrics and alarms |\r | aom:event:list | List events | Query AOM alarm events |\r | aom:alarmRule:list | List alarm rules | Query alarm rules |\r \r Permission Failure Handling:\r

  1. When any command fails due to permission errors, display required permission list\r
  2. Guide the user to create a custom policy in the IAM console\r
  3. Pause execution and wait for user confirmation\r \r ---\r \r

Core Tools\r

\r All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:\r \r Primary Comprehensive Diagnosis:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_root_cause_analyze | region, cluster_id | Primary comprehensive action: orchestrates workload rollout diagnosis, dependency impact, change impact, and AOM alarms into a unified root cause report with Top3 causes |\r \r Workload Domain Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_workload_rollout_diagnose | region, cluster_id, namespace, kind, name | Diagnose Deployment/StatefulSet/DaemonSet rollout failures with funnel and Top causes |\r | huawei_workload_diagnose | region, cluster_id | General workload status diagnosis |\r | huawei_workload_diagnose_by_alarm | region, cluster_id | Workload diagnosis triggered by AOM alarm correlation |\r | huawei_pod_failure_diagnose | region, cluster_id | Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) |\r \r Dependency and Impact Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_dependency_impact_analyze | region, cluster_id | Analyze Service/Ingress/Pod/Node propagation paths and impact scope for service unavailability |\r \r Change Impact Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_change_impact_analyze | region, cluster_id | Correlate recent changes (deployment, config, network, security policy, node changes) with observed failures via audit log and AOM alarm timeline |\r \r Network and Node Domain Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_network_diagnose | region, cluster_id | General network connectivity diagnosis |\r | huawei_network_diagnose_by_alarm | region, cluster_id | Network diagnosis triggered by AOM alarm correlation |\r | huawei_network_failure_diagnose | region, cluster_id | Network failure diagnosis (Service, Ingress connectivity) |\r | huawei_node_diagnose | region, cluster_id | Node-level diagnosis (scheduling, pressure) |\r | huawei_node_failure_diagnose | region, cluster_id | Node failure diagnosis |\r | huawei_node_batch_diagnose | region, cluster_id | Batch node diagnosis for multi-node issues |\r \r Alarm and Report Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_analyze_aom_alarms | region, cluster_id | Analyze AOM alarm patterns and correlation across resources |\r | huawei_generate_diagnosis_report | region, cluster_id | Generate structured Markdown diagnosis report |\r | huawei_generate_monitor_dashboard | region, cluster_id | Generate monitoring dashboard for ongoing observation |\r \r Supporting Evidence Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_get_cce_events | region, cluster_id | List Kubernetes Events in the cluster |\r \r ---\r \r

Parameter Reference\r

\r Common Parameters:\r \r | Parameter | Required | Description |\r |-----------|----------|-------------|\r | region | Yes | Huawei Cloud region, e.g., cn-north-4 |\r | cluster_id | Yes | CCE cluster ID |\r | namespace | Yes* | Kubernetes namespace (required for workload-specific actions) |\r | kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet |\r | name | Yes* | Workload name |\r \r *Required only for huawei_workload_rollout_diagnose.\r \r Optional Parameters (passed via --params JSON):\r \r | Parameter | Description |\r |-----------|-------------|\r | ak | Override AK (uses HW_ACCESS_KEY by default) |\r | sk | Override SK (uses HW_SECRET_KEY by default) |\r | project_id | Override project ID (auto-obtained via IAM when not set) |\r | target_name | Optional workload/app/service name for scope narrowing |\r | hours | Metric/query time range in hours (default 1) |\r | top_n | Number of top results for ranking (default 3) |\r \r ---\r \r

Output Format\r

\r

Primary Comprehensive: huawei_root_cause_analyze\r

\r

{\r
  "success": true,\r
  "analysis_trace_id": "RCA-...",\r
  "scope": {\r
    "region": "cn-north-4",\r
    "cluster_id": "cluster-id",\r
    "namespace": "optional",\r
    "target_name": "optional workload/app/service"\r
  },\r
  "summary": {\r
    "top_cause": {},\r
    "cause_count": 3,\r
    "data_sources": {\r
      "rollout": true,\r
      "dependency": true,\r
      "change": true,\r
      "alarms": true\r
    }\r
  },\r
  "top_causes": [\r
    {\r
      "rank": 1,\r
      "type": "ContainerCommandNotFound",\r
      "title": "New version container startup command or entry file does not exist",\r
      "domain": "workload",\r
      "confidence": 0.94,\r
      "evidence": [],\r
      "counter_evidence": [],\r
      "recommendation": [],\r
      "remediation_hint": {\r
        "skill": "huawei-cloud-cce-auto-remediation-runner",\r
        "action": "huawei_auto_remediation_run",\r
        "strategy": "rollback_previous_revision",\r
        "requires_confirmation": true\r
      }\r
    }\r
  ],\r
  "report_markdown": "# CCE Comprehensive Root Cause Analysis Report...",\r
  "report_file": "optional"\r
}\r
```\r
\r
### Supporting Domain Outputs\r
\r
Each domain action (`huawei_workload_rollout_diagnose`, `huawei_dependency_impact_analyze`, `huawei_change_impact_analyze`, `huawei_analyze_aom_alarms`) produces its own structured JSON output. See individual skill documentation for domain-specific schemas.\r
\r
---\r
\r
## Verification\r
\r
1. Run the environment check script to confirm dependencies and credentials are available\r
2. Use `huawei_root_cause_analyze` on a known healthy cluster to verify it returns `success: true` with zero or low-confidence causes\r
3. Use `huawei_root_cause_analyze` on a cluster with known multi-domain failures to verify Top3 causes are accurately identified\r
4. Compare `huawei_root_cause_analyze` summary with individual domain action outputs for consistency\r
5. Verify that evidence chains reference specific objects, events, and API fields (not generic statements)\r
6. Verify that counter-evidence is present for each top cause candidate\r
7. Confirm that low-confidence conclusions are clearly labeled with required supplementary data\r
\r
---\r
\r
## Best Practices\r
\r
1. Always start with `huawei_root_cause_analyze` for comprehensive diagnosis; drill down into individual domain actions only when specific evidence requires deeper analysis\r
2. For workload rollout failures, prioritize the rollout funnel: generation/observedGeneration → ReplicaSet → Pod Ready → Events → Logs → command/args → probes → image\r
3. For service unavailability, use `huawei_dependency_impact_analyze` to trace Service selector → Ingress backend → Pod Ready → Node distribution propagation paths\r
4. For suspected change-induced failures, use `huawei_change_impact_analyze` to build "change occurred before failure" causal chain with audit logs, K8s historical events, and AOM alarms\r
5. Never conclude root cause from a single alarm alone; always provide timeline or evidence chain\r
6. Record supporting evidence, counter-evidence, data gaps, and remediation handoff for each root cause candidate\r
7. Sort root causes by impact scope, timeline alignment, evidence strength, and recoverability\r
8. Clearly label low-confidence conclusions with required supplementary data\r
9. All remediation actions must be output as recommendations only and handed off to `huawei-cloud-cce-auto-remediation-runner`\r
\r
---\r
\r
## Reference Documents\r
\r
- Evidence chain and root cause ranking workflow: `references/workflow.md`\r
- Output structure specification: `references/output-schema.md`\r
- Risk boundaries and handoff rules: `references/risk-rules.md`\r
- [Huawei Cloud CCE Documentation](https://support.huaweicloud.com/cce/index.html)\r
- [Huawei Cloud Python SDK Documentation](https://support.huaweicloud.com/api-cce/cce_02_0113.html)\r
\r
---\r
\r
## Notes\r
\r
1. This skill is read-only diagnosis and report generation only; no write, scale, delete, cordon, drain, reboot, vulnerability state modification, or cluster sleep/wake operations\r
2. Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables\r
3. All scripts must be executed via `skill action=exec`; do not run them directly in a shell\r
4. Any action requiring `confirm=true` must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill never executes remediation\r
5. The environment check script must be run before any diagnosis action\r
6. When using temporary AK/SK, HW_SECURITY_TOKEN must be set\r
\r
---\r
\r
## Common Pitfalls\r
\r
1. **Concluding root cause from a single alarm** — Always require timeline or evidence chain; a single alarm without temporal correlation is insufficient evidence\r
2. **Skipping `huawei_root_cause_analyze` and drilling into individual domains first** — Always start with comprehensive analysis; individual domain drill-down is for supplementary evidence only\r
3. **Ignoring counter-evidence** — Each root cause candidate must include counter-evidence and data gaps; omitting these leads to false confidence\r
4. **Not building a fault timeline** — Establish user-perceived time, alarm trigger time, Kubernetes event time, and change time before ranking causes\r
5. **Attempting remediation actions from this skill** — All changes must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill only outputs recommendations\r
6. **Failing to label low-confidence conclusions** — When evidence is insufficient, write "insufficient evidence" explicitly; never present guesses as conclusions\r
7. **Not correlating changes with failures** — When a recent deployment, config, network, or security policy change exists, use `huawei_change_impact_analyze` to verify the "change before failure" causal chain\r
8. **Treating dependency propagation as single-direction** — Dependency impact can propagate bidirectionally (upstream failure affects downstream, and downstream back-pressure affects upstream); analyze both directions
Capability Tags
requires-walletrequires-sensitive-credentials
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install huawei-cloud-cce-root-cause-analyzer
  3. After installation, invoke the skill by name or use /huawei-cloud-cce-root-cause-analyzer
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.0
Initial release
Metadata
Slug huawei-cloud-cce-root-cause-analyzer
Version 0.1.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Huawei Cloud Cce Root Cause Analyzer?

Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod e... It is an AI Agent Skill for Claude Code / OpenClaw, with 23 downloads so far.

How do I install Huawei Cloud Cce Root Cause Analyzer?

Run "/install huawei-cloud-cce-root-cause-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Huawei Cloud Cce Root Cause Analyzer free?

Yes, Huawei Cloud Cce Root Cause Analyzer is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Huawei Cloud Cce Root Cause Analyzer support?

Huawei Cloud Cce Root Cause Analyzer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Huawei Cloud Cce Root Cause Analyzer?

It is built and maintained by shijingcheng (@pintudeyudi); the current version is v0.1.0.

💬 Comments