← Back to Skills Marketplace

Huawei Cloud Cce Root Cause Analyzer

Name: Huawei Cloud Cce Root Cause Analyzer
Author: pintudeyudi

by shijingcheng · GitHub ↗ · v0.1.0 · MIT-0

cross-platform ⚠ pending

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install huawei-cloud-cce-root-cause-analyzer

Description

Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod e...

README (SKILL.md)

\r \r

CCE Root Cause Analysis\r

⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the scripts/huawei-cloud.py dispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.\r \r

All actions are dispatched through scripts/huawei-cloud.py with --action \x3Caction_name> and --params \x3Cjson_params>\r

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them; do not run them directly in a shell\r

For action names and parameters, see the Core Tools section below\r

Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools\r

All paths are relative to the skill directory, which is the directory where this SKILL.md resides\r \r

Overview\r

\r This skill converges multi-domain evidence into root cause conclusions for CCE incidents. It orchestrates workload rollout diagnosis, dependency impact analysis, change impact analysis, AOM alarm analysis, and cross-domain drill-down (network, node) to produce a complete Markdown report with investigation steps, timeline, evidence chain, impact scope, Top3 root causes, confidence, counter-evidence, and remediation handoff.\r \r This skill is applicable to the following scenarios:\r \r

Cross-resource incidents involving multiple failure domains (workload + dependency + change + alarm)\r
Root cause analysis when alarms span multiple CCE resources and the user needs comprehensive diagnosis\r
Correlating recent changes (deployments, config updates, network/security policy changes, node changes) with observed failures\r
Dependency impact propagation analysis (Service → Ingress → Pod → Node chain)\r
Workload rollout failures requiring evidence funnel (generation → ReplicaSet → Pod Ready → events → logs → command/args → probes → image)\r
Producing structured Top3 root cause reports with evidence, counter-evidence, and confidence scores\r \r This skill does NOT handle the following:\r \r
Executing any remediation actions (scale, delete, drain, reboot, vulnerability state modification, cluster sleep/wake)\r
Making root cause conclusions from a single alarm without timeline or evidence chain\r
Creating, modifying, or deleting CCE resources\r
Guessing or fabricating diagnosis results without evidence\r \r ---\r \r

Prerequisites\r

\r Before using, you must run the environment check script to complete environment validation and dependency installation in one step:\r \r

Linux / macOS: skill action=exec: bash skill://scripts/check_env.sh\r
Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1\r \r

Windows Note: Do not use && to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.\r \r The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability.\r If the environment check fails, fix the issues before continuing with other actions.\r \r Environment Variables:\r \r | Variable | Required | Description |\r |----------|----------|-------------|\r | HW_ACCESS_KEY | Yes | Huawei Cloud AK |\r | HW_SECRET_KEY | Yes | Huawei Cloud SK |\r | HW_REGION_NAME | No | Default cn-north-4 |\r | HW_PROJECT_ID | No | Project ID (automatically obtained via IAM API when not set) |\r | HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |\r | HW_CLUSTER_ID | No | Default CCE cluster ID (can also be passed per action) |\r \r Security Constraints:\r \r

Never persist credentials (AK/SK/Token/Certificate) to the filesystem\r
AK/SK exist only within the current request call stack; released after use\r
Only non-sensitive project IDs are cached in process memory (never written to disk)\r
All temporary certificate files must be deleted immediately after use\r
Never expose AK/SK in logs, responses, or error messages\r \r Do not output the values of the above environment variables.\r \r ---\r \r

IAM Permission Requirements\r

\r | API Action | Permission | Purpose |\r |-----------|------------|---------|\r | cce:cluster:get | Get cluster | View cluster details |\r | cce:cluster:list | List clusters | List CCE clusters |\r | cce:node:list | List nodes | List cluster nodes |\r | aom:*:get | Read AOM | Query AOM metrics and alarms |\r | aom:event:list | List events | Query AOM alarm events |\r | aom:alarmRule:list | List alarm rules | Query alarm rules |\r \r Permission Failure Handling:\r

When any command fails due to permission errors, display required permission list\r
Guide the user to create a custom policy in the IAM console\r
Pause execution and wait for user confirmation\r \r ---\r \r

Core Tools\r

\r All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:\r \r Primary Comprehensive Diagnosis:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_root_cause_analyze | region, cluster_id | Primary comprehensive action: orchestrates workload rollout diagnosis, dependency impact, change impact, and AOM alarms into a unified root cause report with Top3 causes |\r \r Workload Domain Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_workload_rollout_diagnose | region, cluster_id, namespace, kind, name | Diagnose Deployment/StatefulSet/DaemonSet rollout failures with funnel and Top causes |\r | huawei_workload_diagnose | region, cluster_id | General workload status diagnosis |\r | huawei_workload_diagnose_by_alarm | region, cluster_id | Workload diagnosis triggered by AOM alarm correlation |\r | huawei_pod_failure_diagnose | region, cluster_id | Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) |\r \r Dependency and Impact Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_dependency_impact_analyze | region, cluster_id | Analyze Service/Ingress/Pod/Node propagation paths and impact scope for service unavailability |\r \r Change Impact Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_change_impact_analyze | region, cluster_id | Correlate recent changes (deployment, config, network, security policy, node changes) with observed failures via audit log and AOM alarm timeline |\r \r Network and Node Domain Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_network_diagnose | region, cluster_id | General network connectivity diagnosis |\r | huawei_network_diagnose_by_alarm | region, cluster_id | Network diagnosis triggered by AOM alarm correlation |\r | huawei_network_failure_diagnose | region, cluster_id | Network failure diagnosis (Service, Ingress connectivity) |\r | huawei_node_diagnose | region, cluster_id | Node-level diagnosis (scheduling, pressure) |\r | huawei_node_failure_diagnose | region, cluster_id | Node failure diagnosis |\r | huawei_node_batch_diagnose | region, cluster_id | Batch node diagnosis for multi-node issues |\r \r Alarm and Report Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_analyze_aom_alarms | region, cluster_id | Analyze AOM alarm patterns and correlation across resources |\r | huawei_generate_diagnosis_report | region, cluster_id | Generate structured Markdown diagnosis report |\r | huawei_generate_monitor_dashboard | region, cluster_id | Generate monitoring dashboard for ongoing observation |\r \r Supporting Evidence Actions:\r \r | Action | Required Parameters | Description |\r |--------|---------------------|-------------|\r | huawei_get_cce_events | region, cluster_id | List Kubernetes Events in the cluster |\r \r ---\r \r

Parameter Reference\r

\r Common Parameters:\r \r | Parameter | Required | Description |\r |-----------|----------|-------------|\r | region | Yes | Huawei Cloud region, e.g., cn-north-4 |\r | cluster_id | Yes | CCE cluster ID |\r | namespace | Yes* | Kubernetes namespace (required for workload-specific actions) |\r | kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet |\r | name | Yes* | Workload name |\r \r *Required only for huawei_workload_rollout_diagnose.\r \r Optional Parameters (passed via --params JSON):\r \r | Parameter | Description |\r |-----------|-------------|\r | ak | Override AK (uses HW_ACCESS_KEY by default) |\r | sk | Override SK (uses HW_SECRET_KEY by default) |\r | project_id | Override project ID (auto-obtained via IAM when not set) |\r | target_name | Optional workload/app/service name for scope narrowing |\r | hours | Metric/query time range in hours (default 1) |\r | top_n | Number of top results for ranking (default 3) |\r \r ---\r \r

Output Format\r

Primary Comprehensive: `huawei_root_cause_analyze`\r

{\r
  "success": true,\r
  "analysis_trace_id": "RCA-...",\r
  "scope": {\r
    "region": "cn-north-4",\r
    "cluster_id": "cluster-id",\r
    "namespace": "optional",\r
    "target_name": "optional workload/app/service"\r
  },\r
  "summary": {\r
    "top_cause": {},\r
    "cause_count": 3,\r
    "data_sources": {\r
      "rollout": true,\r
      "dependency": true,\r
      "change": true,\r
      "alarms": true\r
    }\r
  },\r
  "top_causes": [\r
    {\r
      "rank": 1,\r
      "type": "ContainerCommandNotFound",\r
      "title": "New version container startup command or entry file does not exist",\r
      "domain": "workload",\r
      "confidence": 0.94,\r
      "evidence": [],\r
      "counter_evidence": [],\r
      "recommendation": [],\r
      "remediation_hint": {\r
        "skill": "huawei-cloud-cce-auto-remediation-runner",\r
        "action": "huawei_auto_remediation_run",\r
        "strategy": "rollback_previous_revision",\r
        "requires_confirmation": true\r
      }\r
    }\r
  ],\r
  "report_markdown": "# CCE Comprehensive Root Cause Analysis Report...",\r
  "report_file": "optional"\r
}\r
```\r
\r
### Supporting Domain Outputs\r
\r
Each domain action (`huawei_workload_rollout_diagnose`, `huawei_dependency_impact_analyze`, `huawei_change_impact_analyze`, `huawei_analyze_aom_alarms`) produces its own structured JSON output. See individual skill documentation for domain-specific schemas.\r
\r
---\r
\r
## Verification\r
\r
1. Run the environment check script to confirm dependencies and credentials are available\r
2. Use `huawei_root_cause_analyze` on a known healthy cluster to verify it returns `success: true` with zero or low-confidence causes\r
3. Use `huawei_root_cause_analyze` on a cluster with known multi-domain failures to verify Top3 causes are accurately identified\r
4. Compare `huawei_root_cause_analyze` summary with individual domain action outputs for consistency\r
5. Verify that evidence chains reference specific objects, events, and API fields (not generic statements)\r
6. Verify that counter-evidence is present for each top cause candidate\r
7. Confirm that low-confidence conclusions are clearly labeled with required supplementary data\r
\r
---\r
\r
## Best Practices\r
\r
1. Always start with `huawei_root_cause_analyze` for comprehensive diagnosis; drill down into individual domain actions only when specific evidence requires deeper analysis\r
2. For workload rollout failures, prioritize the rollout funnel: generation/observedGeneration → ReplicaSet → Pod Ready → Events → Logs → command/args → probes → image\r
3. For service unavailability, use `huawei_dependency_impact_analyze` to trace Service selector → Ingress backend → Pod Ready → Node distribution propagation paths\r
4. For suspected change-induced failures, use `huawei_change_impact_analyze` to build "change occurred before failure" causal chain with audit logs, K8s historical events, and AOM alarms\r
5. Never conclude root cause from a single alarm alone; always provide timeline or evidence chain\r
6. Record supporting evidence, counter-evidence, data gaps, and remediation handoff for each root cause candidate\r
7. Sort root causes by impact scope, timeline alignment, evidence strength, and recoverability\r
8. Clearly label low-confidence conclusions with required supplementary data\r
9. All remediation actions must be output as recommendations only and handed off to `huawei-cloud-cce-auto-remediation-runner`\r
\r
---\r
\r
## Reference Documents\r
\r
- Evidence chain and root cause ranking workflow: `references/workflow.md`\r
- Output structure specification: `references/output-schema.md`\r
- Risk boundaries and handoff rules: `references/risk-rules.md`\r
- [Huawei Cloud CCE Documentation](https://support.huaweicloud.com/cce/index.html)\r
- [Huawei Cloud Python SDK Documentation](https://support.huaweicloud.com/api-cce/cce_02_0113.html)\r
\r
---\r
\r
## Notes\r
\r
1. This skill is read-only diagnosis and report generation only; no write, scale, delete, cordon, drain, reboot, vulnerability state modification, or cluster sleep/wake operations\r
2. Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables\r
3. All scripts must be executed via `skill action=exec`; do not run them directly in a shell\r
4. Any action requiring `confirm=true` must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill never executes remediation\r
5. The environment check script must be run before any diagnosis action\r
6. When using temporary AK/SK, HW_SECURITY_TOKEN must be set\r
\r
---\r
\r
## Common Pitfalls\r
\r
1. **Concluding root cause from a single alarm** — Always require timeline or evidence chain; a single alarm without temporal correlation is insufficient evidence\r
2. **Skipping `huawei_root_cause_analyze` and drilling into individual domains first** — Always start with comprehensive analysis; individual domain drill-down is for supplementary evidence only\r
3. **Ignoring counter-evidence** — Each root cause candidate must include counter-evidence and data gaps; omitting these leads to false confidence\r
4. **Not building a fault timeline** — Establish user-perceived time, alarm trigger time, Kubernetes event time, and change time before ranking causes\r
5. **Attempting remediation actions from this skill** — All changes must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill only outputs recommendations\r
6. **Failing to label low-confidence conclusions** — When evidence is insufficient, write "insufficient evidence" explicitly; never present guesses as conclusions\r
7. **Not correlating changes with failures** — When a recent deployment, config, network, or security policy change exists, use `huawei_change_impact_analyze` to verify the "change before failure" causal chain\r
8. **Treating dependency propagation as single-direction** — Dependency impact can propagate bidirectionally (upstream failure affects downstream, and downstream back-pressure affects upstream); analyze both directions

Capability Tags

requires-walletrequires-sensitive-credentials

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install huawei-cloud-cce-root-cause-analyzer
After installation, invoke the skill by name or use /huawei-cloud-cce-root-cause-analyzer
Provide required inputs per the skill's parameter spec and get structured output

Version History

v0.1.0

Initial release

Metadata

Slug huawei-cloud-cce-root-cause-analyzer

Version 0.1.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Huawei Cloud Cce Root Cause Analyzer?

Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod e... It is an AI Agent Skill for Claude Code / OpenClaw, with 23 downloads so far.

How do I install Huawei Cloud Cce Root Cause Analyzer?

Run "/install huawei-cloud-cce-root-cause-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Huawei Cloud Cce Root Cause Analyzer free?

Yes, Huawei Cloud Cce Root Cause Analyzer is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Huawei Cloud Cce Root Cause Analyzer support?

Huawei Cloud Cce Root Cause Analyzer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Huawei Cloud Cce Root Cause Analyzer?

It is built and maintained by shijingcheng (@pintudeyudi); the current version is v0.1.0.

More Skills