Huawei Cloud Cce Root Cause Analyzer
/install huawei-cloud-cce-root-cause-analyzer
\r \r
CCE Root Cause Analysis\r
\r
⚠️ Execution Method (Must Read): This skill executes diagnosis via local Python scripts using the
scripts/huawei-cloud.pydispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.\r \r
- All actions are dispatched through
scripts/huawei-cloud.pywith--action \x3Caction_name>and--params \x3Cjson_params>\r- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them; do not run them directly in a shell\r- For action names and parameters, see the Core Tools section below\r
- Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools\r
- All paths are relative to the skill directory, which is the directory where this SKILL.md resides\r \r
Overview\r
\r This skill converges multi-domain evidence into root cause conclusions for CCE incidents. It orchestrates workload rollout diagnosis, dependency impact analysis, change impact analysis, AOM alarm analysis, and cross-domain drill-down (network, node) to produce a complete Markdown report with investigation steps, timeline, evidence chain, impact scope, Top3 root causes, confidence, counter-evidence, and remediation handoff.\r \r This skill is applicable to the following scenarios:\r \r
- Cross-resource incidents involving multiple failure domains (workload + dependency + change + alarm)\r
- Root cause analysis when alarms span multiple CCE resources and the user needs comprehensive diagnosis\r
- Correlating recent changes (deployments, config updates, network/security policy changes, node changes) with observed failures\r
- Dependency impact propagation analysis (Service → Ingress → Pod → Node chain)\r
- Workload rollout failures requiring evidence funnel (generation → ReplicaSet → Pod Ready → events → logs → command/args → probes → image)\r
- Producing structured Top3 root cause reports with evidence, counter-evidence, and confidence scores\r \r This skill does NOT handle the following:\r \r
- Executing any remediation actions (scale, delete, drain, reboot, vulnerability state modification, cluster sleep/wake)\r
- Making root cause conclusions from a single alarm without timeline or evidence chain\r
- Creating, modifying, or deleting CCE resources\r
- Guessing or fabricating diagnosis results without evidence\r \r ---\r \r
Prerequisites\r
\r Before using, you must run the environment check script to complete environment validation and dependency installation in one step:\r \r
- Linux / macOS:
skill action=exec: bash skill://scripts/check_env.sh\r - Windows:
skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1\r \r
Windows Note: Do not use
&&to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.\r \r The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability.\r If the environment check fails, fix the issues before continuing with other actions.\r \r Environment Variables:\r \r | Variable | Required | Description |\r |----------|----------|-------------|\r | HW_ACCESS_KEY | Yes | Huawei Cloud AK |\r | HW_SECRET_KEY | Yes | Huawei Cloud SK |\r | HW_REGION_NAME | No | Default cn-north-4 |\r | HW_PROJECT_ID | No | Project ID (automatically obtained via IAM API when not set) |\r | HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |\r | HW_CLUSTER_ID | No | Default CCE cluster ID (can also be passed per action) |\r \r Security Constraints:\r \r
- Never persist credentials (AK/SK/Token/Certificate) to the filesystem\r
- AK/SK exist only within the current request call stack; released after use\r
- Only non-sensitive project IDs are cached in process memory (never written to disk)\r
- All temporary certificate files must be deleted immediately after use\r
- Never expose AK/SK in logs, responses, or error messages\r \r Do not output the values of the above environment variables.\r \r ---\r \r
IAM Permission Requirements\r
\r | API Action | Permission | Purpose |\r |-----------|------------|---------|\r | cce:cluster:get | Get cluster | View cluster details |\r | cce:cluster:list | List clusters | List CCE clusters |\r | cce:node:list | List nodes | List cluster nodes |\r | aom:*:get | Read AOM | Query AOM metrics and alarms |\r | aom:event:list | List events | Query AOM alarm events |\r | aom:alarmRule:list | List alarm rules | Query alarm rules |\r \r Permission Failure Handling:\r
- When any command fails due to permission errors, display required permission list\r
- Guide the user to create a custom policy in the IAM console\r
- Pause execution and wait for user confirmation\r \r ---\r \r
Core Tools\r
\r
All actions are dispatched through scripts/huawei-cloud.py using skill action=exec:\r
\r
Primary Comprehensive Diagnosis:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_root_cause_analyze | region, cluster_id | Primary comprehensive action: orchestrates workload rollout diagnosis, dependency impact, change impact, and AOM alarms into a unified root cause report with Top3 causes |\r
\r
Workload Domain Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_workload_rollout_diagnose | region, cluster_id, namespace, kind, name | Diagnose Deployment/StatefulSet/DaemonSet rollout failures with funnel and Top causes |\r
| huawei_workload_diagnose | region, cluster_id | General workload status diagnosis |\r
| huawei_workload_diagnose_by_alarm | region, cluster_id | Workload diagnosis triggered by AOM alarm correlation |\r
| huawei_pod_failure_diagnose | region, cluster_id | Pod-level failure diagnosis (CrashLoop, ImagePull, OOM, Pending, etc.) |\r
\r
Dependency and Impact Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_dependency_impact_analyze | region, cluster_id | Analyze Service/Ingress/Pod/Node propagation paths and impact scope for service unavailability |\r
\r
Change Impact Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_change_impact_analyze | region, cluster_id | Correlate recent changes (deployment, config, network, security policy, node changes) with observed failures via audit log and AOM alarm timeline |\r
\r
Network and Node Domain Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_network_diagnose | region, cluster_id | General network connectivity diagnosis |\r
| huawei_network_diagnose_by_alarm | region, cluster_id | Network diagnosis triggered by AOM alarm correlation |\r
| huawei_network_failure_diagnose | region, cluster_id | Network failure diagnosis (Service, Ingress connectivity) |\r
| huawei_node_diagnose | region, cluster_id | Node-level diagnosis (scheduling, pressure) |\r
| huawei_node_failure_diagnose | region, cluster_id | Node failure diagnosis |\r
| huawei_node_batch_diagnose | region, cluster_id | Batch node diagnosis for multi-node issues |\r
\r
Alarm and Report Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_analyze_aom_alarms | region, cluster_id | Analyze AOM alarm patterns and correlation across resources |\r
| huawei_generate_diagnosis_report | region, cluster_id | Generate structured Markdown diagnosis report |\r
| huawei_generate_monitor_dashboard | region, cluster_id | Generate monitoring dashboard for ongoing observation |\r
\r
Supporting Evidence Actions:\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_get_cce_events | region, cluster_id | List Kubernetes Events in the cluster |\r
\r
---\r
\r
Parameter Reference\r
\r
Common Parameters:\r
\r
| Parameter | Required | Description |\r
|-----------|----------|-------------|\r
| region | Yes | Huawei Cloud region, e.g., cn-north-4 |\r
| cluster_id | Yes | CCE cluster ID |\r
| namespace | Yes* | Kubernetes namespace (required for workload-specific actions) |\r
| kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet |\r
| name | Yes* | Workload name |\r
\r
*Required only for huawei_workload_rollout_diagnose.\r
\r
Optional Parameters (passed via --params JSON):\r
\r
| Parameter | Description |\r
|-----------|-------------|\r
| ak | Override AK (uses HW_ACCESS_KEY by default) |\r
| sk | Override SK (uses HW_SECRET_KEY by default) |\r
| project_id | Override project ID (auto-obtained via IAM when not set) |\r
| target_name | Optional workload/app/service name for scope narrowing |\r
| hours | Metric/query time range in hours (default 1) |\r
| top_n | Number of top results for ranking (default 3) |\r
\r
---\r
\r
Output Format\r
\r
Primary Comprehensive: huawei_root_cause_analyze\r
\r
{\r
"success": true,\r
"analysis_trace_id": "RCA-...",\r
"scope": {\r
"region": "cn-north-4",\r
"cluster_id": "cluster-id",\r
"namespace": "optional",\r
"target_name": "optional workload/app/service"\r
},\r
"summary": {\r
"top_cause": {},\r
"cause_count": 3,\r
"data_sources": {\r
"rollout": true,\r
"dependency": true,\r
"change": true,\r
"alarms": true\r
}\r
},\r
"top_causes": [\r
{\r
"rank": 1,\r
"type": "ContainerCommandNotFound",\r
"title": "New version container startup command or entry file does not exist",\r
"domain": "workload",\r
"confidence": 0.94,\r
"evidence": [],\r
"counter_evidence": [],\r
"recommendation": [],\r
"remediation_hint": {\r
"skill": "huawei-cloud-cce-auto-remediation-runner",\r
"action": "huawei_auto_remediation_run",\r
"strategy": "rollback_previous_revision",\r
"requires_confirmation": true\r
}\r
}\r
],\r
"report_markdown": "# CCE Comprehensive Root Cause Analysis Report...",\r
"report_file": "optional"\r
}\r
```\r
\r
### Supporting Domain Outputs\r
\r
Each domain action (`huawei_workload_rollout_diagnose`, `huawei_dependency_impact_analyze`, `huawei_change_impact_analyze`, `huawei_analyze_aom_alarms`) produces its own structured JSON output. See individual skill documentation for domain-specific schemas.\r
\r
---\r
\r
## Verification\r
\r
1. Run the environment check script to confirm dependencies and credentials are available\r
2. Use `huawei_root_cause_analyze` on a known healthy cluster to verify it returns `success: true` with zero or low-confidence causes\r
3. Use `huawei_root_cause_analyze` on a cluster with known multi-domain failures to verify Top3 causes are accurately identified\r
4. Compare `huawei_root_cause_analyze` summary with individual domain action outputs for consistency\r
5. Verify that evidence chains reference specific objects, events, and API fields (not generic statements)\r
6. Verify that counter-evidence is present for each top cause candidate\r
7. Confirm that low-confidence conclusions are clearly labeled with required supplementary data\r
\r
---\r
\r
## Best Practices\r
\r
1. Always start with `huawei_root_cause_analyze` for comprehensive diagnosis; drill down into individual domain actions only when specific evidence requires deeper analysis\r
2. For workload rollout failures, prioritize the rollout funnel: generation/observedGeneration → ReplicaSet → Pod Ready → Events → Logs → command/args → probes → image\r
3. For service unavailability, use `huawei_dependency_impact_analyze` to trace Service selector → Ingress backend → Pod Ready → Node distribution propagation paths\r
4. For suspected change-induced failures, use `huawei_change_impact_analyze` to build "change occurred before failure" causal chain with audit logs, K8s historical events, and AOM alarms\r
5. Never conclude root cause from a single alarm alone; always provide timeline or evidence chain\r
6. Record supporting evidence, counter-evidence, data gaps, and remediation handoff for each root cause candidate\r
7. Sort root causes by impact scope, timeline alignment, evidence strength, and recoverability\r
8. Clearly label low-confidence conclusions with required supplementary data\r
9. All remediation actions must be output as recommendations only and handed off to `huawei-cloud-cce-auto-remediation-runner`\r
\r
---\r
\r
## Reference Documents\r
\r
- Evidence chain and root cause ranking workflow: `references/workflow.md`\r
- Output structure specification: `references/output-schema.md`\r
- Risk boundaries and handoff rules: `references/risk-rules.md`\r
- [Huawei Cloud CCE Documentation](https://support.huaweicloud.com/cce/index.html)\r
- [Huawei Cloud Python SDK Documentation](https://support.huaweicloud.com/api-cce/cce_02_0113.html)\r
\r
---\r
\r
## Notes\r
\r
1. This skill is read-only diagnosis and report generation only; no write, scale, delete, cordon, drain, reboot, vulnerability state modification, or cluster sleep/wake operations\r
2. Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables\r
3. All scripts must be executed via `skill action=exec`; do not run them directly in a shell\r
4. Any action requiring `confirm=true` must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill never executes remediation\r
5. The environment check script must be run before any diagnosis action\r
6. When using temporary AK/SK, HW_SECURITY_TOKEN must be set\r
\r
---\r
\r
## Common Pitfalls\r
\r
1. **Concluding root cause from a single alarm** — Always require timeline or evidence chain; a single alarm without temporal correlation is insufficient evidence\r
2. **Skipping `huawei_root_cause_analyze` and drilling into individual domains first** — Always start with comprehensive analysis; individual domain drill-down is for supplementary evidence only\r
3. **Ignoring counter-evidence** — Each root cause candidate must include counter-evidence and data gaps; omitting these leads to false confidence\r
4. **Not building a fault timeline** — Establish user-perceived time, alarm trigger time, Kubernetes event time, and change time before ranking causes\r
5. **Attempting remediation actions from this skill** — All changes must be handed off to `huawei-cloud-cce-auto-remediation-runner`; this skill only outputs recommendations\r
6. **Failing to label low-confidence conclusions** — When evidence is insufficient, write "insufficient evidence" explicitly; never present guesses as conclusions\r
7. **Not correlating changes with failures** — When a recent deployment, config, network, or security policy change exists, use `huawei_change_impact_analyze` to verify the "change before failure" causal chain\r
8. **Treating dependency propagation as single-direction** — Dependency impact can propagate bidirectionally (upstream failure affects downstream, and downstream back-pressure affects upstream); analyze both directions
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install huawei-cloud-cce-root-cause-analyzer - After installation, invoke the skill by name or use
/huawei-cloud-cce-root-cause-analyzer - Provide required inputs per the skill's parameter spec and get structured output
What is Huawei Cloud Cce Root Cause Analyzer?
Huawei Cloud CCE cross-domain root cause analysis skill using Python SDK dispatcher. Use this skill when a CCE incident spans alarms, workload rollout, Pod e... It is an AI Agent Skill for Claude Code / OpenClaw, with 23 downloads so far.
How do I install Huawei Cloud Cce Root Cause Analyzer?
Run "/install huawei-cloud-cce-root-cause-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Huawei Cloud Cce Root Cause Analyzer free?
Yes, Huawei Cloud Cce Root Cause Analyzer is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Huawei Cloud Cce Root Cause Analyzer support?
Huawei Cloud Cce Root Cause Analyzer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Huawei Cloud Cce Root Cause Analyzer?
It is built and maintained by shijingcheng (@pintudeyudi); the current version is v0.1.0.