Huawei Cloud Cce Auto Remediation Runner
/install huawei-cloud-cce-auto-remediation-runner
\r \r
CCE Auto Remediation Runner\r
\r
⚠️ Execution Method (Must Read): This skill executes remediation actions via local Python scripts using the
scripts/huawei-cloud.pydispatcher. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.\r \r
- All actions are dispatched through
scripts/huawei-cloud.pywith--action \x3Caction_name>and--params \x3Cjson_params>\r- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them; do not run them directly in a shell\r- For action names and parameters, see the Core Tools section below\r
- Do not attempt hcloud, kubectl, curl IAM, or other CLI/API methods. This skill does not depend on these tools\r
- All paths are relative to the skill directory, which is the directory where this SKILL.md resides\r \r
Overview\r
\r
This skill converts remediation intent into reviewable, confirmable, verifiable execution plans. It operates in preview-first mode by default — all mutation actions require preview without confirm=true, explicit user confirmation of action/object/risks, then execution with confirm=true, followed by read-only verification.\r
\r
This skill is applicable to the following scenarios:\r
\r
- Remediation actions triggered by root-cause analysis conclusions (e.g., Deployment rollback for CrashLoop/ImagePull/CommandNotFound)\r
- Node operations: cordon, uncordon, drain, reboot ECS\r
- Workload operations: scale, resize, rollback, delete\r
- Node pool operations: resize node pool\r
- Cluster operations: hibernate, awake\r
- Security operations: HSS vulnerability status change\r
- Auto-remediation orchestration via
huawei_auto_remediation_runfor multi-step remediation plans\r - Traffic cutover: bind/unbind cluster EIP\r
- ECS instance operations: start, stop\r \r This skill does NOT handle the following:\r \r
- Read-only diagnosis (use
huawei-cloud-cce-root-cause-analyzeror domain-specific diagnoser skills)\r - Auto-executing remediation without preview and user confirmation\r
- Guessing or fabricating remediation results without evidence\r
- Batch or fuzzy-target remediation without explicit user confirmation per object\r \r ---\r \r
Prerequisites\r
\r Before using, you must run the environment check script to complete environment validation and dependency installation in one step:\r \r
- Linux / macOS:
skill action=exec: bash skill://scripts/check_env.sh\r - Windows:
skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1\r \r
Windows Note: Do not use
&&to chain commands (PowerShell 5.x does not support it). Use semicolons if you need to change directories first.\r \r The script will check in sequence: Python >= 3.6 → install dependencies → validate SDK → validate credentials → validate service availability.\r If the environment check fails, fix the issues before continuing with other actions.\r \r Environment Variables:\r \r | Variable | Required | Description |\r |----------|----------|-------------|\r | HW_ACCESS_KEY | Yes | Huawei Cloud AK |\r | HW_SECRET_KEY | Yes | Huawei Cloud SK |\r | HW_REGION_NAME | No | Default cn-north-4 |\r | HW_PROJECT_ID | No | Project ID (automatically obtained via IAM API when not set) |\r | HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |\r | HW_CLUSTER_ID | No | Default CCE cluster ID (can also be passed per action) |\r \r Security Constraints:\r \r
- Never persist credentials (AK/SK/Token/Certificate) to the filesystem\r
- AK/SK exist only within the current request call stack; released after use\r
- Only non-sensitive project IDs are cached in process memory (never written to disk)\r
- All temporary certificate files must be deleted immediately after use\r
- Never expose AK/SK in logs, responses, or error messages\r \r Do not output the values of the above environment variables.\r \r ---\r \r
IAM Permission Requirements\r
\r | API Action | Permission | Purpose |\r |-----------|------------|---------|\r | cce:cluster:get | Get cluster | View cluster details |\r | cce:cluster:list | List clusters | List CCE clusters |\r | cce:node:get | Get node | View node details |\r | cce:node:list | List nodes | List cluster nodes |\r | cce:node:update | Update node | Cordon/uncordon/drain nodes |\r | cce:nodepool:update | Update node pool | Resize node pools |\r | cce:nodepool:get | Get node pool | View node pool details |\r | cce:nodepool:list | List node pools | List node pools |\r | aom:*:get | Read AOM | Query AOM metrics and alarms |\r | aom:alarmRule:list | List alarm rules | Query alarm rules for validation |\r | aom:event:list | List events | Query AOM alarm events |\r \r Permission Failure Handling:\r
- When any command fails due to permission errors, display required permission list and policy JSON\r
- Guide the user to create a custom policy in the IAM console and grant authorization\r
- Pause execution and wait for user confirmation that permissions have been granted\r \r ---\r \r
Core Tools\r
\r
All actions are dispatched through scripts/huawei-cloud.py using skill action=exec.\r
\r
Auto-Remediation Orchestration\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_auto_remediation_run | region, cluster_id, strategy | Orchestrate multi-step remediation plan; strategy determines actions (rollback_previous_revision, scale_out, drain_and_replace, etc.) |\r
\r
Workload Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_rollback_cce_workload | region, cluster_id, namespace, kind, name | Rollback Deployment/StatefulSet/DaemonSet to previous revision |\r
| huawei_scale_cce_workload | region, cluster_id, namespace, kind, name, replicas | Scale workload replicas |\r
| huawei_resize_cce_workload | region, cluster_id, namespace, kind, name | Resize workload resource limits |\r
| huawei_delete_cce_workload | region, cluster_id, namespace, kind, name | Delete a workload |\r
\r
Node Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_cce_node_cordon | region, cluster_id, node_name | Mark node as unschedulable |\r
| huawei_cce_node_uncordon | region, cluster_id, node_name | Mark node as schedulable again |\r
| huawei_cce_node_drain | region, cluster_id, node_name | Evict all pods from node |\r
| huawei_reboot_ecs | region, ecs_id | Reboot the underlying ECS instance |\r
\r
Node Pool and Cluster Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_resize_cce_nodepool | region, cluster_id, nodepool_id, target_count | Resize node pool to target count |\r
| huawei_hibernate_cce_cluster | region, cluster_id | Hibernate (sleep) the CCE cluster |\r
| huawei_awake_cce_cluster | region, cluster_id | Awake (wake) the CCE cluster |\r
| huawei_delete_cce_cluster | region, cluster_id | Delete the CCE cluster |\r
| huawei_delete_cce_node | region, cluster_id, node_name | Delete a node from the cluster |\r
\r
ECS Instance Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_start_ecs_instance | region, ecs_id | Start ECS instance |\r
| huawei_stop_ecs_instance | region, ecs_id | Stop ECS instance |\r
\r
Elastic Scaling Policy\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_configure_cce_hpa | region, cluster_id, namespace, kind, name, min_replicas, max_replicas | Configure HPA policy for workload |\r
\r
Network / Traffic Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_bind_cce_cluster_eip | region, cluster_id, eip_id | Bind EIP to cluster for external access |\r
| huawei_unbind_cce_cluster_eip | region, cluster_id | Unbind EIP from cluster |\r
| huawei_network_verify_pod_scheduling | region, cluster_id, namespace | Verify pod scheduling network connectivity |\r
\r
Security Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_hss_change_vul_status | region, vul_id, status | Change HSS vulnerability handling status |\r
\r
Verification (Read-Only) Actions\r
\r
| Action | Required Parameters | Description |\r
|--------|---------------------|-------------|\r
| huawei_get_cce_pods | region, cluster_id | List pods in cluster |\r
| huawei_get_kubernetes_nodes | region, cluster_id | List Kubernetes nodes in cluster |\r
| huawei_get_cce_events | region, cluster_id | List Kubernetes Events in cluster |\r
| huawei_workload_rollout_diagnose | region, cluster_id, namespace, kind, name | Diagnose workload rollout status |\r
| huawei_root_cause_analyze | region, cluster_id | Comprehensive root cause analysis (cross-skill: huawei-cloud-cce-root-cause-analyzer) |\r
| huawei_dependency_impact_analyze | region, cluster_id | Dependency impact analysis (cross-skill: huawei-cloud-cce-root-cause-analyzer) |\r
| huawei_node_diagnose | region, cluster_id | Node-level diagnosis |\r
| huawei_workload_diagnose | region, cluster_id | Workload status diagnosis |\r
\r
---\r
\r
Parameter Reference\r
\r
Common Parameters:\r
\r
| Parameter | Required | Description |\r
|-----------|----------|-------------|\r
| region | Yes | Huawei Cloud region, e.g., cn-north-4 |\r
| cluster_id | Yes* | CCE cluster ID |\r
| namespace | Yes* | Kubernetes namespace (required for workload actions) |\r
| kind | Yes* | Workload type: Deployment, StatefulSet, or DaemonSet |\r
| name | Yes* | Workload name or node name |\r
| node_name | Yes* | Node name (required for node actions) |\r
| nodepool_id | Yes* | Node pool ID (required for node pool resize) |\r
| ecs_id | Yes* | ECS instance ID (required for ECS actions) |\r
| replicas | Yes* | Target replica count (required for scale) |\r
| target_count | Yes* | Target node count (required for node pool resize) |\r
| strategy | Yes* | Remediation strategy (required for auto-remediation) |\r
| confirm | No | Set to true ONLY after explicit user confirmation |\r
\r
*Required for specific actions as noted.\r
\r
Optional Parameters (passed via --params JSON):\r
\r
| Parameter | Description |\r
|-----------|-------------|\r
| ak | Override AK (uses HW_ACCESS_KEY by default) |\r
| sk | Override SK (uses HW_SECRET_KEY by default) |\r
| project_id | Override project ID (auto-obtained via IAM when not set) |\r
| min_replicas | HPA minimum replicas |\r
| max_replicas | HPA maximum replicas |\r
| vul_id | HSS vulnerability ID |\r
| status | HSS vulnerability handling status |\r
| eip_id | EIP ID for bind action |\r
\r
---\r
\r
Output Format\r
\r
Remediation Preview (confirm=false)\r
\r
{\r
"success": false,\r
"requires_confirmation": true,\r
"remediation_trace_id": "ARR-...",\r
"strategy": "rollback_previous_revision",\r
"diagnosis": {},\r
"action_result": {},\r
"preview": {\r
"action": "huawei_rollback_cce_workload",\r
"target": {\r
"region": "cn-north-4",\r
"cluster_id": "cluster-id",\r
"namespace": "default",\r
"kind": "Deployment",\r
"name": "app-server"\r
},\r
"current_state": {},\r
"expected_state": {},\r
"impact_scope": {},\r
"rollback_method": "Re-apply current revision"\r
},\r
"risk_level": "R2",\r
"rollback_notes": [],\r
"summary": "Remediation plan preview — requires user confirmation before execution"\r
}\r
```\r
\r
### Remediation Execution (confirm=true)\r
\r
```json\r
{\r
"success": true,\r
"requires_confirmation": false,\r
"confirmation_received": true,\r
"remediation_trace_id": "ARR-...",\r
"strategy": "rollback_previous_revision",\r
"action_result": {},\r
"execution": {\r
"action": "huawei_rollback_cce_workload",\r
"timestamp": "...",\r
"result": {}\r
},\r
"verification": [\r
{\r
"method": "huawei_get_cce_pods",\r
"status": "healthy",\r
"details": {}\r
}\r
],\r
"report_markdown": "# CCE Auto Remediation Execution Report...",\r
"report_file": "optional"\r
}\r
```\r
\r
### Full Auto-Remediation Orchestration Output\r
\r
```json\r
{\r
"success": false,\r
"requires_confirmation": true,\r
"remediation_trace_id": "ARR-...",\r
"strategy": "rollback_previous_revision",\r
"diagnosis": {},\r
"action_result": {},\r
"verification": {},\r
"summary": "remediation plan or execution result",\r
"action": "huawei_auto_remediation_run",\r
"risk_level": "R2",\r
"target": {\r
"region": "cn-north-4",\r
"cluster_id": "optional",\r
"resource": "optional"\r
},\r
"preview": {},\r
"requires_confirmation": true,\r
"confirmation_received": false,\r
"execution": {},\r
"verification": [],\r
"rollback_notes": [],\r
"report_markdown": "# CCE Auto Remediation Execution Report...",\r
"report_file": "optional"\r
}\r
```\r
\r
---\r
\r
## Verification\r
\r
1. Run the environment check script to confirm dependencies and credentials are available\r
2. Use `huawei_cce_node_cordon` (without `confirm=true`) on a test node to verify preview mode returns `requires_confirmation: true`\r
3. After user confirmation, execute with `confirm=true` and verify node status with `huawei_get_kubernetes_nodes`\r
4. Use `huawei_rollback_cce_workload` preview mode to verify it shows current vs expected state\r
5. After rollback execution, verify workload health with `huawei_workload_rollout_diagnose`\r
6. Use `huawei_auto_remediation_run` preview mode to verify multi-step orchestration plan is shown before execution\r
7. Confirm that all R3 actions (drain, reboot, delete, hibernate) require explicit user confirmation\r
8. Verify that post-execution verification actions return healthy/expected status\r
\r
---\r
\r
## Best Practices\r
\r
1. **Always preview first**: Never call any mutation action with `confirm=true` on the first invocation. Always preview without `confirm=true` first\r
2. **State the four essentials**: Before confirmation, restate the action, object, parameters, impact scope, and rollback plan to the user\r
3. **Prefer rollback for deployment failures**: If root cause is from `huawei-cloud-cce-root-cause-analyzer` and involves startup command, CrashLoop, probe, or image causing new version unavailability, prefer `huawei_auto_remediation_run` with `rollback_previous_revision` strategy\r
4. **Verify after execution**: Every execution must be followed by read-only verification (Pod status, Node status, Events, workload rollout diagnosis)\r
5. **Classify risk correctly**: Refer to `references/risk-rules.md` for R1/R2/R3 classification; apply appropriate confirmation requirements\r
6. **Never auto-add confirm**: Deployment rollback, scale, resize, resource modification, delete cluster/node/workload, drain, reboot, and HSS vulnerability status change must all be preview → user confirm → execute → verify\r
7. **Use auto-remediation orchestration for multi-step plans**: When remediation involves multiple actions, use `huawei_auto_remediation_run` to produce a complete execution report with diagnosis basis, action results, and verification results\r
8. **Cross-skill handoff for diagnosis**: When root cause analysis is needed before remediation, hand off to `huawei-cloud-cce-root-cause-analyzer`; this skill only executes confirmed remediation actions\r
9. **Document rollback notes**: Every execution plan must include rollback method — how to revert if the remediation causes unintended effects\r
\r
---\r
\r
## Reference Documents\r
\r
- Workflow and action orchestration steps: `references/workflow.md`\r
- Risk classification and confirm=true rules: `references/risk-rules.md`\r
- Output execution record schema: `references/output-schema.md`\r
- [Huawei Cloud CCE Documentation](https://support.huaweicloud.com/cce/index.html)\r
- [Huawei Cloud Python SDK Documentation](https://support.huaweicloud.com/api-cce/cce_02_0113.html)\r
\r
---\r
\r
## Notes\r
\r
1. This skill is a **MUTATION skill** — it performs write actions (drain, cordon, scale, restart, delete, reboot, hibernate, vulnerability status change). Preview+confirm workflow is mandatory\r
2. Do not output the values of HW_ACCESS_KEY, HW_SECRET_KEY, HW_SECURITY_TOKEN, or other environment variables\r
3. All scripts must be executed via `skill action=exec`; do not run them directly in a shell\r
4. NEVER auto-add `confirm=true`. User must explicitly confirm the specific action, object, and risks\r
5. The environment check script must be run before any remediation action\r
6. When using temporary AK/SK, HW_SECURITY_TOKEN must be set\r
7. After execution, must call read-only verification actions to confirm status\r
8. Cross-skill references: diagnosis → `huawei-cloud-cce-root-cause-analyzer`; domain-specific diagnosis → `huawei-cloud-cce-pod-failure-diagnoser`, `huawei-cloud-cce-node-failure-diagnoser`, `huawei-cloud-cce-network-failure-diagnoser`\r
\r
---\r
\r
## Common Pitfalls\r
\r
1. **Auto-adding confirm=true** — The most critical pitfall. NEVER assume user intent implies confirmation. Always preview first, show results, and wait for explicit user confirmation\r
2. **Skipping preview for R2 actions** — Even medium-risk actions (scale, resize, cordon, rollback) require preview. No mutation action may skip the preview step\r
3. **Not verifying after execution** — Every R2/R3 execution must be followed by read-only verification (Pod/Node/Workload/Events status). Skipping verification leaves remediation unconfirmed\r
4. **Batch or fuzzy-target remediation** — R3 actions (drain, reboot, delete, hibernate) must have explicit, specific target objects. Never execute with vague or batch targets without per-object confirmation\r
5. **Not documenting rollback method** — Every remediation plan must state how to revert if the action causes unintended effects. Omitting rollback notes is a safety hazard\r
6. **Executing remediation without diagnosis** — Always confirm root cause via `huawei-cloud-cce-root-cause-analyzer` or domain diagnoser before remediation. Blind remediation without evidence is prohibited\r
7. **Confusing R2 and R3 risk levels** — R2 (runtime impact) requires preview+confirm; R3 (destructive) requires explicit per-object confirmation with additional verification. See `references/risk-rules.md`\r
8. **Not restating the plan to the user** — Before requesting confirmation, restate the action, target object, region, cluster_id, expected impact, and rollback plan. The user must confirm all four essentials
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install huawei-cloud-cce-auto-remediation-runner - 安装完成后,直接呼叫该 Skill 的名称或使用
/huawei-cloud-cce-auto-remediation-runner触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Huawei Cloud Cce Auto Remediation Runner 是什么?
Huawei Cloud CCE auto-remediation runner skill that converts remediation intent into preview-first, confirm-required, post-verify execution plans. Use this s... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 38 次。
如何安装 Huawei Cloud Cce Auto Remediation Runner?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install huawei-cloud-cce-auto-remediation-runner」即可一键安装,无需额外配置。
Huawei Cloud Cce Auto Remediation Runner 是免费的吗?
是的,Huawei Cloud Cce Auto Remediation Runner 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Huawei Cloud Cce Auto Remediation Runner 支持哪些平台?
Huawei Cloud Cce Auto Remediation Runner 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Huawei Cloud Cce Auto Remediation Runner?
由 shijingcheng(@pintudeyudi)开发并维护,当前版本 v0.1.2。