← 返回 Skills 市场

Huawei Cloud Cce Autoscaling Diagnoser

Name: Huawei Cloud Cce Autoscaling Diagnoser
Author: pintudeyudi

作者 shijingcheng · GitHub ↗ · v0.1.2 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install huawei-cloud-cce-autoscaling-diagnoser

功能描述

Huawei Cloud CCE autoscaling failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose CCE autoscaling failure...

使用说明 (SKILL.md)

\r \r

Huawei Cloud CCE Autoscaling Diagnoser\r

⚠️ Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, openstack, or other CLI tools or direct API calls is prohibited.\r \r

The dispatcher script is located at scripts/huawei-cloud.py within the skill directory\r

All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them. Do not run them directly in a shell.\r

Do not attempt hcloud, openstack, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.\r

All paths are relative to the skill directory, which is the directory where this SKILL.md is located.\r \r

Overview\r

\r This skill diagnoses CCE autoscaling link failures across two closed-loop layers: (1) whether HPA increases workload replica count from N to N+1, and (2) whether CCE elastic engine / Cluster Autoscaler increases node count from M to M+1 after resource-insufficient Pending Pods appear. It outputs a complete Markdown diagnosis report with process, evidence, root cause conclusion, confidence, data gaps, and recommendations.\r \r Architecture: Python dispatcher (scripts/huawei-cloud.py) → Huawei Cloud Python SDK + Kubernetes client → HPA/CA/Addon/NodePool/Pod/Events/Metrics → Gateway intent routing → Path A/B/C diagnosis → Structured evidence + Markdown report\r \r Related Skills:\r \r | Skill | Purpose |\r |-------|---------|\r | huawei-cloud-cce-pod-failure-diagnoser | Pod runtime failure diagnosis (CrashLoopBackOff, OOMKilled, Pending) |\r | huawei-cloud-cce-node-failure-diagnoser | Node-level failure diagnosis (NotReady, disk/memory pressure) |\r | huawei-cloud-cce-workload-failure-diagnoser | Workload rollout failure diagnosis |\r | huawei-cloud-cce-auto-remediation-runner | Execute remediation actions (HPA config, nodepool resize) |\r | huawei-cloud-cce-root-cause-analyzer | Cross-resource root cause correlation |\r | huawei-cloud-cce-alarm-correlation-engine | Alarm correlation and diagnosis triggering |\r | huawei-cloud-cce-capacity-trend-forecaster | Capacity trend and HPA coverage analysis |\r | huawei-cloud-cce-cost-optimization-advisor | Resource governance and cost optimization |\r \r Capabilities:\r \r

One-shot autoscaling diagnosis with Gateway intent routing, capability discovery, and Path A/B/C evidence collection (huawei_autoscaling_diagnose)\r
HPA object inspection: spec, currentReplicas, desiredReplicas, minReplicas, maxReplicas, conditions, metrics (huawei_list_cce_hpas)\r
CCE addon and nodepool autoscaling discovery (huawei_list_cce_addons, huawei_list_cce_nodepools)\r
CA Pod log analysis: automatic discovery of kube-system autoscaler Pods, log retrieval, and 16 diagnostic signal pattern matching\r
Pending Pod and scheduling constraint analysis (huawei_get_cce_pods, huawei_get_cce_events)\r
AOM/Prometheus custom metric evidence (huawei_get_aom_metrics)\r
Complete Markdown report generation with evidence, conclusion, confidence, and recommendations\r \r Typical Use Cases:\r \r

"HPA is not scaling my Deployment, what's wrong?"\r
"Why isn't the Cluster Autoscaler adding nodes when Pods are Pending?"\r
"My workload replicas aren't increasing despite high CPU usage"\r
"Diagnose why autoscaling is not working in my CCE cluster"\r
"HPA shows desiredReplicas equals currentReplicas, why no scaling?"\r
"Pods are Pending with Insufficient cpu/memory but no new nodes appear"\r
"Check if autoscaling is properly configured for my workload"\r
"Analyze CA logs for node scaling failure signals"\r \r

Prerequisites\r

1. Python Requirements (MANDATORY)\r

Python >= 3.6 installed\r
Required packages: huaweicloudsdkcore, huaweicloudsdkcce, huaweicloudsdkaom, kubernetes\r
Verify: python3 --version\r
Install packages: pip3 install huaweicloudsdkcore huaweicloudsdkcce huaweicloudsdkaom kubernetes\r \r

2. Credential Configuration\r

Valid Huawei Cloud credentials (AK/SK mode)\r
Security Rules:\r
- 🚫 Never expose AK/SK values in code, conversation, or commands\r
- 🚫 Never use echo $HUAWEI_AK or echo $HUAWEI_SK to check credentials\r
- 🚫 Never write credentials to files, logs, or responses\r
- ✅ Use environment variables: HUAWEI_AK, HUAWEI_SK, HUAWEI_REGION\r
- ✅ Credentials exist only in the current request call stack and are released after each invocation\r
- ✅ Prefer IAM users over root account for cloud operations\r \r Configuration Method (Environment Variables Only):\r \r

export HUAWEI_AK=\x3Cyour-ak>\r
export HUAWEI_SK=\x3Cyour-sk>\r
export HUAWEI_REGION=cn-north-4\r
```\r
\r
**Additional Variables**:\r
\r
| Variable | Required | Description |\r
|----------|----------|-------------|\r
| `HUAWEI_AK` | Yes | Huawei Cloud Access Key |\r
| `HUAWEI_SK` | Yes | Huawei Cloud Secret Key |\r
| `HUAWEI_REGION` | No | Default region (overrides `region` param if set) |\r
| `HUAWEI_PROJECT_ID` | No | Project ID (auto-obtained via IAM API when not set) |\r
| `HUAWEI_SECURITY_TOKEN` | No | Required when using temporary AK/SK |\r
\r
### 3. IAM Permission Requirements\r
\r
| API Action | Service | Purpose |\r
|------------|---------|---------|\r
| CCE cluster read | CCE | `huawei_list_cce_clusters`, `huawei_list_cce_nodepools` |\r
| CCE addon read | CCE | `huawei_list_cce_addons`, `huawei_get_cce_addon_detail` |\r
| CCE HPA read | CCE (kubeconfig) | `huawei_list_cce_hpas` |\r
| CCE workload read | CCE (kubeconfig) | `huawei_get_cce_deployments`, `huawei_list_cce_statefulsets` |\r
| CCE Pod read | CCE (kubeconfig) | `huawei_get_cce_pods` |\r
| CCE Pod logs | CCE (kubeconfig) | `huawei_get_pod_logs` |\r
| CCE Events read | CCE (kubeconfig) | `huawei_get_cce_events` |\r
| AOM metrics read | AOM | `huawei_get_aom_metrics`, `huawei_get_cce_pod_metrics_topN`, `huawei_get_cce_node_metrics_topN` |\r
\r
**Permission Failure Handling**:\r
\r
1. When any action fails due to IAM permission errors, display the required permission list\r
2. Guide the user to create custom policies in the IAM console for Huawei Cloud permissions\r
3. Pause execution and wait for user confirmation that permissions have been granted\r
4. Retry the failed action\r
\r
## Core Commands\r
\r
All actions are invoked via the dispatcher script:\r
\r
```bash\r
python3 scripts/huawei-cloud.py \x3Caction> region=\x3Cregion> cluster_id=\x3Ccluster_id> [key=value ...]\r
```\r
\r
### 1. Primary Diagnosis Action\r
\r
```bash\r
python3 scripts/huawei-cloud.py huawei_autoscaling_diagnose \\r
  region=cn-north-4 cluster_id=\x3Ccluster_id> \\r
  namespace=default workload_name=my-app workload_type=Deployment \\r
  question="Why isn't HPA scaling my workload?"\r
```\r
\r
Returns structured evidence + `report_markdown` (complete Markdown diagnosis report). When `report_markdown` is present, use it as the final report body. You may add clarifications the user requests, but do not discard evidence tables.\r
\r
### 2. Evidence Collection Actions (Read-Only)\r
\r
| Action | Required Params | Description |\r
|--------|----------------|-------------|\r
| `huawei_list_cce_hpas` | `region`, `cluster_id` | List HPA specs, current/desired replicas, conditions, metrics |\r
| `huawei_list_cce_addons` | `region`, `cluster_id` | Identify CCE elastic engine, metrics/AOM/Prometheus addons |\r
| `huawei_get_cce_addon_detail` | `region`, `cluster_id`, `addon_id` | Get addon detail (version, status) |\r
| `huawei_list_cce_nodepools` | `region`, `cluster_id` | List nodepools: autoscaling enable, min/max, current node count |\r
| `huawei_get_cce_pods` | `region`, `cluster_id` | List Pod phase, owner, container state, resources.requests/limits, annotations |\r
| `huawei_get_cce_deployments` | `region`, `cluster_id` | Read Deployment desired/current/ready replicas |\r
| `huawei_list_cce_statefulsets` | `region`, `cluster_id` | Read StatefulSet desired/current/ready replicas |\r
| `huawei_get_cce_events` | `region`, `cluster_id` | Read HPA, Pod, Scheduler events (FailedScheduling, FailedGetResourceMetric) |\r
| `huawei_get_cce_pod_metrics_topN` | `region`, `cluster_id` | Pod resource metric ranking |\r
| `huawei_get_cce_node_metrics_topN` | `region`, `cluster_id` | Node resource metric ranking |\r
| `huawei_get_aom_metrics` | `region`, `cluster_id` | AOM/Prometheus custom metric queries |\r
\r
### 3. CA Pod Log Analysis (Manual Fallback)\r
\r
```bash\r
# Step 1: Locate CA component Pods in kube-system\r
python3 scripts/huawei-cloud.py huawei_get_cce_pods \\r
  region=cn-north-4 cluster_id=\x3Ccluster_id> namespace=kube-system\r
\r
# Step 2: Retrieve CA Pod logs (find pods with names containing autoscaler/cce-elastic/elastic-engine)\r
python3 scripts/huawei-cloud.py huawei_get_pod_logs \\r
  region=cn-north-4 cluster_id=\x3Ccluster_id> namespace=kube-system \\r
  pod_name=cce-cluster-autoscaler-abc123 container=autoscaler tail_lines=200\r
```\r
\r
**CA Log Signal Quick Reference**:\r
\r
| Signal | Meaning | Severity |\r
|--------|---------|----------|\r
| `No expansion options` | No available expansion options for node pool specs/AZ/subnet | critical |\r
| `max node group size reached` | Node group reached max_nodes limit | critical |\r
| `Scale-up: final scale-up plan is empty` | All node groups skipped in expansion plan | critical |\r
| `Quota exceeded` / `quota limit` | Cloud resource (ECS/EVS/EIP) quota insufficient | critical |\r
| `subnet ip exhausted` / `no available ip` | VPC subnet available IP exhausted | critical |\r
| `iam` / `permission denied` / `agency` / `forbidden` | IAM agency or permission abnormality | critical |\r
| `Failed to refresh` / `cannot connect` | CA cannot connect to cloud API or control plane | high |\r
| `skipping node group` | CA skipped a node group (reason in log) | high |\r
| `pod ... is unschedulable` | CA identified an unschedulable Pod | info |\r
| `ScaleDown: no candidates` | No candidate nodes for scale-down | info |\r
| `node ... is not suitable for removal` | Node does not meet scale-down conditions | high |\r
| `not safe to evict` / `safe-to-evict=false` | PDB or annotation protection blocking eviction | high |\r
\r
## Parameter Reference\r
\r
### `huawei_autoscaling_diagnose` (Primary Action)\r
\r
| Parameter | Required | Default | Description |\r
|-----------|----------|---------|-------------|\r
| `region` | Yes | - | Huawei Cloud region (e.g., `cn-north-4`) |\r
| `cluster_id` | Yes | - | CCE cluster ID |\r
| `namespace` | No | - | Target namespace (narrows scope) |\r
| `workload_name` | No | - | Target workload name (Deployment/StatefulSet) |\r
| `workload_type` | No | - | Workload type (`Deployment` or `StatefulSet`) |\r
| `question` | No | - | User's original question (improves intent routing) |\r
\r
### Common Parameters\r
\r
| Parameter | Required | Description | Default |\r
|-----------|----------|-------------|---------|\r
| `region` | Yes | Huawei Cloud region | - |\r
| `cluster_id` | Yes (most actions) | CCE cluster ID | - |\r
| `namespace` | Action-dependent | Kubernetes namespace | - |\r
| `workload_name` | Action-dependent | Deployment/StatefulSet name | - |\r
| `pod_name` | Required for logs | Pod name | - |\r
| `container` | Required for logs | Container name | - |\r
| `tail_lines` | No | Log tail lines count | 200 |\r
| `top_n` | No | Number of top results for metrics | 10 |\r
\r
### Common Region IDs\r
\r
| Region Name | Region ID |\r
|-------------|-----------|\r
| North China - Beijing 4 | `cn-north-4` |\r
| North China - Beijing 1 | `cn-north-1` |\r
| East China - Shanghai 1 | `cn-east-3` |\r
| East China - Shanghai 2 | `cn-east-2` |\r
| South China - Guangzhou | `cn-south-1` |\r
| South China - Shenzhen | `cn-south-4` |\r
| Southwest China - Guiyang 1 | `cn-southwest-2` |\r
| Asia Pacific - Bangkok | `ap-southeast-2` |\r
| Asia Pacific - Singapore | `ap-southeast-1` |\r
| Asia Pacific - Hong Kong | `ap-southeast-3` |\r
| Europe - Paris | `eu-west-0` |\r
\r
## Output Format\r
\r
The primary action `huawei_autoscaling_diagnose` returns structured evidence and a Markdown report. See [Output Schema](references/output-schema.md) for the full JSON response schema.\r
\r
Key output fields:\r
\r
| Field | Description |\r
|-------|-------------|\r
| `success` | Whether the diagnosis completed successfully |\r
| `intent.target` | Routing target: `WORKLOAD`, `NODE`, or `UNKNOWN` |\r
| `intent.scale_direction` | Scale direction: `scale_up`, `scale_down`, or `unknown` |\r
| `route` | Diagnosis path: `A`, `B`, `C`, or `BLOCKED` |\r
| `discovery` | Has_HPA, Has_CA, metric addon detected, nodepool autoscaling enabled |\r
| `issues` | List of diagnosed issues with code, severity, layer, evidence, recommendation |\r
| `evidence` | List of evidence items with layer, source, summary |\r
| `data_gaps` | Data collection failures or unconfirmed items |\r
| `conclusion` | Root cause conclusion summary |\r
| `confidence` | Confidence level (`High`, `Medium`, `Low`) |\r
| `report_markdown` | Complete Markdown diagnosis report (use as final output) |\r
\r
Required Markdown report sections:\r
\r
1. `# CCE Autoscaling Automated Diagnosis Report`\r
2. `## 1. Diagnosis Overview`: region, cluster, intent, scale direction, route, conclusion, confidence\r
3. `## 2. Capability Discovery & Routing`: Has_HPA, Has_CA, metric link, routing basis\r
4. `## 3. Investigation Process`: Gateway, Path A/B/C actual execution steps\r
5. `## 4. Key Evidence`: HPA status, nodepool/addon, Pending Pod, FailedScheduling evidence\r
6. `## 5. Issues & Root Cause Convergence`: issues ranked by severity with evidence and recommendations\r
7. `## 6. Next-Step Recommendations`: read-only verification and remediation suggestions only\r
8. `## 7. Data Gaps`: collection failures and items that could not be confirmed\r
\r
## Verification\r
\r
See [Verification Method](references/verification-method.md) for step-by-step verification.\r
\r
## Best Practices\r
\r
1. **Primary action first**: Always call `huawei_autoscaling_diagnose` first; use manual fallback only if the primary action fails\r
2. **Gateway routing**: Do not skip the Gateway phase — intent routing and capability discovery determine the correct Path A/B/C\r
3. **CA logs are critical**: CA Pod logs are the highest-confidence evidence source for node scaling failures; the primary tool automatically collects them, but manual fallback must prioritize this step\r
4. **Cascade diagnosis**: When HPA has scaled but new Pods are Pending, trace from HPA → CA as a cascade (Path C), not as separate isolated issues\r
5. **Metric prerequisites**: CPU/memory utilization-based HPA requires corresponding `resources.requests` on Pod containers; missing requests are a common critical root cause\r
6. **Read-only boundary**: This skill is read-only diagnosis; never create/modify HPA, scale workloads, modify nodepool min/max, install/upgrade addons, expand subnets, or apply for quota\r
7. **Hand off remediation**: When remediation is needed, hand off to `huawei-cloud-cce-auto-remediation-runner` and require user confirmation\r
8. **Log sanitization**: Never copy raw passwords, tokens, AK/SK, or Authorization headers from CA logs into output\r
\r
## Reference Documents\r
\r
| Document | Description |\r
|----------|-------------|\r
| [Workflow](references/workflow.md) | Gateway routing, Path A/B/C diagnosis trees, manual fallback tool sequence |\r
| [Output Schema](references/output-schema.md) | JSON response schema and required Markdown report sections |\r
| [Capability Map](references/capability-map.md) | Reusable tool capabilities, current gaps, and recommended atomic tool additions |\r
| [Risk Rules](references/risk-rules.md) | Allowed read actions, prohibited write actions, mutation boundary rules |\r
\r
## Notes\r
\r
- **Read-only by design** — this skill does NOT create/modify HPA, scale workloads, modify nodepool min/max, install/upgrade addons, expand subnets, or apply for quota\r
- **One-call preferred** — `huawei_autoscaling_diagnose` is the primary tool; raw queries are for targeted evidence when the user requests specific information or when the primary tool fails\r
- **Log sanitization** — only sanitized tail excerpts are included; raw secrets, tokens, and credentials must never appear in output\r
- **Gateway routing mandatory** — do not skip intent identification and capability discovery before entering Path A/B/C\r
- **CA Pod log analysis** — the primary tool automatically discovers and analyzes CA component Pod logs; manual fallback must prioritize this step\r
- **Cross-skill handoff** — when diagnosis reveals issues beyond autoscaling scope (Pod runtime failure, workload rollout failure, node NotReady), escalate to the appropriate skill\r
\r
## Common Pitfalls\r
\r
| Pitfall | Symptom | Quick Fix |\r
|---------|---------|-----------|\r
| Skipping Gateway routing | Diagnosis enters wrong path or misses capability | Always run intent + capability discovery before Path A/B/C |\r
| Missing CPU/Memory request | HPA cannot calculate utilization; `FailedGetResourceMetric` | Check `resources.requests` on all target Pod containers |\r
| Ignoring CA Pod logs | CA root cause remains unknown | Prioritize CA Pod log retrieval (kube-system autoscaler Pods) |\r
| Treating tolerance as failure | HPA not scaling when metrics within ~10% tolerance | Verify current metric ratio vs target threshold and tolerance window |\r
| Isolated HPA/CA analysis | Missing HPA→CA cascade linkage | Use Path C when both HPA and CA are present and intent is UNKNOWN |\r
| Wrong cluster_id | API returns 404 or empty results | Verify cluster ID via `huawei_list_cce_clusters` |\r
| Credential permission denied | API returns 403 | Check IAM permissions for CCE HPA/Pod/Event/Addon access |\r
| Not checking maxReplicas | HPA stuck at max replicas with `ScalingLimited` condition | Compare `currentReplicas` vs `maxReplicas` in HPA status |\r
| Not checking nodepool max_nodes | CA not expanding despite Pending Pods | Check `max_nodes` vs current node count in nodepool |\r
| Metrics API unavailable | HPA shows `FailedGetResourceMetric` | Ensure metrics-server or AOM addon is installed in cluster |

安全使用建议

Review carefully before installing. Use only least-privilege, read-only Huawei Cloud credentials in an isolated environment, and avoid production/admin AK/SK unless the publisher narrows the dispatcher to the advertised diagnostic actions, redacts kubeconfig and Secret values, removes credential echoing in prompts/commands, and adds confirmation gates for all mutations.

能力标签

requires-walletrequires-sensitive-credentials

能力评估

⚠ Purpose & Capability

The stated purpose and risk rules are read-only CCE autoscaling diagnosis, but the dispatcher registers high-impact actions such as kubeconfig retrieval, secret listing with data, cluster/node/nodepool creation or deletion, addon install/update, EIP binding, ECS start, and remediation actions.

⚠ Instruction Scope

The user-facing instructions repeatedly say not to modify HPA, workloads, nodepools, addons, subnets, quotas, or IAM, yet executable actions for many of those changes are present and invokable through the same dispatcher.

✓ Install Mechanism

No hidden installer, post-install hook, background service, or package-time persistence was found; the skill is a Python script package that depends on Huawei Cloud and Kubernetes SDKs.

⚠ Credentials

Huawei AK/SK and Kubernetes access are expected for this domain, but returning raw kubeconfig material, optionally returning Kubernetes Secret data, and exposing broad write actions are disproportionate for a read-only autoscaling diagnoser.

⚠ Persistence & Privilege

Several code paths write kubeconfig or certificate/key material to /tmp or temporary files, reports/raw inventory/history can be written to caller-controlled paths, and multiple live mutation actions lack explicit confirmation gates.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install huawei-cloud-cce-autoscaling-diagnoser
安装完成后，直接呼叫该 Skill 的名称或使用 /huawei-cloud-cce-autoscaling-diagnoser 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v0.1.2

Initial release

v0.1.1

Initial release

v0.1.0

Initial release

元数据

Slug huawei-cloud-cce-autoscaling-diagnoser

版本 0.1.2

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 3

常见问题