Huawei Cloud Cce Storage Failure Diagnoser
/install huawei-cloud-cce-storage-failure-diagnoser
Huawei Cloud CCE Storage Failure Diagnoser
Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.
- The dispatcher script is located at
scripts/huawei-cloud.pywithin the skill directory- All scripts and environment check scripts are inside the skill package. You must use
skill action=execto execute them. Do not run them directly in a shell.- Do not attempt hcloud, kubectl, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
- All paths are relative to the skill directory, which is the directory where this SKILL.md is located.
Overview
This skill diagnoses CCE/Kubernetes storage failures across PVC provisioning, scheduling/binding, attach/mount, runtime I/O, capacity, permission, and teardown stages. It uses the local Python dispatcher (scripts/huawei-cloud.py) to call the Huawei Cloud Python SDK and Kubernetes client APIs, collecting PVC/PV/StorageClass/Pod/Node/Event/VolumeAttachment evidence, Everest CSI logs, Kubelet /stats/summary, and cloud-side storage information. It produces a complete Markdown diagnosis report with process, evidence, conclusion, confidence, and remediation guidance.
Related Skills
| Skill | Purpose |
|---|---|
huawei-cloud-cce-node-failure-diagnoser |
Node-level failure diagnosis (scheduling, node resource issues) |
huawei-cloud-cce-network-failure-diagnoser |
Network failure diagnosis (Service/security group/ACL chain) |
huawei-cloud-cce-pod-failure-diagnoser |
Pod-level failure diagnosis |
huawei-cloud-cce-auto-remediation-runner |
Execute remediation actions (delete residual Pods, migrate workloads, expand storage, fix cloud resources) |
huawei-cloud-cce-metric-analyzer |
Metric trend analysis |
huawei-cloud-cce-observability-context-builder |
Observability context enrichment |
Capabilities
- One-shot storage failure diagnosis with structured evidence and Markdown report (
huawei_storage_failure_diagnose) - PVC/PV/StorageClass/VolumeAttachment collection (
huawei_get_cce_pvcs,huawei_get_cce_pvs,huawei_get_cce_storageclasses,huawei_get_cce_volumeattachments) - Node Kubelet
/stats/summaryproxy-read for capacity and inode analysis (huawei_get_cce_node_stats_summary) - Everest CSI driver/controller log retrieval with auto-sanitization (
huawei_get_cce_everest_csi_logs) - Cloud-side EVS/SFS/SFS Turbo supplementary evidence (
huawei_list_evs,huawei_get_evs_metrics,huawei_list_sfs,huawei_list_sfs_turbo) - Network supplementary evidence for SFS/NFS (
huawei_list_security_groups,huawei_list_vpc_acls) - Pod, Node, and Event Kubernetes evidence (
huawei_get_cce_pods,huawei_get_kubernetes_nodes,huawei_get_cce_events)
Typical Use Cases
- Diagnose a PVC stuck in
Pendingstate - Investigate Pod stuck in
ContainerCreatingwithFailedMountorFailedAttachVolumeevents - Analyze EVS disk attach failures, residual attachment locks, or per-node disk count limits
- Troubleshoot SFS/SFS Turbo NFS mount timeouts or network data-plane blocking
- Resolve OBS bucket access 403 errors, IAM delegation or AK/SK credential failures
- Diagnose runtime read-only filesystem, capacity or inode exhaustion
- Investigate ConfigMap/Secret subPath mount deadlocks
- Resolve PVC stuck in
Terminatingdue to protection finalizers - Check StorageClass provisioning or CSI driver errors
What This Skill Does NOT Handle
- Creating, modifying, or deleting PVC/PV/Pod resources
- Removing finalizers or force-detaching EVS disks
- Modifying StorageClass, IAM delegations, AK/SK Secrets, security groups, or ACLs
- Executing
kubectl exec, node SSH, packet capture, stress tests, orfsck - Any write operations on the data plane or control plane
Prerequisites
Python Dependencies
The dispatcher script requires Python >= 3.6 and the following packages:
huaweicloudsdkcorehuaweicloudsdkccehuaweicloudsdkevshuaweicloudsdksfshuaweicloudsdkvpchuaweicloudsdkiamhuaweicloudsdkceskubernetes
Run environment check before first use (see Verification section). The venv is auto-created by check_env; on Linux/macOS use .venv/bin/python3, on Windows use .venv/Scripts/python3.exe.
Credential Configuration
| Variable | Required | Description |
|---|---|---|
| HW_ACCESS_KEY | Yes | Huawei Cloud Access Key |
| HW_SECRET_KEY | Yes | Huawei Cloud Secret Key |
| HW_REGION_NAME | No | Default region (overrides region param if set); default cn-north-4 |
| HW_PROJECT_ID | No | Project ID (auto-obtained via IAM API when not set) |
| HW_SECURITY_TOKEN | No | Required when using temporary AK/SK |
Security constraints:
- Never persist AK/SK/Token/Certificate to disk or long-term memory
- AK/SK exists only in the current request call stack and is released on completion
- Only non-sensitive project IDs may be cached in process memory (never written to disk)
- All temporary certificate files must be deleted immediately after use
- Never leak AK/SK or other sensitive information in logs, responses, or errors
- Never send authentication information to any third-party server
Do not output the values of the above environment variables.
IAM Permissions
This skill requires read-only IAM permissions for CCE, EVS, SFS, OBS, VPC, and CES services. Minimum required permissions:
| Service | Permission | Purpose |
|---|---|---|
| CCE | cce:cluster:get, cce:node:get |
Read cluster and node info |
| CCE | cce:pod:get, cce:pvc:get |
Read Pod and PVC status |
| EVS | evs:disk:list, evs:disk:get |
Read EVS disk details |
| EVS | evs:cloudvolume:list |
List cloud volumes |
| VPC | vpc:securityGroup:get, vpc:firewall:get |
Read security groups and ACLs |
| SFS | sfs:share:get, sfs:share:list |
Read SFS/SFS Turbo shares |
If a permission check fails, verify AK/SK configuration, confirm the user has the required read-only permissions, and check that the IAM policy is active (policies typically take effect within 5-10 minutes).
Core Tools
All actions are invoked via the dispatcher script:
python3 scripts/huawei-cloud.py \x3Caction> region=\x3Cregion> cluster_id=\x3Ccluster_id> [key=value ...]
Primary Diagnosis Action
python3 scripts/huawei-cloud.py huawei_storage_failure_diagnose \
region=cn-north-4 cluster_id=\x3Ccluster_id> \
namespace=default pvc_name=\x3Cpvc_name> \
include_stats=true include_logs=true include_cloud=false
Returns structured evidence + report_markdown (complete Markdown diagnosis report).
Recommended defaults: include_stats=true, include_logs=true, include_cloud=false. Set include_cloud=true when you need EVS/SFS/SFS Turbo and security group/ACL supplementary evidence.
Kubernetes Evidence Actions
| Action | Required Params | Optional Params | Description |
|---|---|---|---|
huawei_get_cce_pvcs |
region, cluster_id |
namespace, pvc_name |
List PVCs |
huawei_get_cce_pvs |
region, cluster_id |
pv_name |
List PVs |
huawei_get_cce_storageclasses |
region, cluster_id |
- | List StorageClasses with provisioner, parameters, volumeBindingMode |
huawei_get_cce_volumeattachments |
region, cluster_id |
- | List VolumeAttachments with attached status, attachError, detachError |
huawei_get_cce_node_stats_summary |
region, cluster_id |
- | Proxy-read node /stats/summary; parse PVC usedBytes/capacityBytes and inode |
huawei_get_cce_everest_csi_logs |
region, cluster_id |
- | Read Everest CSI driver/controller logs (auto-sanitized) |
huawei_get_cce_events |
region, cluster_id |
- | List cluster events |
huawei_get_cce_pods |
region, cluster_id |
namespace, pod_name |
List Pods |
huawei_get_kubernetes_nodes |
region, cluster_id |
- | List Kubernetes nodes with labels, taints, conditions |
Cloud Supplementary Evidence Actions
| Action | Required Params | Optional Params | Description |
|---|---|---|---|
huawei_list_evs |
region |
disk_id, availability_zone |
List EVS disks |
huawei_get_evs_metrics |
region, disk_id |
- | Get EVS disk I/O metrics |
huawei_list_sfs |
region |
- | List SFS file systems |
huawei_list_sfs_turbo |
region |
- | List SFS Turbo file systems |
huawei_list_security_groups |
region |
- | List VPC security groups (for SFS/NFS network analysis) |
huawei_list_vpc_acls |
region |
- | List VPC network ACLs (for SFS/NFS network analysis) |
Parameter Reference
huawei_storage_failure_diagnose
| Parameter | Required | Default | Description |
|---|---|---|---|
region |
Yes | - | Huawei Cloud region (e.g., cn-north-4) |
cluster_id |
Yes | - | CCE cluster ID |
namespace |
No | - | Kubernetes namespace (recommended for PVC Pending/Terminating/capacity issues) |
pvc_name |
No | - | Specific PVC name |
pod_name |
No | - | Specific Pod name (recommended for Pod Pending/ContainerCreating/IO anomalies) |
failure_symptom |
No | - | Symptom description, e.g., "PVC Pending", "FailedMount mount.nfs timeout", "OBS 403", "Read-only file system", "PVC Terminating" |
include_stats |
No | true | Include node /stats/summary for capacity/inode analysis |
include_logs |
No | true | Include Everest CSI driver/controller logs |
include_cloud |
No | false | Include EVS/SFS/SFS Turbo and security group/ACL cloud-side evidence |
Common Parameters (All Actions)
| Parameter | Required | Default | Description |
|---|---|---|---|
region |
Yes | - | Huawei Cloud region |
cluster_id |
Yes* | - | CCE cluster ID (required for CCE/K8s actions; not required for pure cloud actions) |
ak |
No | env HW_ACCESS_KEY | Huawei Cloud AK |
sk |
No | env HW_SECRET_KEY | Huawei Cloud SK |
project_id |
No | auto-obtained | Project ID (auto-obtained via IAM API when not set) |
*Required for CCE/Kubernetes actions. Not required for pure cloud-side actions like huawei_list_evs, huawei_list_security_groups.
Output Format
The primary action huawei_storage_failure_diagnose returns structured JSON with an embedded report_markdown. See references/output-schema.md for the full JSON response schema.
{
"success": true,
"action": "huawei_storage_failure_diagnose",
"region": "cn-north-4",
"cluster_id": "cluster-id",
"namespace": "default",
"conclusion": "high signal conclusion",
"confidence": "High",
"findings": [
{
"stage": "Mount stage failure",
"type": "EVSNodeAttachLimitExceeded",
"title": "VolumeAttachment attached=false; error indicates ECS per-node disk count limit reached",
"confidence": 0.94,
"severity": "critical",
"evidence": [],
"recommendation": []
}
],
"top_causes": [],
"snapshot": {},
"report_markdown": "# CCE Storage Failure Automated Diagnosis Report\
..."
}
Required Markdown Report Sections
When report_markdown is present, use it as the final report body. You may add clarifications the user requests, but do not discard evidence tables.
The report_markdown must contain these headings:
# CCE Storage Failure Automated Diagnosis Report## 1. Diagnosis Overview## 2. Investigation Process## 3. Key Object Relationships## 4. Evidence Matrix## 5. Diagnosis Conclusion## 6. Recommended Actions and Verification Standards## 7. Data Gaps and Manual Confirmation
Finding Types
Common type values in findings:
| Type | Description |
|---|---|
NormalWaitForFirstConsumer |
PVC Pending with WaitForFirstConsumer; normal behavior awaiting Pod scheduling |
EVSQuotaExceeded |
EVS cloud disk quota exceeded |
SFSSubnetIPInsufficient |
SFS/SFS Turbo subnet available IP or mount target allocation failure |
OBSBucketNameInvalid |
OBS bucket name conflict or invalid naming |
EVSAvailabilityZoneSchedulingConflict |
EVS single-AZ affinity prevents Pod scheduling to storage AZ |
LocalPVNodeOffline |
Local PV host node down/offline |
VolumeAttachmentNotCreated |
K8s control plane has not issued attach instruction |
EVSNodeAttachLimitExceeded |
ECS per-node attached disk count limit reached |
EVSResidualAttachmentLock |
EVS residual node occupancy or underlying lock not released |
EVSAttachFailed |
EVS attach failure (general) |
HostKernelMountFailed |
Cloud-side attached but host kernel/filesystem mount failed |
SFSNfsNetworkBlocked |
SFS/SFS Turbo NFS mount timeout due to network data-plane blocking |
OBSCredentialInvalid |
OBS IAM delegation changed, AK/SK Secret invalid, or bucket permission error |
StoragePermissionDenied |
Permission denied / forbidden / access denied (general) |
PVCCapacityExhausted |
PVC capacity usage > 95% |
PVCInodeExhausted |
PVC inode usage > 95% |
ReadOnlyFilesystemProtection |
Linux read-only filesystem protection triggered |
ConfigMapSecretSubPathDeadlock |
ConfigMap/Secret subPath mount point deadlock |
PVCProtectionBlocked |
PVC Terminating with kubernetes.io/pvc-protection finalizer |
StorageIOError |
Runtime storage I/O errors |
Verification
Environment Check
Before first use, run the environment check script to install dependencies and validate credentials:
- Linux/macOS:
skill action=exec: bash skill://scripts/check_env.sh - Windows:
skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1
The script checks: Python >= 3.6, install dependencies, validate SDK, validate credentials, validate service availability.
Diagnosis Verification
- Run environment check and confirm all checks pass
- Execute
huawei_storage_failure_diagnosewith a known region and cluster_id:python3 scripts/huawei-cloud.py huawei_storage_failure_diagnose \ region=cn-north-4 cluster_id=\x3Ccluster_id> include_stats=true include_logs=true - Verify the returned JSON contains
success=true,findingsarray, andreport_markdown - Check that the Markdown report contains all required sections (see
references/output-schema.md) - Compare diagnosis conclusions against known failure patterns
Best Practices
- Always call
huawei_storage_failure_diagnosefirst; use individual tools only as fallback or for raw evidence - Provide
namespaceandpvc_name/pod_namewhen possible to narrow diagnosis scope - Set
include_cloud=trueonly when you need cloud-side (EVS/SFS/OBS) supplementary evidence - For NFS/SFS mount timeouts, always supplement with security group and VPC ACL checks
- For OBS 403 errors, focus on Everest CSI logs and event messages rather than cloud-side queries
- Conclusion confidence is ranked by evidence strength, not by stage priority
- Never write guesses as conclusions; output evidence gaps explicitly
- For any remediation actions, only output proposed plan and verification standards, then hand off to
huawei-cloud-cce-auto-remediation-runnerfor user confirmation
Reference Documents
| Document | Description |
|---|---|
references/workflow.md |
Diagnosis triage flow, reusable capabilities, and stage-by-stage pipeline |
references/output-schema.md |
Output JSON schema and required Markdown report sections |
references/risk-rules.md |
Risk boundary rules: allowed read actions, prohibited write actions, and high-risk handoff |
| Huawei Cloud Python SDK Documentation | SDK reference |
| Huawei Cloud API Explorer | API interactive explorer |
Notes
- This skill is read-only diagnosis only — it never deletes PVC/PV/Pod, patches finalizers, force-detaches/attaches EVS, or modifies any StorageClass/IAM/Secret/SecurityGroup/ACL
- Never expose or log AK/SK or environment variable values
- All actions are executed via
python3 scripts/huawei-cloud.py \x3Caction>; do not use hcloud CLI, kubectl, or direct API calls - PVC Terminating: never directly suggest removing
kubernetes.io/pvc-protectionfinalizer; must first prove no Pod references and no business data risk - EVS residual mount or read-only filesystem scenarios: never suggest force-unmount, force-attach, or direct restart of database-class workloads before confirming filesystem consistency
- ConfigMap/Secret
resourceVersionhas no natural update timestamp; usemanagedFields.time, Pod timestamps, and FailedMount events as circumstantial evidence only - Cross-diagnosis handoff: scheduling/node resource issues ->
huawei-cloud-cce-node-failure-diagnoser; Service/security group/ACL chain ->huawei-cloud-cce-network-failure-diagnoser; remediation actions ->huawei-cloud-cce-auto-remediation-runner
Common Pitfalls
| Pitfall | Correct Approach |
|---|---|
Treating WaitForFirstConsumer PVC Pending as a failure |
A PVC in Pending state with WaitForFirstConsumer volumeBindingMode and no associated Pod is normal behavior, not a failure |
| Diagnosing scheduling failures without AZ context | EVS disks are single-AZ; always check PV nodeAffinity and node AZ labels before concluding scheduling issues |
| Confusing mount vs. attach | attached=true in VolumeAttachment means cloud-side attach succeeded; FailedMount events indicate host-side kernel/filesystem mount failure, not cloud attach failure |
| Overlooking CSI logs for OBS issues | OBS 403 and credential errors are best identified in Everest CSI logs, not in Kubernetes events alone |
| Premature finalizer removal | Removing kubernetes.io/pvc-protection without verifying no Pod references can cause data loss |
| Guessing without evidence | When no clear finding matches, output the evidence gap rather than fabricating a conclusion |
| Skipping environment check | Always run the environment check script before first diagnosis execution |
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install huawei-cloud-cce-storage-failure-diagnoser - After installation, invoke the skill by name or use
/huawei-cloud-cce-storage-failure-diagnoser - Provide required inputs per the skill's parameter spec and get structured output
What is Huawei Cloud Cce Storage Failure Diagnoser?
Huawei Cloud CCE Storage failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose PVC Pending, volume mount f... It is an AI Agent Skill for Claude Code / OpenClaw, with 23 downloads so far.
How do I install Huawei Cloud Cce Storage Failure Diagnoser?
Run "/install huawei-cloud-cce-storage-failure-diagnoser" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Huawei Cloud Cce Storage Failure Diagnoser free?
Yes, Huawei Cloud Cce Storage Failure Diagnoser is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Huawei Cloud Cce Storage Failure Diagnoser support?
Huawei Cloud Cce Storage Failure Diagnoser is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Huawei Cloud Cce Storage Failure Diagnoser?
It is built and maintained by shijingcheng (@pintudeyudi); the current version is v0.1.0.