← Back to Skills Marketplace
pintudeyudi

Huawei Cloud Cce Storage Failure Diagnoser

by shijingcheng · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ pending
23
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install huawei-cloud-cce-storage-failure-diagnoser
Description
Huawei Cloud CCE Storage failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose PVC Pending, volume mount f...
README (SKILL.md)

Huawei Cloud CCE Storage Failure Diagnoser

Execution Method (Must Read): This skill executes queries via the local Python dispatcher script. Using hcloud, kubectl, or other CLI tools or direct API calls is prohibited.

  • The dispatcher script is located at scripts/huawei-cloud.py within the skill directory
  • All scripts and environment check scripts are inside the skill package. You must use skill action=exec to execute them. Do not run them directly in a shell.
  • Do not attempt hcloud, kubectl, curl IAM, or any other CLI/API methods. This skill does not depend on those tools.
  • All paths are relative to the skill directory, which is the directory where this SKILL.md is located.

Overview

This skill diagnoses CCE/Kubernetes storage failures across PVC provisioning, scheduling/binding, attach/mount, runtime I/O, capacity, permission, and teardown stages. It uses the local Python dispatcher (scripts/huawei-cloud.py) to call the Huawei Cloud Python SDK and Kubernetes client APIs, collecting PVC/PV/StorageClass/Pod/Node/Event/VolumeAttachment evidence, Everest CSI logs, Kubelet /stats/summary, and cloud-side storage information. It produces a complete Markdown diagnosis report with process, evidence, conclusion, confidence, and remediation guidance.

Related Skills

Skill Purpose
huawei-cloud-cce-node-failure-diagnoser Node-level failure diagnosis (scheduling, node resource issues)
huawei-cloud-cce-network-failure-diagnoser Network failure diagnosis (Service/security group/ACL chain)
huawei-cloud-cce-pod-failure-diagnoser Pod-level failure diagnosis
huawei-cloud-cce-auto-remediation-runner Execute remediation actions (delete residual Pods, migrate workloads, expand storage, fix cloud resources)
huawei-cloud-cce-metric-analyzer Metric trend analysis
huawei-cloud-cce-observability-context-builder Observability context enrichment

Capabilities

  1. One-shot storage failure diagnosis with structured evidence and Markdown report (huawei_storage_failure_diagnose)
  2. PVC/PV/StorageClass/VolumeAttachment collection (huawei_get_cce_pvcs, huawei_get_cce_pvs, huawei_get_cce_storageclasses, huawei_get_cce_volumeattachments)
  3. Node Kubelet /stats/summary proxy-read for capacity and inode analysis (huawei_get_cce_node_stats_summary)
  4. Everest CSI driver/controller log retrieval with auto-sanitization (huawei_get_cce_everest_csi_logs)
  5. Cloud-side EVS/SFS/SFS Turbo supplementary evidence (huawei_list_evs, huawei_get_evs_metrics, huawei_list_sfs, huawei_list_sfs_turbo)
  6. Network supplementary evidence for SFS/NFS (huawei_list_security_groups, huawei_list_vpc_acls)
  7. Pod, Node, and Event Kubernetes evidence (huawei_get_cce_pods, huawei_get_kubernetes_nodes, huawei_get_cce_events)

Typical Use Cases

  • Diagnose a PVC stuck in Pending state
  • Investigate Pod stuck in ContainerCreating with FailedMount or FailedAttachVolume events
  • Analyze EVS disk attach failures, residual attachment locks, or per-node disk count limits
  • Troubleshoot SFS/SFS Turbo NFS mount timeouts or network data-plane blocking
  • Resolve OBS bucket access 403 errors, IAM delegation or AK/SK credential failures
  • Diagnose runtime read-only filesystem, capacity or inode exhaustion
  • Investigate ConfigMap/Secret subPath mount deadlocks
  • Resolve PVC stuck in Terminating due to protection finalizers
  • Check StorageClass provisioning or CSI driver errors

What This Skill Does NOT Handle

  1. Creating, modifying, or deleting PVC/PV/Pod resources
  2. Removing finalizers or force-detaching EVS disks
  3. Modifying StorageClass, IAM delegations, AK/SK Secrets, security groups, or ACLs
  4. Executing kubectl exec, node SSH, packet capture, stress tests, or fsck
  5. Any write operations on the data plane or control plane

Prerequisites

Python Dependencies

The dispatcher script requires Python >= 3.6 and the following packages:

  • huaweicloudsdkcore
  • huaweicloudsdkcce
  • huaweicloudsdkevs
  • huaweicloudsdksfs
  • huaweicloudsdkvpc
  • huaweicloudsdkiam
  • huaweicloudsdkces
  • kubernetes

Run environment check before first use (see Verification section). The venv is auto-created by check_env; on Linux/macOS use .venv/bin/python3, on Windows use .venv/Scripts/python3.exe.

Credential Configuration

Variable Required Description
HW_ACCESS_KEY Yes Huawei Cloud Access Key
HW_SECRET_KEY Yes Huawei Cloud Secret Key
HW_REGION_NAME No Default region (overrides region param if set); default cn-north-4
HW_PROJECT_ID No Project ID (auto-obtained via IAM API when not set)
HW_SECURITY_TOKEN No Required when using temporary AK/SK

Security constraints:

  1. Never persist AK/SK/Token/Certificate to disk or long-term memory
  2. AK/SK exists only in the current request call stack and is released on completion
  3. Only non-sensitive project IDs may be cached in process memory (never written to disk)
  4. All temporary certificate files must be deleted immediately after use
  5. Never leak AK/SK or other sensitive information in logs, responses, or errors
  6. Never send authentication information to any third-party server

Do not output the values of the above environment variables.

IAM Permissions

This skill requires read-only IAM permissions for CCE, EVS, SFS, OBS, VPC, and CES services. Minimum required permissions:

Service Permission Purpose
CCE cce:cluster:get, cce:node:get Read cluster and node info
CCE cce:pod:get, cce:pvc:get Read Pod and PVC status
EVS evs:disk:list, evs:disk:get Read EVS disk details
EVS evs:cloudvolume:list List cloud volumes
VPC vpc:securityGroup:get, vpc:firewall:get Read security groups and ACLs
SFS sfs:share:get, sfs:share:list Read SFS/SFS Turbo shares

If a permission check fails, verify AK/SK configuration, confirm the user has the required read-only permissions, and check that the IAM policy is active (policies typically take effect within 5-10 minutes).


Core Tools

All actions are invoked via the dispatcher script:

python3 scripts/huawei-cloud.py \x3Caction> region=\x3Cregion> cluster_id=\x3Ccluster_id> [key=value ...]

Primary Diagnosis Action

python3 scripts/huawei-cloud.py huawei_storage_failure_diagnose \
  region=cn-north-4 cluster_id=\x3Ccluster_id> \
  namespace=default pvc_name=\x3Cpvc_name> \
  include_stats=true include_logs=true include_cloud=false

Returns structured evidence + report_markdown (complete Markdown diagnosis report).

Recommended defaults: include_stats=true, include_logs=true, include_cloud=false. Set include_cloud=true when you need EVS/SFS/SFS Turbo and security group/ACL supplementary evidence.

Kubernetes Evidence Actions

Action Required Params Optional Params Description
huawei_get_cce_pvcs region, cluster_id namespace, pvc_name List PVCs
huawei_get_cce_pvs region, cluster_id pv_name List PVs
huawei_get_cce_storageclasses region, cluster_id - List StorageClasses with provisioner, parameters, volumeBindingMode
huawei_get_cce_volumeattachments region, cluster_id - List VolumeAttachments with attached status, attachError, detachError
huawei_get_cce_node_stats_summary region, cluster_id - Proxy-read node /stats/summary; parse PVC usedBytes/capacityBytes and inode
huawei_get_cce_everest_csi_logs region, cluster_id - Read Everest CSI driver/controller logs (auto-sanitized)
huawei_get_cce_events region, cluster_id - List cluster events
huawei_get_cce_pods region, cluster_id namespace, pod_name List Pods
huawei_get_kubernetes_nodes region, cluster_id - List Kubernetes nodes with labels, taints, conditions

Cloud Supplementary Evidence Actions

Action Required Params Optional Params Description
huawei_list_evs region disk_id, availability_zone List EVS disks
huawei_get_evs_metrics region, disk_id - Get EVS disk I/O metrics
huawei_list_sfs region - List SFS file systems
huawei_list_sfs_turbo region - List SFS Turbo file systems
huawei_list_security_groups region - List VPC security groups (for SFS/NFS network analysis)
huawei_list_vpc_acls region - List VPC network ACLs (for SFS/NFS network analysis)

Parameter Reference

huawei_storage_failure_diagnose

Parameter Required Default Description
region Yes - Huawei Cloud region (e.g., cn-north-4)
cluster_id Yes - CCE cluster ID
namespace No - Kubernetes namespace (recommended for PVC Pending/Terminating/capacity issues)
pvc_name No - Specific PVC name
pod_name No - Specific Pod name (recommended for Pod Pending/ContainerCreating/IO anomalies)
failure_symptom No - Symptom description, e.g., "PVC Pending", "FailedMount mount.nfs timeout", "OBS 403", "Read-only file system", "PVC Terminating"
include_stats No true Include node /stats/summary for capacity/inode analysis
include_logs No true Include Everest CSI driver/controller logs
include_cloud No false Include EVS/SFS/SFS Turbo and security group/ACL cloud-side evidence

Common Parameters (All Actions)

Parameter Required Default Description
region Yes - Huawei Cloud region
cluster_id Yes* - CCE cluster ID (required for CCE/K8s actions; not required for pure cloud actions)
ak No env HW_ACCESS_KEY Huawei Cloud AK
sk No env HW_SECRET_KEY Huawei Cloud SK
project_id No auto-obtained Project ID (auto-obtained via IAM API when not set)

*Required for CCE/Kubernetes actions. Not required for pure cloud-side actions like huawei_list_evs, huawei_list_security_groups.


Output Format

The primary action huawei_storage_failure_diagnose returns structured JSON with an embedded report_markdown. See references/output-schema.md for the full JSON response schema.

{
  "success": true,
  "action": "huawei_storage_failure_diagnose",
  "region": "cn-north-4",
  "cluster_id": "cluster-id",
  "namespace": "default",
  "conclusion": "high signal conclusion",
  "confidence": "High",
  "findings": [
    {
      "stage": "Mount stage failure",
      "type": "EVSNodeAttachLimitExceeded",
      "title": "VolumeAttachment attached=false; error indicates ECS per-node disk count limit reached",
      "confidence": 0.94,
      "severity": "critical",
      "evidence": [],
      "recommendation": []
    }
  ],
  "top_causes": [],
  "snapshot": {},
  "report_markdown": "# CCE Storage Failure Automated Diagnosis Report\
..."
}

Required Markdown Report Sections

When report_markdown is present, use it as the final report body. You may add clarifications the user requests, but do not discard evidence tables.

The report_markdown must contain these headings:

  • # CCE Storage Failure Automated Diagnosis Report
  • ## 1. Diagnosis Overview
  • ## 2. Investigation Process
  • ## 3. Key Object Relationships
  • ## 4. Evidence Matrix
  • ## 5. Diagnosis Conclusion
  • ## 6. Recommended Actions and Verification Standards
  • ## 7. Data Gaps and Manual Confirmation

Finding Types

Common type values in findings:

Type Description
NormalWaitForFirstConsumer PVC Pending with WaitForFirstConsumer; normal behavior awaiting Pod scheduling
EVSQuotaExceeded EVS cloud disk quota exceeded
SFSSubnetIPInsufficient SFS/SFS Turbo subnet available IP or mount target allocation failure
OBSBucketNameInvalid OBS bucket name conflict or invalid naming
EVSAvailabilityZoneSchedulingConflict EVS single-AZ affinity prevents Pod scheduling to storage AZ
LocalPVNodeOffline Local PV host node down/offline
VolumeAttachmentNotCreated K8s control plane has not issued attach instruction
EVSNodeAttachLimitExceeded ECS per-node attached disk count limit reached
EVSResidualAttachmentLock EVS residual node occupancy or underlying lock not released
EVSAttachFailed EVS attach failure (general)
HostKernelMountFailed Cloud-side attached but host kernel/filesystem mount failed
SFSNfsNetworkBlocked SFS/SFS Turbo NFS mount timeout due to network data-plane blocking
OBSCredentialInvalid OBS IAM delegation changed, AK/SK Secret invalid, or bucket permission error
StoragePermissionDenied Permission denied / forbidden / access denied (general)
PVCCapacityExhausted PVC capacity usage > 95%
PVCInodeExhausted PVC inode usage > 95%
ReadOnlyFilesystemProtection Linux read-only filesystem protection triggered
ConfigMapSecretSubPathDeadlock ConfigMap/Secret subPath mount point deadlock
PVCProtectionBlocked PVC Terminating with kubernetes.io/pvc-protection finalizer
StorageIOError Runtime storage I/O errors

Verification

Environment Check

Before first use, run the environment check script to install dependencies and validate credentials:

  • Linux/macOS: skill action=exec: bash skill://scripts/check_env.sh
  • Windows: skill action=exec: powershell -ExecutionPolicy Bypass -File skill://scripts/check_env.ps1

The script checks: Python >= 3.6, install dependencies, validate SDK, validate credentials, validate service availability.

Diagnosis Verification

  1. Run environment check and confirm all checks pass
  2. Execute huawei_storage_failure_diagnose with a known region and cluster_id:
    python3 scripts/huawei-cloud.py huawei_storage_failure_diagnose \
      region=cn-north-4 cluster_id=\x3Ccluster_id> include_stats=true include_logs=true
    
  3. Verify the returned JSON contains success=true, findings array, and report_markdown
  4. Check that the Markdown report contains all required sections (see references/output-schema.md)
  5. Compare diagnosis conclusions against known failure patterns

Best Practices

  1. Always call huawei_storage_failure_diagnose first; use individual tools only as fallback or for raw evidence
  2. Provide namespace and pvc_name/pod_name when possible to narrow diagnosis scope
  3. Set include_cloud=true only when you need cloud-side (EVS/SFS/OBS) supplementary evidence
  4. For NFS/SFS mount timeouts, always supplement with security group and VPC ACL checks
  5. For OBS 403 errors, focus on Everest CSI logs and event messages rather than cloud-side queries
  6. Conclusion confidence is ranked by evidence strength, not by stage priority
  7. Never write guesses as conclusions; output evidence gaps explicitly
  8. For any remediation actions, only output proposed plan and verification standards, then hand off to huawei-cloud-cce-auto-remediation-runner for user confirmation

Reference Documents

Document Description
references/workflow.md Diagnosis triage flow, reusable capabilities, and stage-by-stage pipeline
references/output-schema.md Output JSON schema and required Markdown report sections
references/risk-rules.md Risk boundary rules: allowed read actions, prohibited write actions, and high-risk handoff
Huawei Cloud Python SDK Documentation SDK reference
Huawei Cloud API Explorer API interactive explorer

Notes

  1. This skill is read-only diagnosis only — it never deletes PVC/PV/Pod, patches finalizers, force-detaches/attaches EVS, or modifies any StorageClass/IAM/Secret/SecurityGroup/ACL
  2. Never expose or log AK/SK or environment variable values
  3. All actions are executed via python3 scripts/huawei-cloud.py \x3Caction>; do not use hcloud CLI, kubectl, or direct API calls
  4. PVC Terminating: never directly suggest removing kubernetes.io/pvc-protection finalizer; must first prove no Pod references and no business data risk
  5. EVS residual mount or read-only filesystem scenarios: never suggest force-unmount, force-attach, or direct restart of database-class workloads before confirming filesystem consistency
  6. ConfigMap/Secret resourceVersion has no natural update timestamp; use managedFields.time, Pod timestamps, and FailedMount events as circumstantial evidence only
  7. Cross-diagnosis handoff: scheduling/node resource issues -> huawei-cloud-cce-node-failure-diagnoser; Service/security group/ACL chain -> huawei-cloud-cce-network-failure-diagnoser; remediation actions -> huawei-cloud-cce-auto-remediation-runner

Common Pitfalls

Pitfall Correct Approach
Treating WaitForFirstConsumer PVC Pending as a failure A PVC in Pending state with WaitForFirstConsumer volumeBindingMode and no associated Pod is normal behavior, not a failure
Diagnosing scheduling failures without AZ context EVS disks are single-AZ; always check PV nodeAffinity and node AZ labels before concluding scheduling issues
Confusing mount vs. attach attached=true in VolumeAttachment means cloud-side attach succeeded; FailedMount events indicate host-side kernel/filesystem mount failure, not cloud attach failure
Overlooking CSI logs for OBS issues OBS 403 and credential errors are best identified in Everest CSI logs, not in Kubernetes events alone
Premature finalizer removal Removing kubernetes.io/pvc-protection without verifying no Pod references can cause data loss
Guessing without evidence When no clear finding matches, output the evidence gap rather than fabricating a conclusion
Skipping environment check Always run the environment check script before first diagnosis execution
Capability Tags
requires-walletrequires-sensitive-credentials
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install huawei-cloud-cce-storage-failure-diagnoser
  3. After installation, invoke the skill by name or use /huawei-cloud-cce-storage-failure-diagnoser
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.1.0
Initial release
Metadata
Slug huawei-cloud-cce-storage-failure-diagnoser
Version 0.1.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Huawei Cloud Cce Storage Failure Diagnoser?

Huawei Cloud CCE Storage failure diagnosis skill using Python SDK dispatcher. Use this skill when the user wants to: (1) diagnose PVC Pending, volume mount f... It is an AI Agent Skill for Claude Code / OpenClaw, with 23 downloads so far.

How do I install Huawei Cloud Cce Storage Failure Diagnoser?

Run "/install huawei-cloud-cce-storage-failure-diagnoser" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Huawei Cloud Cce Storage Failure Diagnoser free?

Yes, Huawei Cloud Cce Storage Failure Diagnoser is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Huawei Cloud Cce Storage Failure Diagnoser support?

Huawei Cloud Cce Storage Failure Diagnoser is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Huawei Cloud Cce Storage Failure Diagnoser?

It is built and maintained by shijingcheng (@pintudeyudi); the current version is v0.1.0.

💬 Comments