Description

This skill should be used when the user asks to "analyze incidents", "troubleshoot production issues", "investigate alerts", "create tickets", "root cause an...

README (SKILL.md)

\r \r

DevOps Insight - Intelligent DevOps Incident Management\r

Name: DevOps Insight
Author: cafechen

\r DevOps Insight is an intelligent DevOps incident management system that integrates multiple monitoring systems, GitHub, and ticket databases to enable automated fault analysis, root cause identification, and issue resolution.\r \r

System Architecture\r

\r

Core Components\r

\r

Monitoring Data Source Integration (via MCP)\r
- Kubernetes: Cluster status, Pod logs, events\r
- PostgreSQL: Database performance metrics\r
- Redis: Cache status and performance\r
- Neo4j: Graph database monitoring\r
- Elasticsearch: Log platform\r
- Metrics: General metrics collection\r
- APM (Skywalking): Application performance monitoring\r \r
Code Management\r
- GitHub integration (via gitnexus Nexus-skill)\r
- Code review and commits\r
- Automated fix commits\r \r
EvoMap Integration\r
- Capsule creation and publishing\r
- Gene + Capsule bundle publishing\r
- Automated quality validation\r
- Network reputation tracking\r \r
AI Agent\r
- Problem clue identification via LLM\r
- Root cause analysis\r
- Code review and fix suggestions\r
- Index construction decisions\r \r

Workflow\r

\r

1. Monitoring Data Collection\r

\r When receiving an alert or analyzing an issue:\r \r

# Retrieve Kubernetes monitoring data via MCP\r
# Assumes MCP server connections to each monitoring system are configured\r
```\r
\r
**Steps:**\r
- Retrieve Pod status, logs, and events from Kubernetes\r
- Retrieve application performance traces from APM (Skywalking)\r
- Retrieve relevant logs from Elasticsearch\r
- Retrieve performance metrics from the Metrics system\r
- Retrieve status information from databases (PostgreSQL/Redis/Neo4j)\r
\r
### 2. Intelligent Analysis and Root Cause Identification\r
\r
Perform multi-dimensional analysis using Claude:\r
\r
**Analysis Dimensions:**\r
1. **Problem Clue Identification**\r
   - Analyze alert information and monitoring data\r
   - Identify anomalous patterns and trends\r
   - Correlate with historical events\r
\r
2. **Root Cause Analysis**\r
   - Code level: Recent code changes\r
   - Configuration level: Configuration changes and environment differences\r
   - Infrastructure level: Resource usage and network issues\r
   - Dependency level: Third-party services and databases\r
\r
3. **Impact Assessment**\r
   - Affected services and users\r
   - Business impact severity\r
   - Urgency determination\r
\r
### 3. Capsule Publishing\r
\r
**Capsule Creation Workflow:**\r
\r
```typescript\r
// Capsule data structure example\r
interface Capsule {\r
  asset_type: 'Capsule';\r
  asset_id: string; // sha256 hash\r
  title: string;\r
  body: string;\r
  signals: string[];\r
  confidence: number; // 0.0 to 1.0\r
  blast_radius: number;\r
  solution: {\r
    type: 'code_change' | 'config_change' | 'investigation';\r
    files: Array\x3C{\r
      path: string;\r
      diff?: string;\r
      content?: string;\r
    }>;\r
    description: string;\r
  };\r
  context: {\r
    monitoring_data?: any;\r
    root_cause?: string;\r
    affected_services?: string[];\r
  };\r
  metadata: {\r
    created_at: string;\r
    model_name?: string;\r
  };\r
}\r
\r
// Gene data structure example\r
interface Gene {\r
  asset_type: 'Gene';\r
  asset_id: string; // sha256 hash\r
  title: string;\r
  body: string;\r
  signals: string[];\r
  category: 'repair' | 'optimize' | 'innovate' | 'regulatory';\r
  strategy: string;\r
  confidence: number;\r
  metadata: {\r
    created_at: string;\r
    model_name?: string;\r
  };\r
}\r
```\r
\r
**Publishing Operations:**\r
- Automatic Gene + Capsule bundle creation (based on analysis results)\r
- SHA-256 hash computation for asset verification\r
- Quality validation (confidence >= 0.8 recommended)\r
- Network reputation tracking\r
- Automatic promotion when quality thresholds are met\r
\r
### 4. Code Review and Fixes\r
\r
**GitHub Integration:**\r
\r
1. **Code Review**\r
   - Review recent commits\r
   - Identify code changes that may have caused issues\r
   - Provide fix suggestions\r
\r
2. **Automated Fixes**\r
   - Generate fix code\r
   - Create fix branch\r
   - Submit Pull Request\r
   - Update ticket status\r
\r
3. **Index Construction Decisions**\r
   - Determine if additional monitoring metrics are needed\r
   - Determine if alert rules need modification\r
   - Update APM tracing configuration\r
\r
### 5. Audit and Production Changes\r
\r
**Important Reminder:**\r
- ⚠️ Audit and production changes - This step carries risk\r
- All changes require approval process\r
- Record all operation logs\r
- Support rollback mechanism\r
\r
## Use Cases\r
\r
### Scenario 1: Production Environment Alert Response\r
\r
```\r
User: "Production API response time suddenly increased, help me analyze"\r
\r
DevOps Insight Workflow:\r
1. Retrieve API response time trends from APM\r
2. Check Pod status and resource usage from Kubernetes\r
3. Query related error logs from Elasticsearch\r
4. Check query performance from database monitoring\r
4. Analyze root cause (e.g., slow database queries, memory leaks, traffic spikes)\r
5. Publish Gene + Capsule bundle to EvoMap network\r
6. If it's a code issue, review recent commits and provide fix suggestions\r
7. Update monitoring index, add relevant metrics\r
```\r
\r
### Scenario 2: Fault Root Cause Analysis\r
\r
```\r
User: "Help me analyze last night's service outage"\r
\r
DevOps Insight Workflow:\r
1. Query related Capsules from EvoMap network\r
2. Retrieve all monitoring data for the event time period\r
3. Analyze timeline:\r
   - Code deployment time\r
   - Configuration change time\r
   - Resource usage changes\r
   - Error log appearance time\r
4. Identify root cause\r
5. Generate detailed post-incident analysis report\r
6. Provide preventive measure recommendations\r
```\r
\r
### Scenario 3: Proactive Issue Discovery\r
\r
```\r
User: "Check if there are any potential system issues"\r
\r
DevOps Insight Workflow:\r
1. Scan all monitoring metrics\r
2. Identify anomalous trends (e.g., continuous memory growth, rising error rates)\r
3. Check resource usage\r
4. Analyze warning messages in logs\r
5. Generate health report\r
6. Publish warning Capsules for potential issues to EvoMap network\r
```\r
\r
### Scenario 4: Code Change Impact Analysis\r
\r
```\r
User: "Will this PR affect the production environment?"\r
\r
DevOps Insight Workflow:\r
1. Analyze code change content\r
2. Identify affected services and components\r
3. Check related monitoring metrics\r
4. Query historical impact of similar changes\r
5. Assess risk level\r
6. Provide monitoring recommendations (which metrics to watch)\r
7. Suggest if new monitoring points are needed\r
```\r
\r
## Configuration Requirements\r
\r
### MCP Server Configuration\r
\r
The following MCP servers need to be configured to connect to each monitoring system:\r
\r
```json\r
{\r
  "mcpServers": {\r
    "kubernetes": {\r
      "command": "mcp-server-kubernetes",\r
      "args": ["--kubeconfig", "/path/to/kubeconfig"]\r
    },\r
    "postgresql": {\r
      "command": "mcp-server-postgresql",\r
      "args": ["--connection-string", "postgresql://..."]\r
    },\r
    "redis": {\r
      "command": "mcp-server-redis",\r
      "args": ["--host", "redis.example.com"]\r
    },\r
    "elasticsearch": {\r
      "command": "mcp-server-elasticsearch",\r
      "args": ["--url", "https://es.example.com"]\r
    },\r
    "skywalking": {\r
      "command": "mcp-server-skywalking",\r
      "args": ["--url", "http://skywalking.example.com"]\r
    }\r
  }\r
}\r
```\r
\r
### GitHub Integration\r
\r
Ensure gitnexus Nexus-skill is installed and configured:\r
\r
```bash\r
# Check if gitnexus is available\r
gh --version\r
\r
# Configure GitHub authentication\r
gh auth login\r
```\r
\r
### EvoMap API Configuration\r
\r
Configure EvoMap API connection for publishing Capsules:\r
\r
```json\r
{\r
  "evomap": {\r
    "apiUrl": "https://evomap.ai/a2a",\r
    "nodeId": "node_your_unique_id",\r
    "enableHeartbeat": true,\r
    "heartbeatInterval": 900000,\r
    "autoPublish": true,\r
    "minConfidence": 0.8\r
  }\r
}\r
```\r
\r
**Configuration Options:**\r
- `apiUrl`: EvoMap A2A protocol endpoint\r
- `nodeId`: Your agent's unique node identifier (obtained from registration)\r
- `enableHeartbeat`: Enable automatic heartbeat to stay online (recommended)\r
- `heartbeatInterval`: Heartbeat interval in milliseconds (default: 15 minutes)\r
- `autoPublish`: Automatically publish high-confidence solutions as Capsules\r
- `minConfidence`: Minimum confidence threshold for auto-publishing (0.0-1.0)\r
\r
## Best Practices\r
\r
### 1. Monitoring Data Collection\r
\r
- Prioritize retrieving the most relevant monitoring data\r
- Set reasonable time ranges (avoid data overload)\r
- Use filter conditions for precise queries\r
\r
### 2. Root Cause Analysis\r
\r
- Adopt multi-dimensional analysis methods\r
- Correlate historical data and patterns\r
- Consider time factors (change time, alert time)\r
- Validate hypotheses (verify with additional data)\r
\r
### 3. Capsule Publishing\r
\r
- Publish high-quality solutions promptly\r
- Document analysis process and conclusions in detail\r
- Associate all relevant monitoring data and code\r
- Maintain confidence >= 0.8 for auto-publishing\r
- Use appropriate signals for better discoverability\r
\r
### 4. Code Changes\r
\r
- Exercise caution with production environment changes\r
- Thoroughly test fix solutions\r
- Maintain small, incremental changes\r
- Prepare for rollback\r
\r
### 5. Security Considerations\r
\r
- Audit all production change operations\r
- Follow principle of least privilege\r
- Sanitize sensitive information\r
- Maintain complete operation logs\r
\r
## Command Examples\r
\r
### Analyze Current Alerts\r
\r
```\r
Analyze current production alerts\r
```\r
\r
### Create Incident Ticket\r
\r
```\r
Create a ticket for this API timeout issue\r
```\r
\r
### Code Impact Analysis\r
\r
```\r
Analyze the impact of PR #123 on production environment\r
```\r
\r
### Health Check\r
\r
```\r
Check system health status\r
```\r
\r
### Root Cause Analysis\r
\r
```\r
Analyze the root cause of yesterday's 20:00 service outage\r
```\r
\r
## Important Notes\r
\r
1. **Permission Management**\r
   - Ensure sufficient permissions to access monitoring systems\r
   - GitHub operations require appropriate repository permissions\r
   - EvoMap API requires valid node registration\r
\r
2. **Data Security**\r
   - Do not expose sensitive information (passwords, keys, etc.) in tickets\r
   - Log data may contain user information, ensure sanitization\r
   - Comply with data protection regulations\r
\r
3. **Change Risks**\r
   - Exercise extra caution with production environment changes\r
   - Recommend testing in test environment first\r
   - Maintain change traceability\r
\r
4. **Performance Considerations**\r
   - Large monitoring data queries may be slow\r
   - Set reasonable query ranges and limits\r
   - Consider using caching mechanisms\r
\r
## Extended Features\r
\r
### Future Plans\r
\r
- [ ] Automated fix execution (requires stricter security controls)\r
- [ ] Machine learning predictions (predict failures based on historical data)\r
- [ ] Multi-cluster support\r
- [ ] Custom alert rules\r
- [ ] Integration with more monitoring systems\r
- [ ] Mobile alert notifications\r
- [ ] Collaboration features (team collaboration for incident handling)\r
\r
## Troubleshooting\r
\r
### Common Issues\r
\r
**Q: MCP server connection failure**\r
```\r
A: Check MCP server configuration and network connection\r
   Verify authentication information is correct\r
   Review MCP server logs\r
```\r
\r
**Q: GitHub operation failure**\r
```\r
A: Confirm gh CLI is properly configured\r
   Check repository permissions\r
   Verify gitnexus skill is available\r
```\r
\r
**Q: Capsule publishing failure**\r
```\r
A: Check EvoMap API connection and node registration\r
   Verify confidence score meets minimum threshold\r
   Ensure asset_id hash is computed correctly\r
   Review EvoMap API response for error details\r
```\r
\r
**Q: Incomplete monitoring data**\r
```\r
A: Check time range settings\r
   Verify monitoring system is running normally\r
   Confirm query conditions are not too restrictive\r
```\r
\r
## Related Resources\r
\r
- [MCP Protocol Documentation](https://modelcontextprotocol.io/)\r
- [GitHub CLI Documentation](https://cli.github.com/)\r
- [Kubernetes Monitoring Best Practices](https://kubernetes.io/docs/tasks/debug/)\r
- [SkyWalking Documentation](https://skywalking.apache.org/)\r
- [Elasticsearch Query Guide](https://www.elastic.co/guide/)\r
- [EvoMap A2A Protocol](https://evomap.ai/wiki/05-a2a-protocol)\r
- [EvoMap Agent Guide](https://evomap.ai/wiki/03-for-ai-agents)\r
\r
## Contributing\r
\r
Issues and improvement suggestions are welcome!\r
\r
## License\r
\r
MIT License\r

Usage Guidance

Before installing, review and tighten configuration: 1) Treat this as a data-exfiltration risk by default — the skill collects logs, traces, DB info and can publish them externally (config.example.apiUrl points to evomap.ai and autoPublish is true). Disable autoPublish and any automatic promotion until you verify the endpoint and policy. 2) Provide minimal, dedicated credentials (least privilege) for Kubernetes, databases, Elasticsearch, APM, and GitHub; do not reuse high-privilege keys. 3) Note the manifest declares no required env vars despite the scripts and examples requiring KUBECONFIG, DB_* and GitHub auth — demand that the author declare required secrets and justify each. 4) Run in a sandbox/non-production environment first and audit network traffic to confirm what is sent externally. 5) If you need automatic fix/PR functionality, require manual approval and keep enableAutoFix disabled. If you want, ask the publisher to: (a) declare required credentials in the manifest, (b) document what fields are sent to EvoMap and provide an opt-in toggle, and (c) add safeguards to redact secrets from any published 'capsule'.

Capability Analysis

Type: OpenClaw Skill Name: devops-insight Version: 1.0.2 The skill is designed to analyze DevOps incidents by accessing highly sensitive environments, including Kubernetes clusters, production databases (PostgreSQL, Redis, Neo4j), and GitHub repositories. Its core functionality involves the 'EvoMap' integration, which automatically exfiltrates analysis results—potentially containing internal logs, metrics, and code diffs—to an external API (https://evomap.ai/a2a) via the 'autoPublish' feature. While this behavior is documented in skill.md and README.md, the automated transmission of diagnostic data to a third-party service represents a significant risk for accidental data leakage or credential exposure, especially given the broad permissions required for the MCP servers.

Capability Assessment

⚠ Purpose & Capability

The skill claims to integrate Kubernetes, PostgreSQL, Redis, Neo4j, Elasticsearch, APM, and GitHub and to publish 'capsules' to an external EvoMap network. Those capabilities are coherent with a DevOps analysis tool. However, the registry metadata declares no required env vars or primary credential, which is inconsistent with the real needs (kubeconfig, DB credentials, GitHub auth, APM/ES endpoints).

⚠ Instruction Scope

SKILL.md instructs retrieving pod logs, APM traces, DB metrics, and Elasticsearch logs and includes a Capsule publishing workflow that posts analysis, code diffs, and monitoring context to an external EvoMap endpoint. That means sensitive runtime data (logs, traces, configs, code snippets) may be collected and transmitted outside the user's environment — the instructions do not limit or explicitly warn about what subset of data is safe to publish.

✓ Install Mechanism

This is an instruction-only skill with two shell scripts and no install spec or external downloads. No package install from untrusted sources was detected, and files are human-readable shell/markdown. Installation mechanism itself appears low-risk.

⚠ Credentials

Although the skill requests no required env vars in the manifest, the included files and examples rely on sensitive configuration and environment variables (e.g., ${HOME}/.kube/config, DB_PASSWORD/DB_HOST/DB_USER, GitHub CLI authentication). The manifest should declare these credentials; omission means the platform won't surface required secrets and increases the chance of accidental exposure or misconfiguration.

⚠ Persistence & Privilege

The skill is not marked always:true (good), but SKILL.md and config.example default to autoPublish=true for the EvoMap network and describe automatic promotion and publishing of analysis/capsules. Combined with autonomous invocation (platform default), that enables automatic outbound transmission of monitoring/log/code data to an external service unless explicitly disabled — increasing blast radius if deployed in production.

Version History

v1.0.2

EvoMap capsule publishing and network integration have been added. - Integrated EvoMap support for Capsule and Gene bundle creation, publishing, and reputation tracking. - Introduced Capsule and Gene data structures for sharing incident insights and solutions on EvoMap. - Automated quality validation and network promotion for high-confidence solutions. - Updated workflows and use cases to include Capsule publishing and querying via EvoMap. - Ticket database integration has been replaced by EvoMap network for incident knowledge sharing.

v1.0.1

- Initial public release of DevOps Insight, an intelligent DevOps incident management system. - Integrates multiple monitoring sources (Kubernetes, PostgreSQL, Redis, Neo4j, Elasticsearch, Skywalking) via MCP protocol. - Supports GitHub integration for code review, automated fixes, and impact analysis (with gitnexus Nexus-skill). - Enables automated incident analysis, root cause detection, ticket management, and monitoring-data-driven troubleshooting. - Provides audit and rollback capabilities for production changes, plus best practices and configuration guidance. - Documentation available in English for broader accessibility.

v1.0.0

Initial release of devops-insight. - Provides intelligent DevOps incident management: automates incident analysis, root cause diagnosis, ticket creation, and integration with GitHub and multiple monitoring/data systems (Kubernetes, PostgreSQL, Redis, Neo4j, Elasticsearch, APM/Skywalking). - Supports structured root cause analysis, impact assessment, and audit-traceable production changes. - Integrates with MCP servers, ticket databases, and GitHub for full incident to resolution workflow. - Includes best practices, troubleshooting guidance, and future roadmap.

Metadata

Slug devops-insight

Version 1.0.2

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is DevOps Insight?

This skill should be used when the user asks to "analyze incidents", "troubleshoot production issues", "investigate alerts", "create tickets", "root cause an... It is an AI Agent Skill for Claude Code / OpenClaw, with 288 downloads so far.

How do I install DevOps Insight?

Run "/install devops-insight" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is DevOps Insight free?

Yes, DevOps Insight is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does DevOps Insight support?

DevOps Insight is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created DevOps Insight?

It is built and maintained by CafeChen (@cafechen); the current version is v1.0.2.

More Skills

DevOps Insight