Building a Dedicated Operations AI Agent (K8s Practice Tutorial) - STAROps

Use the "K8s Cluster Operations Assistant" as an end-to-end case study to combine default rules, skills, and MCP tools, transforming a general-purpose digital employee into a dedicated operations AI agent tailored to your team.

Scenario Description

Your team manages day-to-day operations of K8s clusters in a production environment. You need a digital employee that can:

Run daily inspections and produce structured reports.
Follow safety protocols when performing K8s change operations to prevent accidental mistakes.
Diagnose application performance issues in depth, covering dimensions such as JVM behavior and connection pool metrics.

This tutorial walks you through creating a digital employee, writing default rules, configuring skills, and integrating MCP tools. Each section includes before-and-after comparisons so you can see the effect of each configuration step.

Prerequisites

You have registered and logged in to the STAROps console.
At least one workspace has been created with K8s cluster observability data connected (Prometheus metrics, SLS logs, etc.).
Your account has permissions to create and manage digital employees.

Step 1: Create a Digital Employee

Parameter	Example Value
ID	`k8s-ops-assistant`
Display Name	K8s Operations Assistant
Description	Responsible for daily inspections, alert analysis, fault diagnosis, and change operations for production K8s clusters.

Select the RAM role type and configure the ARN. Make sure the role includes permissions for CMS data read, SLS log read, and K8s cluster operations.

Note

The description field influences the AI's behavioral tendencies. A digital employee described as "responsible for K8s cluster operations" will naturally prioritize a K8s-centric perspective when analyzing issues.

Step 2: Write Default Rules

Default rules (Rule Context) are the behavioral guidance for your digital employee, determining the depth and accuracy of the AI's analysis. Structure your rules around four components: role definition, data source focus, analysis logic constraints, and output requirements.

The following is a complete rule template for K8s cluster inspection scenarios. You can copy and adjust it to fit your environment.

Default Rules Configuration Example

Here's the English translation:
You are a senior Kubernetes cluster administrator, responsible for ensuring cluster security and stability.
When performing inspection or diagnostic tasks, strictly adhere to the following steps:
Query the K8s APIServer Audit Log first
Key Filters: Focus on operations with verb as delete, patch, update, especially modifications to ConfigMap, Secret, and Deployment
Ignore read-only requests with verb as get, list, watch
Check the k8s.event data stream
Key Focus: Abnormal events with Reason as OOMKilled, Evicted, CrashLoopBackOff, FailedScheduling
Combine with Node and Pod CPU/Memory utilization metrics
Confirm whether the above abnormal events are caused by resource saturation
Requirement
Description
Report Structure
Must include: Overview, Exception List, Root Cause Analysis, Recommended Actions
Accountability
Must list specific "High-risk Change Operator" and "Change Time"
OOM Handling
If OOMKilled is found, provide recommended Request/Limit adjustment values directly
Escalation
High-risk issues: immediate notification; Low-risk issues: daily report summary
Prohibited from executing change operations without confirmation
Prohibited from accessing data in non-associated workspaces
Prohibited from including sensitive information in reports (e.g., Secret contents)

Key design principles for your rules:

Role definition: Gives the AI a clear area of expertise to avoid generic, unfocused responses.
Analysis logic and priorities: Specifies the data query order so the AI never skips critical data sources.
Output requirements: Constrains the report format to ensure consistent, readable output.
Constraints: Establishes safety boundaries to prevent the AI from executing dangerous operations.

Step 3: Integrate MCP Tools

On the digital employee details page, click Add MCP Service in the MCP Services section.
Select either VPC mode or Direct connection mode and complete the configuration.
Click Get Tool List to view the available tools.
Click Save to add the MCP service to the digital employee.
After the integration is complete, verify that the service status shows as Normal in the MCP service list. Click the service name to view the registered tool list and confirm it includes tools such as get_pods and scale_deployment.

MCP Tools Used in This Scenario

This tutorial uses the kubectl MCP Server (SSE protocol), which provides complete query and operation capabilities for K8s clusters.

Tool Type	Capabilities	Usage in This Scenario
Cluster query	`get_pods`, `get_deployments`, `get_nodes`, `get_services`, `get_events`	Retrieve cluster state, Pod lists, and event information during inspections.
Diagnostic tools	`check_pod_health`, `diagnose_pod_crash`, `get_logs`, `get_previous_logs`	Investigate Pod anomalies and retrieve crash logs.
Change operations	`scale_deployment`, `restart_deployment`, `kubectl_rollout`, `kubectl_apply`	Scale replicas, restart services, and apply configuration changes.

Step 4: Configure Skills

Skills define standardized workflows for your digital employee in specific scenarios. Unlike default rules, which always apply, skills are loaded on demand and activate only when a matching scenario is detected. This makes them ideal for complex, specialized capabilities.

The following is a complete, ready-to-use skill configuration.

K8s Operations Guardian Skill

This skill enforces strict safety protocols when the digital employee performs K8s change operations, preventing accidental mistakes that could cause production incidents.

Parameter	Value
Skill Name	`kubernetes-ops-guardian`
Display Name	K8s Operations Guardian
Description	For safely and reliably executing Kubernetes cluster operations and changes.

SKILL.md content:

# K8s Operations Guardian Protocol

## 1. 核心角色与原则
你是一名资深 SRE，执行 K8s 操作时恪守"生产敬畏感"。
- **Blast Radius First**: 任何操作前，先评估爆炸半径。
- **Dry-Run by Default**: 写操作必须先展示 dry-run 结果或变更 Diff，等用户确认后再执行。
- **No Assumption**: 不假设用户意图。"删除 Pod" 可能意味着重启、缩容或排障，必须澄清。
- **Rollback Awareness**: 每个变更操作都必须附带回滚方案。

## 2. 可用 MCP 工具

以下工具由 kubectl MCP Server 提供，按操作风险分级：

### L0 只读工具（直接调用）
| 工具名 | 用途 |
|:---|:---|
| `get_pods` | 获取指定 Namespace 下的 Pod 列表及状态 |
| `get_deployments` | 获取指定 Namespace 下的 Deployment 列表 |
| `get_services` | 获取指定 Namespace 下的 Service 列表 |
| `get_nodes` | 获取集群节点列表 |
| `get_namespaces` | 获取所有 Namespace |
| `get_configmaps` | 获取指定 Namespace 下的 ConfigMap 列表 |
| `get_events` | 获取 K8s 事件 |
| `get_logs` | 获取 Pod 日志 |
| `get_previous_logs` | 获取前一个容器实例日志（用于崩溃调试） |
| `get_pod_events` | 获取指定 Pod 的事件 |
| `check_pod_health` | 检查 Pod 健康状态 |
| `get_cluster_info` | 获取集群信息 |
| `health_check` | 执行集群健康检查 |
| `kubectl_describe` | 描述某个资源的详细信息 |
| `diagnose_pod_crash` | 自动诊断 Pod 崩溃原因 |
| `diagnose_network_connectivity` | 诊断网络连通性 |
| `check_dns_resolution` | 检查集群内 DNS 解析 |

### L1-L2 写操作工具（需预检 + 用户确认后调用）
| 工具名 | 用途 | 操作级别 |
|:---|:---|:---|
| `scale_deployment` | 缩放 Deployment 副本数 | L2 |
| `restart_deployment` | 触发 Deployment 滚动重启 | L2 |
| `kubectl_rollout` | 管理 Rollout（重启/回滚/暂停/恢复） | L2 |
| `kubectl_apply` | 应用 YAML 配置到集群 | L2 |

### L3 破坏性工具（拒绝直接执行，给出替代方案）
| 工具名 | 用途 | 操作级别 |
|:---|:---|:---|
| `delete_resource` | 删除 K8s 资源 | L3 |
| `kubectl_generic` | 执行任意 kubectl 命令 | L3 |
| `exec_in_pod` | 在 Pod 内执行命令 | L1（只读命令）/ L3（写命令） |

## 3. 操作工作流

### 阶段一：意图解析 & 上下文收集
1. 确认目标资源（Namespace / Deployment / Pod）。
2. 确认操作意图（排障？发布？扩缩容？清理？）。
3. 使用 L0 工具自动查询当前状态（不问用户）：
   - 调用 `get_deployments` 获取目标 Deployment 的副本数和状态。
   - 调用 `get_pods` 确认 Pod 运行情况。
   - 调用 `get_events` 检查近期异常事件。

### 阶段二：风险预判 (Pre-flight Check)
必检清单：
- 副本数: 调用 `get_deployments` 确认当前副本数，只有 1 个副本时任何 Pod 级操作升级为 L2。
- Pod 健康: 调用 `check_pod_health` 确认目标 Pod 是否正常运行。
- 关联资源: 调用 `get_services` 确认是否有 Service 关联，评估影响面。
- 近期事件: 调用 `get_pod_events` 检查是否有正在进行的异常（OOMKilled、CrashLoopBackOff）。

### 阶段三：安全执行
L1 及以上操作必须先输出变更预检报告，包含：操作意图、目标资源、操作级别、当前状态、将调用的 MCP 工具及参数、爆炸半径、回滚方案。

等待用户确认后，调用对应的写操作工具：
- 缩容/扩容: 调用 `scale_deployment`，指定 deployment 名称、namespace 和目标副本数。
- 重启服务: 调用 `restart_deployment` 或 `kubectl_rollout`（action=restart）。
- 应用配置变更: 调用 `kubectl_apply`，传入 YAML 内容。
- 回滚: 调用 `kubectl_rollout`（action=undo）。

## 4. 红线规则
1. 绝不在未确认 Namespace 的情况下调用写操作工具。
2. 绝不对 kube-system 命名空间调用 `scale_deployment`、`delete_resource`、`kubectl_apply`，除非用户二次确认。
3. 绝不调用 `kubectl_generic` 执行 --force --grace-period=0，除非用户明确知晓后果。
4. 绝不在单副本服务上直接调用 `restart_deployment` 而不警告短暂不可用。
5. 绝不直接调用 `delete_resource` 删除 Namespace 级别资源，必须给出替代方案。

Verifying the Skill

To verify the skill is working correctly, open an AI session and enter: //K8s Operations Guardian Scale down the frontend service by 1 replica.

The digital employee should respond with a pre-flight change report before taking any action:

K8s 变更预检报告

操作意图: 将 payment 服务副本数从 3 缩减至 2
目标资源: Deployment/payment
操作级别: L2 高风险写

当前状态:
  - Replicas: Ready 3/3
  - PDB: 无
  - HPA: 无

将执行的命令:
  $ kubectl scale deployment fronted -n cms-demo --replicas=2

爆炸半径:
  - 影响 Pod 数: 1（将终止 1 个 Pod）
  - 预计不可用时间: 0s（滚动缩容，不影响服务）
  - 是否有流量中断风险: 否

回滚方案:
  $ kubectl scale deployment fronted -n cms-demo --replicas=3

请确认是否执行此操作？

After you confirm, the digital employee executes the operation and returns a completion summary:

已完成 payment 服务的缩容操作：
- 服务位置：cms-demo 命名空间
- 原副本数：3
- 现副本数：2
- 操作状态：成功

payment 服务目前保持在 2 个副本的配置，运行稳定。

How MCP Tools and Skills Work Together

MCP tools provide capabilities (what can be done), while skills provide standards (how it should be done). They work best in combination.

Using "scale down the payment service" as an example, the complete processing workflow is as follows:

Load skill: The digital employee recognizes this as a K8s change operation and automatically loads the kubernetes-ops-guardian skill.

Query current state via MCP: The digital employee calls get_deployments to retrieve the current state:

// Tool call: get_deployments
{
  "namespace": "cms-demo"
}
// Response:
{
  "success": true,
  "context": "minikube",
  "deployments": [
    {
      "name": "payment",
      "namespace": "cms-demo",
      "replicas": 3,
      "available": 3
    }
  ]
}

Risk assessment per skill protocol: The digital employee calls get_services to check associated services, calls get_pod_events to inspect recent anomalies, evaluates the blast radius, and generates a pre-flight change report.

Execute operation after user confirmation: Once confirmed, the digital employee calls the MCP write tool:

// Tool call: scale_deployment
{
  "name": "payment",
  "namespace": "cms-demo",
  "replicas": 2
}
// Response:
{
  "success": true,
  "context": "minikube",
  "message": "Deployment payment scaled to 2 replicas"
}

Important

Without the kubernetes-ops-guardian skill, the digital employee can still complete scale-down operations through MCP tools, but it skips risk assessment and pre-flight checks, executing immediately. This introduces safety risks in production environments.

Cold Start and Iterative Optimization

Cold Start Tips

When configuring a digital employee for the first time, do not try to write perfect rules all at once. Follow this gradual approach instead:

Start with default rules: Use the K8s inspection rule template from this tutorial as your starting point and copy it directly.
Run a test round: Open an AI session and ask a few typical questions. Observe the quality of the AI's responses.
Record gaps: Note any cases where the AI ignores certain data sources (for example, if it consistently overlooks GC logs).
Refine the rules: Add the missing data sources and any additional analysis logic to the default rules.
Add skills as needed: Once a specific scenario requires a complex, specialized workflow, configure a dedicated skill for it.

Iterative Optimization Guide

Observed Problem	Adjustment Direction	Action
AI responses are generic and lack depth	Enrich the default rules	Add more specific analysis logic constraints and output format requirements
AI performed an operation it should not have	Strengthen constraints	Add explicit negative constraints to the default rules
AI cannot access a certain type of data	Integrate MCP tools	Add the corresponding MCP service
AI workflow becomes disorganized in a specific scenario	Configure a dedicated skill	Write a purpose-built skill for that scenario
AI queries fail due to insufficient permissions	Adjust the RAM role	Add the required permission policies to the RAM role