通過 loongsuite-otel-util-genai 與 OpenTelemetry SDK 為調用鏈增加自訂埋點 - Cloud Monitor

接入 ARMS 應用監控以後，探針對常見的 AI 架構進行了自動埋點，因此不需要修改任何代碼，就可以實現調用鏈資訊的採集。如果您需要在調用鏈資訊中，體現業務方法的執行情況，可以引入 loongsuite-otel-util-genai 以及 OpenTelemetry SDK，在業務代碼中增加自訂埋點。本文介紹如何通過 loongsuite-otel-util-genai 以及 OpenTelemetry Python SDK 實現自訂埋點以及自訂 Attribute。

ARMS 探針支援的 AI 組件和架構，請參見：

前提條件

已經成功接入 ARMS 應用監控。

引入依賴

pip install -U "loongsuite-otel-util-genai>=0.6.1"

安裝後提供 opentelemetry.util.genai 包及 ExtendedTelemetryHandler 等擴充介面。0.6.1 及以上版本支援在 Agent Span 生命週期內自動將 Agent 名稱透傳到下遊 GenAI 子 Span。更多資訊，請參見 loongsuite-otel-util-genai 詳細文檔。

使用 loongsuite-otel-util-genai 和 OpenTelemetry SDK

通過 loongsuite-otel-util-genai 和 OpenTelemetry SDK 主要可以實現以下操作：

建立 GenAI 語義的 Span（Entry、Agent、Tool、ReAct Step等）。
通過 OpenTelemetry SDK 埋點產生自訂 Span。
為 Span 增加自訂 Attributes。
擷取當前 Trace 上下文並列印 traceId。

名詞介紹

Span：一次請求的一個具體操作，比如一次 LLM 調用或一次工具執行。
SpanContext：一次請求追蹤的上下文，包含 traceId、spanId 等資訊。
Attribute：Span 的附加屬性欄位，用於記錄關鍵資訊，如模型名稱、Token 用量等。
Handler：loongsuite-otel-util-genai 提供的 ExtendedTelemetryHandler，用於建立符合 GenAI語義約定的 Span。

loongsuite-otel-util-genai 支援的全部 Span 類型如下表所示，本文重點介紹 Entry、Agent、Tool 和 ReAct Step 的用法，其他類型（Embedding、Retrieval、Rerank、Memory等）的詳細用法請參見loongsuite-otel-util-genai 完整文檔。

Span 類型	操作名	說明
Entry	`enter`	應用入口，攜帶 session_id / user_id / 應用完整互動資訊
Agent	`invoke_agent {name}`	Agent 調用，匯總 Token 用量
Tool	`execute_tool {name}`	工具/函數執行
Step	`react`	ReAct 單輪迭代標識
LLM	`chat {model}`	大模型對話（通常由探針自動採集）
Embedding	`embeddings {model}`	向量嵌入
Retriever	`retrieval {data_source}`	檢索（RAG）
Reranker	`rerank {model}`	重排序
Memory	`memory {operation}`	記憶讀寫

下面分步介紹各類 Span 的埋點寫法，每一步給出獨立的程式碼片段。完整可啟動並執行範例程式碼請參見本文末尾附錄部分。

重要

請務必通過 get_extended_telemetry_handler() 擷取 Handler 執行個體，而非直接執行個體化 TelemetryHandler。ARMS 探針僅對 get_extended_telemetry_handler() 進行了相容適配，直接執行個體化 TelemetryHandler 可能導致環境變數相容性問題。

重要

自訂埋點時請務必遵循LLM Trace欄位定義說明中的語義規範。AI應用可觀測能力（Token統計、會話分析等）均基於該規範中定義進行適配和渲染，若 Span 屬性不符合規範，相關資料可能無法在控制台中正確展示。

1. 擷取 Handler 和 Tracer

通過 get_extended_telemetry_handler() 擷取 loongsuite-otel-util-genai 的單例 Handler，通過 get_tracer(__name__) 擷取 OpenTelemetry SDK 的 Tracer。兩者分別用於建立 GenAI 語義 Span 和自訂業務 Span。

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai.extended_types import (
    ExecuteToolInvocation,
    InvokeAgentInvocation,
)
from opentelemetry.util.genai._extended_common import EntryInvocation, ReactStepInvocation
from opentelemetry.util.genai.types import Error, InputMessage, OutputMessage, Text
from opentelemetry.trace import get_tracer
handler = get_extended_telemetry_handler()
tracer = get_tracer(__name__)

Handler 提供兩種使用方式：

上下文管理器（with handler.entry(inv) 等）：推薦方式，自動管理 Span 生命週期。
start/stop/fail API（handler.start_entry(inv) / handler.stop_entry(inv) / handler.fail_entry(inv, error)）：適用於非同步、回調或流式等無法使用 with 語句的情境。

2. 建立 Entry Span

在請求入口處建立 Entry Span，攜帶 session_id、user_id，並通過 input_messages 記錄使用者輸入。流式響應完成後，將輸出內容拼接設定到 output_messages，再調用 stop_entry 結束 Span。這樣在控制台中能直接看到該次請求的完整輸入和最終輸出。

entry_inv = EntryInvocation(
    session_id=req.session_id or str(uuid.uuid4()),
    user_id=req.user_id or "anonymous",
    input_messages=[
        InputMessage(role="user", parts=[Text(content=req.topic)]),
    ],
)
def event_generator():
    handler.start_entry(entry_inv)
    output_chunks: list[str] = [ ]
    try:
        for chunk in run_agent_stream(topic=req.topic):
            output_chunks.append(chunk)
            yield f"data: {json.dumps({'content': chunk}, ensure_ascii=False)}\n\n"
        yield "data: [DONE]\n\n"
    except Exception as exc:
        handler.fail_entry(entry_inv, Error(message=str(exc), type=type(exc)))
        yield f"data: {json.dumps({'error': str(exc)}, ensure_ascii=False)}\n\n"
        return
    entry_inv.output_messages = [
        OutputMessage(
            role="assistant",
            parts=[Text(content="".join(output_chunks))],
            finish_reason="stop",
        ),
    ]
    handler.stop_entry(entry_inv)

3. 建立 Agent Span

通過 start_invoke_agent 建立 Agent Span，記錄 Agent 名稱、模型和描述資訊。Agent Span 是整個調用鏈的根 GenAI Span，所有後續的 ReAct Step、LLM 調用和 Tool 調用都作為它的子 Span。

從 loongsuite-otel-util-genai 0.6.1 開始，調用 start_invoke_agent 時，如果 InvokeAgentInvocation.agent_name 已設定，工具會自動通過 OpenTelemetry Baggage 寫入 gen_ai.agent.name。在該 Agent Span 生命週期內，由 LoongSuite GenAI 探針或本工具建立的 LLM、Tool、Embedding、Retrieval、Rerank、Memory、ReAct Step 等 GenAI 子 Span 會自動繼承該屬性，無需在每個子 Span 中手動設定 Agent 名稱。通過 OpenTelemetry SDK 建立的普通自訂業務 Span 不會自動補充該 GenAI 屬性，如需記錄業務屬性，請繼續使用 span.set_attribute()。

invocation = InvokeAgentInvocation(
    provider="dashscope",
    agent_name="TechContentAgent",
    agent_description="技術內容產生助手",
    request_model="qwen-plus",
)
total_input_tokens = 0
total_output_tokens = 0
handler.start_invoke_agent(invocation)
try:
    # ... Agent 核心邏輯（ReAct 迴圈） ...
    invocation.input_tokens = total_input_tokens
    invocation.output_tokens = total_output_tokens
    handler.stop_invoke_agent(invocation)
except Exception:
    handler.fail_invoke_agent(invocation, Error(message="agent failed", type=RuntimeError))
    raise

Agent 執行完成後，將累積的 total_input_tokens 和 total_output_tokens 寫入 Agent Span，實現 Token 指標匯總統計。

4. 建立 ReAct Step Span

在每一輪 ReAct 推理迭代時建立 Step Span，傳入當前輪次 round。迭代結束時設定 finish_reason：需要繼續迭代為 continue，最終回答為 stop。樣本中每輪迭代的 LLM 調用由 ARMS 探針自動埋點，無需手動建立。

step_inv = ReactStepInvocation(round=iteration + 1)
handler.start_react_step(step_inv)
try:
    response = client.chat.completions.create(
        model="qwen-plus",
        messages=messages,
        tools=TOOL_DEFINITIONS,
    )
    # ... 處理響應 ...
    step_inv.finish_reason = "stop"  # 或 "continue"
    handler.stop_react_step(step_inv)
except Exception:
    handler.fail_react_step(step_inv, Error(message="step failed", type=RuntimeError))
    raise

5. 建立 Tool Span

當模型返回工具調用時，為每個 tool_call 建立 Tool Span，記錄工具名稱、調用 ID、入參和返回結果。

tool_inv = ExecuteToolInvocation(
    tool_name=tool_call.function.name,
    tool_call_id=tool_call.id,
    tool_call_arguments=tool_call.function.arguments,
    tool_type="function",
)
handler.start_execute_tool(tool_inv)
try:
    result = dispatch_tool(tool_name, tool_call.function.arguments)
    tool_inv.tool_call_result = result
except Exception as exc:
    handler.fail_execute_tool(tool_inv, error=Error(message=str(exc), type=type(exc)))
    raise
else:
    handler.stop_execute_tool(tool_inv)

6. 使用 OpenTelemetry SDK 建立自訂 Span

除了 loongsuite-otel-util-genai 提供的 GenAI 語義 Span，還可以通過 OpenTelemetry SDK 的 tracer.start_as_current_span() 建立自訂業務 Span，與 GenAI Span 混合使用。

以下樣本展示了兩種典型的自訂 Span 用法：

`duplicate_tool_detection` — 工具重複調用檢測

在每輪 ReAct 迭代前執行，通過 Counter 統計每個工具的調用次數，將檢測結果寫入 gen_ai.loop_detection.* 屬性。若發現重複，向訊息列表追加系統提示引導模型避免重複。

def _check_duplicate_tools(
    tool_usage_counter: Counter,
    messages: list[dict[str, Any]],
) -> None:
    duplicates = [name for name, count in tool_usage_counter.items() if count > 1]
    has_duplicates = len(duplicates) > 0
    with tracer.start_as_current_span("duplicate_tool_detection") as span:
        span.set_attributes({
            "gen_ai.loop_detection.detected": has_duplicates,
            "gen_ai.loop_detection.duplicate_tools": str(duplicates) if has_duplicates else "[ ]",
            "gen_ai.loop_detection.total_calls": sum(tool_usage_counter.values()),
            "gen_ai.loop_detection.unique_tools": len(tool_usage_counter),
        })
    if has_duplicates:
        details = ", ".join(f"{n}({tool_usage_counter[n]}次)" for n in duplicates)
        messages.append({
            "role": "system",
            "content": f"[系統提示] 檢測到工具被重複調用：{details}。請避免重複調用。",
        })

`response_loop_detection` — LLM 回複迴圈檢測

在每輪 LLM 回複後執行，通過比較當前回複與上一輪迴複的文本相似性，將 is_loop、overlap_ratio 等指標寫入 Span 屬性。若檢測到迴圈（文本完全相同或重疊率超過 80%），設定 finish_reason 為 loop_detected 並提前終止 Agent。

def _check_response_loop(
    current_content: str | None,
    previous_content: str | None,
) -> bool:
    cur = (current_content or "").strip()
    prev = (previous_content or "").strip()
    with tracer.start_as_current_span("response_loop_detection") as span:
        if not prev or not cur:
            span.set_attributes({
                "gen_ai.loop_detection.is_loop": False,
                "gen_ai.loop_detection.reason": "no_text_content",
            })
            return False
        is_identical = cur == prev
        longer = max(len(cur), len(prev))
        common_prefix_len = sum(1 for a, b in zip(cur, prev) if a == b)
        overlap_ratio = common_prefix_len / longer if longer > 0 else 0.0
        is_loop = is_identical or overlap_ratio > 0.8
        span.set_attributes({
            "gen_ai.loop_detection.is_loop": is_loop,
            "gen_ai.loop_detection.is_identical": is_identical,
            "gen_ai.loop_detection.overlap_ratio": round(overlap_ratio, 2),
            "gen_ai.loop_detection.current_length": len(cur),
            "gen_ai.loop_detection.previous_length": len(prev),
        })
        return is_loop

說明

由於自訂 Span 不屬於大模型語義規範，在控制台的調用鏈視圖中需要切換到全部視圖才能查看。

查看監控詳情

登入CloudMonitor2.0控制台，選擇目標工作空間，在左側導覽列選擇所有功能 > AI應用可觀測。
在AI應用列表頁面可以看到已接入的應用，單擊應用程式名稱可以查看詳細的應用監控資料。

埋點效果展示

1. Entry Span 詳情

Enter Span 能看到 gen_ai.session.id、gen_ai.user.id 等關鍵屬性，通過在函數入口處設定能自動透傳到 LLM、TOOL等Span中，能用於關聯會話和使用者資訊進行分析。同時 Entry Span 還攜帶 gen_ai.input.messages（使用者輸入內容）和 gen_ai.output.messages（最終輸出內容），便於在控制台中直接查看該次請求的整體互動內容。

Entry Span 詳情面板中，關鍵屬性包括 gen_ai.session.id（會話唯一標識）和 gen_ai.user.id（使用者標識），這些屬性在函數入口處設定後會自動透傳到下遊子 Span。同時可在詳情面板查看 gen_ai.input.messages（使用者輸入的完整訊息內容）和 gen_ai.output.messages（模型最終輸出的完整訊息內容）。

2. Agent Span 詳情

Agent Span能看到該 Agent 的定義名稱以及相應的描述，同時體現上述範例程式碼中統計的屬於該 Agent 層級的 Token 用量匯總統計效果。

Agent Span 詳情面板中，關鍵屬性包括 gen_ai.agent.name（值為 TechContentAgent）、gen_ai.provider.name（dashscope）、gen_ai.request.model（qwen-plus），以及 Agent 層級的 Token 用量匯總：gen_ai.usage.input_tokens（3982）、gen_ai.usage.output_tokens（884）、gen_ai.usage.total_tokens（4866）。

3. Tool Span 詳情

Tool Span 能看到該 Tool 的名稱以及入參配置，並且展示工具調用結果。Tool Span 展示工具調用的詳細資料。選中 execute_tool generate_seo_keywords 後，右側詳情面板顯示工具名稱（Tool: $generate_seo_keywords）、工具類型（Tool Type: function）、工具調用入參（Tool Call Arguments，如 {"topic": "CMS 2.0 AI 警示降噪"}）以及工具返回結果（Tool Result），便於排查工具調用的輸入輸出是否符合預期。

4. LLM Span 詳情

LLM Span在上述範例程式碼中並沒有進行手動埋點，由於是 openai 調用，此處全部由探針自動採集，能清晰觀察到該次 LLM 調用的完整上下文資訊以及 token 消耗。

LLM Span 詳情面板展示應用程式名稱、介面名、IP、開始與結束時間、spanId、parentSpanId、狀態代碼等基本資料。附加資訊選項卡可查看該次調用的 response 輸出內容，Output Messages選項卡展示 assistant 的 tool_call 調用參數（如工具名稱和請求體）。

5.自訂 Span 詳情

範例程式碼中通過 OpenTelemetry SDK 建立了兩個自訂業務 Span，展示如何將自訂埋點與 GenAI 語義 Span 混合使用，由於該自訂Span並不在大模型語義中，需要開啟全部視圖進行查看。

duplicate_tool_detection：在每輪 ReAct 迭代前執行，用於檢測 Agent 是否陷入工具重複調用。Span 屬性中記錄了是否檢測到重複、重複的工具列表、總調用次數和去重工具數，便於在 ARMS 中快速定位 Agent 的工具調用迴圈問題。
在 ARMS Trace 詳情頁中，單擊 duplicate_tool_detection Span 可在右側詳情面板查看迴圈檢測屬性，各屬性鍵名分別為 gen_ai.loop_detection.detected、gen_ai.loop_detection.duplicate_tools、gen_ai.loop_detection.total_calls、gen_ai.loop_detection.unique_tools 和 gen_ai.loop_detection.details。當 4 次工具調用均不重複時，detected 為 false，duplicate_tools 為空白數組 []。
response_loop_detection：在每輪 LLM 回複後執行，用於檢測模型是否連續返回高度相似的內容。Span 屬性中記錄了是否判定為迴圈、文本是否完全相同、重疊率以及當前和上一輪迴複的文本長度，協助排查模型陷入重複輸出的異常情境。
在 Trace 詳情頁的瀑布視圖中選中 response_loop_detection Span，右側 Attributes 面板可查看具體屬性，例如 gen_ai.loop_detection.is_loop=false 表示未檢測到迴圈，gen_ai.loop_detection.reason=no_text_content 表示原因為無常值內容。

附錄

完整範例程式碼

app.py

import json
import uuid
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai._extended_common import EntryInvocation
from opentelemetry.util.genai.types import Error, InputMessage, OutputMessage, Text
from agent import run_marketing_agent_stream
app = FastAPI(title="雲產品技術內容產生助手")
class GenerateRequest(BaseModel):
    content_type: str = "blog"
    product: str = "CMS"
    target_audience: str = "營運工程師"
    topic: str = ""
    session_id: str = ""
    user_id: str = ""
@app.post("/api/v1/generate/stream")
async def generate_stream(req: GenerateRequest) -> StreamingResponse:
    handler = get_extended_telemetry_handler()
    user_prompt = (
        f"內容類型: {req.content_type}, 產品: {req.product}, "
        f"目標受眾: {req.target_audience}, 主題: {req.topic}"
    )
    entry_inv = EntryInvocation(
        session_id=req.session_id or str(uuid.uuid4()),
        user_id=req.user_id or "anonymous",
        input_messages=[
            InputMessage(role="user", parts=[Text(content=user_prompt)]),
        ],
    )
    def event_generator():
        handler.start_entry(entry_inv)
        output_chunks: list[str] = []
        try:
            for chunk in run_marketing_agent_stream(
                content_type=req.content_type,
                product=req.product,
                target_audience=req.target_audience,
                topic=req.topic,
            ):
                output_chunks.append(chunk)
                yield f"data: {json.dumps({'content': chunk}, ensure_ascii=False)}\n\n"
            yield "data: [DONE]\n\n"
        except Exception as exc:
            handler.fail_entry(
                entry_inv,
                Error(message=str(exc), type=type(exc)),
            )
            yield f"data: {json.dumps({'error': str(exc)}, ensure_ascii=False)}\n\n"
            return
        entry_inv.output_messages = [
            OutputMessage(
                role="assistant",
                parts=[Text(content="".join(output_chunks))],
                finish_reason="stop",
            ),
        ]
        handler.stop_entry(entry_inv)
    return StreamingResponse(event_generator(), media_type="text/event-stream")
@app.get("/health")
async def health():
    return {"status": "ok"}
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

agent.py

import os
from collections import Counter
from collections.abc import Generator
from typing import Any
from openai import OpenAI
from opentelemetry.trace import get_tracer
from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai.extended_types import (
    ExecuteToolInvocation,
    InvokeAgentInvocation,
)
from opentelemetry.util.genai._extended_common import ReactStepInvocation
from opentelemetry.util.genai.types import Error
from tools import TOOL_DEFINITIONS, dispatch_tool
tracer = get_tracer(__name__)
MODEL_NAME = os.environ.get("MODEL_NAME", "qwen-plus")
BASE_URL = os.environ.get(
    "OPENAI_BASE_URL",
    "https://dashscope.aliyuncs.com/compatible-mode/v1",
)
API_KEY = os.environ.get("DASHSCOPE_API_KEY", "")
MAX_ITERATIONS = 10
SYSTEM_PROMPT = """\
你是阿里雲CloudMonitor 2.0（CMS 2.0）的技術內容產生助手。\
面向營運工程師和架構師，用其熟悉的專業語言產生高價值技術內容。
關鍵原則：根據目標受眾調整內容的視角和語言風格——
- 營運工程師：聚焦實操步驟、排障效率、工具整合，用一線營運的日常術語
- 架構師：聚焦架構設計、標準化、可擴充性，用技術深度的專業表達
你必須嚴格按以下步驟執行，每一步都要調用對應的工具：
第一步：使用 search_product_knowledge 工具搜尋 CMS 產品資訊（features 或 comparison）
第二步：使用 get_audience_profile 工具擷取目標受眾的畫像和痛點
第三步：使用 get_industry_cases 工具尋找相關行業案例
第四步：如果是部落格文章，使用 generate_seo_keywords 工具擷取 SEO 關鍵詞
第五步：根據收集到的資訊產生內容
第六步：使用 check_content_compliance 工具檢查合規性
內容要求：圍繞產品優勢和受眾痛點，引用案例資料，中文撰寫，800 字以內。"""
def _build_client() -> OpenAI:
    return OpenAI(base_url=BASE_URL, api_key=API_KEY)
def _build_user_message(
    content_type: str,
    product: str,
    target_audience: str,
    topic: str,
) -> str:
    type_labels = {
        "blog": "面向一線技術人員的實戰技術部落格",
        "email": "精準觸達目標角色的技術推薦郵件",
        "case_study": "可落地參考的客戶實踐案例",
        "comparison": "輔助技術選型的產品對比分析",
    }
    label = type_labels.get(content_type, content_type)
    return (
        f"請為 {product} 產品產生一篇{label}。\n"
        f"目標受眾：{target_audience}\n"
        f"主題/方向：{topic}\n\n"
        f"請用目標受眾日常工作中熟悉的語言和視角來撰寫，"
        f"嚴格按照步驟調用工具收集資訊後再產生內容。"
    )
def _check_duplicate_tools(
    tool_usage_counter: Counter,
    messages: list[dict[str, Any]],
) -> list[str]:
    duplicates = [name for name, count in tool_usage_counter.items() if count > 1]
    total_calls = sum(tool_usage_counter.values())
    has_duplicates = len(duplicates) > 0
    duplicate_details = ", ".join(
        f"{name}({tool_usage_counter[name]}次)" for name in duplicates
    ) if has_duplicates else "none"
    with tracer.start_as_current_span("duplicate_tool_detection") as detect_span:
        detect_span.set_attributes({
            "gen_ai.loop_detection.detected": has_duplicates,
            "gen_ai.loop_detection.duplicate_tools": str(duplicates) if has_duplicates else "[]",
            "gen_ai.loop_detection.details": duplicate_details,
            "gen_ai.loop_detection.total_calls": total_calls,
            "gen_ai.loop_detection.unique_tools": len(tool_usage_counter),
        })
    if not has_duplicates:
        return []
    hint_message = (
        f"[系統提示] 檢測到以下工具被重複調用：{duplicate_details}。"
        f"請避免重複調用相同的工具，直接使用已擷取的資訊繼續執行後續步驟。"
    )
    messages.append({"role": "system", "content": hint_message})
    return duplicates
def _check_response_loop(
    current_content: str | None,
    previous_content: str | None,
) -> bool:
    """Compare consecutive LLM text responses to detect stuck loops."""
    cur = (current_content or "").strip()
    prev = (previous_content or "").strip()
    with tracer.start_as_current_span("response_loop_detection") as span:
        if not prev or not cur:
            span.set_attributes({
                "gen_ai.loop_detection.is_loop": False,
                "gen_ai.loop_detection.reason": "no_text_content",
            })
            return False
        is_identical = cur == prev
        common_prefix_len = 0
        for a, b in zip(cur, prev):
            if a == b:
                common_prefix_len += 1
            else:
                break
        longer = max(len(cur), len(prev))
        overlap_ratio = common_prefix_len / longer if longer > 0 else 0.0
        is_loop = is_identical or overlap_ratio > 0.8
        span.set_attributes({
            "gen_ai.loop_detection.is_loop": is_loop,
            "gen_ai.loop_detection.is_identical": is_identical,
            "gen_ai.loop_detection.overlap_ratio": round(overlap_ratio, 2),
            "gen_ai.loop_detection.current_length": len(cur),
            "gen_ai.loop_detection.previous_length": len(prev),
        })
        return is_loop
def run_marketing_agent_stream(
    content_type: str,
    product: str,
    target_audience: str,
    topic: str,
) -> Generator[str, None, None]:
    client = _build_client()
    handler = get_extended_telemetry_handler()
    user_message = _build_user_message(content_type, product, target_audience, topic)
    invocation = InvokeAgentInvocation(
        provider="dashscope",
        agent_name="TechContentAgent",
        agent_description="面向不同技術角色的雲產品內容產生助手",
        request_model=MODEL_NAME,
    )
    total_input_tokens = 0
    total_output_tokens = 0
    tool_usage_counter: Counter = Counter()
    previous_content: str | None = None
    handler.start_invoke_agent(invocation)
    try:
        messages: list[dict[str, Any]] = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ]
        for iteration in range(MAX_ITERATIONS):
            _check_duplicate_tools(tool_usage_counter, messages)
            step_inv = ReactStepInvocation(round=iteration + 1)
            handler.start_react_step(step_inv)
            try:
                response = client.chat.completions.create(
                    model=MODEL_NAME,
                    messages=messages,
                    tools=TOOL_DEFINITIONS,
                    temperature=0.7,
                )
                choice = response.choices[0]
                message = choice.message
                if response.usage:
                    total_input_tokens += response.usage.prompt_tokens
                    total_output_tokens += response.usage.completion_tokens
                current_content = message.content
                if _check_response_loop(current_content, previous_content):
                    step_inv.finish_reason = "loop_detected"
                    handler.stop_react_step(step_inv)
                    if current_content:
                        yield current_content
                    break
                if (current_content or "").strip():
                    previous_content = current_content
                if message.tool_calls:
                    messages.append(message.model_dump())
                    for tool_call in message.tool_calls:
                        tool_name = tool_call.function.name
                        tool_args = tool_call.function.arguments
                        tool_usage_counter[tool_name] += 1
                        tool_inv = ExecuteToolInvocation(
                            tool_name=tool_name,
                            tool_call_id=tool_call.id,
                            tool_call_arguments=tool_args,
                            tool_type="function",
                        )
                        handler.start_execute_tool(tool_inv)
                        try:
                            result = dispatch_tool(tool_name, tool_args)
                            tool_inv.tool_call_result = result
                        except Exception as exc:
                            handler.fail_execute_tool(
                                tool_inv,
                                error=Error(message=str(exc), type=type(exc)),
                            )
                            raise
                        else:
                            handler.stop_execute_tool(tool_inv)
                        messages.append({
                            "role": "tool",
                            "tool_call_id": tool_call.id,
                            "content": result,
                        })
                    step_inv.finish_reason = "continue"
                    handler.stop_react_step(step_inv)
                    continue
                if choice.finish_reason == "stop" or message.content:
                    if message.content:
                        yield message.content
                    step_inv.finish_reason = "stop"
                    handler.stop_react_step(step_inv)
                    break
            except Exception:
                handler.fail_react_step(
                    step_inv, Error(message="step failed", type=RuntimeError)
                )
                raise
        invocation.input_tokens = total_input_tokens
        invocation.output_tokens = total_output_tokens
        handler.stop_invoke_agent(invocation)
    except Exception:
        handler.fail_invoke_agent(
            invocation, Error(message="agent failed", type=RuntimeError)
        )
        raise

tools.py

import json
from typing import Any
PRODUCT_KNOWLEDGE: dict[str, dict[str, str]] = {
    "CMS": {
        "features": (
            "CloudMonitor 2.0（CMS 2.0）是阿里雲一站式可觀測平台，"
            "融合 SLS + CMS + ARMS 三大產品能力：\n"
            "1. 全棧統一監控：指標、鏈路、日誌、事件統一視圖\n"
            "2. UModel 統一建模：資源自動關聯與觀測圖譜構建\n"
            "3. AI 智能分析：異常檢測、警示降噪、對話式營運 Copilot\n"
            "4. 開放相容：支援 Prometheus、Grafana、OpenTelemetry 生態\n"
            "5. AI 應用可觀測：LLM 調用鏈追蹤、Token 統計、模型效能分析"
        ),
        "comparison": (
            "CloudMonitor 2.0 vs 傳統監控方案：\n"
            "1. 資料融合：傳統方案需在 3-5 個控制台間切換；CMS 2.0 一站式融合\n"
            "2. AI 能力：傳統靜態閾值警示誤判率 30%+；CMS 2.0 AI 降噪 80%\n"
            "3. 觀測圖譜：CMS 2.0 通過 UModel 自動構建依賴圖譜\n"
            "4. AI 應用可觀測：傳統方案不支援；CMS 2.0 原生支援 LLM/Agent 全鏈路"
        ),
    },
}
AUDIENCE_PROFILES: dict[str, dict[str, str]] = {
    "營運工程師": {
        "role": "營運工程師 / SRE",
        "pain_points": (
            "1. 故障排查耗時間長度：微服務架構下定位問題平均 30-60 分鐘\n"
            "2. 警示風暴：大促期間警示激增，難以區分優先順序\n"
            "3. 工具片段化：需在 5-6 個監控工具間切換\n"
            "4. AI 營運盲區：大模型調用鏈路不透明"
        ),
        "interests": "全鏈路追蹤、根因分析、警示降噪、Prometheus/Grafana 整合",
        "decision_factors": "技術成熟度等級、社區活躍度、學習成本、整合難度",
    },
    "架構師": {
        "role": "架構師 / 技術專家",
        "pain_points": (
            "1. 微服務 + AI Agent 混合架構的可觀測性挑戰\n"
            "2. 開源自建 vs 商業方案選型缺乏客觀對比\n"
            "3. 各團隊監控方案不統一，資料格式片段化\n"
            "4. 現有方案能否支撐業務 10 倍增長"
        ),
        "interests": "架構設計、OpenTelemetry 標準化、資料模型統一、可擴充性",
        "decision_factors": "架構先進性、標準化程度、可擴充性、開放性、社區生態",
    },
}
INDUSTRY_CASES: dict[str, list[dict[str, str]]] = {
    "金融": [
        {
            "company": "某頭部股份制銀行",
            "scenario": (
                "核心交易系統可觀測升級：覆蓋 200+ 微服務，"
                "日均處理 5000 萬筆交易的全鏈路追蹤"
            ),
            "results": (
                "故障 MTTR 從 45 分鐘降至 8 分鐘，降幅 82%；"
                "警示準確率從 60% 提升至 95%；"
                "營運人效提升 3 倍，等保三級合規檢查一次通過"
            ),
        },
    ],
    "互連網": [
        {
            "company": "某社交平台",
            "scenario": (
                "千萬 DAU 應用的全棧可觀測：覆蓋 App 端體驗監控 → "
                "CDN → API Gateway → 2000+ 微服務 → 資料庫/緩衝"
            ),
            "results": (
                "使用者側 Crash 率從 0.5% 降至 0.08%；"
                "API P99 延遲最佳化 40%；"
                "每月節省 10 萬元+ 監控成本（相比自建方案）"
            ),
        },
    ],
}
COMPLIANCE_RULES: dict[str, dict[str, Any]] = {
    "product_names": {
        "incorrect": {
            "Aliyun": "阿里雲",
            "CMS2.0": "CMS 2.0",
            "CloudMonitor2.0": "CloudMonitor 2.0",
        },
    },
    "claim_rules": [
        "資料引用必須標註來源",
        "避免絕對化用語（如'最好的''唯一的''第一'）",
        "對比競品時使用客觀資料",
    ],
}
SEO_KEYWORDS_DB: dict[str, dict[str, Any]] = {
    "可觀測": {
        "primary": "可觀測性",
        "long_tail": ["雲原生可觀測性方案", "微服務可觀測平台選型"],
        "search_volume": "高",
    },
    "AI可觀測": {
        "primary": "AI 應用可觀測",
        "long_tail": ["LLM 調用鏈追蹤", "AI Agent 可觀測性"],
        "search_volume": "中（快速增長）",
    },
}
def search_product_knowledge(product: str, aspect: str) -> str:
    product_key = "CMS"
    product_data = PRODUCT_KNOWLEDGE.get(product_key)
    if not product_data:
        available = ", ".join(PRODUCT_KNOWLEDGE.keys())
        return f"未找到產品 '{product}' 的知識庫。可用產品：{available}"
    aspect_lower = aspect.lower()
    aspect_data = product_data.get(aspect_lower)
    if not aspect_data:
        available = ", ".join(product_data.keys())
        return f"未找到 '{product}' 的 '{aspect}' 方面資訊。可查詢方面：{available}"
    return f"【{product} - {aspect}】\n{aspect_data}"
def get_audience_profile(audience_type: str) -> str:
    profile = AUDIENCE_PROFILES.get(audience_type)
    if not profile:
        available = ", ".join(AUDIENCE_PROFILES.keys())
        return f"未找到受眾類型 '{audience_type}'。可用類型：{available}"
    return (
        f"受眾畫像 — {profile['role']}\n\n"
        f"核心痛點:\n{profile['pain_points']}\n\n"
        f"關注領域: {profile['interests']}\n\n"
        f"決策因素: {profile['decision_factors']}"
    )
def get_industry_cases(industry: str) -> str:
    cases = INDUSTRY_CASES.get(industry)
    if not cases:
        available = ", ".join(INDUSTRY_CASES.keys())
        return f"未找到 '{industry}' 行業的案例。可用行業：{available}"
    parts: list[str] = [f"【{industry}行業案例】\n"]
    for i, case in enumerate(cases, 1):
        parts.append(
            f"案例 {i}: {case['company']}\n"
            f"  情境: {case['scenario']}\n"
            f"  成效: {case['results']}"
        )
    return "\n\n".join(parts)
def check_content_compliance(content_type: str, key_claims: str) -> str:
    issues: list[str] = []
    for wrong, correct in COMPLIANCE_RULES["product_names"]["incorrect"].items():
        if wrong in key_claims and correct not in key_claims:
            issues.append(f"產品名稱 '{wrong}' 應更正為 '{correct}'")
    for word in ("最好", "唯一", "第一", "最強"):
        if word in key_claims:
            issues.append(f"包含絕對化用語 '{word}'，建議替換為客觀表述")
    rules_text = "\n".join(
        f"  {i+1}. {rule}"
        for i, rule in enumerate(COMPLIANCE_RULES["claim_rules"])
    )
    result = "合規檢查結果:\n\n"
    if issues:
        result += "發現問題:\n" + "\n".join(f"  - {i}" for i in issues) + "\n\n"
    else:
        result += "未發現明顯合規問題。\n\n"
    result += f"合規規則:\n{rules_text}"
    return result
def generate_seo_keywords(topic: str) -> str:
    topic_lower = topic.lower()
    matched: list[dict[str, Any]] = []
    for key, data in SEO_KEYWORDS_DB.items():
        if key.lower() in topic_lower or topic_lower in key.lower() or any(
            w in topic_lower for w in key.lower().split() if len(w) > 1
        ):
            matched.append({"keyword": key, **data})
    if not matched:
        all_keywords = list(SEO_KEYWORDS_DB.keys())
        return (
            f"未找到與 '{topic}' 直接匹配的關鍵詞資料。\n"
            f"建議關鍵詞方向：{', '.join(all_keywords)}\n"
            f"通用 SEO 建議：標題包含核心關鍵詞，"
            f"H2/H3 使用長尾關鍵詞，內容長度 2000+ 字"
        )
    parts: list[str] = [f"SEO 關鍵詞分析 — '{topic}':\n"]
    for item in matched:
        long_tail = "\n".join(f"    - {kw}" for kw in item["long_tail"])
        parts.append(
            f"主關鍵詞: {item['primary']}\n"
            f"  搜尋熱度: {item['search_volume']}\n"
            f"  長尾關鍵詞:\n{long_tail}"
        )
    return "\n\n".join(parts)
TOOL_REGISTRY: dict[str, Any] = {
    "search_product_knowledge": search_product_knowledge,
    "get_audience_profile": get_audience_profile,
    "get_industry_cases": get_industry_cases,
    "check_content_compliance": check_content_compliance,
    "generate_seo_keywords": generate_seo_keywords,
}
def dispatch_tool(name: str, arguments: str) -> str:
    func = TOOL_REGISTRY.get(name)
    if not func:
        return f"未知工具: {name}"
    try:
        kwargs = json.loads(arguments)
    except json.JSONDecodeError:
        return f"工具參數解析失敗: {arguments}"
    return func(**kwargs)
TOOL_DEFINITIONS: list[dict[str, Any]] = [
    {
        "type": "function",
        "function": {
            "name": "search_product_knowledge",
            "description": "搜尋 CMS 產品知識庫，擷取特性或競品對比資訊。",
            "parameters": {
                "type": "object",
                "properties": {
                    "product": {
                        "type": "string",
                        "description": "產品名稱",
                        "enum": ["CMS"],
                    },
                    "aspect": {
                        "type": "string",
                        "description": "查詢方面",
                        "enum": ["features", "comparison"],
                    },
                },
                "required": ["product", "aspect"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_audience_profile",
            "description": "擷取目標受眾畫像，包括痛點、關注領域和決策因素。",
            "parameters": {
                "type": "object",
                "properties": {
                    "audience_type": {
                        "type": "string",
                        "description": "目標受眾類型",
                        "enum": ["營運工程師", "架構師"],
                    },
                },
                "required": ["audience_type"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_industry_cases",
            "description": "擷取行業客戶成功案例，包括情境和成效資料。",
            "parameters": {
                "type": "object",
                "properties": {
                    "industry": {
                        "type": "string",
                        "description": "目標行業",
                        "enum": ["金融", "互連網"],
                    },
                },
                "required": ["industry"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "check_content_compliance",
            "description": "檢查內容合規性，包括產品名稱規範和宣傳用語。",
            "parameters": {
                "type": "object",
                "properties": {
                    "content_type": {
                        "type": "string",
                        "description": "內容類型",
                        "enum": ["blog", "case_study", "comparison"],
                    },
                    "key_claims": {
                        "type": "string",
                        "description": "關鍵宣傳點和資料引用",
                    },
                },
                "required": ["content_type", "key_claims"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "generate_seo_keywords",
            "description": "基於主題產生 SEO 關鍵詞，產生部落格文章時調用。",
            "parameters": {
                "type": "object",
                "properties": {
                    "topic": {
                        "type": "string",
                        "description": "文章主題或核心關鍵詞",
                    },
                },
                "required": ["topic"],
            },
        },
    },
]

requirements.txt

openai
fastapi
uvicorn[standard]
loongsuite-otel-util-genai>=0.6.1

Cloud Monitor：通過 loongsuite-otel-util-genai 與 OpenTelemetry SDK 為調用鏈增加自訂埋點

前提條件

引入依賴

使用 loongsuite-otel-util-genai 和 OpenTelemetry SDK

名詞介紹

1. 擷取 Handler 和 Tracer

2. 建立 Entry Span

3. 建立 Agent Span

4. 建立 ReAct Step Span

5. 建立 Tool Span

6. 使用 OpenTelemetry SDK 建立自訂 Span

`duplicate_tool_detection` — 工具重複調用檢測

`response_loop_detection` — LLM 回複迴圈檢測

查看監控詳情

埋點效果展示

相關文檔

其他語言的自訂埋點

附錄

app.py

agent.py

tools.py

requirements.txt

前提條件

引入依賴

使用 loongsuite-otel-util-genai 和 OpenTelemetry SDK

名詞介紹

1. 擷取 Handler 和 Tracer

2. 建立 Entry Span

3. 建立 Agent Span

4. 建立 ReAct Step Span

5. 建立 Tool Span

6. 使用 OpenTelemetry SDK 建立自訂 Span

duplicate_tool_detection — 工具重複調用檢測

response_loop_detection — LLM 回複迴圈檢測

查看監控詳情

埋點效果展示

相關文檔

其他語言的自訂埋點

附錄

app.py

agent.py

tools.py

requirements.txt

`duplicate_tool_detection` — 工具重複調用檢測

`response_loop_detection` — LLM 回複迴圈檢測