LLM 大語言模型應用/推理服務接入 ARMS - Application Real-Time Monitoring Service

Python探針是阿里雲可觀測產品自研的Python語言的可觀測採集探針，其基於OpenTelemetry標準實現了自動化埋點能力，支援追蹤LLM應用程式。

背景資訊

LLM（Large Language Model）應用是指基於大語言模型所開發的各種應用，大語言模型通過大量的資料和參數訓練，能夠回答類似人類自然語言的問題，因此在自然語言處理、文本產生和智能對話等領域有廣泛應用。同時在大語言模型應用日益普及的今天，如何高效地提供推理服務成為一個重要挑戰。傳統的服務架構在處理並發請求時往往會遇到效能瓶頸、記憶體管理效率低下等問題。大語言模型推理服務架構（如vLLM等）正是為解決這些關鍵挑戰而生。

由於LLM的輸出結果往往很難準確預測，同時面臨訓練和生產效果可能出現偏差、資料分布漂移導致效能下降、資料品質不保鮮、依賴外部資料不可靠等若干因素的不可控情況，這往往會影響LLM應用與推理服務的整體表現，當模型輸出品質下降時能夠及時識別就顯得非常重要。

ARMS支援對LLM應用通過Python探針自動埋點，將LLM應用接入ARMS後，您即可查看LLM應用的調用鏈視圖，更直觀地分析不同操作類型的輸入輸出、Token消耗等資訊。更多資訊，請參見LLM調用鏈分析。

ARMS支援的LLM （大語言模型）推理服務架構和應用程式框架，請參見ARMS 應用監控支援的 Python 組件和架構。

安裝 Python 探針

根據LLM應用部署環境選擇合適的安裝方式：

通過 Python 探針啟動應用

aliyun-instrument python llm_app.py

說明

請將llm_app.py替換為實際應用，如果您暫時沒有可接入的LLM應用，您也可以使用附錄提供的應用Demo。
ARMS Python探針會根據您安裝的依賴對您的應用類型進行自動識別：
- 如果您安裝了以下依賴之一，應用將被識別為大語言模型應用：
  - openai
  - dashscope
  - llama_index
  - langchain
- 如果您安裝了以下依賴之一，應用將被識別為大語言模型服務：
  - vllm
  - sglang
- 如果您需要強制指定Python應用的類型，您可以設定APSARA_APM_APP_TYPE環境變數，該環境變數的取值如下：
  - microservice：普通微服務應用
  - app：大語言模型應用
  - model：大語言模型服務

執行結果

約一分鐘後，若Python應用出現在ARMS控制台的LLM應用監控 > 應用列表頁面中且有資料上報，則說明接入成功。

2025-01-07_11-44-13

配置

輸入/輸出內容採集

預設值：True，預設開啟採集。

關閉後的效果：使用者query時，模型、工具、知識庫的input/output等詳情欄位只採集欄位大小，不採集欄位內容。

當前生效外掛程式：僅Dify、LangChain支援此配置。

配置方式：設定環境變數OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=False。

大模型應用拆分

預設值： False，預設不做應用拆分。

開啟後的效果：上報的資料拆分到LLM子應用，每個大模型應用（如Dify Workflow/Agent/Chat App）對應一個ARMS應用。

當前生效外掛程式：僅Dify支援此配置。

配置方式：設定環境變數PROFILER_GENAI_SPLITAPP_ENABLE=True。

支援的地區：河源、新加坡。

訊息內容欄位長度限制

預設值：4K 字元。

開啟後的效果：限制LLM每條訊息內容（如input/output的訊息內容欄位）的長度，超過指定字元長度的訊息內容將會被截斷。

當前生效外掛程式：僅Dify、LangChain支援此配置。

配置方式：如果您的探針版本>=1.8.3可以通過設定環境變數OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH=<integer_value>，將 <integer_value> 替換為希望限制的字元長度大小整數值。

Span屬性值長度採集限制

預設值：不設定，預設沒有限制。

開啟後的效果：限制上報的Span屬性值（如 gen_ai.agent.description）的長度，超過指定字元長度的屬性值將會被截斷。

當前生效外掛程式：該配置適用於所有支援 OpenTelemetry 的外掛程式（如LangChain/DashScope/Dify等）。

配置方式：設定環境變數OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=<integer_value>，將 <integer_value> 替換為希望限制的字元長度大小整數值。

附錄

OpenAI Demo

llm_app.py

import openai
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_openai():
    client = openai.OpenAI(api_key="sk-xxx")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Write a haiku."}],
        max_tokens=20,
    )
    return {"data": f"{response}"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
openai >= 1.0.0

DashScope Demo

llm_app.py

from http import HTTPStatus
import dashscope
from dashscope import Generation
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_dashscope():
    dashscope.api_key = 'YOUR-DASHSCOPE-API-KEY'
    responses = Generation.call(model=Generation.Models.qwen_turbo,
                                prompt='今天天氣好嗎？')
    resp = ""
    if responses.status_code == HTTPStatus.OK:
        resp = f"Result is: {responses.output}"
    else:
        resp = f"Failed request_id: {responses.request_id}, status_code: {responses.status_code}, code: {responses.code}, message: {responses.message}"
    return {"data": f"{resp}"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
dashscope >= 1.0.0

LlamaIndex Demo

在data目錄下存放知識庫文檔（pdf、txt、doc等文字格式設定）。

llm_app.py

import time

from fastapi import FastAPI
import uvicorn
import aiohttp

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.dashscope import DashScopeEmbedding
import chromadb
import dashscope
import os
from dotenv import load_dotenv
from llama_index.core.llms import ChatMessage
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
import random

load_dotenv()

os.environ["DASHSCOPE_API_KEY"] = 'sk-xxxxxx'
dashscope.api_key = 'sk-xxxxxxx'
api_key = 'sk-xxxxxxxx'

llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX,api_key=api_key)

# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("chapters")

# define embedding function
embed_model = DashScopeEmbedding(model_name="text-embedding-v1", api_key=api_key)

# load documents
filename_fn = lambda filename: {"file_name": filename}

# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data/", file_metadata=filename_fn
).load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=4,
    verbose=True
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(llm=llm, response_mode="refine")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

SYSTEM_PROMPT = """
你是一個兒童通識類聊天機器人，你的任務是根據使用者輸入的問題，結合知識庫中找到的最相關的內容，然後根據內容產生回答。注意不回答主觀問題。
"""

# Initialize the conversation with a system message
messages = [ChatMessage(role="system", content=SYSTEM_PROMPT)]

app = FastAPI()


async def fetch(question):
    url = "https://www.aliyun.com"
    call_url = os.environ.get("LLM_INFRA_URL")
    if call_url is None or call_url == "":
        call_url = url
    else:
        call_url = f"{call_url}?question={question}"
    print(call_url)
    async with aiohttp.ClientSession() as session:
        async with session.get(call_url) as response:
            print(f"GET Status: {response.status}")
            data = await response.text()
            print(f"GET Response JSON: {data}")
            return data


@app.get("/heatbeat")
def heatbeat():
    return {"msg", "ok"}


cnt = 0


@app.get("/query")
async def call(question: str = None):
    global cnt
    cnt += 1
    if cnt == 20:
        cnt = 0
        raise BaseException("query is over limit,20 ", 401)
    # Add user message to the conversation history
    message = ChatMessage(role="user", content=question)
    # Convert messages into a string
    message_string = f"{message.role}:{message.content}"

    search = await fetch(question)
    print(f"search:{search}")
    resp = query_engine.query(message_string)
    print(resp)
    return {"data": f"{resp}".encode('utf-8').decode('utf-8')}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
numpy==1.23.5
llama-index==0.10.62
llama-index-core==0.10.28
llama-index-embeddings-dashscope==0.1.3
llama-index-llms-dashscope==0.1.2
llama-index-vector-stores-chroma==0.1.6
aiohttp

LangChain Demo

llm_app.py

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"])

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

@app.get("/")
def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
langchain
langchain_community

Dify Demo

您可以參考基於Dify構建網頁定製化AI問答助手文檔快速搭建一個Dify應用。