All Products
Search
Document Center

Application Real-Time Monitoring Service:Connect LLM applications or inference services to ARMS

Last Updated:Sep 29, 2025

The Python agent is an observability data collector for Python developed by Alibaba Cloud. It provides automatic instrumentation based on the OpenTelemetry standard and supports tracing for LLM applications.

Background information

An LLM application is an application developed based on a large language model (LLM). LLMs are trained on vast amounts of data and parameters. This allows them to answer questions in a way that mimics natural human language. As a result, LLMs are widely used in areas such as natural language processing, text generation, and intelligent dialogue. As LLM applications grow in popularity, providing efficient inference services has become a significant challenge. Traditional service frameworks often face performance bottlenecks and inefficient memory management when processing concurrent requests. LLM inference service frameworks, such as vLLM, are designed to address these challenges.

The output of an LLM is often hard to predict. Several uncontrollable factors can affect the overall performance of LLM applications and inference services. These factors include performance drift between training and production, performance degradation from data distribution drift, poor data quality, and unreliable external data dependencies. Therefore, it is crucial to promptly detect any decrease in the model's output quality.

ARMS supports automatic instrumentation for LLM applications through the Python agent. After you connect an LLM application to ARMS, you can view its call chain. This helps you analyze information such as the input/output of different operation types and token consumption. For more information, see LLM call chain analysis.

For a list of LLM inference service frameworks and application frameworks supported by ARMS, see Python components and frameworks supported by Application Monitoring.

Install the Python agent

Choose an installation method based on the deployment environment of your LLM application:

Start the application with the Python agent

aliyun-instrument python llm_app.py
Note
  • Replace llm_app.py with your actual application. If you do not have an LLM application, you can use the application demo provided in the Appendix.

  • The ARMS Python agent automatically detects your application type based on the dependencies you have installed:

    • If you install one of the following dependencies, the application is identified as an LLM application:

      • openai

      • dashscope

      • llama_index

      • langchain

    • If you install one of the following dependencies, the application is identified as an LLM service:

      • vllm

      • sglang

    • To force a specific type for your Python application, set the APSARA_APM_APP_TYPE environment variable. The valid values are:

      • microservice: A regular microservice application

      • app: An LLM application

      • model: An LLM service

Result

The connection is successful if, after about one minute, the Python application appears on the LLM Application Monitoring > Application List page in the ARMS console and reports data.

2025-01-07_11-44-13

Configurations

Input/output content collection

Default value: True. Collection is enabled by default.

Effect when disabled: If this feature is disabled, only the size of detail fields, such as the input/output of models, tools, and knowledge bases, is collected. The content of these fields is not collected.

Applicable plug-ins: This configuration is supported only by Dify and LangChain.

Configuration: Set the environment variable OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=False.

LLM application splitting

Default value: False. Application splitting is disabled by default.

Effect when enabled: The reported data is split into LLM sub-applications. Each LLM application, such as a Dify Workflow, Agent, or Chat App, corresponds to a separate ARMS application.

Applicable plug-in: This configuration is supported only by Dify.

Configuration: Set the environment variable PROFILER_GENAI_SPLITAPP_ENABLE=True.

Supported regions: Heyuan and Singapore.

Message content field length limit

Default value: 4,096 characters.

Effect when enabled: This setting limits the length of LLM message content, such as the content of input/output message fields. Content that exceeds the specified character limit is truncated.

Applicable plug-ins: This configuration is supported only by Dify and LangChain.

Configuration: If your agent version is 1.8.3 or later, set the environment variable OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH=<integer_value>. Replace <integer_value> with an integer that specifies the desired character length limit.

Span attribute value length limit

Default value: Not set. There is no limit by default.

Effect when enabled: This setting limits the length of reported span attribute values, such as `gen_ai.agent.description`. Attribute values that exceed the specified character limit are truncated.

Applicable plug-ins: This configuration applies to all plug-ins that support OpenTelemetry, such as LangChain, DashScope, and Dify.

Configuration: Set the environment variable OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=<integer_value>. Replace <integer_value> with an integer that specifies the desired character length limit.

Appendix

OpenAI demo

llm_app.py

import openai
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_openai():
    client = openai.OpenAI(api_key="sk-xxx")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Write a haiku."}],
        max_tokens=20,
    )
    return {"data": f"{response}"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)    

requirements.txt

fastapi
uvicorn
openai >= 1.0.0

DashScope demo

llm_app.py

from http import HTTPStatus
import dashscope
from dashscope import Generation
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_dashscope():
    dashscope.api_key = 'YOUR-DASHSCOPE-API-KEY'
    responses = Generation.call(model=Generation.Models.qwen_turbo,
                                prompt='How is the weather today?')
    resp = ""
    if responses.status_code == HTTPStatus.OK:
        resp = f"Result is: {responses.output}"
    else:
        resp = f"Failed request_id: {responses.request_id}, status_code: {responses.status_code}, code: {responses.code}, message: {responses.message}"
    return {"data": f"{resp}"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
dashscope >= 1.0.0

LlamaIndex demo

Store knowledge base documents in text formats, such as PDF, TXT, and DOC, in the `data` directory.

llm_app.py

import time

from fastapi import FastAPI
import uvicorn
import aiohttp

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.dashscope import DashScopeEmbedding
import chromadb
import dashscope
import os
from dotenv import load_dotenv
from llama_index.core.llms import ChatMessage
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
import random

load_dotenv()

os.environ["DASHSCOPE_API_KEY"] = 'sk-xxxxxx'
dashscope.api_key = 'sk-xxxxxxx'
api_key = 'sk-xxxxxxxx'

llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX,api_key=api_key)

# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("chapters")

# define embedding function
embed_model = DashScopeEmbedding(model_name="text-embedding-v1", api_key=api_key)

# load documents
filename_fn = lambda filename: {"file_name": filename}

# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data/", file_metadata=filename_fn
).load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=4,
    verbose=True
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(llm=llm, response_mode="refine")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

SYSTEM_PROMPT = """
You are a general knowledge chatbot for children. Your task is to generate answers based on user questions by combining the most relevant content found in the knowledge base. Do not answer subjective questions.
"""

# Initialize the conversation with a system message
messages = [ChatMessage(role="system", content=SYSTEM_PROMPT)]

app = FastAPI()


async def fetch(question):
    url = "https://www.aliyun.com"
    call_url = os.environ.get("LLM_INFRA_URL")
    if call_url is None or call_url == "":
        call_url = url
    else:
        call_url = f"{call_url}?question={question}"
    print(call_url)
    async with aiohttp.ClientSession() as session:
        async with session.get(call_url) as response:
            print(f"GET Status: {response.status}")
            data = await response.text()
            print(f"GET Response JSON: {data}")
            return data


@app.get("/heatbeat")
def heatbeat():
    return {"msg", "ok"}


cnt = 0


@app.get("/query")
async def call(question: str = None):
    global cnt
    cnt += 1
    if cnt == 20:
        cnt = 0
        raise BaseException("query is over limit,20 ", 401)
    # Add user message to the conversation history
    message = ChatMessage(role="user", content=question)
    # Convert messages into a string
    message_string = f"{message.role}:{message.content}"

    search = await fetch(question)
    print(f"search:{search}")
    resp = query_engine.query(message_string)
    print(resp)
    return {"data": f"{resp}".encode('utf-8').decode('utf-8')}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
numpy==1.23.5
llama-index==0.10.62
llama-index-core==0.10.28
llama-index-embeddings-dashscope==0.1.3
llama-index-llms-dashscope==0.1.2
llama-index-vector-stores-chroma==0.1.6
aiohttp

LangChain demo

llm_app.py

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"])

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

@app.get("/")
def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
langchain
langchain_community

Dify demo

To quickly build a Dify application, see Build a customized AI Q&A assistant for web pages using Dify.