Optimize Python Application Observability: Overcome the Challenges in LLM Application Deployment

1. Background

With the growing maturity and versatility of LLMs across an expanding range of scenarios, an increasing number of enterprises are integrating LLMs into their products and services. LLMs demonstrate impressive capabilities in processing natural language. However, the internal mechanisms of LLMs remain unclear. The lack of transparency in LLMs poses unnecessary risks to downstream applications. As a result, issues such as difficulties in LLM application deployment occur. Therefore, the understanding and interpretations of LLMs are essential to elucidating their behavior, limits, and social impact. The LLM application observability feature provides necessary data support for model interpretability. This feature also helps researchers and developers identify unexpected biases and risks and improve performance.

As a programming language in the AI era, Python has been widely used in recent years. Popular LLM projects, such as LangChain, LlamaIndex, Dify, Prompt flow, OpenAI, and Dashscope, are developed in Python. To enhance the observability of Python applications, especially LLM applications in Python, Alibaba Cloud provides the Application Real-Time Monitoring Service (ARMS) agent for Python. The ARMS agent aims to facilitate LLM application deployment for enterprises.

This article describes how to install an ARMS agent for a Python application. This article also describes the features and compatibility of the ARMS agent. In this example, an LLM application is created for testing.

2. Sample application

In this example, a sample LLM application is created in the following scenario to help you gain a better understanding of the ARMS agent for Python:

An enterprise integrates the intelligent Q&A feature into its search service during a service upgrade. The following figure shows the service architecture.


After a user initiates a Q&A query to the server, the server calls the chatbot to obtain the result. The chatbot receives the query and returns the result by using retrieval-augmented generation (RAG).

To observe the LLM application, the enterprise installs the ARMS agent for the LLM application in Python. The following section describes how to install the ARMS agent for Python.

Installation methods
In this example, the Python application is deployed in Container Service for Kubernetes (ACK). For more information about how to manually install an ARMS agent for Python, visit https://www.alibabacloud.com/help/en/arms/application-monitoring/user-guide/start-monitoring-python-applications/

2.1 Prerequisites

• An ACK cluster is created. You can create an ACK dedicated cluster, ACK managed cluster, or ACK Serverless cluster based on your business requirements.

• A namespace is created in the cluster. For more information, see Manage namespaces and resource quotas. In this example, a namespace named arms-demo is used.

• The Python version is compatible with the framework version. For more information, see Compatibility requirements of the ARMS agent for Python.

2.2 Step 1: Install the ack-onepilot Component

  1. Log on to the ACK console.
  2. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster in which the LLM application is deployed and click its name.
  3. In the left-side navigation pane of the cluster details page, choose Operations > Addons. On the Add-ons page, search for the ack-onepilot component by keyword. Note: Make sure that the ack-onepilot component is of V3.2.0 or later.
  4. Click Install on the ack-onepilot card. Note: By default, the ack-onepilot component supports 1,000 pods. For every additional 1,000 pods in the cluster, you need to add 0.5 CPU cores and 512 MB of memory for the component.
  5. In the Install ack-onepilot dialog box, configure the parameters and click OK. We recommend that you use the default values. Note: After you install ack-onepilot, you can upgrade, configure, or uninstall it on the Add-ons page.

2.3 Step 2: Modify the Dockerfile

1.  Download the agent installer from the Python Package Index (PyPI) repository.

pip3 install aliyun-bootstrap

2.  Install the ARMS agent for Python by using aliyun-bootstrap.

aliyun-bootstrap -a install

3.  Start the application by using the ARMS agent for Python.

aliyun-instrument python app.py

4.  Build an image. You can refer to the following sample Dockerfiles to modify your Dockerfile.

Sample Dockerfiles:

Dockerfile before modification:

# Use the base image for Python 3.10.
FROM docker.m.daocloud.io/python:3.10

# Specify the working directory.

# Copy the requirements.txt file to the working directory.
COPY requirements.txt .

# Install dependencies by using pip.
RUN pip install --no-cache-dir -r requirements.txt

COPY ./app.py /app/app.py
# Expose port 8000 of the container.
CMD ["python","app.py"]

Modified Dockerfile:

# Use the base image for Python 3.10.
FROM docker.m.daocloud.io/python:3.10

# Specify the working directory.

# Copy the requirements.txt file to the working directory.
COPY requirements.txt .

# Install dependencies by using pip.
RUN pip install --no-cache-dir -r requirements.txt
######################### Install the ARMS agent for Python###############################
RUN pip3 install  aliyun-bootstrap  && aliyun-bootstrap -a install

COPY ./app.py /app/app.py

# Expose port 8000 of the container.
CMD ["aliyun-instrument","python","app.py"]

Usage Notes

1.  If you use unicorn to start the application, we recommend that you replace the unicorn command with a gunicorn command. Examples:

Original command:

unicorn -w 4 -b app:app

Modified command:

gunicorn -w 4 -k uvicorn.workers.UvicornWorker -b app:app

2.  If gevent is used, you need to specify the required parameter.

from gevent import monkey

In this case, set the GEVENT_ENABLE environment variable to true:


2.4 Step 3: Grant Access Permissions on ARMS Resources

• To monitor applications in a serverless ACK cluster or applications in an ACK cluster connected to Elastic Container Instance, you must first authorize the cluster to access ARMS on the Cloud Resource Access Authorization page. Then, restart all pods managed by the ack-onepilot component.

• To monitor an application deployed in an ACK cluster in which ARMS Addon Token does not exist, perform the following operations to manually authorize the ACK cluster to access ARMS. If ARMS Addon Token exists, go to Step 4.

Perform the following steps to check whether ARMS Addon Token exists in a cluster:

a. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster in which the application is deployed and click its name to go to the cluster details page. 
b. In the left-side navigation pane of the cluster details page, choose Configurations > Secrets. In the upper part of the Secrets page, select kube-system from the Namespace drop-down list and check whether addon.arms.token is displayed on the Secrets page. 

Note: If ARMS Addon Token exists in a cluster, ARMS performs password-free authorization on the cluster. ARMS Addon Token may not exist in ACK managed clusters that run on specific Kubernetes versions. We recommend that you check whether an ACK managed cluster has ARMS Addon Token before you use ARMS to monitor applications in the cluster. If a cluster does not have ARMS Addon Token, you must perform the following steps to manually authorize the cluster to access ARMS:

  1. Log on to the ACK console.
  2. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster that you want to manage and click its name.
  3. On the cluster details page, click the Basic Information tab. On the Basic Information tab, click the link next to the Worker RAM Role parameter in the Cluster Resources section.
  4. On the Permissions tab, click Grant Permission.
  5. In the Grant Permission panel, select the AliyunARMSFullAccess policy and click Grant permissions. To monitor an application deployed in an ACK dedicated cluster or registered cluster as a Resource Access Management (RAM) user, make sure that the AliyunARMSFullAccess and AliyunSTSAssumeRoleAccess policies are attached to the RAM user. For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.

After you install the ack-onepilot component, you must enter the AccessKey ID and AccessKey secret of your Alibaba Cloud account in the configuration file of the ack-onepilot component.

  1. In the left-side navigation pane of the cluster details page, choose Applications > Helm. On the Helm page, find ack-onepilot and click Update in the Actions column.
  2. In the Update Release panel, set the accessKey and accessKeySecret parameters to the AccessKey ID and AccessKey secret of your Alibaba Cloud account and click OK. For more information about how to obtain an AccessKey pair, see Create an AccessKey pair.
  3. Restart the Deployment.

2.5 Step 4: Enable Application Monitoring for the Application

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster in which the application is deployed and click Applications in the Actions column.
  2. On the Deployments page, find the application that you want to manage and choose More > View in YAML in the Actions column. To enable application monitoring for a new application, click Create from YAML on the Deployments page.
  3. Add the following labels to spec.template.metadata in the Template code editor:
  aliyun.com/app-language: python # Required. Specify that the application is developed in Python. 
  armsPilotAutoEnable: 'on'
  armsPilotCreateAppName: "<your-deployment-name>"    # Specify the display name of the application in ARMS. 


The following YAML template shows how to create a Deployment and enable application monitoring for the application:

apiVersion: apps/v1
kind: Deployment
    app: arms-python-client
  name: arms-python-client
  namespace: arms-demo
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
      app: arms-python-client
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
        app: arms-python-client
        aliyun.com/app-language: python # Required. Specify that the application is developed in Python.
        armsPilotAutoEnable: 'on'
        armsPilotCreateAppName: "arms-python-client"    # Specify the display name of the application in ARMS.
        - image: registry.cn-hangzhou.aliyuncs.com/arms-default/python-agent:arms-python-client
          imagePullPolicy: Always
          name: client
              cpu: 250m
              memory: 300Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30


apiVersion: apps/v1
kind: Deployment
    app: arms-python-server
  name: arms-python-server
  namespace: arms-demo
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
      app: arms-python-server
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
        app: arms-python-server
        aliyun.com/app-language: python # Required. Specify that the application is developed in Python. 
        armsPilotAutoEnable: 'on'
        armsPilotCreateAppName: "arms-python-server"    # Specify the display name of the application in ARMS. 
        - env:
          - name: CLIENT_URL
            value: 'http://arms-python-client-svc:8000'
        - image: registry.cn-hangzhou.aliyuncs.com/arms-default/python-agent:arms-python-server
          imagePullPolicy: Always
          name: server
              cpu: 250m
              memory: 300Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30


apiVersion: v1
kind: Service
    app: arms-python-server
  name: arms-python-server-svc
  namespace: arms-demo
  internalTrafficPolicy: Cluster
    - IPv4
  ipFamilyPolicy: SingleStack
    - name: http
      port: 8000
      protocol: TCP
      targetPort: 8000
    app: arms-python-server
  sessionAffinity: None
  type: ClusterIP

apiVersion: v1
kind: Service
  name: arms-python-client-svc
  namespace: arms-demo
  uid: 91f94804-594e-495b-9f57-9def1fdc7c1d
  internalTrafficPolicy: Cluster
    - IPv4
  ipFamilyPolicy: SingleStack
    - name: http
      port: 8000
      protocol: TCP
      targetPort: 8000
    app: arms-python-client
  sessionAffinity: None
  type: ClusterIP

3. Result Verification

After the application is automatically redeployed, wait for 1 to 2 minutes. In the left-side navigation pane of the ARMS console, choose Application Monitoring > Application List. On the Application List page, find the application that you created and click its name to view the metrics of the application. For more information, see View monitoring details (new).

4. Features

After the application is monitored by ARMS, you can view the information about the application on the application details page in the ARMS console. This section describes the features that you can use.

4.1 Analyze Traces

Microservices Scenarios

On the Trace Explorer tab, you can combine filter conditions and aggregation dimensions for real-time analysis. You can troubleshoot failed or slow calls of your application based on the failed or slow trace data.


The following figure shows an example of the trace details.


LLM Scenarios

You can view TraceView of the new edition in the LLM field and analyze information such as input and output of different operation types and token consumption in a visualized manner.

On the Trace Explorer tab, click the LLM view icon in the upper-right corner.


The following figure shows an example of the trace details of an LLM application.


4.2 Monitor Metrics

Application overview


Application topology


4.3 Configure Alerts

You can configure alerting for your application. If an alert is triggered, alert notifications are sent to the contacts or DingTalk group chat based on the specified notification methods. This way, the contacts can resolve issues at the earliest opportunity. For more information, see Application Monitoring Alert Rules.


5. Compatibility

The ARMS agent for Python is compatible with Python 3.8 and later.

6. Appendices

arms-python-server file:

import uvicorn
from fastapi import FastAPI, HTTPException
from logging import getLogger
from concurrent import futures
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
_logger = getLogger(__name__)
import requests
import os

def call_requests():
    url = 'https://www.aliyun.com'  # Replace the value with the URL of the server.
    call_url = os.environ.get("CALL_URL")
    if call_url is None or call_url == "":
        call_url = url
    # try:
    response = requests.get(call_url)
    response.raise_for_status()  # If an error code is returned, an exception is thrown.
    print(f"response code: {response.status_code} - {response.text}")

app = FastAPI()

def call_client():
    _logger.warning("calling client")
    url = 'https://www.aliyun.com'  # Replace the value with the URL of the client-side application.
    call_url = os.environ.get("CLIENT_URL")
    if call_url is None or call_url == "":
        call_url = url
    response = requests.get(call_url)
    # print(f"response code: {response.status_code} - {response.text}")
    return response.text

async def call():
    with tracer.start_as_current_span("parent") as rootSpan:
        rootSpan.set_attribute("parent.value", "parent")
        with futures.ThreadPoolExecutor(max_workers=2) as executor:
            with tracer.start_as_current_span("ThreadPoolExecutorTest") as span:
                span.set_attribute("future.value", "ThreadPoolExecutorTest")
                future = executor.submit(call_client)
# call_client()
    return {"data": f"call"}

if __name__ == "__main__":
    uvicorn.run(app, host="", port=8000)

arms-python-client file:

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"])

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="", port=8000)


