LLM アプリケーションまたは推論サービスを ARMS に接続する - Application Real-Time Monitoring Service

Python エージェントは、Alibaba Cloud が開発した Python 用の可観測性データコレクターです。OpenTelemetry 標準に基づく自動イベントトラッキングを提供し、LLM アプリケーションのトレースをサポートします。

背景情報

LLM アプリケーションは、大規模言語モデル (LLM) に基づいて開発されたアプリケーションです。LLM は、膨大な量のデータとパラメーターでトレーニングされています。これにより、自然な人間の言語を模倣した方法で質問に答えることができます。その結果、LLM は、Natural Language Processing、テキスト生成、インテリジェント対話などの分野で広く使用されています。LLM アプリケーションの人気が高まるにつれて、効率的な推論サービスを提供することが大きな課題となっています。従来のサービスフレームワークは、同時リクエストを処理する際に、パフォーマンスボトルネックや非効率的なメモリ管理に直面することがよくあります。vLLM などの LLM 推論サービスフレームワークは、これらの課題に対処するように設計されています。

LLM の出力は、予測が難しいことがよくあります。トレーニングと本番環境間のパフォーマンスのドリフト、データ分布のドリフトによるパフォーマンスの低下、データ品質の低下、信頼性の低い外部データへの依存など、いくつかの制御不能な要因が LLM アプリケーションと推論サービスの全体的なパフォーマンスに影響を与える可能性があります。したがって、モデルの出力品質の低下を迅速に検出することが重要です。

ARMS は、Python エージェントを介して LLM アプリケーションの自動イベントトラッキングをサポートします。LLM アプリケーションを ARMS に接続すると、その呼び出しチェーンを表示できます。これにより、さまざまな操作タイプの入出力やトークンの消費などの情報を分析できます。詳細については、「LLM 呼び出しチェーン分析」をご参照ください。

ARMS でサポートされている LLM 推論サービスフレームワークとアプリケーションフレームワークのリストについては、「Application Monitoring でサポートされている Python コンポーネントとフレームワーク」をご参照ください。

Python エージェントのインストール

LLM アプリケーションのデプロイメント環境に基づいてインストール方法を選択します。

Python エージェントでアプリケーションを起動する

aliyun-instrument python llm_app.py

説明

llm_app.py を実際のアプリケーションに置き換えます。LLM アプリケーションがない場合は、「付録」で提供されているアプリケーションデモを使用できます。
ARMS Python エージェントは、インストールした依存関係に基づいてアプリケーションタイプを自動的に検出します。
- 次のいずれかの依存関係をインストールすると、アプリケーションは LLM アプリケーションとして識別されます。
  - openai
  - dashscope
  - llama_index
  - langchain
- 次のいずれかの依存関係をインストールすると、アプリケーションは LLM サービスとして識別されます。
  - vllm
  - sglang
- Python アプリケーションに特定のタイプを強制するには、APSARA_APM_APP_TYPE 環境変数を設定します。有効な値は次のとおりです。
  - microservice: 通常のマイクロサービスアプリケーション
  - app: LLM アプリケーション
  - model: LLM サービス

結果

約 1 分後、ARMS コンソールの [LLM Application Monitoring] > [Application List] ページに Python アプリケーションが表示され、データがレポートされると、接続は成功です。

2025-01-07_11-44-13

構成

入出力コンテンツのコレクション

デフォルト値: True。コレクションはデフォルトで有効になっています。

無効にした場合の効果: この機能を無効にすると、モデル、ツール、ナレッジベースの入出力などの詳細フィールドのサイズのみが収集されます。これらのフィールドのコンテンツは収集されません。

適用可能なプラグイン: この構成は Dify と LangChain でのみサポートされています。

構成: 環境変数 OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=False を設定します。

LLM アプリケーションの分割

デフォルト値: False。アプリケーションの分割はデフォルトで無効になっています。

有効にした場合の効果: レポートされたデータは LLM サブアプリケーションに分割されます。Dify Workflow、Agent、Chat App などの各 LLM アプリケーションは、個別の ARMS アプリケーションに対応します。

適用可能なプラグイン: この構成は Dify でのみサポートされています。

構成: 環境変数 PROFILER_GENAI_SPLITAPP_ENABLE=True を設定します。

サポートされているリージョン: 河源、シンガポール。

メッセージコンテンツフィールドの長さ制限

デフォルト値: 4,096 文字。

有効にした場合の効果: この設定は、入出力メッセージフィールドのコンテンツなど、LLM メッセージコンテンツの長さを制限します。指定された文字数制限を超えるコンテンツは切り捨てられます。

適用可能なプラグイン: この構成は Dify と LangChain でのみサポートされています。

構成: エージェントのバージョンが 1.8.3 以降の場合は、環境変数 OTEL_INSTRUMENTATION_GENAI_MESSAGE_CONTENT_MAX_LENGTH=<integer_value> を設定します。<integer_value> を、目的の文字数制限を指定する整数に置き換えます。

Span 属性値の長さ制限

デフォルト値: 設定されていません。デフォルトでは制限はありません。

有効にした場合の効果: この設定は、`gen_ai.agent.description` など、レポートされる Span 属性値の長さを制限します。指定された文字数制限を超える属性値は切り捨てられます。

適用可能なプラグイン: この構成は、LangChain、DashScope、Dify など、OpenTelemetry をサポートするすべてのプラグインに適用されます。

構成: 環境変数 OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=<integer_value> を設定します。<integer_value> を、目的の文字数制限を指定する整数に置き換えます。

付録

OpenAI デモ

llm_app.py

import openai
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_openai():
    client = openai.OpenAI(api_key="sk-xxx")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Write a haiku."}],
        max_tokens=20,
    )
    return {"data": f"{response}"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
openai >= 1.0.0

DashScope デモ

llm_app.py

from http import HTTPStatus
import dashscope
from dashscope import Generation
from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def call_dashscope():
    dashscope.api_key = 'YOUR-DASHSCOPE-API-KEY'
    responses = Generation.call(model=Generation.Models.qwen_turbo,
                                prompt='How is the weather today?')
    resp = ""
    if responses.status_code == HTTPStatus.OK:
        resp = f"Result is: {responses.output}"
    else:
        resp = f"Failed request_id: {responses.request_id}, status_code: {responses.status_code}, code: {responses.code}, message: {responses.message}"
    return {"data": f"{resp}"}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
dashscope >= 1.0.0

LlamaIndex デモ

`data` ディレクトリに、PDF、TXT、DOC などのテキストフォーマットでナレッジベースドキュメントを保存します。

llm_app.py

import time

from fastapi import FastAPI
import uvicorn
import aiohttp

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.dashscope import DashScopeEmbedding
import chromadb
import dashscope
import os
from dotenv import load_dotenv
from llama_index.core.llms import ChatMessage
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
import random

load_dotenv()

os.environ["DASHSCOPE_API_KEY"] = 'sk-xxxxxx'
dashscope.api_key = 'sk-xxxxxxx'
api_key = 'sk-xxxxxxxx'

llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX,api_key=api_key)

# クライアントと新しいコレクションを作成
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("chapters")

# 埋め込み関数を定義
embed_model = DashScopeEmbedding(model_name="text-embedding-v1", api_key=api_key)

# ドキュメントをロード
filename_fn = lambda filename: {"file_name": filename}

# filename_fn に従って各ドキュメントのメタデータを自動的に設定
documents = SimpleDirectoryReader(
    "./data/", file_metadata=filename_fn
).load_data()

# ChromaVectorStore をセットアップしてデータをロード
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=4,
    verbose=True
)

# 応答シンセサイザーを構成
response_synthesizer = get_response_synthesizer(llm=llm, response_mode="refine")

# クエリエンジンをアセンブル
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

SYSTEM_PROMPT = """
You are a general knowledge chatbot for children. Your task is to generate answers based on user questions by combining the most relevant content found in the knowledge base. Do not answer subjective questions.
"""

# システムメッセージで会話を初期化
messages = [ChatMessage(role="system", content=SYSTEM_PROMPT)]

app = FastAPI()


async def fetch(question):
    url = "https://www.aliyun.com"
    call_url = os.environ.get("LLM_INFRA_URL")
    if call_url is None or call_url == "":
        call_url = url
    else:
        call_url = f"{call_url}?question={question}"
    print(call_url)
    async with aiohttp.ClientSession() as session:
        async with session.get(call_url) as response:
            print(f"GET Status: {response.status}")
            data = await response.text()
            print(f"GET Response JSON: {data}")
            return data


@app.get("/heatbeat")
def heatbeat():
    return {"msg", "ok"}


cnt = 0


@app.get("/query")
async def call(question: str = None):
    global cnt
    cnt += 1
    if cnt == 20:
        cnt = 0
        raise BaseException("query is over limit,20 ", 401)
    # 会話履歴にユーザーメッセージを追加
    message = ChatMessage(role="user", content=question)
    # メッセージを文字列に変換
    message_string = f"{message.role}:{message.content}"

    search = await fetch(question)
    print(f"search:{search}")
    resp = query_engine.query(message_string)
    print(resp)
    return {"data": f"{resp}".encode('utf-8').decode('utf-8')}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
numpy==1.23.5
llama-index==0.10.62
llama-index-core==0.10.28
llama-index-embeddings-dashscope==0.1.3
llama-index-llms-dashscope==0.1.2
llama-index-vector-stores-chroma==0.1.6
aiohttp

LangChain デモ

llm_app.py

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"]) #後でコールバックします。/あなたは彼らを「コンソール」します！

template = """Question: {question} #質問：

Answer: Let's think step by step.""" #回答：ステップバイステップで考えてみましょう。

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What NFL team won the Super Bowl in the year Justin Beiber was born?" #ジャスティン・ビーバーが生まれた年にスーパーボウルで優勝したNFLチームは？

@app.get("/")
def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

requirements.txt

fastapi
uvicorn
langchain
langchain_community

Dify デモ

Dify アプリケーションを迅速に構築するには、「Dify を使用して Web ページ用のカスタマイズされた AI Q&A アシスタントを構築する」をご参照ください。