Deploy MLLM inference service with EAS | PAI - Platform For AI

Multimodal Large Language Models (MLLMs) process text, images, and audio for cross-modal understanding. With Elastic Algorithm Service (EAS), you can deploy an MLLM inference service in under five minutes and start running inference immediately.

Background

Large Language Models (LLMs) excel at text tasks such as generation, sentiment analysis, and translation, but cannot process images, audio, or video. Multimodal understanding bridges this gap.

Multimodal Large Language Models (MLLMs) address this by processing text, images, and audio simultaneously. Models such as GPT-4o have driven widespread industry adoption.

EAS provides a one-click solution to deploy popular MLLM inference services in under five minutes.

Prerequisites

PAI is activated and a default workspace is created. Activate PAI and create a default workspace.
If you use a RAM user to deploy the model, grant the required EAS management permissions. Cloud product dependencies and authorization: EAS.

Deploy an EAS service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the following parameters. The full parameter list is in Custom Deployment.

Parameter		Description
Environment Information	Deployment Method	Select Image-based Deployment and select the Enable Web App checkbox.
	Image Configuration	From the Alibaba Cloud Image list, select chat-mllm-webui > chat-mllm-webui:1.0. Note Image versions are updated frequently. Select the latest version.
	Command to Run	The command auto-populates after image selection. Change the model_type parameter to deploy a different model. Supported models are listed in the following table.
Resource Information	Deployment	Select a GPU-accelerated instance type. Recommended: ml.gu7i.c16m60.1-gu30 for cost-effectiveness.

Supported models

model_type	Model link
qwen_vl_chat	qwen/Qwen-VL-Chat
qwen_vl_chat_int4	qwen/Qwen-VL-Chat-Int4
qwen_vl	qwen/Qwen-VL
glm4v_9b_chat	ZhipuAI/glm-4v-9b
llava1_5-7b-instruct	swift/llava-1___5-7b-hf
llava1_5-13b-instruct	swift/llava-1___5-13b-hf
internvl_chat_v1_5_int8	AI-ModelScope/InternVL-Chat-V1-5-int8
internvl-chat-v1_5	AI-ModelScope/InternVL-Chat-V1-5
mini-internvl-chat-2b-v1_5	OpenGVLab/Mini-InternVL-Chat-2B-V1-5
mini-internvl-chat-4b-v1_5	OpenGVLab/Mini-InternVL-Chat-4B-V1-5
internvl2-2b	OpenGVLab/InternVL2-2B
internvl2-4b	OpenGVLab/InternVL2-4B
internvl2-8b	OpenGVLab/InternVL2-8B
internvl2-26b	OpenGVLab/InternVL2-26B
internvl2-40b	OpenGVLab/InternVL2-40B

After you configure the parameters, click Deploy.

Call the service

Use the WebUI for inference

On the Elastic Algorithm Service (EAS) page, click the target service name, then click View Web App in the upper-right corner to open the WebUI.
Run an inference test on the WebUI to validate the model.

Use an API for inference

Get the endpoint and token.
1. On the Elastic Algorithm Service (EAS) page, click the target service name. In the Basic Information section, click View Endpoint Information.
2. In the Invocation Method panel, copy the token and endpoint.

Use an API to run model inference.

EAS provides three API operations:

Infer forward

Retrieves the inference result.

Note

WebUI and API calls cannot run simultaneously. If you used the WebUI, run clear chat history first, then run infer forward.

Replace the following parameters in the sample code.

Parameter	Description
hosts	The endpoint obtained in Step 1.
authorization	The token obtained in Step 1.
prompt	Your question content. English is recommended.
image_path	The local path of the image.

Request and response parameters

The following table describes the request parameters.

Parameter	Type	Description	Default
prompt	String	The content of the question.	None. This parameter is required.
image	Base64-encoded format	The input image.	None
chat_history	List[List]	The chat history.	[]
temperature	Float	Controls output randomness. Higher values increase randomness; 0 makes output deterministic. Range: 0 to 1.	0.2
top_p	Float	Samples from tokens with cumulative probability up to top_p.	0.7
max_output_tokens	Int	Maximum number of output tokens.	512
use_stream	Bool	Enables streaming output. Valid values: True False	True

The output is a string containing the answer.

The following Python code provides an example:

import requests
import json
import base64


def post_get_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
    datas = {
        "prompt": prompt,
        "image": image,
        "chat_history": chat_history,
        "temperature": temperature,
        "top_p": top_p,
        "max_output_tokens": max_output_tokens,
        "use_stream": use_stream,
    }

    if use_stream:
        headers.update({'Accept': 'text/event-stream'})

        response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)

        if response.status_code != 200:
            print(f"Request failed with status code {response.status_code}")
            return
        process_stream(response)

    else:
        r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
        data = r.content.decode('utf-8')

        print(data)


def image_to_base64(image_path):
    """
    Convert an image file to a Base64 encoded string.

    :param image_path: The file path to the image.
    :return: A Base64 encoded string representation of the image.
    """
    with open(image_path, "rb") as image_file:
        # Read the binary data of the image
        image_data = image_file.read()
        # Encode the binary data to Base64
        base64_encoded_data = base64.b64encode(image_data)
        # Convert bytes to string and remove any trailing newline characters
        base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
    return base64_string


def process_stream(response, previous_text=""):
    MARK_RESPONSE_END = '##END'  # DONOT CHANGE
    buffer = previous_text
    current_response = ""

    for chunk in response.iter_content(chunk_size=100):
        if chunk:
            text = chunk.decode('utf-8')
            current_response += text

            parts = current_response.split(MARK_RESPONSE_END)
            for part in parts[:-1]:
                new_part = part[len(previous_text):]
                if new_part:
                    print(new_part, end='', flush=True)

                previous_text = part

            current_response = parts[-1]

    remaining_new_text = current_response[len(previous_text):]
    if remaining_new_text:
        print(remaining_new_text, end='', flush=True)


if __name__ == '__main__':
    # Replace <service_url> with the endpoint.
    hosts = '<service_url>'
    # Replace <token> with the token.
    head = {
        'Authorization': '<token>'
    }

    # Get chat history
    chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']

    # The content of the question. We recommend that you use English.
    prompt = 'Please describe the image'
    # Replace path_to_your_image with the local path of your image.
    image_path = 'path_to_your_image'
    image_base_64 = image_to_base64(image_path)

    post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head)

Get chat history

Retrieves the chat history.

Replace the following parameters in the sample code.

Parameter	Description
hosts	The endpoint obtained in Step 1.
authorization	The token obtained in Step 1.

This operation requires no input parameters.

Output parameters:

Parameter	Type	Description
chat_history	List[List]	The chat history.

The following Python code provides an example:

import requests
import json

def post_get_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


if __name__ == '__main__':
    # Replace <service_url> with the endpoint.
    hosts = '<service_url>'
    # Replace <token> with the token.
    head = {
        'Authorization': '<token>'
    }

    chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
    print(chat_history)

Clear chat history

Clears the chat history.

Replace the following parameters in the sample code.

Parameter	Description
hosts	The endpoint obtained in Step 1.
authorization	The token obtained in Step 1.

This operation requires no input parameters.
The operation returns a "success" string.

The following Python code provides an example:

import requests
import json


def post_clear_history(url='http://127.0.0.1:7860', headers=None):
    r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
    data = r.content.decode('utf-8')
    return data


if __name__ == '__main__':
    # Replace <service_url> with the endpoint.
    hosts = '<service_url>'
    # Replace <token> with the token.
    head = {
        'Authorization': '<token>'
    }
    clear_info = post_clear_history(url=hosts, headers=head)
    print(clear_info)