All Products
Search
Document Center

Platform For AI:Deploy MLLM applications with EAS

Last Updated:Jun 04, 2026

Multimodal Large Language Models (MLLMs) process text, images, and audio for cross-modal understanding. With Elastic Algorithm Service (EAS), you can deploy an MLLM inference service in under five minutes and start running inference immediately.

Background

Large Language Models (LLMs) excel at text tasks such as generation, sentiment analysis, and translation, but cannot process images, audio, or video. Multimodal understanding bridges this gap.

Multimodal Large Language Models (MLLMs) address this by processing text, images, and audio simultaneously. Models such as GPT-4o have driven widespread industry adoption.

EAS provides a one-click solution to deploy popular MLLM inference services in under five minutes.

Prerequisites

Deploy an EAS service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. On the Custom Deployment page, configure the following parameters. The full parameter list is in Custom Deployment.

    Parameter

    Description

    Environment Information

    Deployment Method

    Select Image-based Deployment and select the Enable Web App checkbox.

    Image Configuration

    From the Alibaba Cloud Image list, select chat-mllm-webui > chat-mllm-webui:1.0.

    Note

    Image versions are updated frequently. Select the latest version.

    Command to Run

    The command auto-populates after image selection. Change the model_type parameter to deploy a different model. Supported models are listed in the following table.

    Resource Information

    Deployment

    Select a GPU-accelerated instance type. Recommended: ml.gu7i.c16m60.1-gu30 for cost-effectiveness.

    Supported models

    model_type

    Model link

    qwen_vl_chat

    qwen/Qwen-VL-Chat

    qwen_vl_chat_int4

    qwen/Qwen-VL-Chat-Int4

    qwen_vl

    qwen/Qwen-VL

    glm4v_9b_chat

    ZhipuAI/glm-4v-9b

    llava1_5-7b-instruct

    swift/llava-1___5-7b-hf

    llava1_5-13b-instruct

    swift/llava-1___5-13b-hf

    internvl_chat_v1_5_int8

    AI-ModelScope/InternVL-Chat-V1-5-int8

    internvl-chat-v1_5

    AI-ModelScope/InternVL-Chat-V1-5

    mini-internvl-chat-2b-v1_5

    OpenGVLab/Mini-InternVL-Chat-2B-V1-5

    mini-internvl-chat-4b-v1_5

    OpenGVLab/Mini-InternVL-Chat-4B-V1-5

    internvl2-2b

    OpenGVLab/InternVL2-2B

    internvl2-4b

    OpenGVLab/InternVL2-4B

    internvl2-8b

    OpenGVLab/InternVL2-8B

    internvl2-26b

    OpenGVLab/InternVL2-26B

    internvl2-40b

    OpenGVLab/InternVL2-40B

  4. After you configure the parameters, click Deploy.

Call the service

Use the WebUI for inference

  1. On the Elastic Algorithm Service (EAS) page, click the target service name, then click View Web App in the upper-right corner to open the WebUI.

  2. Run an inference test on the WebUI to validate the model.cb3daf8135235cbd35c456965fc60199

Use an API for inference

  1. Get the endpoint and token.

    1. On the Elastic Algorithm Service (EAS) page, click the target service name. In the Basic Information section, click View Endpoint Information.

    2. In the Invocation Method panel, copy the token and endpoint.

  2. Use an API to run model inference.

    EAS provides three API operations:

    Infer forward

    Retrieves the inference result.

    Note

    WebUI and API calls cannot run simultaneously. If you used the WebUI, run clear chat history first, then run infer forward.

    Replace the following parameters in the sample code.

    Parameter

    Description

    hosts

    The endpoint obtained in Step 1.

    authorization

    The token obtained in Step 1.

    prompt

    Your question content. English is recommended.

    image_path

    The local path of the image.

    Request and response parameters

    • The following table describes the request parameters.

      Parameter

      Type

      Description

      Default

      prompt

      String

      The content of the question.

      None. This parameter is required.

      image

      Base64-encoded format

      The input image.

      None

      chat_history

      List[List]

      The chat history.

      []

      temperature

      Float

      Controls output randomness. Higher values increase randomness; 0 makes output deterministic. Range: 0 to 1.

      0.2

      top_p

      Float

      Samples from tokens with cumulative probability up to top_p.

      0.7

      max_output_tokens

      Int

      Maximum number of output tokens.

      512

      use_stream

      Bool

      Enables streaming output. Valid values:

      • True

      • False

      True

    • The output is a string containing the answer.

    The following Python code provides an example:

    import requests
    import json
    import base64
    
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}):
        datas = {
            "prompt": prompt,
            "image": image,
            "chat_history": chat_history,
            "temperature": temperature,
            "top_p": top_p,
            "max_output_tokens": max_output_tokens,
            "use_stream": use_stream,
        }
    
        if use_stream:
            headers.update({'Accept': 'text/event-stream'})
    
            response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500)
    
            if response.status_code != 200:
                print(f"Request failed with status code {response.status_code}")
                return
            process_stream(response)
    
        else:
            r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500)
            data = r.content.decode('utf-8')
    
            print(data)
    
    
    def image_to_base64(image_path):
        """
        Convert an image file to a Base64 encoded string.
    
        :param image_path: The file path to the image.
        :return: A Base64 encoded string representation of the image.
        """
        with open(image_path, "rb") as image_file:
            # Read the binary data of the image
            image_data = image_file.read()
            # Encode the binary data to Base64
            base64_encoded_data = base64.b64encode(image_data)
            # Convert bytes to string and remove any trailing newline characters
            base64_string = base64_encoded_data.decode('utf-8').replace('\n', '')
        return base64_string
    
    
    def process_stream(response, previous_text=""):
        MARK_RESPONSE_END = '##END'  # DONOT CHANGE
        buffer = previous_text
        current_response = ""
    
        for chunk in response.iter_content(chunk_size=100):
            if chunk:
                text = chunk.decode('utf-8')
                current_response += text
    
                parts = current_response.split(MARK_RESPONSE_END)
                for part in parts[:-1]:
                    new_part = part[len(previous_text):]
                    if new_part:
                        print(new_part, end='', flush=True)
    
                    previous_text = part
    
                current_response = parts[-1]
    
        remaining_new_text = current_response[len(previous_text):]
        if remaining_new_text:
            print(remaining_new_text, end='', flush=True)
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the endpoint.
        hosts = '<service_url>'
        # Replace <token> with the token.
        head = {
            'Authorization': '<token>'
        }
    
        # Get chat history
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
    
        # The content of the question. We recommend that you use English.
        prompt = 'Please describe the image'
        # Replace path_to_your_image with the local path of your image.
        image_path = 'path_to_your_image'
        image_base_64 = image_to_base64(image_path)
    
        post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head) 
    

    Get chat history

    Retrieves the chat history.

    • Replace the following parameters in the sample code.

      Parameter

      Description

      hosts

      The endpoint obtained in Step 1.

      authorization

      The token obtained in Step 1.

    • This operation requires no input parameters.

    • Output parameters:

      Parameter

      Type

      Description

      chat_history

      List[List]

      The chat history.

    The following Python code provides an example:

    import requests
    import json
    
    def post_get_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/get_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the endpoint.
        hosts = '<service_url>'
        # Replace <token> with the token.
        head = {
            'Authorization': '<token>'
        }
    
        chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history']
        print(chat_history)
    

    Clear chat history

    Clears the chat history.

    • Replace the following parameters in the sample code.

      Parameter

      Description

      hosts

      The endpoint obtained in Step 1.

      authorization

      The token obtained in Step 1.

    • This operation requires no input parameters.

    • The operation returns a "success" string.

    The following Python code provides an example:

    import requests
    import json
    
    
    def post_clear_history(url='http://127.0.0.1:7860', headers=None):
        r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500)
        data = r.content.decode('utf-8')
        return data
    
    
    if __name__ == '__main__':
        # Replace <service_url> with the endpoint.
        hosts = '<service_url>'
        # Replace <token> with the token.
        head = {
            'Authorization': '<token>'
        }
        clear_info = post_clear_history(url=hosts, headers=head)
        print(clear_info)