Multimodal Large Language Models (MLLMs) process text, images, and audio for cross-modal understanding. With Elastic Algorithm Service (EAS), you can deploy an MLLM inference service in under five minutes and start running inference immediately.
Background
Large Language Models (LLMs) excel at text tasks such as generation, sentiment analysis, and translation, but cannot process images, audio, or video. Multimodal understanding bridges this gap.
Multimodal Large Language Models (MLLMs) address this by processing text, images, and audio simultaneously. Models such as GPT-4o have driven widespread industry adoption.
EAS provides a one-click solution to deploy popular MLLM inference services in under five minutes.
Prerequisites
-
PAI is activated and a default workspace is created. Activate PAI and create a default workspace.
-
If you use a RAM user to deploy the model, grant the required EAS management permissions. Cloud product dependencies and authorization: EAS.
Deploy an EAS service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
-
On the Custom Deployment page, configure the following parameters. The full parameter list is in Custom Deployment.
Parameter
Description
Environment Information
Deployment Method
Select Image-based Deployment and select the Enable Web App checkbox.
Image Configuration
From the Alibaba Cloud Image list, select chat-mllm-webui > chat-mllm-webui:1.0.
NoteImage versions are updated frequently. Select the latest version.
Command to Run
The command auto-populates after image selection. Change the model_type parameter to deploy a different model. Supported models are listed in the following table.
Resource Information
Deployment
Select a GPU-accelerated instance type. Recommended: ml.gu7i.c16m60.1-gu30 for cost-effectiveness.
-
After you configure the parameters, click Deploy.
Call the service
Use the WebUI for inference
-
On the Elastic Algorithm Service (EAS) page, click the target service name, then click View Web App in the upper-right corner to open the WebUI.
-
Run an inference test on the WebUI to validate the model.

Use an API for inference
-
Get the endpoint and token.
-
On the Elastic Algorithm Service (EAS) page, click the target service name. In the Basic Information section, click View Endpoint Information.
-
In the Invocation Method panel, copy the token and endpoint.
-
-
Use an API to run model inference.
EAS provides three API operations:
Infer forward
Retrieves the inference result.
NoteWebUI and API calls cannot run simultaneously. If you used the WebUI, run
clear chat historyfirst, then runinfer forward.Replace the following parameters in the sample code.
Parameter
Description
hosts
The endpoint obtained in Step 1.
authorization
The token obtained in Step 1.
prompt
Your question content. English is recommended.
image_path
The local path of the image.
The following Python code provides an example:
import requests import json import base64 def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}): datas = { "prompt": prompt, "image": image, "chat_history": chat_history, "temperature": temperature, "top_p": top_p, "max_output_tokens": max_output_tokens, "use_stream": use_stream, } if use_stream: headers.update({'Accept': 'text/event-stream'}) response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500) if response.status_code != 200: print(f"Request failed with status code {response.status_code}") return process_stream(response) else: r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500) data = r.content.decode('utf-8') print(data) def image_to_base64(image_path): """ Convert an image file to a Base64 encoded string. :param image_path: The file path to the image. :return: A Base64 encoded string representation of the image. """ with open(image_path, "rb") as image_file: # Read the binary data of the image image_data = image_file.read() # Encode the binary data to Base64 base64_encoded_data = base64.b64encode(image_data) # Convert bytes to string and remove any trailing newline characters base64_string = base64_encoded_data.decode('utf-8').replace('\n', '') return base64_string def process_stream(response, previous_text=""): MARK_RESPONSE_END = '##END' # DONOT CHANGE buffer = previous_text current_response = "" for chunk in response.iter_content(chunk_size=100): if chunk: text = chunk.decode('utf-8') current_response += text parts = current_response.split(MARK_RESPONSE_END) for part in parts[:-1]: new_part = part[len(previous_text):] if new_part: print(new_part, end='', flush=True) previous_text = part current_response = parts[-1] remaining_new_text = current_response[len(previous_text):] if remaining_new_text: print(remaining_new_text, end='', flush=True) if __name__ == '__main__': # Replace <service_url> with the endpoint. hosts = '<service_url>' # Replace <token> with the token. head = { 'Authorization': '<token>' } # Get chat history chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] # The content of the question. We recommend that you use English. prompt = 'Please describe the image' # Replace path_to_your_image with the local path of your image. image_path = 'path_to_your_image' image_base_64 = image_to_base64(image_path) post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head)Get chat history
Retrieves the chat history.
-
Replace the following parameters in the sample code.
Parameter
Description
hosts
The endpoint obtained in Step 1.
authorization
The token obtained in Step 1.
-
This operation requires no input parameters.
-
Output parameters:
Parameter
Type
Description
chat_history
List[List]
The chat history.
The following Python code provides an example:
import requests import json def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the endpoint. hosts = '<service_url>' # Replace <token> with the token. head = { 'Authorization': '<token>' } chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] print(chat_history)Clear chat history
Clears the chat history.
-
Replace the following parameters in the sample code.
Parameter
Description
hosts
The endpoint obtained in Step 1.
authorization
The token obtained in Step 1.
-
This operation requires no input parameters.
-
The operation returns a "success" string.
The following Python code provides an example:
import requests import json def post_clear_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the endpoint. hosts = '<service_url>' # Replace <token> with the token. head = { 'Authorization': '<token>' } clear_info = post_clear_history(url=hosts, headers=head) print(clear_info) -