use PAI to quickly deploy Tongyi Qianwen - Platform For AI

This topic describes how to deploy a web application based on the open source model Tongyi Qianwen and perform model inference on the web page or by using API operations in Elastic Algorithm Service (EAS) of Platform for AI (PAI).

Background information

Tongyi Qianwen-7b (Qwen-7B) is a 7 billion-parameter model of the Tongyi Qianwen foundation model series that is developed by Alibaba Cloud. Qwen-7B is a large language model (LLM) that is based on Transformer and trained on ultra-large-scale pre-training data. The pre-training data covers a wide range of data types, including a large number of texts, professional books, and code. In addition, the LLM AI assistant Qwen-7B-Chat is developed by using the alignment mechanism based on Qwen-7B.

Prerequisites

EAS is activated. The default workspace and pay-as-you-go resources are created. For more information, see Activate PAI and create a default workspace.

Deploy Qwen-7B

Perform the following steps to deploy Qwen-7B as an AI-powered web application.

Go to the EAS-Online Model Services page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Deploy Service page, configure the required parameters. The following table describes key parameters.

Parameter	Description
Service Name	The name of the service. In this example, the service name qwen_demo is specified.
Deployment Mode	Select Deploy Web App by Using Image.
Select Image	Click PAI Image, select modelscope-inference from the image drop-down list, and then select 1.8.1 from the image version drop-down list.
Environment Variable	MODEL_ID: qwen/Qwen-7B-Chat TASK: chat REVISION: v1.0.5 For information about the related configurations, see the description of Qwen-7B-Chat on the ModelScope website.
Command to Run	Command: `python app.py` Port number: `8000`
Resource Group Type	Select Public Resource Group.
Resource Configuration Mode	Select General.
Resource Configuration	Click GPU and select the ml.gu7i.c16m60.1-gu30 instance type. Note In this example, the training requires an instance of the GPU type that has at least 20 GB of memory. We recommend that you use ml.gu7i.c16m60.1-gu30 to reduce costs.
Additional System Disk	Additional System Disk: 100. Unit: GB.

Click Deploy. Go to the Elastic Algorithm Service (EAS) page. When the Service Status changes to Running, the model is deployed.
Note
In most cases, deployment requires approximately 5 minutes to complete. The amount of time that is required to complete a deployment varies based on the resource availability, service load, and configuration.

Perform model inference

After you deploy the model, you can perform model inference by using different methods.

Perform model inference on the web UI

Find the service that you want to view and click View Web App in the Service Type column.
Perform model inference on the web UI.

Perform model inference by using online debugging

Click Online Debugging in the Actions column of the service that you want to view. The Online Debugging tab appears.

In the Body section, specify the request in the JSON format and click Send Request. The response is returned in the Debugging Information section on the right side.

Note

In this example, the debugging information is in the list format. The input field is the input content, and the history field is the history dialogue. The body is a list that contains two sections. The first section is the question, and the second section is the answer to the question.

You can start the inference by entering a request without the history field. Example:

{"input": "Where is the provincial capital of Zhejiang?"}

The service returns the result that contains the history field. Example:

Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:45 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 511
Body: {"response":"The provincial capital of Zhejiang is Hangzhou. ","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}

You can include the history field in the following request to perform a continuous conversation. Example:

{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."]]}

The service returns the result. Example:

Status Code: 200
Content-Type: application/json
Date: Mon, 14 Aug 2023 12:01:23 GMT
Server: envoy
Vary: Accept-Encoding
X-Envoy-Upstream-Service-Time: 522
Body: {"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],[ "What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}

Perform model inference by using APIs

You can call the service by calling API operations.

In the Basic Information section of the Service Details tab, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.

Call the service based on the information that you obtained in the terminal. Example:

curl -d '{"input": "What about Jiangsu?", "history": [["Where is the provincial capital of Zhejiang?", "The provincial capital of Zhejiang is Hangzhou."]]}' -H "Authorization: xxx" http://xxxx.com

The service returns the result. Example:

{"response":"The provincial capital of Jiangsu is Nanjing.","history":[["Where is the provincial capital of Zhejiang?","The provincial capital of Zhejiang is Hangzhou."],["What about Jiangsu?","The provincial capital of Jiangsu is Nanjing."]]}

Send an HTTP request to the service based on your business requirements. For more information about debugging, refer to the SDK that is provided by PAI in the Deploy inference services topic. Sample Python code:

import requests
import json

data = {"input": "Who are you?"}
response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
              headers={"Authorization": "yourtoken"},
              data=json.dumps(data))

print(response.text)

data = {"input": "What can you do?", "history": json.load (response.text)["history"]}


response = requests.post(url='http://qwen-demo.16623xxxxx.cn-hangzhou.pai-eas.aliyuncs.com/',
              headers={"Authorization": "yourtoken"},
              data=json.dumps(data))

print(response.text)

Perform model inference in streaming mode

In the Basic Information section of the Service Details tab, click View Endpoint Information. In the Invocation Method dialog box, obtain the values of the Public Endpoint and Token parameters.

In the terminal, run the following Python code to send a streaming request based on the information that you obtained.

#encoding=utf-8
from websockets.sync.client import connect
import os
import platform

def clear_screen():
    if platform.system() == "Windows":
        os.system("cls")
    else:
        os.system("clear")


def print_history(history):
    print("Welcome to the Qwen-7B model. Start the conversation by entering a content. Press clear to clear the conversation history and stop to terminate the program.")
    for pair in history:
        print(f"\nUser: {pair[0]}\nQwen-7B: {pair[1]}")


def main():
    history, response = [], ''
    clear_screen()
    print_history(history)
    with connect("<service_url>", additional_headers={"Authorization": "<token>"}) as websocket:

        while True:
            query = input("\nUser: ")
            if query.strip() == "stop":
                break
            websocket.send(query)
            while True:
                msg = websocket.recv()
                
                if msg == '<EOS>':
                    break
                clear_screen()
                print_history(history)
                print(f"\nUser: {query}")
                print("\nQwen-7B: ", end="")
                print(msg)
                response = msg
                
            history.append((query, response))


if __name__ == "__main__":
    main()

Replace <service_url> with the endpoint that you obtained in Step 1 and replace http in the endpoint with ws.
Replace <token> with the service token that you obtained in Step 1.

References

For more information about EAS, see Overview of online model services EAS.