Community Blog Use Alibaba Cloud Serverless Kubernetes + AIGC to Build a Personal Code Assistant

Use Alibaba Cloud Serverless Kubernetes + AIGC to Build a Personal Code Assistant

This article uses open-source FastChat as an example to explain how to build a personal code assistant in ASK.

By Zibai and Dongdao

AIGC technology is causing a worldwide AI technology wave. Many similar models have emerged in the open source community, such as FastGPT, Moss, and Stable Diffusion. These models have shown amazing results, attracting enterprises and developers to participate, but the complex and cumbersome deployment methods have become obstacles.

Alibaba Serverless Kubernetes (ASK) provides Serverless container services. It helps developers quickly deploy AI models without worrying about resource or environment configurations. This article uses open-source FastChat as an example to explain how to quickly build a personal code assistant in ASK.

Effect Preview

If you think the code generation of Cursor + GPT-4 is intelligent, we can achieve the same effect with FastChat + VSCode plug-ins!

  • Generate a Golang Hello World

Address: https://intranetproxy.alipay.com/skylark/lark/0/2023/gif/11431/1682574183392-11e16131-3dae-4969-a0d1-79a0a9eefb01.gif

  • Generate a Kubernetes Deployment

Address: https://intranetproxy.alipay.com/skylark/lark/0/2023/gif/11431/1682574192825-7a1d3c76-025d-45db-bea1-4ca5dd885520.gif


Alibaba Serverless Kubernetes (ASK) is a container service provided by the Alibaba Cloud Container Service Team for Serverless scenarios. Users can use the Kubernetes API to create Workloads directly, free from the operation and maintenance of nodes. As a Serverless container platform, ASK has four major features: O&M-free, elastic scale-out, compatibility with the Kubernetes community, and strong isolation.

The main challenges in training and deploying large-scale AI applications are listed below:

  • Limited GPU Resources and High Training Costs

Large-scale AI applications require GPUs for training and inference. However, many developers lack GPU resources. Purchasing a GPU card separately or purchasing an ECS instance can be costly.

  • Resource Heterogeneity

A large number of GPU resources are required for parallel training, and these GPUs are often of different series. Different GPUs support different CUDA versions and are bound to the kernel version and nvidia-container-cli version. Developers need to pay attention to the underlying resources, making AI application development more difficult.

  • Slow Image Loading

AI application images are often tens of GB in size, and it takes tens of minutes (or hours) to download them.

ASK provides a perfect solution to the preceding problems. In ASK, you can use Kubernetes Workload to easily use GPU resources without preparing them in advance. You can release GPU resources immediately when you finish using them. Thus, the cost is low. ASK shields the underlying resources. Users do not need to care about dependencies (such as GPU and CUDA versions). They only need to care about the logic of AI applications. At the same time, ASK has image caching capabilities by default. When a Pod is created for the second time, it can be started within seconds.


1. Prerequisites

  • An ASK cluster is created. Please see Create an ASK cluster [1] for more information.
  • Download the llama-7b model and upload it to OSS. Please see the Appendix for more information.

2. Use Kubectl to Create

Replace variables in the yaml file.
${your-ak} Your AK
${your-sk} Your SK
${oss-endpoint-url} OSS Endpoint
Replace ${llama-oss-path} with the address where the llama-7b model is stored (/ is not required at the end of the path). For example, oss://xxxx/llama-7b-hf
apiVersion: v1
kind: Secret
  name: oss-secret
type: Opaque
  .ossutilconfig: |
apiVersion: apps/v1
kind: Deployment
    app: fastchat
  name: fastchat
  namespace: default
  replicas: 1
      app: fastchat
      maxSurge: 100%
      maxUnavailable: 100%
    type: RollingUpdate
        app: fastchat
        alibabacloud.com/eci: "true" 
        k8s.aliyun.com/eci-use-specs: ecs.gn6e-c12g1.3xlarge
      - name: data
        emptyDir: {}
      - name: oss-volume
          secretName: oss-secret
      dnsPolicy: Default
      - name: llama-7b
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/ossutil:v1
          - name: data
            mountPath: /data
          - name: oss-volume
            mountPath: /root/
            readOnly: true
        - sh
        - -c
        - ossutil cp -r ${llama-oss-path} /data/
            ephemeral-storage: 50Gi
      - command:
        - sh
        - -c 
        - "/root/webui.sh"
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/fastchat:v1.0.0
        imagePullPolicy: IfNotPresent
        name: fastchat
        - containerPort: 7860
          protocol: TCP
        - containerPort: 8000
          protocol: TCP
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
            port: 7860
          timeoutSeconds: 1
            cpu: "4"
            memory: 8Gi
            nvidia.com/gpu: 1
            ephemeral-storage: 100Gi
        - mountPath: /data
          name: data
apiVersion: v1
kind: Service
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
  name: fastchat
  namespace: default
  externalTrafficPolicy: Local
  - port: 7860
    protocol: TCP
    targetPort: 7860
    name: web
  - port: 8000
    protocol: TCP
    targetPort: 8000
    name: api
    app: fastchat
  type: LoadBalancer

3. Wait for FastChat Ready

After the pod is ready, visit http://${externa-ip}:7860 in the browser.

📍 After startup, you need to download vicuna-7b model, which is about 13GB in size.
It takes about 20 minutes to download the model. If you create a disk snapshot in advance, create a disk through the disk snapshot and attach it to a pod. The model takes effect in seconds.
kubectl get po |grep fastchat

# NAME                        READY   STATUS    RESTARTS   AGE
# fastchat-69ff78cf46-tpbvp   1/1     Running   0          20m

kubectl get svc fastchat
# NAME       TYPE           CLUSTER-IP        EXTERNAL-IP    PORT(S)          AGE
# fastchat   LoadBalancer   xxx.xx.x.xxx   7860:31444/TCP   22m


Case 1: Use FastChat through the Console

Visit http://${externa-ip}:7860 in the browser to directly test the chat function. For example, use natural language to make FastChat write a piece of code.

Input: Write a Kubernetes Deployment Yaml file based on the NGINX image

FastChat output is shown in the following figure:


Case 2: Use FastChat through API

The FastChat API monitors port 8000. As shown in the following figure, Initiate an API call through curl and then return the result.

  • curl command
curl http://xxx:xxx:xxx:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.1",
    "messages": [{"role": "user", "content": "golang generates a hello world"}]
  • Output result
{"id":"3xqtJcXSLnBomSWocuLW 2b","object":"chat.com pletion","created":1682574393,"choices":[{"index":0,"message":{"role":"assistant","content":" Here is the code that generates \"Hello, World!\"  in golang:\n```go\npackage main\n\nimport \"fmt\"\n\nfunc main() {\n fmt.Println(\"Hello, World! \")\n}\n```\n After running this code, it prints \"Hello, World!\". "},"finish_reason":"stop"}],"usage":null}

Case 3: VSCode Plug-In

Now that you have an API interface, how can you integrate this capability in an integrated development environment (IDE)? You might think of Copilot, Cursor, and Tabnine. Let's integrate FastChat using the VSCode plug-in. src/extension.ts, package.json, and tsconfig.json are core files of the VSCode plug-in.


The contents of the three files are listed below:

  • src/extension.ts
import * as vscode from 'vscode';
import axios from 'axios';

import { ExtensionContext, commands, window } from "vscode";
const editor = window.activeTextEditor
export function activate(context: vscode.ExtensionContext) {
    let fastchat = async () => {
        vscode.window.showInputBox({ prompt: ' Please enter code prompt ' }).then((inputValue) => {
            if (!inputValue) {

                location: vscode.ProgressLocation.Notification,
                title: 'Requesting...',
                cancellable: false
            }, (progress, token) => {
                return axios.post('http://example.com:8000/v1/chat/completions', {
                    model: 'vicuna-7b-v1.1',
                    messages: [{ role: 'user', content: inputValue }]
                }, {
                    headers: {
                        'Content-Type': 'application/json'
                }).then((response) => {
                    // const content = JSON.stringify(response.data);
                    const content = response.data.choices[0].message.content;
                    const regex = /```.*\n([\s\S]*?)```/
                    const matches = content.match(regex)
                    if (matches && matches.length > 1) {
                        editor?.edit(editBuilder => {
                            let position = editor.selection.active;
                            position && editBuilder.insert(position, matches[1].trim())
                }).catch((error) => {

    let command = commands.registerCommand(
  • package.json
    "name": "fastchat",
    "version": "1.0.0",
    "publisher": "yourname",
    "engines": {
        "vscode": "^1.0.0"
    "categories": [
    "activationEvents": [
    "main": "./dist/extension.js",
    "contributes": {
        "commands": [
                "command": "fastchat",
                "title": "fastchat code generator"
    "devDependencies": {
        "@types/node": "^18.16.1",
        "@types/vscode": "^1.77.0",
        "axios": "^1.3.6",
        "typescript": "^5.0.4"
  • tsconfig.json
    "compilerOptions": {
      "target": "ES2018",
      "module": "commonjs",
      "outDir": "./dist",
      "strict": true,
      "esModuleInterop": true,
      "resolveJsonModule": true,
      "declaration": true
    "include": ["src/**/*"],
    "exclude": ["node_modules", "**/*.test.ts"]

Let's look at the effect after the plug-in is developed.

  • Generate a Golang Hello World

Address: https://intranetproxy.alipay.com/skylark/lark/0/2023/gif/11431/1682574183392-11e16131-3dae-4969-a0d1-79a0a9eefb01.gif

  • Generate a Kubernetes Deployment

Address: https://intranetproxy.alipay.com/skylark/lark/0/2023/gif/11431/1682574192825-7a1d3c76-025d-45db-bea1-4ca5dd885520.gif


As a Serverless container platform, ASK has capabilities, including O&M-free, auto scaling, shielding heterogeneous resources, and image acceleration. ASK is suitable for deployment scenarios of large-scale AI models. You are welcome to try it out.


1.  Download the llama-7b model

Model address: https://huggingface.co/decapoda-research/llama-7b-hf/tree/main

# If you are using Alibaba Cloud Elastic Compute Service (ECS), you need to run the following command to install git-lfs.
# yum install git-lfs

git clone https://huggingface.co/decapoda-research/llama-7b-hf
git lfs install
git lfs pull

2.  Upload to OSS

You can refer to: https://www.alibabacloud.com/help/en/object-storage-service/latest/ossutil-overview


[1] Create an ASK Cluster

[2] ASK Overview

0 1 0
Share on

Alibaba Cloud Native

174 posts | 12 followers

You may also like


Alibaba Cloud Native

174 posts | 12 followers

Related Products

  • Function Compute

    Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.

    Learn More
  • Container Service for Kubernetes

    Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.

    Learn More
  • Serverless Workflow

    Visualization, O&M-free orchestration, and Coordination of Stateful Application Scenarios

    Learn More
  • Serverless Application Engine

    Serverless Application Engine (SAE) is the world's first application-oriented serverless PaaS, providing a cost-effective and highly efficient one-stop application hosting solution.

    Learn More