All Products
Search
Document Center

Container Service for Kubernetes:Accelerate data pulling for a model by using Fluid in KServe

Last Updated:Nov 01, 2024

With the development of technology, the size of files used by the models of AI applications become larger and larger. However, long latency and cold start issues may occur when these large files are pulled from storage services such as Object Storage Service (OSS) and File Storage NAS (NAS). You can use Fluid to accelerate file pulling for models and optimize the performance of inference services, especially inference services deployed by using KServe. This topic describes how to accelerate a model by using Fluid in KServe. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster that does not run the ContainerOS operating system is created. The ACK Pro cluster runs Kubernetes 1.22 or later and contains at least three nodes. Each node must have at least 3 GB of free memory. For more information, see Create an ACK Pro cluster.

  • The cloud-native AI suite is installed and the ack-fluid component is deployed. For more information, see Deploy the cloud-native AI suite.

  • The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.

  • The ack-kserve component is installed. For more information, see Install ack-kserve.

  • OSS is activated. For more information, see Activate OSS.

Step 1: Prepare model data and upload the model data to an OSS bucket

  1. Download a model. In this example, a Qwen-7B-Chat-Int8 model is used.

    1. Run the following command to install Git:

      sudo yum install git
    2. Run the following command to install the Large File Support (LFS) plug-in:

      sudo yum install git-lfs
    3. Run the following command to clone the Qwen-7B-Chat-Int8 repository from the ModelScope community to your local host:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
    4. Run the following command to go to the directory in which the Qwen-7B-Chat-Int8 repository is stored:

      cd Qwen-7B-Chat-Int8
    5. Run the following command to download large files managed by LFS from the directory in which the Qwen-7B-Chat-Int8 repository is stored:

      git lfs pull
  2. Upload the downloaded Qwen-7B-Chat-Int8 file to the OSS bucket.

    1. Log on to the OSS console and view and record the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create a bucket.

    2. Install and configure ossutil. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen-7B-Chat-Int8 in the OSS bucket:

      ossutil mkdir oss://<your-bucket-name>/Qwen-7B-Chat-Int8
    4. Run the following command to upload the model file to the OSS bucket.

      ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/Qwen-7B-Chat-Int8

Step 2: Create a dataset and a JindoRuntime

A dataset can be used to efficiently organize and process data. A JindoRuntime can further accelerate data access based on a data cache policy. The dataset and JindoRuntime can be used together to greatly improve the performance of data processing and models.

  1. Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket.

    kubectl apply -f-<<EOF                                            
    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      fs.oss.accessKeyId: <YourAccessKey ID>
      fs.oss.accessKeySecret: <YourAccessKey Secret>
    EOF

    In the preceding code, the fs.oss.accessKeyId parameter specifies the AccessKey ID and the fs.oss.accessKeySecret parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

    Expected output:

    secret/oss-secret created
  2. Create a file named resource.yaml and copy the following content to the file to create a dataset and a JindoRuntime. For more information about how to configure a dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.

    • The dataset is used to specify information about datasets in remote storage and the underlying file system (UFS).

    • The JindoRuntime is used to start a JindoFS cluster for data caching.

    View content in the resource.yaml file

    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: qwen-7b-chat-int8
    spec:
      mounts:
        - mountPoint: oss://<oss_bucket>/Qwen-7b-chat-Int8 # Replace the value with the actual storage address of the model. 
          options:
            fs.oss.endpoint: <oss_endpoint> # Replace the value with the actual endpoint of the OSS bucket. 
          name: models
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: oss-secret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: oss-secret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: qwen-7b-chat-int8 # The name must be the same as that of the dataset. 
    spec:
      replicas: 3
      tieredstore:
        levels:
          - mediumtype: MEM # Use memory to cache data. 
            volumeType: emptyDir
            path: /dev/shm
            quota: 3Gi # The cache capacity that a worker replica can provide. 
            high: "0.95"
            low: "0.7"
      fuse:
        resources:
          requests:
            memory: 2Gi
        properties:
          fs.oss.download.thread.concurrency: "200"
          fs.oss.read.buffer.size: "8388608"
          fs.oss.read.readahead.max.buffer.count: "200"
          fs.oss.read.sequence.ambiguity.range: "2147483647"
  3. Run the following command to create the dataset and JindoRuntime:

    kubectl apply -f resource.yaml

    Expected output:

    dataset.data.fluid.io/qwen-7b-chat-int8 created
    jindoruntime.data.fluid.io/qwen-7b-chat-int8 created

Step 3: Deploy a vLLM model as an inference service

  1. Run the following command to deploy a KServe-based inference service.

    The following sample code deploys a Qwen-7B-Chat-Int8 model as the inference service by using the Vetorized Large Language Model (vLLM) and KServe.

    arena serve kserve \
        --name=qwen-fluid \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --data="qwen-7b-chat-int8:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen-fluid created
    INFO[0002] The Job qwen-fluid has been submitted successfully 
    INFO[0002] You can run `arena serve get qwen-fluid --type kserve -n default` to check the job status 

    The expected output indicates that the inference service is deployed.

Step 4: Check the results of data pulling acceleration

  1. Run the following command to view the information about the dataset.

    kubectl get dataset qwen-7b-chat-int8

    Expected output:

    NAME                UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    qwen-7b-chat-int8   17.01GiB         10.46MiB   18.00GiB         0.1%                Bound   23h

  2. Run the following command to check the amount of time that is consumed until the application server is ready.

    # Filter pods whose names contain qwen-fluid from all pods, extract the pod names, and assign the extracted pod names to the POD_NAME variable. 
    POD_NAME=$(kubectl get po |grep qwen-fluid|awk -F " " '{print $1}')
    Check the amount of time that is consumed until the application server is ready.
    kubectl logs $POD_NAME |grep -i "server ready takes"

    Expected output:

    server ready takes 25.875763 s

    The output shows that 25.875763 seconds are consumed until the application server is ready after Fluid is used to accelerate data pulling. The acceleration effect varies based on your application, dataset size, and environment configurations. The data provided in this topic is for reference only.

    For more information about the acceleration effect based on JindoRuntime, see the Step 3: Create applications to test data acceleration section of the "Use JindoFS to accelerate access to OSS" topic.

References

For more information about Fluid, see Overview of Fluid.