All Products
Search
Document Center

Platform For AI:Model cache acceleration

Last Updated:Apr 07, 2026

Cache model files in memory to accelerate reads from mounted OSS or NAS paths and reduce service startup latency.

How it works

Model cache acceleration supports two caching methods:

  • Local cache: Uses idle memory of the inference service to cache model files and exposes them as a file system directory. During scale-out, multiple instances of the same service form a P2P network. New instances pull data from instances with an existing cache instead of fetching from OSS or NAS.

  • Local cache + cache warm-up: Enhances local cache with a dedicated cache warm-up service that preloads model files into memory. This solves the cold start problem that local cache alone cannot address.

After configuration, each inference service instance mounts an accelerated path. Your application reads model files from this path without code changes. Model loading priority:

  • Cold start: Fetches data from the cache warm-up service if configured. Otherwise, pulls data from OSS or NAS and caches it locally.

  • Scale-out: Prioritizes the local cache, which uses a Least Recently Used (LRU) eviction policy. On a cache miss, falls back to the cache warm-up service, then to OSS or NAS.

Limits

  • The accelerated path is read-only to ensure data consistency.

  • To add new model files, add them to the source path. They are automatically cached and available through the accelerated path.

  • Do not update or delete files in the source path directly. This can cause the cache to serve stale data.

Configure a local model cache

Custom deployment

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. Configure the following key parameters. For other parameters, see Custom Deployment.

    1. In the Environment Information section, configure Mount storage to mount your model files to a container directory. For example, when mounting from OSS:

      • Uri: OSS path of the model files, such as oss://path/to/models/Qwen3-8B/.

      • Mount Path: The path within the container where the files will be mounted, such as /mnt/models/Qwen3-8B/.

    2. In the Features section, enable the Distributed cache acceleration switch and configure the following parameters:

      Parameter

      Description

      Maximum Memory Usage

      Maximum memory for the cache, in GB. When exceeded, LRU eviction applies. Example: 20 GB.

      Source Path

      Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.

      Accelerated Path

      Local cache path for your application to read models from. Must be different from the source path. Example: /mnt/models/Qwen3-8B-fast/.

      Model Cache Prefetch Service

      (Optional) Select a deployed cache warm-up service to reduce cold start time. To use this option, first deploy a cache warm-up service.

    3. In the Environment Information section, modify the Command to Run to change the model file path from the source path to the accelerated path. For example, when deploying an LLM service:

      vllm serve /mnt/models/Qwen3-8B-fast/
  4. After configuration, click Deploy.

JSON deployment

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click JSON Deployment.

  3. Enter the JSON configuration. Sample:

    {
        "cloud": {
            "computing": {
                "instances": [
                    {
                        "type": "ecs.gn6e-c12g1.3xlarge"
                    }
                ]
            },
            "networking": {
                "security_group_id": "your-security-group-id",
                "vpc_id": "your-vpc-id",
                "vswitch_id": "your-vswitch-id"
            }
        },
        "containers": [
            {
                "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/vllm:0.11.2-py312-mows0.5.1",
                "port": 8000,
                "script": "vllm serve /mnt/models/Qwen3-8B/"
            }
        ],
        "metadata": {
            "cpu": 12,
            "disk": "30Gi",
            "gpu": 1,
            "instance": 1,
            "memory": 92000,
            "name": "vllm_test",
            "workspace_id": "your-workspace-id"
        },
        "storage": [
            {
                "mount_path": "/mnt/models/Qwen3-8B/",
                "oss": {
                    "path": "oss://path/to/models/Qwen3-8B/",
                    "readOnly": false
                }
            },
            {
                "cache": {
                    "capacity": "10G",
                    "path": "/mnt/models/Qwen3-8B/",
                    "cacheroot_service": "your-cacheroot-service"
                },
                "mount_path": "/mnt/models/Qwen3-8B-fast/"
            }
        ]
    }

    The following table describes parameters related to model cache acceleration. For other parameters, see JSON Deployment.

    Parameter

    Description

    containers.script

    Change the model file path from the source path (OSS or NAS mount path) to the accelerated path.

    storage[].cache

    capacity

    Maximum cache memory, in GB. LRU eviction applies when exceeded.

    path

    Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.

    preload

    Set to "/" to cache all files from the source path when the service starts.

    cacheroot_service

    The name of the cache warm-up service.

    storage[].mount_path

    Mount path for the storage object. In a cache block, this is the accelerated path. In an oss or nas block, this is the source path.

  4. Click Deploy.

Deploy a cache warm-up service

A cache warm-up service preloads model files into memory and serves as a high-speed data source for inference services with model cache acceleration enabled.

Important

The OSS path mounted by the cache warm-up service must match the OSS path used as the source path in the inference service's cache acceleration configuration. Otherwise, cache warm-up does not take effect.

For example, if the source path for cache acceleration in the inference service is /mnt/models/Qwen3-8B/, which corresponds to the OSS path oss://path/to/models/Qwen3-8B/, then the cache warm-up service must also mount oss://path/to/models/Qwen3-8B/.

  1. On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click Model Warm-up Cache Service Deployment.

  2. Configure the following key parameters, and then click Deploy.

    Parameter

    Description

    Basic Information

    Deployment

    Select an instance type with enough memory to hold the model files.

    Cache Configuration

    Cache Path

    The model directories to cache. Multiple paths are supported.

    Maximum Memory Usage

    Maximum memory for the cache warm-up service.

    Network Information

    VPC

    Required. Must be the same VPC as the inference service. Otherwise, the inference service cannot access the cache warm-up service.

    Associate NLB

    Must be enabled. An NLB is created automatically by default.

Performance benchmarks

Benchmark results for model cache acceleration. Actual results may vary.

Qwen3-32B

Model: Qwen3-32B (62 GB)

Machine: ml.gu8is.c64m512.4-gu60 | 64-core 512 GB + 4× GU60(48G) | L20

Deployment mode

Model loading time

Model loading speed

Service readiness time

Standard (no cache acceleration)

01:05

7.63 Gbit/s

01:43

Cold start acceleration (with cache warm-up)

00:21

23.62 Gbit/s

01:01

Scale-out acceleration (with local cache)

00:18

27.55 Gbit/s

00:58

MiniMax-M2

Model: MiniMax-M2 (215 GB)

Machine: ml.gu8tf.8.40xlarge | 160vcpu+1800 GB + 8*GU8T | H20(96G)

Deployment mode

Model loading time

Model loading speed

Service readiness time

Standard (no cache acceleration)

06:42

4.28 Gbit/s

09:16

Cold start acceleration (with cache warm-up)

01:49

15.78 Gbit/s

04:49

Scale-out acceleration (with local cache)

01:42

16.86 Gbit/s

04:34

DeepSeek-V3.2

Model: DeepSeek-V3.2 (643 GB)

Machine: ml.gu8tef.8.46xlarge | 184vcpu+1800GB+8*GU8TE | H20-3e(141G)

Deployment mode

Model loading time

Model loading speed

Service readiness time

Standard (no cache acceleration)

12:33

6.83 Gbit/s

27:41

Cold start acceleration (with cache warm-up)

02:43

31.56 Gbit/s

13:01

Scale-out acceleration (with local cache)

01:58

43.60 Gbit/s

12:49