Configure model cache acceleration - Platform For AI - Alibaba Cloud Documentation Center

How it works

Model cache acceleration supports two caching methods:

Local cache: Uses idle memory of the inference service to cache model files and exposes them as a file system directory. During scale-out, multiple instances of the same service form a P2P network. New instances pull data from instances with an existing cache instead of fetching from OSS or NAS.
Local cache + cache warm-up: Enhances local cache with a dedicated cache warm-up service that preloads model files into memory. This solves the cold start problem that local cache alone cannot address.

After configuration, each inference service instance mounts an accelerated path. Your application reads model files from this path without code changes. Model loading priority:

Cold start: Fetches data from the cache warm-up service if configured. Otherwise, pulls data from OSS or NAS and caches it locally.
Scale-out: Prioritizes the local cache, which uses a Least Recently Used (LRU) eviction policy. On a cache miss, falls back to the cache warm-up service, then to OSS or NAS.

Limits

The accelerated path is read-only to ensure data consistency.
To add new model files, add them to the source path. They are automatically cached and available through the accelerated path.
Do not update or delete files in the source path directly. This can cause the cache to serve stale data.

Configure a local model cache

Custom deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

Configure the following key parameters. For other parameters, see Custom Deployment.

In the Environment Information section, configure Mount storage to mount your model files to a container directory. For example, when mounting from OSS:
- Uri: OSS path of the model files, such as oss://path/to/models/Qwen3-8B/.
- Mount Path: The path within the container where the files will be mounted, such as /mnt/models/Qwen3-8B/.

In the Features section, enable the Distributed cache acceleration switch and configure the following parameters:

Parameter	Description
Maximum Memory Usage	Maximum memory for the cache, in GB. When exceeded, LRU eviction applies. Example: `20` GB.
Source Path	Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.
Accelerated Path	Local cache path for your application to read models from. Must be different from the source path. Example: `/mnt/models/Qwen3-8B-fast/`.
Model Cache Prefetch Service	(Optional) Select a deployed cache warm-up service to reduce cold start time. To use this option, first deploy a cache warm-up service.

In the Environment Information section, modify the Command to Run to change the model file path from the source path to the accelerated path. For example, when deploying an LLM service:
```
vllm serve /mnt/models/Qwen3-8B-fast/
```

After configuration, click Deploy.

JSON deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click JSON Deployment.

Enter the JSON configuration. Sample:

{
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.gn6e-c12g1.3xlarge"
                }
            ]
        },
        "networking": {
            "security_group_id": "your-security-group-id",
            "vpc_id": "your-vpc-id",
            "vswitch_id": "your-vswitch-id"
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/vllm:0.11.2-py312-mows0.5.1",
            "port": 8000,
            "script": "vllm serve /mnt/models/Qwen3-8B/"
        }
    ],
    "metadata": {
        "cpu": 12,
        "disk": "30Gi",
        "gpu": 1,
        "instance": 1,
        "memory": 92000,
        "name": "vllm_test",
        "workspace_id": "your-workspace-id"
    },
    "storage": [
        {
            "mount_path": "/mnt/models/Qwen3-8B/",
            "oss": {
                "path": "oss://path/to/models/Qwen3-8B/",
                "readOnly": false
            }
        },
        {
            "cache": {
                "capacity": "10G",
                "path": "/mnt/models/Qwen3-8B/",
                "cacheroot_service": "your-cacheroot-service"
            },
            "mount_path": "/mnt/models/Qwen3-8B-fast/"
        }
    ]
}

The following table describes parameters related to model cache acceleration. For other parameters, see JSON Deployment.

Parameter		Description
containers.script		Change the model file path from the source path (OSS or NAS mount path) to the accelerated path.
storage[].cache	capacity	Maximum cache memory, in GB. LRU eviction applies when exceeded.
	path	Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.
	preload	Set to `"/"` to cache all files from the source path when the service starts.
	cacheroot_service	The name of the cache warm-up service.
storage[].mount_path		Mount path for the storage object. In a `cache` block, this is the accelerated path. In an `oss` or `nas` block, this is the source path.

Click Deploy.

Deploy a cache warm-up service

A cache warm-up service preloads model files into memory and serves as a high-speed data source for inference services with model cache acceleration enabled.

Important

The OSS path mounted by the cache warm-up service must match the OSS path used as the source path in the inference service's cache acceleration configuration. Otherwise, cache warm-up does not take effect.

For example, if the source path for cache acceleration in the inference service is /mnt/models/Qwen3-8B/, which corresponds to the OSS path oss://path/to/models/Qwen3-8B/, then the cache warm-up service must also mount oss://path/to/models/Qwen3-8B/.

On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click Model Warm-up Cache Service Deployment.

Configure the following key parameters, and then click Deploy.

Parameter		Description
Basic Information	Deployment	Select an instance type with enough memory to hold the model files.
Cache Configuration	Cache Path	The model directories to cache. Multiple paths are supported.
Cache Configuration	Maximum Memory Usage	Maximum memory for the cache warm-up service.
Network Information	VPC	Required. Must be the same VPC as the inference service. Otherwise, the inference service cannot access the cache warm-up service.
Network Information	Associate NLB	Must be enabled. An NLB is created automatically by default.

Performance benchmarks

Benchmark results for model cache acceleration. Actual results may vary.

Qwen3-32B

Model: Qwen3-32B (62 GB)

Machine: ml.gu8is.c64m512.4-gu60 | 64-core 512 GB + 4× GU60(48G) | L20

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	01:05	7.63 Gbit/s	01:43
Cold start acceleration (with cache warm-up)	00:21	23.62 Gbit/s	01:01
Scale-out acceleration (with local cache)	00:18	27.55 Gbit/s	00:58

MiniMax-M2

Model: MiniMax-M2 (215 GB)

Machine: ml.gu8tf.8.40xlarge | 160vcpu+1800 GB + 8*GU8T | H20(96G)

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	06:42	4.28 Gbit/s	09:16
Cold start acceleration (with cache warm-up)	01:49	15.78 Gbit/s	04:49
Scale-out acceleration (with local cache)	01:42	16.86 Gbit/s	04:34

DeepSeek-V3.2

Model: DeepSeek-V3.2 (643 GB)

Machine: ml.gu8tef.8.46xlarge | 184vcpu+1800GB+8*GU8TE | H20-3e(141G)

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	12:33	6.83 Gbit/s	27:41
Cold start acceleration (with cache warm-up)	02:43	31.56 Gbit/s	13:01
Scale-out acceleration (with local cache)	01:58	43.60 Gbit/s	12:49