Cache model files in memory to accelerate reads from mounted OSS or NAS paths and reduce service startup latency.
How it works
Model cache acceleration supports two caching methods:
-
Local cache: Uses idle memory of the inference service to cache model files and exposes them as a file system directory. During scale-out, multiple instances of the same service form a P2P network. New instances pull data from instances with an existing cache instead of fetching from OSS or NAS.
-
Local cache + cache warm-up: Enhances local cache with a dedicated cache warm-up service that preloads model files into memory. This solves the cold start problem that local cache alone cannot address.
After configuration, each inference service instance mounts an accelerated path. Your application reads model files from this path without code changes. Model loading priority:
-
Cold start: Fetches data from the cache warm-up service if configured. Otherwise, pulls data from OSS or NAS and caches it locally.
-
Scale-out: Prioritizes the local cache, which uses a Least Recently Used (LRU) eviction policy. On a cache miss, falls back to the cache warm-up service, then to OSS or NAS.
Limits
-
The accelerated path is read-only to ensure data consistency.
-
To add new model files, add them to the source path. They are automatically cached and available through the accelerated path.
-
Do not update or delete files in the source path directly. This can cause the cache to serve stale data.
Configure a local model cache
Custom deployment
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
-
Configure the following key parameters. For other parameters, see Custom Deployment.
-
In the Environment Information section, configure Mount storage to mount your model files to a container directory. For example, when mounting from OSS:
-
Uri: OSS path of the model files, such as
oss://path/to/models/Qwen3-8B/. -
Mount Path: The path within the container where the files will be mounted, such as
/mnt/models/Qwen3-8B/.
-
-
In the Features section, enable the Distributed cache acceleration switch and configure the following parameters:
Parameter
Description
Maximum Memory Usage
Maximum memory for the cache, in GB. When exceeded, LRU eviction applies. Example:
20GB.Source Path
Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.
Accelerated Path
Local cache path for your application to read models from. Must be different from the source path. Example:
/mnt/models/Qwen3-8B-fast/.Model Cache Prefetch Service
(Optional) Select a deployed cache warm-up service to reduce cold start time. To use this option, first deploy a cache warm-up service.
-
In the Environment Information section, modify the Command to Run to change the model file path from the source path to the accelerated path. For example, when deploying an LLM service:
vllm serve /mnt/models/Qwen3-8B-fast/
-
-
After configuration, click Deploy.
JSON deployment
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click JSON Deployment.
-
Enter the JSON configuration. Sample:
{ "cloud": { "computing": { "instances": [ { "type": "ecs.gn6e-c12g1.3xlarge" } ] }, "networking": { "security_group_id": "your-security-group-id", "vpc_id": "your-vpc-id", "vswitch_id": "your-vswitch-id" } }, "containers": [ { "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/vllm:0.11.2-py312-mows0.5.1", "port": 8000, "script": "vllm serve /mnt/models/Qwen3-8B/" } ], "metadata": { "cpu": 12, "disk": "30Gi", "gpu": 1, "instance": 1, "memory": 92000, "name": "vllm_test", "workspace_id": "your-workspace-id" }, "storage": [ { "mount_path": "/mnt/models/Qwen3-8B/", "oss": { "path": "oss://path/to/models/Qwen3-8B/", "readOnly": false } }, { "cache": { "capacity": "10G", "path": "/mnt/models/Qwen3-8B/", "cacheroot_service": "your-cacheroot-service" }, "mount_path": "/mnt/models/Qwen3-8B-fast/" } ] }The following table describes parameters related to model cache acceleration. For other parameters, see JSON Deployment.
Parameter
Description
containers.script
Change the model file path from the source path (OSS or NAS mount path) to the accelerated path.
storage[].cache
capacity
Maximum cache memory, in GB. LRU eviction applies when exceeded.
path
Source directory of files to accelerate. Enter the mount path where OSS or NAS storage is mounted to the container.
preload
Set to
"/"to cache all files from the source path when the service starts.cacheroot_service
The name of the cache warm-up service.
storage[].mount_path
Mount path for the storage object. In a
cacheblock, this is the accelerated path. In anossornasblock, this is the source path. -
Click Deploy.
Deploy a cache warm-up service
A cache warm-up service preloads model files into memory and serves as a high-speed data source for inference services with model cache acceleration enabled.
The OSS path mounted by the cache warm-up service must match the OSS path used as the source path in the inference service's cache acceleration configuration. Otherwise, cache warm-up does not take effect.
For example, if the source path for cache acceleration in the inference service is /mnt/models/Qwen3-8B/, which corresponds to the OSS path oss://path/to/models/Qwen3-8B/, then the cache warm-up service must also mount oss://path/to/models/Qwen3-8B/.
-
On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click Model Warm-up Cache Service Deployment.
-
Configure the following key parameters, and then click Deploy.
Parameter
Description
Basic Information
Deployment
Select an instance type with enough memory to hold the model files.
Cache Configuration
Cache Path
The model directories to cache. Multiple paths are supported.
Maximum Memory Usage
Maximum memory for the cache warm-up service.
Network Information
VPC
Required. Must be the same VPC as the inference service. Otherwise, the inference service cannot access the cache warm-up service.
Associate NLB
Must be enabled. An NLB is created automatically by default.
Performance benchmarks
Benchmark results for model cache acceleration. Actual results may vary.
Qwen3-32B
Model: Qwen3-32B (62 GB)
Machine: ml.gu8is.c64m512.4-gu60 | 64-core 512 GB + 4× GU60(48G) | L20
|
Deployment mode |
Model loading time |
Model loading speed |
Service readiness time |
|
Standard (no cache acceleration) |
01:05 |
7.63 Gbit/s |
01:43 |
|
Cold start acceleration (with cache warm-up) |
00:21 |
23.62 Gbit/s |
01:01 |
|
Scale-out acceleration (with local cache) |
00:18 |
27.55 Gbit/s |
00:58 |
MiniMax-M2
Model: MiniMax-M2 (215 GB)
Machine: ml.gu8tf.8.40xlarge | 160vcpu+1800 GB + 8*GU8T | H20(96G)
|
Deployment mode |
Model loading time |
Model loading speed |
Service readiness time |
|
Standard (no cache acceleration) |
06:42 |
4.28 Gbit/s |
09:16 |
|
Cold start acceleration (with cache warm-up) |
01:49 |
15.78 Gbit/s |
04:49 |
|
Scale-out acceleration (with local cache) |
01:42 |
16.86 Gbit/s |
04:34 |
DeepSeek-V3.2
Model: DeepSeek-V3.2 (643 GB)
Machine: ml.gu8tef.8.46xlarge | 184vcpu+1800GB+8*GU8TE | H20-3e(141G)
|
Deployment mode |
Model loading time |
Model loading speed |
Service readiness time |
|
Standard (no cache acceleration) |
12:33 |
6.83 Gbit/s |
27:41 |
|
Cold start acceleration (with cache warm-up) |
02:43 |
31.56 Gbit/s |
13:01 |
|
Scale-out acceleration (with local cache) |
01:58 |
43.60 Gbit/s |
12:49 |