EAS: Configure model cache acceleration - Platform For AI

How it works

Model cache acceleration supports the following two methods:

Local cache: Caches model files in the idle memory of the inference service, exposing them as a file system directory. This method is ideal for scale-out scenarios. Multiple instances of the same service form a peer-to-peer (P2P) network, allowing new instances to pull data directly from existing instances with a populated cache instead of fetching from the source OSS or NAS.
Local cache + cache warm-up: Builds on local cache by deploying a separate, dedicated cache warm-up service that preloads model files into memory. This is ideal for new deployments and solves the cold-start problem inherent to local caching.

After configuration, the system mounts an accelerated path to each inference service instance. Your application reads model files from this directory with no code changes required. The model loading priority is as follows:

Cold start: The system first attempts to fetch data from the cache warm-up service, if configured. If not, it pulls data from OSS or NAS and caches it locally.
Scale-out: The system prioritizes a local cache hit (LRU eviction is supported). On a cache miss, it attempts to fetch from the cache warm-up service. If the data is still not found, it falls back to the source OSS or NAS.

Notes

To maintain data consistency, the mounted accelerated path is read-only.
To add model files, add them to the source path. The accelerated path automatically reads the new files from the source.
Do not update or delete model files directly in the source path. Updating or deleting files this way can lead to the cache serving inconsistent or stale data.

Procedure

Custom deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

The key parameters are as follows. For details about other parameters, see custom deployment.

In the Environment Information section, configure Mount storage to mount your model files to a container directory. For example, to mount files from OSS:
- Uri: The OSS path where the model files are located, such as oss://path/to/models/Qwen3-8B/.
- Mount Path: The path inside the container where the files are mounted, such as /mnt/models/Qwen3-8B/.

In the Features section, turn on the Distributed cache acceleration switch and configure the following parameters:

Parameter	Description
Maximum Memory Usage	The maximum memory that the cache can use, in GB. When this limit is exceeded, the system evicts data based on an LRU policy. Example: `20` GB.
Source Path	The source directory to accelerate. Enter the container path where OSS or NAS storage is mounted.
Accelerated Path	The local cache path. Your application reads models from this directory. This path must be different from the source path. Example: `/mnt/models/Qwen3-8B-fast/`.
Model Cache Prefetch Service	(Optional) Select a deployed cache warm-up service. This option is recommended for new deployments or scenarios that require fast cold starts, such as those involving large model files or frequent scale-outs. To use this option, you must first deploy a cache warm-up service.

In the Environment Information section, modify the Command to Run. Change the model file path in the command from the source path to the accelerated path. For example, when deploying an LLM service:
```
vllm serve /mnt/models/Qwen3-8B-fast/
```

After configuring the parameters, click Deploy.

JSON deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Custom Model Deployment section, click JSON Deployment.

Enter the JSON configuration. Example:

{
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.gn6e-c12g1.3xlarge"
                }
            ]
        },
        "networking": {
            "security_group_id": "your-security-group-id",
            "vpc_id": "your-vpc-id",
            "vswitch_id": "your-vswitch-id"
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/vllm:0.11.2-py312-mows0.5.1",
            "port": 8000,
            "script": "vllm serve /mnt/models/Qwen3-8B-fast/"
        }
    ],
    "metadata": {
        "cpu": 12,
        "disk": "30Gi",
        "gpu": 1,
        "instance": 1,
        "memory": 92000,
        "name": "vllm_test",
        "workspace_id": "your-workspace-id"
    },
    "storage": [
        {
            "mount_path": "/mnt/models/Qwen3-8B/",
            "oss": {
                "path": "oss://path/to/models/Qwen3-8B/",
                "readOnly": false
            }
        },
        {
            "cache": {
                "capacity": "10G",
                "path": "/mnt/models/Qwen3-8B/",
                "cacheroot_service": "your-cacheroot-service"
            },
            "mount_path": "/mnt/models/Qwen3-8B-fast/"
        }
    ]
}

The following table describes parameters related to model cache acceleration. For information about other parameters, see JSON deployment.

Parameter		Description
containers.script		Change the model file path in the run command from the source path (OSS/NAS mount path) to the accelerated path.
storage[].cache	capacity	The maximum memory that the cache can use, in GB. When this limit is exceeded, the system evicts data based on an LRU policy.
	path	The source directory to accelerate. Enter the container path where OSS or NAS storage is mounted.
	preload	Caches files in memory when the service starts. Set this to `"/"`.
	cacheroot_service	The name of the cache warm-up service.
storage[].mount_path		The path in the container where the storage is mounted.

Click Deploy.

Deploy a cache warm-up service

A cache warm-up service preloads model files into memory and serves as a high-speed data source for inference services with cache acceleration enabled. This is ideal for large model files on OSS or NAS, such as those for LLMs, AI image generation, and AI video generation.

Important

The OSS path corresponding to the source path in the inference service's cache acceleration configuration must exactly match the OSS path mounted by the cache warm-up service. If the paths do not match, cache warm-up will fail.

For example, if the source path for cache acceleration in the inference service is /mnt/models/Qwen3-8B/, which corresponds to the OSS path oss://path/to/models/Qwen3-8B/, the cache warm-up service must also mount oss://path/to/models/Qwen3-8B/.

On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click Model Warm-up Cache Service Deployment.

Configure the following key parameters, and then click Deploy.

Parameter		Description
Basic Information	Deployment	Select resources based on the required memory size.
Cache Configuration	Cache Path	The model directories to be cached. You can mount multiple paths.
Cache Configuration	Maximum Memory Usage	Required. The maximum memory that the cache warm-up service can use.
Network Information	VPC	Required. This must be the same VPC as the inference service. Otherwise, the inference service cannot access the cache warm-up service.
Network Information	Associate NLB	Must be enabled. By default, the system automatically creates an NLB.

Performance benchmarks

The following benchmarks show the performance of model cache acceleration. Actual results may vary.

Qwen3-32B

Model: Qwen3-32B (62 GB)

Machine: ml.gu8is.c64m512.4-gu60 | 64-core 512 GB + 4× GU60(48G) | L20

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	01:05	7.63 Gbit/s	01:43
Cold start acceleration (with cache warm-up)	00:21	23.62 Gbit/s	01:01
Scale-out acceleration (with local cache)	00:18	27.55 Gbit/s	00:58

MiniMax-M2

Model: MiniMax-M2 (215 GB)

Machine: ml.gu8tf.8.40xlarge | 160 vCPU + 1800 GB + 8×GU8T | H20 (96 GB)

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	06:42	4.28 Gbit/s	09:16
Cold start acceleration (with cache warm-up)	01:49	15.78 Gbit/s	04:49
Scale-out acceleration (with local cache)	01:42	16.86 Gbit/s	04:34

DeepSeek-V3.2

Model: DeepSeek-V3.2 (643 GB)

Machine: ml.gu8tef.8.46xlarge | 184 vCPU + 1800 GB + 8×GU8TE | H20-3e (141 GB)

Deployment mode	Model loading time	Model loading speed	Service readiness time
Standard (no cache acceleration)	12:33	6.83 Gbit/s	27:41
Cold start acceleration (with cache warm-up)	02:43	31.56 Gbit/s	13:01
Scale-out acceleration (with local cache)	01:58	43.60 Gbit/s	12:49