Use OSS Connector for AI/ML to accelerate AI model deployment - Object Storage Service - Alibaba Cloud - Object Storage Service

OSS Connector for AI/ML provides a non-intrusive way to load models without requiring code changes. It uses LD_PRELOAD for high-performance, direct reads from OSS and supports prefetch and caching to significantly improve model loading speed. The connector is compatible with containers and mainstream inference frameworks.

High performance

It can achieve over 10 GB/s of throughput with sufficient bandwidth. For more information, see Performance testing.

The model broadcast feature lets you batch-start inference services for the same model. A single node loads the model from OSS, and the remaining nodes use local storage and network resources to distribute the model through a chain topology structure. This significantly reduces origin-pull pressure and improves startup efficiency for large-scale node deployments.

How it works

OSS Connector for AI/ML addresses the performance bottlenecks of loading large models from OSS in the cloud.

Traditional FUSE-based mount solutions often fail to fully utilize OSS's high bandwidth, resulting in slow model loading. OSS Connector for AI/ML improves data access efficiency by intercepting I/O requests from the inference framework and converting them directly into HTTP(s) requests to OSS.
It uses the LD_PRELOAD mechanism to prefetch and cache model data in memory, which requires no changes to your application code and significantly speeds up model loading.

Deployment environment

Operating system: Linux x86-64
glibc: >=2.17

Installation

Download the complete installation package.
- oss-connector-lib-1.2.0.x86_64.rpm: For Red Hat-based Linux distributions
```
https://gosspublic.alicdn.com/oss-connector/oss-connector-lib-1.2.0.x86_64.rpm
```
- oss-connector-lib-1.2.0.x86_64.deb: For Debian-based Linux distributions
```
https://gosspublic.alicdn.com/oss-connector/oss-connector-lib-1.2.0.x86_64.deb
```
Install the OSS connector.
When you use an .rpm or .deb package to install OSS Connector, the dynamic library file libossc_preload.so is automatically installed to the /usr/local/lib/ directory.
- Install oss-connector-lib-1.2.0.x86_64.rpm
```
yum install -y oss-connector-lib-1.2.0.x86_64.rpm
```
- Install oss-connector-lib-1.2.0.x86_64.deb
```
dpkg -i oss-connector-lib-1.2.0.x86_64.deb
```
After installation, verify that /usr/local/lib/libossc_preload.so exists and that the version is correct.
```
nm -D /usr/local/lib/libossc_preload.so | grep version
```

Configuration

Configuration file

The configuration file controls log output, cache policies, and prefetch concurrency. Properly configuring these parameters can improve system performance and maintainability.

The path to the configuration file is /etc/oss-connector/config.json. The installation package includes a default configuration file. The configuration is as follows:

{
    "logLevel": 1,
    "logPath": "/var/log/oss-connector/connector.log",
    "auditPath": "/var/log/oss-connector/audit.log",
    "expireTimeSec": 120,
    "prefetch": {
        "vcpus": 16,
        "workers": 16,
        "maxCacheAdviseGB": -1
    }
}

Parameter	Description
logLevel	The log level. Controls the verbosity of the log output.
logPath	The log file path. Specifies the output location for runtime logs.
auditPath	The audit log file path. Records audit information for security and compliance tracking.
expireTimeSec	The delay in seconds before releasing cached files that are no longer referenced. Default: 120.
prefetch.vcpus	The number of vCPUs (virtual CPUs) for prefetching. Default: 16.
prefetch.workers	The number of coroutine workers per vCPU to increase concurrency. Default: 16.
prefetch.maxCacheAdviseGB	The size of the memory cache in GB that can be used for prefetching. Default: -1 (unlimited).

Configure environment variables

Environment variable	Description
OSS_ACCESS_KEY_ID	The AccessKey ID and AccessKey Secret of an Alibaba Cloud account or a RAM user. When you configure permissions with a temporary access credential, set these variables to the AccessKey ID and AccessKey Secret of that credential. OSS Connector for AI/ML requires the `oss:ListObjects` permission for the target bucket and directory. If the bucket and files you are accessing support anonymous access, you can leave the `OSS_ACCESS_KEY_ID` and `OSS_ACCESS_KEY_SECRET` environment variables unset or set them to empty strings.
OSS_ACCESS_KEY_SECRET
OSS_SESSION_TOKEN	The temporary access token. This is required when you use a temporary access credential from STS to access OSS. If you use the AccessKey ID and AccessKey Secret of an Alibaba Cloud account or RAM user for permission configuration, leave this field empty.
OSS_ENDPOINT	Specify the OSS service endpoint. An example value is `http://oss-cn-beijing-internal.aliyuncs.com`. If you do not specify a protocol, HTTPS is used by default. We recommend that you use the HTTP protocol in secure environments, such as an internal network, for better performance.
OSS_REGION	The OSS region ID, such as `cn-beijing`. If not specified, authentication might fail.
OSS_PATH	The OSS model directory is in the format `oss://bucketname/path/`. For example, `oss://examplebucket/qwen/Qwen3-8B/`.
MODEL_DIR	The local model directory for vllm or other inference frameworks. We recommend starting with an empty directory. Temporary data downloaded during use can be safely deleted afterward. Note The `MODEL_DIR` path must match the model path used by the inference framework, such as the `--model` parameter for vllm or the `--model-path` parameter for sglang. `MODEL_DIR` requires read and write permissions. The directory structure of `MODEL_DIR` must correspond to that of `OSS_PATH`. During model loading, model files are prefetched and cached in memory. After loading, the cache is released with a delay, which defaults to 120 seconds. You can adjust this with the `expireTimeSec` parameter in the configuration file. The local model directory should only be used for loading models with the connector; it is not valid for other purposes. Do not create the local model directory on another OSS mount point, such as an ossfs mount point.
LD_PRELOAD	The path of the dynamic library to preload is typically `/usr/local/lib/libossc_preload.so`. We recommend that you use a temporary environment variable for configuration. For example, `LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ./myapp`
ENABLE_CONNECTOR	Sets the OSS Connector process role. Use a temporary environment variable to apply this setting. `ENABLE_CONNECTOR=1`: Primary connector role. `ENABLE_CONNECTOR=2`: Secondary connector role. Within a single running instance, only one process can have the primary connector role. We recommend assigning this role to the main process (for example, the entrypoint). Other processes that use the connector must be assigned the secondary connector role. For a usage example, see the ray+vllm example for multi-node startup.
OSS_AUTHORIZATION_FILE_PATH	The path to a JSON-formatted credential file. AccessKey ID and AccessKey Secret of an Alibaba Cloud account or RAM user: `{ "AccessKeyId": "LTAI**********************", "AccessKeySecret": "At32********************" }` Temporary access credential: `{ "AccessKeyId": "STS.L4aB**************", "AccessKeySecret": "wyLTSm*********************", "SecurityToken": "********", "Expiration": "2024-08-15T15:04:05Z" }` Note** This setting has a higher priority than the `OSS_ACCESS_KEY_ID`, `OSS_ACCESS_KEY_SECRET`, and `OSS_SESSION_TOKEN` environment variables.
CONNECTOR_CONFIG_PATH	You can modify the configuration file path by using an environment variable. Default value: `/etc/oss-connector/config.json`
CONNECTOR_UDS_PATH	You can set the Unix Domain Socket (UDS) file path by using an environment variable. Default value: `/run/modelconnector.sock` Note The primary and secondary connector processes communicate through UDS.
CONNECTOR_MAX_CACHE_ADVISE_GB	Sets the size of the memory cache in GB that can be used for prefetching. Note This has the same function as `prefetch.maxCacheAdviseGB` in the configuration file but has a higher priority.

Start model service

Single-node startup

Vllm API server

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m vllm.entrypoints.openai.api_server --model /tmp/model --trust-remote-code --tensor-parallel-size 1 --disable-custom-all-reduce

Sglang API server

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000

Multi-model loading

When an inference task involves multiple models, such as Speculative Decoding, the OSS Connector supports loading multiple models from OSS simultaneously. Simply set OSS_PATH to the common parent path of all models and MODEL_DIR to the corresponding local parent directory.

The following example shows how to use Speculative Decoding with vllm to load the target model Qwen3-32B and the draft model Qwen3-0.6B simultaneously:

export OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID}
export OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET}
export OSS_ENDPOINT=${OSS_ENDPOINT}
export OSS_REGION=${OSS_REGION}
export OSS_PATH=oss://examplebucket/
export MODEL_DIR=/tmp/models

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 \
python3 -m vllm.entrypoints.openai.api_server \
    --model ${MODEL_DIR}/qwen/Qwen3-32B/ --trust-remote-code \
    --tensor-parallel-size 1 --disable-custom-all-reduce \
    --speculative_config '{"model": "'"${MODEL_DIR}/qwen/Qwen3-0___6B/"'", "num_speculative_tokens": 5}'

Note

There is a correspondence between OSS_PATH and MODEL_DIR. For example, if the target model path on OSS is oss://examplebucket/qwen/Qwen3-32B/ and the draft model path is oss://examplebucket/qwen/Qwen3-0___6B/, set OSS_PATH to their common parent path, oss://examplebucket/, and set MODEL_DIR to /tmp/models. The corresponding local paths for the target and draft models are /tmp/models/qwen/Qwen3-32B/ and /tmp/models/qwen/Qwen3-0___6B/, respectively.

Multi-node startup

In multi-node deployment scenarios, OSS Connector for AI/ML supports model broadcast. When model broadcast is enabled, only a single node loads model data from OSS. The remaining nodes distribute the model data through a chain topology structure, which avoids the high bandwidth consumption that occurs when multiple nodes pull from the origin simultaneously. For more information about model broadcast, see Model broadcast.

Ray and vllm

Common environment variables:

export OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID}
export OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET}
export OSS_ENDPOINT=${OSS_ENDPOINT}
export OSS_REGION=${OSS_REGION}
export OSS_PATH=oss://examplebucket/
export MODEL_DIR=/tmp/models

Important

The OSS_PATH and MODEL_DIR variables must correspond. For example, if the model path on OSS is oss://examplebucket/qwen/Qwen2___5-72B/, the local model directory is /tmp/models/qwen/Qwen2___5-72B/.

Start the ray head on Pod A:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ray start --head --dashboard-host 0.0.0.0 --block

Start ray on Pod B and join the cluster:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ray start --address='172.24.176.137:6379' --block     // Use the head pod's IP address, which is provided in the 'ray start' output from Pod A.

Start the vllm API Server:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=2 python3 -m vllm.entrypoints.openai.api_server --model ${MODEL_DIR}/qwen/Qwen2___5-72B/ --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.98 --tensor-parallel-size 32

Sglang

Configure environment variables for the sglang process on each node.

Primary node startup:

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000 --dist-init-addr 192.168.1.1:20000 --nnodes 2 --node-rank 0

Secondary node startup:

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000 --dist-init-addr 192.168.1.1:20000 --nnodes 2 --node-rank 1

Kubernetes deployment

When deploying OSS Connector for AI/ML in a Kubernetes environment, you can install it by using an Init Container, performing a dynamic installation at startup, or creating a custom image. For more information and a complete YAML example of a Kubernetes deployment, see Enable Connector in Kubernetes.

Performance testing

Single-node model loading test

Test environment

Item	Specification
OSS	Beijing, internal network download bandwidth 250 Gbps
Test node	ecs.g7nex.32xlarge, network bandwidth 160 Gbps (80 Gbps × 2)

Metrics

Metric	Description
Model download	The time taken to download the model files by using the connector.
End-to-end	The total time from starting the CPU version of the vllm API server until the service is ready.

Test results

Model name	Model size (GB)	Download time (s)	End-to-end time (s)
Qwen2.5-14B	27.522	1.7721	20.48
Qwen2.5-72B	135.437	10.57	30.09
Qwen3-8B	15.271	0.97	18.88
Qwen3-32B	61.039	3.99	22.97