Improve model deployment efficiency with OSS Connector for AI/ML - Object Storage Service

OSS Connector for AI/ML offers a non-intrusive model loading solution that requires no code changes. It uses `LD_PRELOAD` for high-performance direct reads from OSS. The connector supports prefetching and caching to significantly improve model loading speed. It works with containers and mainstream inference frameworks.

High performance

OSS Connector for AI/ML significantly improves performance when loading large models from OSS. With sufficient bandwidth, throughput can exceed 10 GB/s. For more information, see Performance testing.

How it works

OSS Connector for AI/ML addresses performance bottlenecks that occur when you load large models from OSS in a cloud environment.

Traditional mount solutions based on Filesystem in Userspace (FUSE) often cannot fully utilize the high bandwidth of OSS. This results in slow model loading. OSS Connector improves data access efficiency by intercepting I/O requests from the inference framework and converting them directly into HTTP(s) requests to OSS.
It uses the `LD_PRELOAD` mechanism to prefetch and cache model data in memory. This requires no code changes to your inference application and significantly speeds up model loading.

Deployment environment

Operating system: Linux x86-64
glibc: >=2.17

Install OSS Connector

Download the complete installation package.
- oss-connector-lib-1.1.0rc7.x86_64.rpm: For Red Hat-based Linux distributions
```
https://gosspublic.alicdn.com/oss-connector/oss-connector-lib-1.1.0rc7.x86_64.rpm
```
- oss-connector-lib-1.1.0rc7.x86_64.deb: For Debian-based Linux distributions
```
https://gosspublic.alicdn.com/oss-connector/oss-connector-lib-1.1.0rc7.x86_64.deb
```
Install OSS Connector.
Use the downloaded .rpm or .deb package for the installation. The dynamic library file `libossc_preload.so` is automatically installed to the /usr/local/lib/ directory.
- Install oss-connector-lib-1.1.0rc7.x86_64.rpm
```
yum install -y oss-connector-lib-1.1.0rc7.x86_64.rpm
```
- Install oss-connector-lib-1.1.0rc7.x86_64.deb
```
dpkg -i oss-connector-lib-1.1.0rc7.x86_64.deb
```
After installation, verify that `/usr/local/lib/libossc_preload.so` exists and that the version is correct.
```
nm -D /usr/local/lib/libossc_preload.so | grep version
```

Configure OSS Connector

Configuration file

You can use the configuration file to control log output, cache policy, and prefetch concurrency. Correctly setting these parameters can improve system performance and maintenance.

The configuration file is located at /etc/oss-connector/config.json. The installation package includes a default configuration file, as shown below:

{
    "logLevel": 1,
    "logPath": "/var/log/oss-connector/connector.log",
    "auditPath": "/var/log/oss-connector/audit.log",
    "expireTimeSec": 120,
    "prefetch": {
        "vcpus": 16,
        "workers": 16
    }
}

Parameter	Description
logLevel	Log level. Controls the detail level of log output.
logPath	Log file path. Specifies the output location for runtime logs.
auditPath	Audit log file path. Records audit information for security and compliance tracking.
expireTimeSec	The delayed release time for cache files in seconds. Files are released after a delay when there are no references. The default is 120 seconds.
prefetch.vcpus	The number of virtual CPUs (concurrent CPU cores) used for prefetching. The default value is 16.
prefetch.workers	The number of coroutines (workers) per vCPU. This is used to increase concurrency. The default value is 16.

Configure environment variables

Environment variable KEY	Description
OSS_ACCESS_KEY_ID	The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a Resource Access Management (RAM) user. When you configure permissions with a temporary access token, set these to the AccessKey ID and AccessKey secret of the temporary access credential. OSS Connector requires the `oss:ListObjects` permission for the target bucket directory. If the bucket and files you access support anonymous access, you can leave the `OSS_ACCESS_KEY_ID` and `OSS_ACCESS_KEY_SECRET` environment variables unset or set them to empty strings.
OSS_ACCESS_KEY_SECRET
OSS_SESSION_TOKEN	The temporary access token. You must set this parameter when you use a temporary access credential from Security Token Service (STS) to access OSS. When you use the AccessKey ID and AccessKey secret of an Alibaba Cloud account or RAM user for permission configuration, set this field to an empty string.
OSS_ENDPOINT	Specifies the OSS service Endpoint. Example: `http://oss-cn-beijing-internal.aliyuncs.com`. If you do not specify a protocol, HTTPS is used by default. We recommend using the HTTP protocol in secure environments, such as an internal network, for better performance.
OSS_REGION	Specifies the OSS Region ID. Example: cn-beijing. If not specified, authentication may fail.
OSS_PATH	The OSS model directory. The format is `oss://bucketname/path/`. Example: `oss://examplebucket/qwen/Qwen3-8B/`.
MODEL_DIR	The local model directory passed to vllm or other inference frameworks. We recommend emptying the directory first. Temporary data is downloaded during use and can be deleted afterward. Note The `MODEL_DIR` path must be consistent with the model path of the inference framework, such as the `--model` parameter for vllm or the `--model-path` parameter for sglang. `MODEL_DIR` requires read and write permissions. The directory structure of `MODEL_DIR` corresponds to `OSS_PATH`. During model loading, model files are prefetched and cached in memory. The cache is released after a delay when the model is loaded. The default delay is 120 seconds. You can adjust this with the `expireTimeSec` parameter in the configuration file. Use the local model directory only for loading models with the connector. It cannot be used for other purposes. Do not create the local model directory on another OSS mount target, such as an ossfs mount target.
LD_PRELOAD	The path to the dynamic library to be preloaded, usually `/usr/local/lib/libossc_preload.so`. We recommend configuring this using a temporary environment variable. For example: `LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ./myapp`
ENABLE_CONNECTOR	Sets the OSS Connector process role. Use a temporary environment variable to make it effective. `ENABLE_CONNECTOR=1`: Primary connector role. `ENABLE_CONNECTOR=2`: Secondary connector role. A single running instance can have only one primary connector process. We recommend assigning the primary role to the main process, such as an entrypoint. All other processes that use the connector must be assigned the secondary connector role. For more information, see the ray+vllm example for multi-node startup.

Start the model service

Single-node startup

vllm API Server

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m vllm.entrypoints.openai.api_server --model /tmp/model --trust-remote-code --tensor-parallel-size 1 --disable-custom-all-reduce

sglang API Server

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000

Multi-node startup

ray+vllm

Common environment variables:

export OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID}
export OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET}
export OSS_ENDPOINT=${OSS_ENDPOINT}
export OSS_REGION=${OSS_REGION}
export OSS_PATH=oss://examplebucket/
export MODEL_DIR=/tmp/models

Important

The `OSS_PATH` and `MODEL_DIR` variables must correspond. For example, if the model path on OSS is `oss://examplebucket/qwen/Qwen2___5-72B/`, the local model directory is `/tmp/models/qwen/Qwen2___5-72B/`.

Pod A starts the ray head:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ray start --head --dashboard-host 0.0.0.0 --block

Pod B starts ray and joins the cluster:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=1 ray start --address='172.24.176.137:6379' --block     // 172.24.176.137 is the pod IP. Change this to the IP address of the head pod. The command to join the cluster is provided in the output after you run `ray start` on Pod A.

Start the vllm API Server:

LD_PRELOAD=/usr/local/lib/libossc_preload.so ENABLE_CONNECTOR=2 python3 -m vllm.entrypoints.openai.api_server --model ${MODEL_DIR}/qwen/Qwen2___5-72B/ --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.98 --tensor-parallel-size 32

sglang

Configure environment variables for the sglang process on each node.

Primary node startup:

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000 --dist-init-addr 192.168.1.1:20000 --nnodes 2 --node-rank 0

Secondary node startup:

LD_PRELOAD=/usr/local/lib/libossc_preload.so \
ENABLE_CONNECTOR=1 OSS_ACCESS_KEY_ID=${OSS_ACCESS_KEY_ID} \
OSS_ACCESS_KEY_SECRET=${OSS_ACCESS_KEY_SECRET} \ OSS_ENDPOINT=${OSS_ENDPOINT} \
OSS_REGION=${OSS_REGION} \
OSS_PATH=${OSS_PATH} \
MODEL_DIR=/tmp/model \
python3 -m sglang.launch_server --model-path /tmp/model --port 8000 --dist-init-addr 192.168.1.1:20000 --nnodes 2 --node-rank 1

Kubernetes deployment

To deploy a pod in a Kubernetes environment, first build an image with the connector installed and push the image to a repository. The following YAML file is an example of a Kubernetes pod deployment:

apiVersion: v1
kind: ConfigMap
metadata:
  name: connector-config
data:
  config.json: |
    {
        "logLevel": 1,
        "logPath": "/var/log/oss-connector/connector.log",
        "auditPath": "/var/log/oss-connector/audit.log",
        "expireTimeSec": 120,
        "prefetch": {
            "vcpus": 16,
            "workers": 16
        }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-connector-deployment
spec:
  selector:
    matchLabels:
      app: model-connector
  template:
    metadata:
      labels:
        app: model-connector
    spec:
      imagePullSecrets:
        - name: acr-credential-beijing
      hostNetwork: true
      containers:
      - name: container-name
        image: {IMAGE_ADDRESS}
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "24"
            memory: "700Gi"
          limits:
            cpu: "128"
            memory: "900Gi"
        command: 
          - bash
          - -c
          - ENABLE_CONNECTOR=1 python3 -m vllm.entrypoints.openai.api_server --model /var/model --trust-remote-code --tensor-parallel-size 1 --disable-custom-all-reduce
        env:
        - name: LD_PRELOAD
          value: "/usr/local/lib/libossc_preload.so"
        - name: OSS_ENDPOINT
          value: "oss-cn-beijing-internal.aliyuncs.com"
        - name: OSS_REGION
          value: "cn-beijing"
        - name: OSS_PATH
          value: "oss://examplebucket/qwen/Qwen1.5-7B-Chat/"
        - name: MODEL_DIR
          value: "/var/model/"
        - name: OSS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: oss-access-key-connector
              key: key
        - name: OSS_ACCESS_KEY_SECRET
          valueFrom:
            secretKeyRef:
              name: oss-access-key-connector
              key: secret
        volumeMounts:
          - name: connector-config
            mountPath:  /etc/oss-connector/
      terminationGracePeriodSeconds: 10
      volumes:
      - name: connector-config
        configMap:
          name: connector-config

Performance testing

Single-node model loading test

Test environment

Metric	Description
OSS	Beijing, internal network download bandwidth 250 Gbps
Test node	ecs.g7nex.32xlarge, network bandwidth 160 Gbps (80 Gbps × 2)

Statistical metrics

Metric	Description
Model download	The time from when the model file download starts to when it finishes using the connector.
End-to-end	The time it takes for the CPU version of the vllm API server to start and become ready.

Test results

Model name	Model size (GB)	Model download time (seconds)	End-to-end time (seconds)
Qwen2.5-14B	27.522	1.7721	20.48
Qwen2.5-72B	135.437	10.57	30.09
Qwen3-8B	15.271	0.97	18.88
Qwen3-32B	61.039	3.99	22.97