When launching AI inference services on multiple nodes, the OSS Connector model broadcasting feature loads model data from OSS on only a single node. The remaining nodes then receive the data through a chain-based topology. This method significantly reduces back-to-source traffic and improves model distribution efficiency.
How it works
When multiple nodes running AI inference services pull model files from OSS simultaneously, the download activity can saturate the source's egress bandwidth. This creates a performance bottleneck that can cause startup delays or failures. This issue is particularly pronounced in regions with lower OSS egress bandwidth, where concurrent back-to-source traffic can severely impact deployment efficiency.
OSS Connector model broadcasting optimizes large-scale AI inference deployments. When multiple inference instances for the same model are launched, only one or a few nodes load the model data directly from OSS. This data is then distributed to other nodes through a chain-based topology. Model broadcasting leverages node storage and network resources to reduce back-to-source traffic, lessen the source's load, and improve distribution efficiency.
OSS Connector model broadcasting uses a chain-based transport method where model files are passed serially from one node to the next. Each node receives and forwards the data only once. For model file transfers, a single data stream is often sufficient to saturate the network bandwidth of most mainstream instance types. The chain-based method avoids the bandwidth bottlenecks that can occur in tree-based transport, where a node must send data to multiple downstream nodes simultaneously.
OSS Connector preloads model files from OSS into a memory buffer using a high-concurrency strategy. This lets the inference engine load the model into GPU memory as needed. The buffered memory is then released after a delay once inference is complete. The model broadcasting feature builds on this by enabling the buffer to be shared across nodes. It integrates DADI P2P functionalities and requires only a Redis or Tair service for node discovery and metadata management. This setup allows buffered data to be distributed to other nodes. Compared to a single-node deployment, this solution adds only lightweight buffer-sharing logic while fully using idle node egress bandwidth during model loading. This provides a cost-effective and efficient method for distributed model loading.
With model broadcasting, only a single data stream is pulled from OSS at any given time for a specific model. This significantly reduces the load on the OSS source during batch startups. However, if source performance remains a bottleneck, you should use this feature with OSS Accelerator or the distributed cache version of DADI P2P.
Prerequisites
OSS Connector for AI/ML v1.2.0 or later is installed. For installation instructions, see Improve model deployment efficiency with OSS Connector for AI/ML.
You have a Redis or Tair database available for node discovery and metadata management.
Configure the database
The model broadcasting feature requires a Redis or Tair service for node discovery and metadata management. You must configure this database to use the feature.
Option 1: Purchase and configure Tair (Recommended)
Tair is Alibaba Cloud's fully managed cloud database service, compatible with the Redis protocol.
Create a Tair instance. For instructions, see Quick start overview. The instance version must be 6.0 or later, and you can use the minimum specifications.
Configure a whitelist to ensure that the inference nodes can access the Tair instance.
You will need the
Connection Address,Port Number,Username, andPasswordto configure model broadcasting.
Option 2: Deploy a standalone Redis service
Alternatively, you can deploy your own Redis service in a Kubernetes cluster.
The following YAML configuration deploys a Redis service with Access Control List (ACL) authentication.
Create an ACL configuration file and generate a Kubernetes secret.
# Create ACL content cat > users.acl << EOF user default off -@all user Username on >Password ~* &* +@all EOF # Create a secret kubectl create secret generic redis-acl-secret \ --from-file=users.acl \ --dry-run=client -o yaml | kubectl apply -f -NoteReplace
UsernameandPasswordwith your actual username and password.Use the following configuration to deploy the Redis service and its deployment.
# redis-service.yaml apiVersion: v1 kind: Service metadata: name: redis spec: selector: app: model-redis ports: - protocol: TCP port: 6379 targetPort: 6379 --- # redis-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: model-redis-deployment spec: replicas: 1 selector: matchLabels: app: model-redis template: metadata: labels: app: model-redis spec: containers: - name: redis image: mirrors-ssl.aliyuncs.com/redis:8.4.0 ports: - containerPort: 6379 command: ["redis-server"] args: - "--aclfile" - "/etc/redis/users.acl" - "--maxmemory" - "900mb" - "--maxmemory-policy" - "volatile-lru" - "--save" - "" - "--appendonly" - "no" - "--loglevel" - "notice" resources: requests: memory: "1Gi" cpu: "100m" limits: memory: "1Gi" cpu: "200m" volumeMounts: - name: acl-config mountPath: /etc/redis/users.acl subPath: users.acl volumes: - name: acl-config secret: secretName: redis-acl-secretTo deploy the Redis service, run the following command.
kubectl apply -f redis-service.yaml
Enable model broadcasting
Add the model broadcasting configuration to the OSS Connector configuration file at /etc/oss-connector/config.json.
{
...
"broadcast": {
"enableBroadcast": true,
"tenant": "${P2P_KEY_PREFIX}",
"db": {
"host": "${P2P_REDIS_HOST}",
"port": 6379,
"username": "${P2P_REDIS_USERNAME}",
"password": "${P2P_REDIS_PASSWD}"
}
},
"bindPort": 19898
...
}The following table describes the configuration parameters.
Parameter | Description |
broadcast.enableBroadcast | Specifies whether to enable model broadcasting. Set this to |
broadcast.tenant | Specifies the tenant name. Nodes with the same tenant name can use model broadcasting. We recommend configuring a unique tenant for each service. |
broadcast.db.host | Specifies the connection address of the Redis or Tair service. |
broadcast.db.port | Specifies the port number of the Redis or Tair service. The default is 6379. |
broadcast.db.username | Specifies the username for the Redis or Tair service. |
broadcast.db.password | Specifies the password for the Redis or Tair service. |
bindPort | Specifies the port used to provide data to other nodes. The default value is 19898. |
For a complete example of how to deploy a model broadcasting service with multiple instances in a Kubernetes cluster, see Deploy a model broadcasting service with multiple instances.
Limit the cache size
During model broadcasting, nodes cache model data in memory for retrieval by other nodes. You can limit this cache memory in the following ways.
Method 1: Set an environment variable
export CONNECTOR_MAX_CACHE_ADVISE_GB=100Method 2: Set in the configuration file
Set
prefetch.maxCacheAdviseGBin/etc/oss-connector/config.json:{ ... "prefetch": { "vcpus": 16, "workers": 24, "maxCacheAdviseGB": 100 }, ... }
The memory limit is a soft limit.
Environment variables take precedence over the configuration file.
Performance report
The following are the performance test results for the model broadcasting feature with the Qwen2.5-72B model (135.437 GB) in different regions.
Test in Beijing region
Test environment
Item | Configuration |
OSS | China (Beijing), intranet download bandwidth 250 Gbps |
Node configuration | ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB |
Model | Qwen2.5-72B, 135.437 GB |
Metrics | Time from vLLM API server startup to service readiness, along with OSS and P2P traffic. |
Unlimited cache size

Only one back-to-source data stream is used, with all other data transferred via P2P. This minimizes bandwidth pressure on OSS.
The average model-ready time remains close to O(1) and does not increase linearly with the number of nodes. This demonstrates excellent horizontal scaling.
Limited cache size
The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 100, 60, 40, 20, and 0 GB.


Model broadcasting functions as expected under different cache size limits.
The impact of cache limits on performance is consistent across all concurrency levels. A cache size of 40 GB or larger has no significant effect on model-ready time. Performance begins to decline noticeably with a cache size of 20 GB or smaller.
Test in Ulanqab region
Test environment
Item | Configuration |
OSS | China (Ulanqab), intranet download bandwidth 10 Gbps |
Node configuration | ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB |
Model | Qwen2.5-72B, 135.437 GB |
The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 60 and 0 GB.

Even with limited OSS download bandwidth, the test results show that model broadcasting maintains excellent horizontal scaling and minimizes bandwidth pressure on OSS.