Model broadcasting - Object Storage Service - Alibaba Cloud Documentation Center

When launching AI inference services on multiple nodes, the OSS Connector model broadcasting feature loads model data from OSS on only a single node. The remaining nodes then receive the data through a chain-based topology. This method significantly reduces back-to-source traffic and improves model distribution efficiency.

How it works

When multiple nodes running AI inference services pull model files from OSS simultaneously, the download activity can saturate the source's egress bandwidth. This creates a performance bottleneck that can cause startup delays or failures. This issue is particularly pronounced in regions with lower OSS egress bandwidth, where concurrent back-to-source traffic can severely impact deployment efficiency.

OSS Connector model broadcasting optimizes large-scale AI inference deployments. When multiple inference instances for the same model are launched, only one or a few nodes load the model data directly from OSS. This data is then distributed to other nodes through a chain-based topology. Model broadcasting leverages node storage and network resources to reduce back-to-source traffic, lessen the source's load, and improve distribution efficiency.

OSS Connector model broadcasting uses a chain-based transport method where model files are passed serially from one node to the next. Each node receives and forwards the data only once. For model file transfers, a single data stream is often sufficient to saturate the network bandwidth of most mainstream instance types. The chain-based method avoids the bandwidth bottlenecks that can occur in tree-based transport, where a node must send data to multiple downstream nodes simultaneously.

OSS Connector preloads model files from OSS into a memory buffer using a high-concurrency strategy. This lets the inference engine load the model into GPU memory as needed. The buffered memory is then released after a delay once inference is complete. The model broadcasting feature builds on this by enabling the buffer to be shared across nodes. It integrates DADI P2P functionalities and requires only a Redis or Tair service for node discovery and metadata management. This setup allows buffered data to be distributed to other nodes. Compared to a single-node deployment, this solution adds only lightweight buffer-sharing logic while fully using idle node egress bandwidth during model loading. This provides a cost-effective and efficient method for distributed model loading.

Note

With model broadcasting, only a single data stream is pulled from OSS at any given time for a specific model. This significantly reduces the load on the OSS source during batch startups. However, if source performance remains a bottleneck, you should use this feature with OSS Accelerator or the distributed cache version of DADI P2P.

Prerequisites

OSS Connector for AI/ML v1.2.0 or later is installed. For installation instructions, see Improve model deployment efficiency with OSS Connector for AI/ML.
You have a Redis or Tair database available for node discovery and metadata management.

Configure the database

The model broadcasting feature requires a Redis or Tair service for node discovery and metadata management. You must configure this database to use the feature.

Option 1: Purchase and configure Tair (Recommended)

Tair is Alibaba Cloud's fully managed cloud database service, compatible with the Redis protocol.

Create a Tair instance. For instructions, see Quick start overview. The instance version must be 6.0 or later, and you can use the minimum specifications.
Configure a whitelist to ensure that the inference nodes can access the Tair instance.
You will need the Connection Address, Port Number, Username, and Password to configure model broadcasting.

Option 2: Deploy a standalone Redis service

Alternatively, you can deploy your own Redis service in a Kubernetes cluster.

The following YAML configuration deploys a Redis service with Access Control List (ACL) authentication.

Create an ACL configuration file and generate a Kubernetes secret.

# Create ACL content
cat > users.acl << EOF
user default off -@all
user Username on >Password ~* &* +@all
EOF

# Create a secret
kubectl create secret generic redis-acl-secret \
  --from-file=users.acl \
  --dry-run=client -o yaml | kubectl apply -f -

Note

Replace Username and Password with your actual username and password.

Use the following configuration to deploy the Redis service and its deployment.

# redis-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  selector:
    app: model-redis
  ports:
    - protocol: TCP
      port: 6379
      targetPort: 6379

---
# redis-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-redis-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: model-redis
  template:
    metadata:
      labels:
        app: model-redis
    spec:
      containers:
      - name: redis
        image: mirrors-ssl.aliyuncs.com/redis:8.4.0
        ports:
        - containerPort: 6379
        command: ["redis-server"]
        args:
        - "--aclfile"
        - "/etc/redis/users.acl"
        - "--maxmemory"
        - "900mb"
        - "--maxmemory-policy"
        - "volatile-lru"
        - "--save"
        - ""
        - "--appendonly"
        - "no"
        - "--loglevel"
        - "notice"
        resources:
          requests:
            memory: "1Gi"
            cpu: "100m"
          limits:
            memory: "1Gi"
            cpu: "200m"
        volumeMounts:
        - name: acl-config
          mountPath: /etc/redis/users.acl
          subPath: users.acl
      volumes:
      - name: acl-config
        secret:
          secretName: redis-acl-secret

To deploy the Redis service, run the following command.
```
kubectl apply -f redis-service.yaml
```

Enable model broadcasting

Add the model broadcasting configuration to the OSS Connector configuration file at /etc/oss-connector/config.json.

{
  ...
  "broadcast": {
    "enableBroadcast": true,
    "tenant": "${P2P_KEY_PREFIX}",
    "db": {
      "host": "${P2P_REDIS_HOST}",
      "port": 6379,
      "username": "${P2P_REDIS_USERNAME}",
      "password": "${P2P_REDIS_PASSWD}"
    }
  },
  "bindPort": 19898
  ...
}

The following table describes the configuration parameters.

Parameter	Description
broadcast.enableBroadcast	Specifies whether to enable model broadcasting. Set this to `true` to enable the feature.
broadcast.tenant	Specifies the tenant name. Nodes with the same tenant name can use model broadcasting. We recommend configuring a unique tenant for each service.
broadcast.db.host	Specifies the connection address of the Redis or Tair service.
broadcast.db.port	Specifies the port number of the Redis or Tair service. The default is 6379.
broadcast.db.username	Specifies the username for the Redis or Tair service.
broadcast.db.password	Specifies the password for the Redis or Tair service.
bindPort	Specifies the port used to provide data to other nodes. The default value is 19898.

For a complete example of how to deploy a model broadcasting service with multiple instances in a Kubernetes cluster, see Deploy a model broadcasting service with multiple instances.

Limit the cache size

During model broadcasting, nodes cache model data in memory for retrieval by other nodes. You can limit this cache memory in the following ways.

Method 1: Set an environment variable

export CONNECTOR_MAX_CACHE_ADVISE_GB=100

Method 2: Set in the configuration file

Set prefetch.maxCacheAdviseGB in /etc/oss-connector/config.json:

{
  ...
  "prefetch": {
    "vcpus": 16,
    "workers": 24,
    "maxCacheAdviseGB": 100
  },
  ...
}

Note

The memory limit is a soft limit.
Environment variables take precedence over the configuration file.

Performance report

The following are the performance test results for the model broadcasting feature with the Qwen2.5-72B model (135.437 GB) in different regions.

Test in Beijing region

Test environment

Item	Configuration
OSS	China (Beijing), intranet download bandwidth 250 Gbps
Node configuration	ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB
Model	Qwen2.5-72B, 135.437 GB
Metrics	Time from vLLM API server startup to service readiness, along with OSS and P2P traffic.

Unlimited cache size

北京Region不限制缓存大小测试结果

Only one back-to-source data stream is used, with all other data transferred via P2P. This minimizes bandwidth pressure on OSS.
The average model-ready time remains close to O(1) and does not increase linearly with the number of nodes. This demonstrates excellent horizontal scaling.

Limited cache size

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 100, 60, 40, 20, and 0 GB.

北京Region限制缓存大小测试结果-时间

北京Region限制缓存大小测试结果-影响

Model broadcasting functions as expected under different cache size limits.
The impact of cache limits on performance is consistent across all concurrency levels. A cache size of 40 GB or larger has no significant effect on model-ready time. Performance begins to decline noticeably with a cache size of 20 GB or smaller.

Test in Ulanqab region

Test environment

Item	Configuration
OSS	China (Ulanqab), intranet download bandwidth 10 Gbps
Node configuration	ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB
Model	Qwen2.5-72B, 135.437 GB

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 60 and 0 GB.

乌兰察布Region测试结果

Even with limited OSS download bandwidth, the test results show that model broadcasting maintains excellent horizontal scaling and minimizes bandwidth pressure on OSS.