All Products
Search
Document Center

Object Storage Service:Model broadcasting

Last Updated:Apr 02, 2026

When launching AI inference services on multiple nodes, the OSS Connector model broadcasting feature loads model data from OSS on only a single node. The remaining nodes then receive the data through a chain-based topology. This method significantly reduces back-to-source traffic and improves model distribution efficiency.

How it works

When multiple nodes running AI inference services pull model files from OSS simultaneously, the download activity can saturate the source's egress bandwidth. This creates a performance bottleneck that can cause startup delays or failures. This issue is particularly pronounced in regions with lower OSS egress bandwidth, where concurrent back-to-source traffic can severely impact deployment efficiency.

OSS Connector model broadcasting optimizes large-scale AI inference deployments. When multiple inference instances for the same model are launched, only one or a few nodes load the model data directly from OSS. This data is then distributed to other nodes through a chain-based topology. Model broadcasting leverages node storage and network resources to reduce back-to-source traffic, lessen the source's load, and improve distribution efficiency.

OSS Connector model broadcasting uses a chain-based transport method where model files are passed serially from one node to the next. Each node receives and forwards the data only once. For model file transfers, a single data stream is often sufficient to saturate the network bandwidth of most mainstream instance types. The chain-based method avoids the bandwidth bottlenecks that can occur in tree-based transport, where a node must send data to multiple downstream nodes simultaneously.

OSS Connector preloads model files from OSS into a memory buffer using a high-concurrency strategy. This lets the inference engine load the model into GPU memory as needed. The buffered memory is then released after a delay once inference is complete. The model broadcasting feature builds on this by enabling the buffer to be shared across nodes. It integrates DADI P2P functionalities and requires only a Redis or Tair service for node discovery and metadata management. This setup allows buffered data to be distributed to other nodes. Compared to a single-node deployment, this solution adds only lightweight buffer-sharing logic while fully using idle node egress bandwidth during model loading. This provides a cost-effective and efficient method for distributed model loading.

image
Note

With model broadcasting, only a single data stream is pulled from OSS at any given time for a specific model. This significantly reduces the load on the OSS source during batch startups. However, if source performance remains a bottleneck, you should use this feature with OSS Accelerator or the distributed cache version of DADI P2P.

Prerequisites

Configure the database

The model broadcasting feature requires a Redis or Tair service for node discovery and metadata management. You must configure this database to use the feature.

Option 1: Purchase and configure Tair (Recommended)

Tair is Alibaba Cloud's fully managed cloud database service, compatible with the Redis protocol.

  1. Create a Tair instance. For instructions, see Quick start overview. The instance version must be 6.0 or later, and you can use the minimum specifications.

  2. Configure a whitelist to ensure that the inference nodes can access the Tair instance.

  3. You will need the Connection Address, Port Number, Username, and Password to configure model broadcasting.

Option 2: Deploy a standalone Redis service

Alternatively, you can deploy your own Redis service in a Kubernetes cluster.

The following YAML configuration deploys a Redis service with Access Control List (ACL) authentication.

  1. Create an ACL configuration file and generate a Kubernetes secret.

    # Create ACL content
    cat > users.acl << EOF
    user default off -@all
    user Username on >Password ~* &* +@all
    EOF
    
    # Create a secret
    kubectl create secret generic redis-acl-secret \
      --from-file=users.acl \
      --dry-run=client -o yaml | kubectl apply -f -
    Note

    Replace Username and Password with your actual username and password.

  2. Use the following configuration to deploy the Redis service and its deployment.

    # redis-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: redis
    spec:
      selector:
        app: model-redis
      ports:
        - protocol: TCP
          port: 6379
          targetPort: 6379
    
    ---
    # redis-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: model-redis-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: model-redis
      template:
        metadata:
          labels:
            app: model-redis
        spec:
          containers:
          - name: redis
            image: mirrors-ssl.aliyuncs.com/redis:8.4.0
            ports:
            - containerPort: 6379
            command: ["redis-server"]
            args:
            - "--aclfile"
            - "/etc/redis/users.acl"
            - "--maxmemory"
            - "900mb"
            - "--maxmemory-policy"
            - "volatile-lru"
            - "--save"
            - ""
            - "--appendonly"
            - "no"
            - "--loglevel"
            - "notice"
            resources:
              requests:
                memory: "1Gi"
                cpu: "100m"
              limits:
                memory: "1Gi"
                cpu: "200m"
            volumeMounts:
            - name: acl-config
              mountPath: /etc/redis/users.acl
              subPath: users.acl
          volumes:
          - name: acl-config
            secret:
              secretName: redis-acl-secret
  3. To deploy the Redis service, run the following command.

    kubectl apply -f redis-service.yaml

Enable model broadcasting

Add the model broadcasting configuration to the OSS Connector configuration file at /etc/oss-connector/config.json.

{
  ...
  "broadcast": {
    "enableBroadcast": true,
    "tenant": "${P2P_KEY_PREFIX}",
    "db": {
      "host": "${P2P_REDIS_HOST}",
      "port": 6379,
      "username": "${P2P_REDIS_USERNAME}",
      "password": "${P2P_REDIS_PASSWD}"
    }
  },
  "bindPort": 19898
  ...
}

The following table describes the configuration parameters.

Parameter

Description

broadcast.enableBroadcast

Specifies whether to enable model broadcasting. Set this to true to enable the feature.

broadcast.tenant

Specifies the tenant name. Nodes with the same tenant name can use model broadcasting. We recommend configuring a unique tenant for each service.

broadcast.db.host

Specifies the connection address of the Redis or Tair service.

broadcast.db.port

Specifies the port number of the Redis or Tair service. The default is 6379.

broadcast.db.username

Specifies the username for the Redis or Tair service.

broadcast.db.password

Specifies the password for the Redis or Tair service.

bindPort

Specifies the port used to provide data to other nodes. The default value is 19898.

For a complete example of how to deploy a model broadcasting service with multiple instances in a Kubernetes cluster, see Deploy a model broadcasting service with multiple instances.

Limit the cache size

During model broadcasting, nodes cache model data in memory for retrieval by other nodes. You can limit this cache memory in the following ways.

  • Method 1: Set an environment variable

    export CONNECTOR_MAX_CACHE_ADVISE_GB=100
  • Method 2: Set in the configuration file

    Set prefetch.maxCacheAdviseGB in /etc/oss-connector/config.json:

    {
      ...
      "prefetch": {
        "vcpus": 16,
        "workers": 24,
        "maxCacheAdviseGB": 100
      },
      ...
    }
Note
  • The memory limit is a soft limit.

  • Environment variables take precedence over the configuration file.

Performance report

The following are the performance test results for the model broadcasting feature with the Qwen2.5-72B model (135.437 GB) in different regions.

Test in Beijing region

Test environment

Item

Configuration

OSS

China (Beijing), intranet download bandwidth 250 Gbps

Node configuration

ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB

Model

Qwen2.5-72B, 135.437 GB

Metrics

Time from vLLM API server startup to service readiness, along with OSS and P2P traffic.

Unlimited cache size

北京Region不限制缓存大小测试结果

  • Only one back-to-source data stream is used, with all other data transferred via P2P. This minimizes bandwidth pressure on OSS.

  • The average model-ready time remains close to O(1) and does not increase linearly with the number of nodes. This demonstrates excellent horizontal scaling.

Limited cache size

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 100, 60, 40, 20, and 0 GB.

北京Region限制缓存大小测试结果-时间

北京Region限制缓存大小测试结果-影响

  • Model broadcasting functions as expected under different cache size limits.

  • The impact of cache limits on performance is consistent across all concurrency levels. A cache size of 40 GB or larger has no significant effect on model-ready time. Performance begins to decline noticeably with a cache size of 20 GB or smaller.

Test in Ulanqab region

Test environment

Item

Configuration

OSS

China (Ulanqab), intranet download bandwidth 10 Gbps

Node configuration

ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB

Model

Qwen2.5-72B, 135.437 GB

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 60 and 0 GB.

乌兰察布Region测试结果

Even with limited OSS download bandwidth, the test results show that model broadcasting maintains excellent horizontal scaling and minimizes bandwidth pressure on OSS.