All Products
Search
Document Center

Object Storage Service:Model broadcasting

Last Updated:Jun 02, 2026

OSS Connector model broadcasting loads model data from OSS on a single node and distributes it to other nodes through a chain topology, reducing back-to-source traffic and speeding up multi-node model deployment.

How it works

When multiple nodes pull model files from OSS simultaneously, the concurrent downloads can saturate the source's egress bandwidth, causing startup delays or failures. This bottleneck is especially severe in regions with lower OSS egress bandwidth.

With model broadcasting, only one or a few nodes load data directly from OSS when multiple inference instances start. The data is then distributed through a chain topology, leveraging node storage and network resources to reduce back-to-source traffic and improve distribution efficiency.

Model files are passed serially from one node to the next; each node receives and forwards data once. A single data stream typically saturates the network bandwidth of most instance types, and the chain topology avoids the bottlenecks of tree-based transport where a node must serve multiple downstream nodes.

OSS Connector preloads model files into a memory buffer using high-concurrency downloads. The inference engine loads the model into GPU memory as needed, and the buffer is released after a delay once inference completes. Model broadcasting extends this by sharing the buffer across nodes through DADI P2P, requiring only a Redis or Tair service for node discovery and metadata management. This adds lightweight buffer-sharing logic while fully utilizing idle node bandwidth during model loading.

image
Note

Model broadcasting pulls only a single data stream from OSS for each model, reducing source load during batch startups. If source performance remains a bottleneck, combine this feature with OSS Accelerator or the distributed cache version of DADI P2P.

Prerequisites

Configure the database

Model broadcasting requires a Redis or Tair service for node discovery and metadata management.

Option 1: Purchase and configure Tair (Recommended)

Tair is Alibaba Cloud's fully managed cloud database service, compatible with the Redis protocol.

  1. Create a Tair instance (version 6.0 or later, minimum specifications). Quick start overview.

  2. Configure a whitelist so that inference nodes can access the Tair instance.

  3. Note the Connection Address, Port Number, Username, and Password for the model broadcasting configuration.

Option 2: Deploy a standalone Redis service

You can also deploy Redis in a Kubernetes cluster.

The following YAML deploys Redis with ACL authentication.

  1. Create an ACL configuration file and generate a Kubernetes secret.

    # Create ACL content
    cat > users.acl << EOF
    user default off -@all
    user Username on >Password ~* &* +@all
    EOF
    
    # Create a secret
    kubectl create secret generic redis-acl-secret \
      --from-file=users.acl \
      --dry-run=client -o yaml | kubectl apply -f -
    Note

    Replace Username and Password with your actual username and password.

  2. Apply the following YAML to deploy Redis.

    # redis-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: redis
    spec:
      selector:
        app: model-redis
      ports:
        - protocol: TCP
          port: 6379
          targetPort: 6379
    
    ---
    # redis-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: model-redis-deployment
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: model-redis
      template:
        metadata:
          labels:
            app: model-redis
        spec:
          containers:
          - name: redis
            image: mirrors-ssl.aliyuncs.com/redis:8.4.0
            ports:
            - containerPort: 6379
            command: ["redis-server"]
            args:
            - "--aclfile"
            - "/etc/redis/users.acl"
            - "--maxmemory"
            - "900mb"
            - "--maxmemory-policy"
            - "volatile-lru"
            - "--save"
            - ""
            - "--appendonly"
            - "no"
            - "--loglevel"
            - "notice"
            resources:
              requests:
                memory: "1Gi"
                cpu: "100m"
              limits:
                memory: "1Gi"
                cpu: "200m"
            volumeMounts:
            - name: acl-config
              mountPath: /etc/redis/users.acl
              subPath: users.acl
          volumes:
          - name: acl-config
            secret:
              secretName: redis-acl-secret
  3. Deploy the Redis service.

    kubectl apply -f redis-service.yaml

Enable model broadcasting

Add the model broadcasting configuration to the OSS Connector configuration file at /etc/oss-connector/config.json.

{
  ...
  "broadcast": {
    "enableBroadcast": true,
    "tenant": "${P2P_KEY_PREFIX}",
    "db": {
      "host": "${P2P_REDIS_HOST}",
      "port": 6379,
      "username": "${P2P_REDIS_USERNAME}",
      "password": "${P2P_REDIS_PASSWD}"
    }
  },
  "bindPort": 19898
  ...
}

Configuration parameters:

Parameter

Description

broadcast.enableBroadcast

Enables model broadcasting. Set to true.

broadcast.tenant

Tenant name. Nodes with the same tenant share model broadcasting. Use a unique tenant per service.

broadcast.db.host

Connection address of the Redis or Tair service.

broadcast.db.port

Port of the Redis or Tair service. Default: 6379.

broadcast.db.username

Username for the Redis or Tair service.

broadcast.db.password

Password for the Redis or Tair service.

bindPort

Port for serving data to other nodes. Default: 19898.

Model broadcasting supports multi-instance Kubernetes deployments. Deploy a model broadcasting service with multiple instances.

Limit the cache size

During broadcasting, nodes cache model data in memory for other nodes. You can limit this cache size:

  • Method 1: Set an environment variable

    export CONNECTOR_MAX_CACHE_ADVISE_GB=100
  • Method 2: Set in the configuration file

    Set prefetch.maxCacheAdviseGB in /etc/oss-connector/config.json:

    {
      ...
      "prefetch": {
        "vcpus": 16,
        "workers": 24,
        "maxCacheAdviseGB": 100
      },
      ...
    }
Note
  • The memory limit is a soft limit.

  • Environment variables take precedence over the configuration file.

Performance report

Performance test results for model broadcasting with the Qwen2.5-72B model (135.437 GB) across regions.

Test in Beijing region

Test environment

Item

Configuration

OSS

China (Beijing), intranet download bandwidth 250 Gbps

Node configuration

ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB

Model

Qwen2.5-72B, 135.437 GB

Metrics

Time from vLLM API server startup to service readiness, along with OSS and P2P traffic.

Unlimited cache size

Beijing region test results with unlimited cache

  • Only one back-to-source stream is used; all other data transfers via P2P, minimizing OSS bandwidth pressure.

  • Average model-ready time stays near O(1) regardless of node count, demonstrating excellent horizontal scaling.

Limited cache size

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 100, 60, 40, 20, and 0 GB.

Beijing region test results with limited cache - time

Beijing region test results with limited cache - impact

  • Model broadcasting works as expected across all cache size limits.

  • Cache limit impact is consistent across concurrency levels. A cache of 40 GB or more has no significant effect on model-ready time. Performance declines noticeably at 20 GB or less.

Test in Ulanqab region

Test environment

Item

Configuration

OSS

China (Ulanqab), intranet download bandwidth 10 Gbps

Node configuration

ecs.g9i.24xlarge, network 32/48 Gbps (peak), 96 vCPUs, 384 GiB

Model

Qwen2.5-72B, 135.437 GB

The model-ready time was tested for 1, 10, 50, and 100 nodes starting simultaneously with the cache size unlimited and limited to 60 and 0 GB.

Ulanqab region test results

Even with limited OSS download bandwidth, model broadcasting maintains excellent horizontal scaling and minimizes source bandwidth pressure.