ACK Gateway with Inference Extension: A Practice for Optimizing Large Model Inference Service Deployed across Multiple Nodes

By Hang Yin (Luying)

ACK Gateway with Inference Extension is designed for LLM inference scenarios, supporting both Layer 4/7 traffic routing and intelligent load balancing capabilities based on model server load. In addition, InferencePool and InferenceModel custom resources (CRDs) help flexibly define traffic distribution policies for inference services, including model canary release and LLM traffic mirroring.

This article introduces the practical process of optimizing the inference service performance based on ACK Gateway with Inference Extension in the large model inference service scenarios deployed across multiple nodes.

Background of Inference Deployed across Multiple Nodes

vLLM is a high-performance inference system designed for large language models (LLMs). Its goal is to efficiently utilize GPU and multi-node cluster resources while ensuring high throughput and low latency. Given the massive parameter size characteristic of LLMs, it is often difficult for a single GPU or even a single node to accommodate all the parameters at once. This necessitates the underlying architecture to do a lot of work on parallel computing and memory optimization. To this end, vLLM uses a variety of parallelization technologies, including tensor parallel and pipeline parallel, combined with deployment across multiple nodes to achieve efficient large model inference.

Tensor Parallel

Tensor parallel mainly partitions matrix operations in a single layer and distributes computing tasks to multiple GPUs. Its core principles and technical background are as follows:

① Parameter partitioning

The fully connected layers and other weight matrices in large neural networks can far exceed the memory capacity of a single GPU. Tensor parallel is used to divide these matrices into several sub-tensors according to a certain method (such as by row, column, or block). Each GPU is only responsible for part of the storage and computing, which can effectively reduce the memory pressure of a single GPU.

② Distributed matrix operation

In the forward and backward propagation processes, each GPU calculates operations such as multiplication and activation of local submatrices. Finally, each GPU needs to summarize and reduce the results of each part through communication operations (such as All-Reduce or All-Gather) to ensure the final output consistency. This requires an efficient cross-device communication and synchronization strategy to prevent communication bottlenecks that can degrade overall inference latency.

③ Custom scheduling and optimization

To further improve efficiency, vLLM usually combines task scheduling, pipeline filling (overlapping communication and computation), and other technologies in tensor parallel to optimize load balancing across GPUs, reduce communication latency, and realize efficient parallel computing.

Pipeline Parallel

Pipeline parallel focuses on decomposing the entire model into multiple consecutive stages, each running on a separate GPU (or even on a different machine), thus enabling parallel overlap among different stages. Its background and key points include:

① Model splitting

For deep models, the network can be hierarchically divided into multiple "stages". Each stage contains several successive layers deployed on separate devices. When an input sample or micro-batch is executed in the first stage, subsequent samples can follow into the previous stage, realizing pipeline processing that maximizes device utilization.

② Pipeline concurrency

The input is split by micro-batch processing, allowing different micro-batches to be executed concurrently on different stages. This both avoids long idling of stages and optimally overlaps the computing and communication overhead. A sophisticated scheduling mechanism is required to address potential pipeline bubbles or imbalances.

③ Communication and synchronization

Due to the front-to-back dependency between stages, information such as intermediate activation values and gradients must be passed between stages. This requires a low-latency communication mechanism. Additionally, the scheduling system must ensure timely data input in every device to minimize the latency caused by waiting for data transmission.

Large models in a single node often cannot meet storage and computing requirements due to the large number of parameters. Therefore, they need to be deployed across multi-node clusters. vLLM typically combines tensor parallel with pipeline parallel when deploying across multiple nodes.

Procedure

1. Environment Preparation: QwQ-32B Model Inference Service Deployed across Multiple Nodes

In this article, we will deploy the QwQ-32B model across multiple nodes. QwQ-32B is the latest high-efficiency large language model released by Alibaba Cloud. With 3.2 billion parameters, its performance rivals that of DeepSeek-R1 671B. The model performs well on core metrics such as mathematics and code generation, delivering superior inference capabilities with lower resource consumption.

Prerequisites

An ACK cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
Prepare a GPU node cluster with at least 4 GPUs in total and 2 GPUs per node. In this example, a cluster that contains three ecs.gn7i-c32g1.32xlarge (4 × A10 GPU) GPU nodes is used.
Install LeaderWorkerSet.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, click Deploy. In the Deploy panel, set the Application Name parameter to lws. In the Chart section, search for lws and select it. Then, click Next. In the Confirm message, click Yes. The chart will be installed in the lws-system namespace by default.
In the Parameters step, check the value of the Chart Version parameter and the information displayed in the Parameters field and click OK.

Step 1: Prepare model data

1. Download the model.

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/QwQ-32B.git
cd QwQ-32B
git lfs pull

2. Upload the model to OSS.

ossutil mkdir oss://<Your-Bucket-Name>/QwQ-32B
ossutil cp -r ./QwQ-32B oss://<Your-Bucket-Name>/QwQ-32B

3. Configure the PV and PVC for the target cluster.

For more information, see Mount a statically provisioned OSS volume. The following table describes the parameters of a PV.

Parameter or setting	Description
PV Type	OSS
Volume Name	llm-model
Access Certificate	Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID	Select the bucket that you created.
OSS Path	Select the path of the model, such as /models/QwQ-32B.

The following table describes the parameters of a PVC.

Parameter or setting	Description
PVC Type	OSS
Volume Name	llm-model
Allocation Mode	Select Existing Volumes.
Existing Volumes	Click Existing Volumes and select the PV that you created.

Click Existing Volumes and select the PV that you created.

Step 2: Deploy an inference service

The following command is based on the vLLM framework for deploying the QwQ-32B model inference service across multiple nodes:

kubectl apply -f- <<EOF
apiVersion: v1
data:
  hostfile-0: |-
    qwq-dist-v1.qwq-dist-v1-0.default
    qwq-dist-v1.qwq-dist-v1-0-0.default
  master.rayInit: |-
    #!/bin/bash
    ray_port=6379
    ray_init_timeout=300
    ray_cluster_size=$WORLD_SIZE
    master_command=$1
    ray start --head --port=$ray_port
    for (( i=0; i < $ray_init_timeout; i+=5 )); do
      active_nodes=`python3 -c 'import ray; ray.init(); print(sum(node["Alive"] for node in ray.nodes()))'`
      if [ $active_nodes -eq $ray_cluster_size ]; then
        echo "All ray workers are active and the ray cluster is initialized successfully."
        $master_command
        exit 0
      fi
      echo "Wait for all ray workers to be active. $active_nodes/$ray_cluster_size is active"
      sleep 5s;
    done
    echo "Waiting for all ray workers to be active timed out."
    exit 1
  worker.rayInit: |-
    #!/bin/bash
    ray_port=6379
    ray_init_timeout=300
    ray_address=$MASTER_ADDR
    worker_command=$1
    for (( i=0; i < $ray_init_timeout; i+=5 )); do
      ray start --address=$ray_address:$ray_port
      if [ $? -eq 0 ]; then
        echo "Worker: Ray runtime started with head address $ray_address:$ray_port"
        $worker_command
        exit 0
      fi
      echo "Waiting until the ray worker is active..."
      sleep 5s;
    done
    echo "Ray worker starts timeout, head address: $ray_address:$ray_port"
    exit 1
kind: ConfigMap
metadata:
  labels:
    app: distributed-serving
    chart: distributed-serving-0.1.0
    createdBy: DistributedServing
    heritage: Helm
    release: qwq-dist-v1
  name: qwq-dist-v1-cm
---
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  labels:
    app: distributed-serving
    release: qwq-dist-v1
    serviceName: qwq-dist
    servingName: qwq-dist
    servingType: distributed-serving
    servingVersion: v1
  name: qwq-dist-v1-distributed-serving
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      metadata:
        annotations:
          prometheus.io/path: /metrics
          prometheus.io/port: "8000"
          prometheus.io/scrape: "true"
        labels:
          app: distributed-serving
          release: qwq-dist-v1
          role: leader
          serviceName: qwq-dist
          servingName: qwq-dist
          servingType: distributed-serving
          servingVersion: v1
      spec:
        containers:
        - command:
          - sh
          - -c
          - /vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
            vllm serve /mnt/models/QwQ-32B --port 8000 --trust-remote-code --served-model-name
            qwq --max-model-len 16384 --gpu-memory-utilization 0.95 --tensor-parallel-size
            2 --pipeline-parallel-size 2 --enforce-eager
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "2"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: master
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.7.2
          imagePullPolicy: IfNotPresent
          name: distributed-serving-leader
          ports:
          - containerPort: 8000
            name: restful
            protocol: TCP
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 30
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: llm-models
          - mountPath: /etc/hostfile
            name: qwq-dist-v1-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: qwq-dist-v1-cm
          name: qwq-dist-v1-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
        - name: llm-model
          persistentVolumeClaim:
            claimName: llm-model
    restartPolicy: RecreateGroupOnPodRestart
    size: 2
    workerTemplate:
      metadata:
        labels:
          app: distributed-serving
          release: qwq-dist-v1
          role: worker
          serviceName: qwq-dist
          servingName: qwq-dist
          servingType: distributed-serving
          servingVersion: v1
      spec:
        containers:
        - command:
          - sh
          - -c
          - /vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)
          env:
          - name: MASTER_ADDR
            value: $(LWS_LEADER_ADDRESS)
          - name: WORLD_SIZE
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
          - name: GROUP_INDEX
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/group-index']
          - name: GPU_COUNT
            value: "2"
          - name: HOSTFILE
            value: /etc/hostfile
          - name: ROLE
            value: worker
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.7.2
          imagePullPolicy: IfNotPresent
          name: distributed-serving-worker
          resources:
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /mnt/models
            name: llm-models
          - mountPath: /etc/hostfile
            name: qwq-dist-v1-cm
            subPathExpr: hostfile-$(GROUP_INDEX)
        volumes:
        - configMap:
            items:
            - key: hostfile-0
              mode: 438
              path: hostfile-0
            name: qwq-dist-v1-cm
          name: qwq-dist-v1-cm
        - emptyDir:
            medium: Memory
            sizeLimit: 30Gi
          name: dshm
        - name: llm-model
          persistentVolumeClaim:
            claimName: llm-model
  networkConfig:
    subdomainPolicy: Shared
  replicas: 3
  startupPolicy: LeaderCreated
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: distributed-serving
    release: qwq-dist-v1
    servingName: qwq-dist
    servingType: distributed-serving
    servingVersion: v1
spec:
  ports:
    - name: http-serving
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: distributed-serving
    release: qwq-dist-v1
    role: leader
  type: ClusterIP
EOF

The sample YAML uses LeaderWorkderSet to deploy the QwQ-32B model across multiple nodes. Each group contains a leader pod and a worker pod, each of which is equipped with 2 GPUs to form a RAY cluster.

When vLLM starts, pipeline parallel is enabled between two pods by setting pipeline-parallel-size to 2, and tensor parallel is enabled using two GPUs within a single pod by setting tensor-parallel-size to 2.

2. Use ACK Gateway with Inference Extension

Intelligent Routing to Optimize Model Services Deployed across Multiple Nodes

Overview of ACK Gateway with AI Extension

ACK Gateway with AI Extension is a component designed for LLM inference scenarios. It supports traffic routing at Layer 4/7 and provides intelligent load balancing capabilities based on model server load. In addition, InferencePool and InferenceModel custom resources (CRDs) help flexibly define traffic distribution policies for inference services, including model canary release.

ACK Gateway with Inference Extension provides intelligent load balancing capabilities based on model server load. In a deployment environment across multiple nodes combining tensor parallel and pipeline parallel, it can intelligently monitor system load through leader model server metrics, enabling fine-grained routing accordingly to ensure balanced workload distribution across different distributed workload groups.

Step 1: Enable ACK Gateway with Inference Extension

1. Enable the ACK Gateway with AI Extension component in the Components section of the ACK cluster and select the Enable Gateway API inference extension option when enabled.

2. Create a gateway instance: Use kubectl to connect to the ACK cluster and run the following command to create a gateway with ports 8080 and 8081. Port 8080 deploys a standard HTTP routing to the backend inference service, while port 8081 extends the routing to the backend inference service based on the inference service.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-gateway
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: inference-gateway
  listeners:
    - name: http
      protocol: HTTP
      port: 8080
    - name: llm-gw
      protocol: HTTP
      port: 8081
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-buffer-limit
spec:
  connection:
    bufferLimit: 20Mi
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-gateway
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
spec:
  timeout:
    http:
      requestTimeout: 24h
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: reasoning-backend
spec:
  parentRefs:
    - name: inference-gateway
      sectionName: http
  rules:
    - backendRefs:
        - group: ""
          kind: Service
          name: qwq-dist-v1
          port: 8000
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /
EOF

Step 2: Enable inference extension on the gateway port 8081

Use the InferencePool and InferenceModel resources to enable AI inference extension capabilities. In the CRD provided by the inference extension: InferencePool uses a label selector to declare a set of LLM inference service workloads running in the cluster, while InferenceModel specifies traffic distribution policies for specific models in the InferencePool.

1. Create an InferencePool.

InferencePool uses a label selector to declare a set of LLM inference service workloads:

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/attach-to: |
      name: inference-gateway
      port: 8081
  name: reasoning-pool
spec:
  extensionRef:
    failureMode: FailClose
    group: ""
    kind: Service
    name: inference-gateway-ext-proc
  selector:
    app: distributed-serving
    release: qwq-dist-v1
    role: leader
  targetPortNumber: 8000
EOF

2. Create an InferenceModel.

InferenceModel defines traffic distribution policies for models and supports canary release. The following example demonstrates a basic use case: all requests for the model named qwq are forwarded to the QwQ-32B model.

kubectl apply -f- <<EOF
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: reasoning-model
spec:
  criticality: Critical
  modelName: qwq
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: reasoning-pool
  targetModels:
  - name: qwq
    weight: 100
EOF

Semantic description:

• The targetModels field defines the target models and their weight ratios. For example, the above configuration indicates that 100% of requests will be routed to the QwQ-32B model.

• The criticality field marks the importance of the model, which affects the priority of traffic scheduling.

Run the following command to check whether intelligent routing takes effect:

GATEWAY_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -X POST ${GATEWAY_IP}/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwq",
    "messages": [
      {
        "role": "user",
        "content": "Who are you?" 
      }
    ]
}' -v

Step 3: Run stress tests on the inference service to monitor the performance and load balancing effects

1. Collect metrics related to the inference service by using the default service discovery mechanism of the Prometheus instance. For more information, see Default service discovery.

2. Monitor the LLM inference service deployed based on vLLM through the Grafana dashboard: Import the vLLM Grafana JSON model to Grafana to create an observable dashboard for the LLM inference service. For more information about the imported JSON model, visit the vLLM official website. For specific import operations, see Grafana dashboard export and import

3. Run the following command to create a stress test tool.

kubectl apply -f- <<EOF 
apiVersion: apps/v1 
kind: Deployment
metadata:
  name: vllm-benchmark
  labels:
    app: vllm-benchmark
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-benchmark
  template:
    metadata:
      labels:
        app: vllm-benchmark
    spec:
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model
      containers:
      - name: vllm-benchmark
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-benchmark:v1
        command:
        - "sh"
        - "-c"
        - "sleep inf"
        volumeMounts:
        - mountPath: /models/QwQ-32B
          name: llm-model
EOF

4. Download the stress test dataset.

# Run the following command to log on to the benchmark pod.
PODNAME=$(kubectl get po -o custom-columns=":metadata.name"|grep "vllm-benchmark")
kubectl exec -it $PODNAME -- bash
# Download the stress test dataset.
pip3 install modelscope
modelscope download --dataset gliang1001/ShareGPT_V3_unfiltered_cleaned_split ShareGPT_V3_unfiltered_cleaned_split.json --local_dir /root/

5. Use the following command to initiate two stress tests on port 8080 of the gateway (using the default service routing of the gateway) and port 8081 (using the service routing of the inference extension).

• Run stress tests on the default service routing of the gateway.

python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/QwQ-32B \
--served-model-name qwq \
--trust-remote-code \
--dataset-name random \
--dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-prefix-len 1000 \
--random-input-len 4000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 30 \
--host 192.168.14.126 \ # The internal IP address of ACK gateway.
--port 8080 \ # Request port 8080, which is the default route of the gateway.
--endpoint /v1/completions \
--save-result \
2>&1 | tee benchmark_serving.txt
============ Serving Benchmark Result ============
Successful requests:                     300
Benchmark duration (s):                  2362.70
Total input tokens:                      1014085
Total generated tokens:                  537817
Request throughput (req/s):              0.13
Output token throughput (tok/s):         227.63
Total Token throughput (tok/s):          656.83
---------------Time to First Token----------------
Mean TTFT (ms):                          10909.18
Median TTFT (ms):                        2446.06
P99 TTFT (ms):                           119308.83 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          114.18
Median TPOT (ms):                        113.00
P99 TPOT (ms):                           158.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           113.69
Median ITL (ms):                         103.10
P99 ITL (ms):                            109.65
==================================================

• Run stress tests on service routing of inference extension.

python3 /root/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model /models/QwQ-32B \
--served-model-name qwq \
--trust-remote-code \
--dataset-name random \
--dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
--random-prefix-len 1000 \
--random-input-len 4000 \
--random-output-len 3000 \
--random-range-ratio 0.2 \
--num-prompts 300 \
--max-concurrency 30 \
--host 192.168.14.126 \
--port 8081 \ # Request port 8081. Use the gateway inference extension for routing.
--endpoint /v1/completions \
--save-result \
2>&1 | tee gie_benchmark_serving.txt
============ Serving Benchmark Result ============
Successful requests:                     300
Benchmark duration (s):                  2325.10
Total input tokens:                      1014085
Total generated tokens:                  538158
Request throughput (req/s):              0.13
Output token throughput (tok/s):         231.46
Total Token throughput (tok/s):          667.60
---------------Time to First Token----------------
Mean TTFT (ms):                          7335.87
Median TTFT (ms):                        2413.13
P99 TTFT (ms):                           106624.40 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          117.26
Median TPOT (ms):                        114.39
P99 TPOT (ms):                           198.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           115.99
Median ITL (ms):                         103.84
P99 ITL (ms):                            109.88
==================================================

Performance Comparison and Summary

Stress tests are run on the standard gateway (port 8080) and ACK Gateway with AI Extension (port 8081). The key metrics are as follows:

Metrics	Metrics	ACK Gateway with AI Extension Intelligent Routing
Average TTFT (ms)	10909.18	7335.87
P99 TTFT (ms)	119308.83	106624.40
Output Throughput (tok/s)	656.83	667.60

At the same time, you can make a visual comparison of the observable dashboard of the vLLM server in the corresponding time periods of the two stress tests.

Through stress tests on data and dashboards, you can see that when you use the inference extension based on ACK Gateway to perform routing and load balancing on the inference service, the QwQ-32B inference service has less latency and better throughput, cache utilization, and productivity. Average TTFT is shortened by 32.7% and KV Cache utilization is more uniform across workloads.

Note: The test results are related to many factors, and the above results are for reference only.

Community

ACK Gateway with Inference Extension: A Practice for Optimizing Large Model Inference Service Deployed across Multiple Nodes

Background of Inference Deployed across Multiple Nodes

Tensor Parallel

① Parameter partitioning

② Distributed matrix operation

③ Custom scheduling and optimization

Pipeline Parallel

① Model splitting

② Pipeline concurrency

③ Communication and synchronization

Procedure

1. Environment Preparation: QwQ-32B Model Inference Service Deployed across Multiple Nodes

Prerequisites

Step 1: Prepare model data

Step 2: Deploy an inference service

2. Use ACK Gateway with Inference Extension

Overview of ACK Gateway with AI Extension

Step 1: Enable ACK Gateway with Inference Extension

Step 2: Enable inference extension on the gateway port 8081

Step 3: Run stress tests on the inference service to monitor the performance and load balancing effects

Performance Comparison and Summary

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

AI Acceleration Solution

Offline Visual Intelligence Software Packages

ACK One

Tongyi Qianwen (Qwen)