Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. DeepSeek-R1 features hundreds of billions of parameters. Some standalone GPUs are incapable of loading or running DeepSeek-R1 with 100% performance. We recommend that you use a distributed deployment solution based on two or more container instances to guarantee the performance of large models and increase their throughput. This topic describes how to use ACS compute power to deploy a distributed inference service based on the DeepSeek full version.
Background information
DeepSeek-R1
vLLM
ACS
LeaderWorkerSet (LWS)
Fluid
Solution overview
Model splitting
DeepSeek-R1 features 671 billion parameters. Each GPU can provide at most 96 GB of memory, which is insufficient to load the entire model. To resolve this issue, you need to split the model. In this example, the model is deployed across two GPU-accelerated container instances. The model is split by using model parallelism (PP=2) and data parallelism (TP=8). The following figure shows how the model is split.

Model parallelism (PP=2) splits the model into two phases. Each phase runs on a GPU-accelerated container instance. For example, Model M is split into M1 and M2. M1 runs on the first GPU-accelerated container instance and passes results to M2 that runs on the second GPU-accelerated container instance.
Data parallelism (TP=8) allocates computing operations in each phase (M1 or M2) to eight GPUs. For example, in the M1 phase, input data is split into eight portions and allocated to eight GPUs. Each GPU processes a portion and then the system merges the computing results from the eight GPUs.
Distributed architecture
This solution uses ACS to quickly deploy a distributed inference service based on the DeepSeek full version. It uses vLLM and Ray to deploy the DeepSeek-R1 model in a distributed architecture. This solution also uses LWS to manage the deployment of leaders and workers for DeepSeek, and uses distributed caches provided by Fluid to accelerate model loading in the ACS cluster. vLLM is deployed in two GPU-accelerated ACS pods. Each pod has eight GPUs. Each pod serves as a Ray group to improve the overall throughput and concurrency level. Each Ray group consists of Ray heads and Ray workers. You can split the model accordingly. Take note that the values of the tensor-parallel-size and LWS_GROUP_SIZE variables in the YAML template vary depending on the distributed architecture.
Prerequisites
During the first time you use Container Compute Service (ACS), you need to assign the default role to the account. Only after you complete the authorization, ACS can call other services, such as ECS, OSS, NAS, CPFS, and SLB, create clusters, and save logs. For more information, see Quick start for first-time ACS users.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
GPU-accelerated specification and estimated cost
Suggested ACS GPU-accelerated instance specification for deployments of two or more instances: 8 GPUs (96 GiB of memory per GPU), 64 vCPUs, and 512 GiB of memory. You can also refer to the Table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.
Make sure that the specification of the ACS GPU-accelerated instance complies with ACS pod specification adjustment logic.
By default, an ACS pod provides 30 GiB of free EphemeralStorage. The inference image
registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2used in this example is 9.5 GiB in size. If you need more storage space, customize the size of the EphemeralStorage. For more information, see Add the EphemeralStorage.
Procedure
Step 1: Prepare the DeepSeek-R1 model files
The LLM requires large amounts of disk space to store model files. We recommend that you create a NAS or OSS volume to persist the model files. In this example, OSS is used.
To accelerate file downloading and uploading, you can submit a ticket to copy the files to your OSS bucket.
Run the following command to download the DeepSeek-R1 model from ModelScope.
NoteCheck whether the git-lfs plug-in is installed. If not, run
yum install git-lfsorapt-get install git-lfsto install it. For more information, see Install git-lfs.git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git cd DeepSeek-R1/ git lfs pullCreate an OSS directory and upload the model files to the directory.
NoteTo install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1 ossutil cp -r ./DeepSeek-R1 oss://<your-bucket-name>/models/DeepSeek-R1You can use the following methods to load the model from OSS.
Use a pair of PVC and PV to mount the model: This method is suitable for small models. Use this method if your application does not need to quickly load models or launch pods.
Use the console
The following table describes the basic parameters that are used to create the PV.
Parameter
Description
PV Type
OSS
Volume Name
llm-model
Access Certificate
Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID
Select the OSS bucket that you created in the previous step.
OSS Path
Select the path of the model, such as /models/DeepSeek-R1.
The following table describes the basic parameters that are used to create the PVC.
Parameter
Description
PVC Type
OSS
Name
llm-model
Allocation Mode
In this example, Existing Volumes is selected.
Existing Volumes
Click Existing Volumes and select the PV that you created.
Use kubectl
The following code block shows the YAML template:
apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket. akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket. --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: llm-model nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: <your-bucket-name> # The name of the OSS bucket. url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: <your-model-path> # The model path, such as /models/DeepSeek-R1/ in this example. --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-modelUse Fluid to accelerate model loading: This method is suitable for large models. Use this method if your application needs to quickly load models or launch pods. For more information, see Use Fluid to accelerate data access.
Use Helm to install the ack-fluid component from the marketplace of ACS. The component version must be 1.0.11-* or later. For more information, see Use Helm to create an application.
Enable the privileged mode for ACS pods. To enable this mode, submit a ticket.
Create a Secret to access OSS.
apiVersion: v1 kind: Secret metadata: name: mysecret stringData: fs.oss.accessKeyId: xxx fs.oss.accessKeySecret: xxxfs.oss.accessKeyIdandfs.oss.accessKeySecretspecify the precedingAccessKey IDandAccessKey secretused to access OSS.Create a dataset and a JindoRuntime.
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: deepseek spec: mounts: - mountPoint: oss://<your-bucket-name> # Replace <your-bucket-name> with the actual value. options: fs.oss.endpoint: <your-bucket-endpoint> # Replace <your-bucket-endpoint> with the actual value. name: deepseek path: "/" encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: mysecret key: fs.oss.accessKeySecret --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: deepseek spec: replicas: 16 # Modify the parameter on demand. master: podMetadata: labels: alibabacloud.com/compute-class: performance alibabacloud.com/compute-qos: default worker: podMetadata: labels: alibabacloud.com/compute-class: performance alibabacloud.com/compute-qos: default annotations: kubernetes.io/resource-type: serverless resources: requests: cpu: 16 memory: 128Gi limits: cpu: 16 memory: 128Gi tieredstore: levels: - mediumtype: MEM path: /dev/shm volumeType: emptyDir ## Modify the setting on demand. quota: 128Gi high: "0.99" low: "0.95"Run the
kubectl get pod | grep jindocommand to check whether the pod is in theRunningstate. Expected results:deepseek-jindofs-master-0 1/1 Running 0 3m29s deepseek-jindofs-worker-0 1/1 Running 0 2m52s deepseek-jindofs-worker-1 1/1 Running 0 2m52s ...Create a DataLoad job to cache the model.
apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: deepseek spec: dataset: name: deepseek namespace: default loadMetadata: trueRun the following command to query the status of the cache.
kubectl get dataloadExpected results:
NAME DATASET PHASE AGE DURATION deepseek deepseek Executing 4m30s UnfinishedIf
PHASEdisplaysExecuting, the caching is in progress. Wait about 20 minutes and run the command again. If the field displaysComplete, the model is cached. You can run thekubectl logs $(kubectl get pods --selector=job-name=deepseek-loader-job -o jsonpath='{.items[0].metadata.name}') | grep progresscommand to query the job name and print the log to view the progress.Run the following command to check the dataset.
kubectl get datasetsExpected results:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE deepseek 1.25TiB 1.25TiB 2.00TiB 100.0% Bound 21h
Step 2: Deploy the model based on ACS GPU compute power
Use Helm to install the lws component from the marketplace of ACS. For more information, see Use Helm to create an application.
Create a LeaderWorkerSet to deploy the model.
NoteReplace the variable in
alibabacloud.com/gpu-model-series: <example-model>with the actual GPU model supported by ACS. For more information about the GPU models supported by ACS, consult the PDSA or submit a ticket.Compared with TCP/IP, high-performance RDMA features zero copy and kernel bypass to help avoid file copy and frequent context switchover. RDMA can reduce the latency and CPU usage and increase the throughput. ACS allows you to add the
alibabacloud.com/hpn-type: "rdma"label to use RDMA. For more information about the GPU models that support RDMA, consult the PDSA or submit a ticket.To use Fluid to load the model, change the two
claimNameparameters in the PVC to the name of the Fluid dataset.The values of the
tensor-parallel-sizeandLWS_GROUP_SIZEvariables vary depending on the distributed architecture.
Standard deployment example
apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: deepseek-r1-671b-fp8-distrubution spec: replicas: 1 leaderWorkerTemplate: size: 2 #The total number of leaders and workers. restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader alibabacloud.com/compute-class: gpu #Specify GPU compute power. alibabacloud.com/compute-qos: default #Specify teh ACS QoS class. alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model. spec: volumes: - name: llm-model persistentVolumeClaim: ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek. claimName: llm-model - name: shm emptyDir: medium: Memory sizeLimit: 32Gi containers: - name: deepseek-r1-671b-leader image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2 env: - name: NCCL_SOCKET_IFNAME #Specify the NIC. value: eth0 command: - sh - -c - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);vllm serve /models/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager" #Set tensor-parallel-size to the number of GPUs provided by each leader or worker pod. resources: limits: nvidia.com/gpu: "8" cpu: "64" memory: 512G requests: nvidia.com/gpu: "8" cpu: "64" memory: 512G ports: - containerPort: 8000 volumeMounts: - mountPath: /models/DeepSeek-R1 name: llm-model - mountPath: /dev/shm name: shm workerTemplate: metadata: labels: alibabacloud.com/compute-class: gpu #Specify GPU compute power. alibabacloud.com/compute-qos: default #Specify teh ACS QoS class. alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model. spec: volumes: - name: llm-model persistentVolumeClaim: ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek. claimName: llm-model - name: shm emptyDir: medium: Memory sizeLimit: 32Gi containers: - name: deepseek-r1-671b-worker image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2 env: - name: NCCL_SOCKET_IFNAME #Specify the NIC. value: eth0 command: - sh - -c - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" resources: limits: nvidia.com/gpu: "8" cpu: "64" memory: 512G requests: nvidia.com/gpu: "8" cpu: "64" memory: 512G ports: - containerPort: 8000 volumeMounts: - mountPath: /models/DeepSeek-R1 name: llm-model - mountPath: /dev/shm name: shmRDMA acceleration example
If an open source base image, such as vLLM, is used, add the following env parameter to the YAML file.
Name
Value
NCCL_SOCKET_IFNAME
eth0
NCCL_IB_TC
136
NCCL_IB_SL
5
NCCL_IB_GID_INDEX
3
NCCL_DEBUG
INFO
NCCL_IB_HCA
mlx5
NCCL_NET_PLUGIN
none
apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: deepseek-r1-671b-fp8-distrubution spec: replicas: 1 leaderWorkerTemplate: size: 2 #The total number of leaders and workers. restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader alibabacloud.com/compute-class: gpu #Specify GPU compute power. alibabacloud.com/compute-qos: default #Specify teh ACS QoS class. alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model. # Run the application in a high-performance RDMA network. Submit a ticket to obtain the list of GPU models that support RDMA. alibabacloud.com/hpn-type: "rdma" spec: volumes: - name: llm-model persistentVolumeClaim: ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek. claimName: llm-model - name: shm emptyDir: medium: Memory sizeLimit: 32Gi containers: - name: deepseek-r1-671b-leader image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2 env: - name: NCCL_SOCKET_IFNAME #Specify the NIC. value: eth0 - name: NCCL_IB_TC value: "136" - name: NCCL_IB_SL value: "5" - name: NCCL_IB_GID_INDEX value: "3" - name: NCCL_DEBUG value: "INFO" - name: NCCL_IB_HCA value: "mlx5" - name: NCCL_NET_PLUGIN value: "none" command: - sh - -c - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);vllm serve /models/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager" #Set tensor-parallel-size to the number of GPUs provided by each leader or worker pod. resources: limits: nvidia.com/gpu: "8" cpu: "64" memory: 512G requests: nvidia.com/gpu: "8" cpu: "64" memory: 512G ports: - containerPort: 8000 volumeMounts: - mountPath: /models/DeepSeek-R1 name: llm-model - mountPath: /dev/shm name: shm workerTemplate: metadata: labels: alibabacloud.com/compute-class: gpu #Specify GPU compute power. alibabacloud.com/compute-qos: default #Specify teh ACS QoS class. alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model. # Run the application in a high-performance RDMA network. Submit a ticket to obtain the list of GPU models that support RDMA. alibabacloud.com/hpn-type: "rdma" spec: volumes: - name: llm-model persistentVolumeClaim: ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek. claimName: llm-model - name: shm emptyDir: medium: Memory sizeLimit: 32Gi containers: - name: deepseek-r1-671b-worker image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2 env: - name: NCCL_SOCKET_IFNAME #Specify the NIC. value: eth0 - name: NCCL_IB_TC value: "136" - name: NCCL_IB_SL value: "5" - name: NCCL_IB_GID_INDEX value: "3" - name: NCCL_DEBUG value: "INFO" - name: NCCL_IB_HCA value: "mlx5" - name: NCCL_NET_PLUGIN value: "none" command: - sh - -c - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" resources: limits: nvidia.com/gpu: "8" cpu: "64" memory: 512G requests: nvidia.com/gpu: "8" cpu: "64" memory: 512G ports: - containerPort: 8000 volumeMounts: - mountPath: /models/DeepSeek-R1 name: llm-model - mountPath: /dev/shm name: shmCreate a Service to expose the inference service.
apiVersion: v1 kind: Service metadata: name: ds-leader spec: ports: - name: http port: 8000 protocol: TCP targetPort: 8000 selector: leaderworkerset.sigs.k8s.io/name: deepseek-r1-671b-fp8-distrubution role: leader type: ClusterIP
Step 3: Verify the inference service
Run
kubectl port-forwardto configure port forwarding between the local environment and inference service.NotePort forwarding set up by using
kubectl port-forwardis not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.kubectl port-forward svc/ds-leader 8000:8000Expected results:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Send requests to the inference service.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ds", "messages": [ { "role": "system", "content": "You are a friendly AI assistant." }, { "role": "user", "content": "Please introduce deep learning." } ], "max_tokens": 1024, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'Expected results:
{"id":"chatcmpl-4bc78b66e2a4439f8362bd434a60be57","object":"chat.completion","created":1739501401,"model":"ds","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"OK. I have to think carefully about how to answer this question. First, I have to explain the definition of deep learning. It is a branch of machine learning, right? Then, I need to compare it with traditional machine learning methods to explain the advantages of deep learning, such as automatic feature extraction. It may be necessary to mention neural networks, especially the structure of deep neural networks, such as multiple hidden layers. \n\nNext, we should talk about the core components of deep learning, such as activation functions, loss functions, and optimizers. You may be unfamiliar with these terms, so I need to briefly explain how each component works. For example, ReLU functions as an activation function, Adam functions as an optimizer, and how the cross entropy loss function works. \n\nThen, in the application sector, computer vision and natural language processing are common. A few more examples may be needed, such as image recognition and machine translation, which are much easier to understand. In addition, users may be interested in use scenarios, such as the medical care and finance sectors. \n\nAs for popular frameworks, such as TensorFlow and PyTorch, they are used to make deep learning easier to implement. There is also the importance of hardware acceleration such as GPU, which explains why deep learning is developing so fast now. \n\nIn addition, it is possible to discuss the challenges and limitations of deep learning, such as data dependence, high computing resource requirements, and poor interpretability, so that users can fully understand the advantages and disadvantages. It may also be necessary to mention some future development directions, such as efficient training algorithms and interpretability research. \n\nThe user's need may be to understand the basic concepts of deep learning, and may have some technical background, but not too in-depth. They may want to quickly grasp the key points and applications, so the answer needs to be clearly structured, focused, not too in-depth technical details, but not too brief. Need to balance professionalism and understandability. \n\nAvoid using too many terms, or explain a little when using terms, such as "neural network", "convolutional neural network", etc. Perhaps the user is a student or a practitioner who has just entered the profession and needs to be understood by them in plain language. At the same time, give some examples to help them contact the practical application, so that it is easier to remember. \n\nWe should also pay attention to the relationship between deep learning and machine learning, and we need to clearly distinguish the difference between the two, which shows that deep learning belongs to machine learning, but at a deeper level and deal with more complex problems. The backpropagation algorithm may also be mentioned as one of the key techniques for training. \n\nFinally, make a summary, emphasizing the impact and potential of deep learning and the future development direction, so that users can have a comprehensive understanding. Check for missing important points, such as common model structures such as CNN and RNN, which may also need to be briefly mentioned to show the diversity. \n\nSometimes the user may be interested in the principle, but it is important to keep it simple here, focusing on the overview rather than the in-depth technical details. Ensure that the logic of the answer is smooth, starting from the definition, to the core components, applications, frameworks, challenges, and future directions. This structure is more reasonable. \n</think>\n\nDeep learning is a branch of machine learning that aims to simulate the learning mechanism of the human brain by building multi-layered neural networks (called "deep" networks). It is widely used in Image Recognition, speech processing, Natural Language Processing and other fields by automatically learning complex features and patterns from large amounts of data. \n\n### Core Concepts \n1. **Artificial Neural Network (ANN)**:\n -Consists of an input layer, multiple hidden layers, and an output layer, each containing multiple neurons. \n - Information processing is achieved by simulating the activation and transmission of neurons. \n\n2. **Automatic feature extraction**:\n - Traditional machine learning relies on manual design features, while deep learning automatically extracts abstract features (such as from pixels to edges and shapes of objects) of data through multi-layer networks. \n\n3. **Key components**:\n - **Activation functions** (such as ReLU and Sigmoid): introduce nonlinearity to enhance model expression. \n - **Loss function** (e. g. cross entropy, mean square error): measures the difference between the predicted value and the true value. \n - **Optimizer** (such as SGD and Adam): optimizes network parameters through back propagation to minimize losses. \n\n---\n\n### Typical model\n- **Convolutional neural network (CNN)**: \n Designed for images, spatial features are extracted through convolution kernels. Classic models such as ResNet and VGG. \n- **Recurrent Neural Network (RNN)**: \n Processing sequence data (text, speech), introducing the memory mechanism, improved versions such as LSTM, GRU. \n- **Transformer**: \n Based on the self-attention mechanism, the performance of Natural Language Processing (such as BERT and GPT series) is greatly improved. \n\n---\n\n### Application scenarios\n- **Computer vision**: Face Recognition, medical image analysis (such as lung CT lesion detection). \n- **Natural Language Processing**: Intelligent customer service, document summary generation, and translation (such as DeepL). \n- **Voice technology**: voice assistant (such as Siri), real-time subtitle generation. \n- **reinforcement learning**: game AI (AlphaGo), robot control. \n\n---\n\n ### Advantages and Challenges \n- **Advantages**:\n -Automatically learn complex features and reduce manual intervention. \n -It performs far better than traditional methods in big data and high computing power. \n- **Challenge**:\n -Relies on a large amount of labeled data (for example, tens of thousands of labeled medical images). \n - The model training cost is high (for example, the GPT-3 training cost exceeds tens of millions of dollars). \n - The "black box" characteristic leads to poor interpretability and limited application in high-risk areas such as medical care. \n\n---\n\n ### Tools and trends \n- **Mainstream frameworks**:TensorFlow (industrial deployment-friendly) and PyTorch (preferred for research). \n- **Research Direction**:\n -Lightweight model (such as MobileNet for mobile devices). \n -Self-supervised learning (reduces dependence on labeled data). \n -interpretability enhancement (such as visualizing the model decision basis). \n\nDeep learning is pushing the boundaries of artificial intelligence, from generative AI (such as Stable Diffusion to generate images) to autonomous driving, continuously changing the technology ecology. Its future development may achieve breakthroughs in reducing computing costs, improving efficiency and interpretability.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":1131,"completion_tokens":1114,"prompt_tokens_details":null},"prompt_logprobs":null}
References
Container Compute Service (ACS) is integrated into Container Service for Kubernetes. This allows you to use the computing power of ACS in ACK Pro clusters. For more information about using ACS GPU compute power in ACK, see Use the computing power of ACS in ACK Pro clusters.
For more information about deploying DeepSeek in ACK, see the following topics:
For more information about DeepSeek R1 and V3, see the following topics:
The AI container image of ACS is dedicated to GPU-accelerated containers in ACS clusters. For more information about the release notes of this image, see Release notes for ACS AI container images.