All Products
Search
Document Center

Container Compute Service:Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version

Last Updated:Apr 09, 2025

Container Compute Service (ACS) does not require you to have deep knowledge about the underlying hardware or manage GPU-accelerated nodes. All configurations are out-of-the-box. ACS is easy to deploy and billed on a pay-as-you-go basis. It is suitable for LLM inference services, which can efficiently reduce the inference cost. DeepSeek-R1 features hundreds of billions of parameters. Some standalone GPUs are incapable of loading or running DeepSeek-R1 with 100% performance. We recommend that you use a distributed deployment solution based on two or more container instances to guarantee the performance of large models and increase their throughput. This topic describes how to use ACS compute power to deploy a distributed inference service based on the DeepSeek full version.

Background information

DeepSeek-R1

DeepSeek-R1 is the first-generation inference model provided by DeepSeek. It is intended for improve the inference performance of LLMs through large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reach or surpass the OpenAI-01 series in certain sectors. The performance of DeepSeek-R1 is also stunning in sectors related to knowledge, such as creation, writing, and Q&A. For more information about DeepSeek, see DeepSeek AI GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

ACS

ACS was released in 2023. ACS focuses on consistently delivering inclusive, easy-to-use, elastic, and flexible next-generation container compute power. ACS provides general-purpose and heterogeneous compute power that complies with Kubernetes specifications. It provides serverless container compute resources and eliminates the need to worry about node and cluster O&M. You can integrate scheduling, container runtime, storage, and networking capabilities with ACS to reduce the O&M complexity of Kubernetes and improve the elasticity and flexibility of container compute power. With the pay-as-you-go billing method, elastic instances, and flexible capabilities, ACS can greatly reduce the resource cost. In LLM inference scenarios, ACS can accelerate data and image loading to further reduce the model launch time and resource cost.

LeaderWorkerSet (LWS)

LWS is a new type of workload proposed by the SIG of Kubernetes. The difference between LWS and other Kubernetes-native workloads, including Deployments and StatefulSets, is that LWS treats a group of pods instead of a single pod as a replica. When the replica is scaled out, all pods in the replica are scaled out. The pods in the replica serve as leader pods and worker pods. LWS is suitable for running AI or machine learning inference jobs distributed across multiple instances. For more information about LWS, see LWS GitHub repository.

Fluid

Fluid enables the observability, auto scaling, and portability of datasets by managing and scheduling JindoRuntime. You can use Fluid to accelerate model access. Fluid uses caches to accelerate access to large models. For example, 10 inference service instances are launched simultaneously. The bandwidth available for each instance to pull data from OSS is limited. Consequently, data is pulled with an unignorable delay. You can extend the underlying storage system to ACS clusters to make use of the bandwidth of distributed cache nodes. This can greatly reduce the model loading time and make your businesses more elastic.

Solution overview

Model splitting

DeepSeek-R1 features 671 billion parameters. Each GPU can provide at most 96 GB of memory, which is insufficient to load the entire model. To resolve this issue, you need to split the model. In this example, the model is deployed across two GPU-accelerated container instances. The model is split by using model parallelism (PP=2) and data parallelism (TP=8). The following figure shows how the model is split.

image

Model parallelism (PP=2) splits the model into two phases. Each phase runs on a GPU-accelerated container instance. For example, Model M is split into M1 and M2. M1 runs on the first GPU-accelerated container instance and passes results to M2 that runs on the second GPU-accelerated container instance.

Data parallelism (TP=8) allocates computing operations in each phase (M1 or M2) to eight GPUs. For example, in the M1 phase, input data is split into eight portions and allocated to eight GPUs. Each GPU processes a portion and then the system merges the computing results from the eight GPUs.

Distributed architecture

This solution uses ACS to quickly deploy a distributed inference service based on the DeepSeek full version. It uses vLLM and Ray to deploy the DeepSeek-R1 model in a distributed architecture. This solution also uses LWS to manage the deployment of leaders and workers for DeepSeek, and uses distributed caches provided by Fluid to accelerate model loading in the ACS cluster. vLLM is deployed in two GPU-accelerated ACS pods. Each pod has eight GPUs. Each pod serves as a Ray group to improve the overall throughput and concurrency level. Each Ray group consists of Ray heads and Ray workers. You can split the model accordingly. Take note that the values of the tensor-parallel-size and LWS_GROUP_SIZE variables in the YAML template vary depending on the distributed architecture.

image

Prerequisites

GPU-accelerated specification and estimated cost

Suggested ACS GPU-accelerated instance specification for deployments of two or more instances: 8 GPUs (96 GiB of memory per GPU), 64 vCPUs, and 512 GiB of memory. You can also refer to the Table of suggested specifications and GPU models and specifications. For more information about the billing of ACS GPU-accelerated instances, see Billing overview.

Note
  • Make sure that the specification of the ACS GPU-accelerated instance complies with ACS pod specification adjustment logic.

  • By default, an ACS pod provides 30 GiB of free EphemeralStorage. The inference image registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2 used in this example is 9.5 GiB in size. If you need more storage space, customize the size of the EphemeralStorage. For more information, see Add the EphemeralStorage.

Procedure

Step 1: Prepare the DeepSeek-R1 model files

The LLM requires large amounts of disk space to store model files. We recommend that you create a NAS or OSS volume to persist the model files. In this example, OSS is used.

Note

To accelerate file downloading and uploading, you can submit a ticket to copy the files to your OSS bucket.

  1. Run the following command to download the DeepSeek-R1 model from ModelScope.

    Note

    Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git
    cd DeepSeek-R1/
    git lfs pull
  2. Create an OSS directory and upload the model files to the directory.

    Note

    To install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1
    ossutil cp -r ./DeepSeek-R1 oss://<your-bucket-name>/models/DeepSeek-R1
  3. You can use the following methods to load the model from OSS.

    1. Use a pair of PVC and PV to mount the model: This method is suitable for small models. Use this method if your application does not need to quickly load models or launch pods.

      Use the console

      The following table describes the basic parameters that are used to create the PV.

      Parameter

      Description

      PV Type

      OSS

      Volume Name

      llm-model

      Access Certificate

      Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.

      Bucket ID

      Select the OSS bucket that you created in the previous step.

      OSS Path

      Select the path of the model, such as /models/DeepSeek-R1.

      The following table describes the basic parameters that are used to create the PVC.

      Parameter

      Description

      PVC Type

      OSS

      Name

      llm-model

      Allocation Mode

      In this example, Existing Volumes is selected.

      Existing Volumes

      Click Existing Volumes and select the PV that you created.

      Use kubectl

      The following code block shows the YAML template:

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
      stringData:
        akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
        akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: llm-model
        labels:
          alicloud-pvname: llm-model
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: llm-model
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: <your-bucket-name> # The name of the OSS bucket.
            url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: <your-model-path> # The model path, such as /models/DeepSeek-R1/ in this example.
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: llm-model
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: llm-model
    2. Use Fluid to accelerate model loading: This method is suitable for large models. Use this method if your application needs to quickly load models or launch pods. For more information, see Use Fluid to accelerate data access.

      1. Use Helm to install the ack-fluid component from the marketplace of ACS. The component version must be 1.0.11-* or later. For more information, see Use Helm to create an application.

      2. Enable the privileged mode for ACS pods. To enable this mode, submit a ticket.

      3. Create a Secret to access OSS.

        apiVersion: v1
        kind: Secret
        metadata:
          name: mysecret
        stringData:
          fs.oss.accessKeyId: xxx
          fs.oss.accessKeySecret: xxx

        fs.oss.accessKeyId and fs.oss.accessKeySecret specify the preceding AccessKey ID and AccessKey secret used to access OSS.

      4. Create a dataset and a JindoRuntime.

        apiVersion: data.fluid.io/v1alpha1
        kind: Dataset
        metadata:
          name: deepseek
        spec:
          mounts:
            - mountPoint:  oss://<your-bucket-name>       # Replace <your-bucket-name> with the actual value.
              options:
                fs.oss.endpoint: <your-bucket-endpoint>    # Replace <your-bucket-endpoint> with the actual value.
              name: deepseek
              path: "/"
              encryptOptions:
                - name: fs.oss.accessKeyId
                  valueFrom:
                    secretKeyRef:
                      name: mysecret
                      key: fs.oss.accessKeyId
                - name: fs.oss.accessKeySecret
                  valueFrom:
                    secretKeyRef:
                      name: mysecret
                      key: fs.oss.accessKeySecret
        ---
        apiVersion: data.fluid.io/v1alpha1
        kind: JindoRuntime
        metadata:
          name: deepseek
        spec:
          replicas: 16    # Modify the parameter on demand.
          master:
            podMetadata:
              labels:
                alibabacloud.com/compute-class: performance
                alibabacloud.com/compute-qos: default
          worker:
            podMetadata:
              labels:
                alibabacloud.com/compute-class: performance
                alibabacloud.com/compute-qos: default
              annotations:
                kubernetes.io/resource-type: serverless
            resources:
              requests:
                cpu: 16
                memory: 128Gi
              limits:
                cpu: 16
                memory: 128Gi
          tieredstore:
            levels:
              - mediumtype: MEM
                path: /dev/shm
                volumeType: emptyDir
                ## Modify the setting on demand.
                quota: 128Gi
                high: "0.99"
                low: "0.95"

        Run the kubectl get pod | grep jindo command to check whether the pod is in the Running state. Expected results:

        deepseek-jindofs-master-0    1/1     Running   0          3m29s
        deepseek-jindofs-worker-0    1/1     Running   0          2m52s
        deepseek-jindofs-worker-1    1/1     Running   0          2m52s
        ...
      5. Create a DataLoad job to cache the model.

        apiVersion: data.fluid.io/v1alpha1
        kind: DataLoad
        metadata:
          name: deepseek
        spec:
          dataset:
            name: deepseek
            namespace: default
          loadMetadata: true
      6. Run the following command to query the status of the cache.

        kubectl get dataload

        Expected results:

        NAME       DATASET    PHASE       AGE     DURATION
        deepseek   deepseek   Executing   4m30s   Unfinished

        If PHASE displays Executing, the caching is in progress. Wait about 20 minutes and run the command again. If the field displays Complete, the model is cached. You can run the kubectl logs $(kubectl get pods --selector=job-name=deepseek-loader-job -o jsonpath='{.items[0].metadata.name}') | grep progress command to query the job name and print the log to view the progress.

        Fluid DataLoad parameters

        Parameter

        Description

        Example

        Name

        The name of the DataLoad job.

        deepseek

        Dataset

        The name of the dataset.

        deepseek

        Phase

        The status of the DataLoad job. Complete indicates that the job is completed.

        Executing or Complete

        Age

        The creation time of the DataLoad job.

        4m30s

        Duration

        The duration of the DataLoad job.

        Unfinished and 16m29s

      7. Run the following command to check the dataset.

        kubectl get datasets

        Expected results:

        NAME       UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
        deepseek   1.25TiB          1.25TiB   2.00TiB          100.0%              Bound   21h

        Fluid dataset parameters

        Parameter

        Description

        Example

        Name

        The name of the dataset.

        deepseek

        UFS Total Size

        The size of the dataset.

        1.25TiB

        Cached

        The size of the cached data.

        1.25TiB

        Cache Capacity

        The total size of the cache.

        2.00TiB

        Cached %

        The percentage of the cached data.

        100.0%

        Phase

        The status of the dataset. Bound indicates that the dataset is associated.

        Bound

        Age

        The creation time of the dataset.

        21h

Step 2: Deploy the model based on ACS GPU compute power

  1. Use Helm to install the lws component from the marketplace of ACS. For more information, see Use Helm to create an application.

  2. Create a LeaderWorkerSet to deploy the model.

    Note
    • Replace the variable in alibabacloud.com/gpu-model-series: <example-model> with the actual GPU model supported by ACS. For more information about the GPU models supported by ACS, consult the PDSA or submit a ticket.

    • Compared with TCP/IP, high-performance RDMA features zero copy and kernel bypass to help avoid file copy and frequent context switchover. RDMA can reduce the latency and CPU usage and increase the throughput. ACS allows you to add the alibabacloud.com/hpn-type: "rdma" label to use RDMA. For more information about the GPU models that support RDMA, consult the PDSA or submit a ticket.

    • To use Fluid to load the model, change the two claimName parameters in the PVC to the name of the Fluid dataset.

    • The values of the tensor-parallel-size and LWS_GROUP_SIZE variables vary depending on the distributed architecture.

    Standard deployment example

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: deepseek-r1-671b-fp8-distrubution
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2 #The total number of leaders and workers.
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels: 
              role: leader
              alibabacloud.com/compute-class: gpu  #Specify GPU compute power. 
              alibabacloud.com/compute-qos: default #Specify teh ACS QoS class.
              alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model.
          spec:
            volumes:
              - name: llm-model
                persistentVolumeClaim:
                  ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek.
                  claimName: llm-model
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 32Gi
            containers:
              - name: deepseek-r1-671b-leader
                image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
                env:
                  - name: NCCL_SOCKET_IFNAME #Specify the NIC.
                    value: eth0
                command:
                  - sh
                  - -c
                  - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);vllm serve /models/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager"
    #Set tensor-parallel-size to the number of GPUs provided by each leader or worker pod.
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                  requests:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G           
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/DeepSeek-R1
                    name: llm-model
                  - mountPath: /dev/shm
                    name: shm
        workerTemplate:
          metadata:
            labels: 
              alibabacloud.com/compute-class: gpu  #Specify GPU compute power. 
              alibabacloud.com/compute-qos: default #Specify teh ACS QoS class.
              alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model.
          spec:
            volumes:
              - name: llm-model
                persistentVolumeClaim:
                  ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek.
                  claimName: llm-model
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 32Gi
            containers:
              - name: deepseek-r1-671b-worker
                image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
                env:
                  - name: NCCL_SOCKET_IFNAME #Specify the NIC.
                    value: eth0
                command:
                  - sh
                  - -c
                  - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                  requests:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/DeepSeek-R1
                    name: llm-model
                  - mountPath: /dev/shm
                    name: shm

    RDMA acceleration example

    If an open source base image, such as vLLM, is used, add the following env parameter to the YAML file.

    Name

    Value

    NCCL_SOCKET_IFNAME

    eth0

    NCCL_IB_TC

    136

    NCCL_IB_SL

    5

    NCCL_IB_GID_INDEX

    3

    NCCL_DEBUG

    INFO

    NCCL_IB_HCA

    mlx5

    NCCL_NET_PLUGIN

    none

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: deepseek-r1-671b-fp8-distrubution
    spec:
      replicas: 1
      leaderWorkerTemplate:
        size: 2 #The total number of leaders and workers.
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata:
            labels: 
              role: leader
              alibabacloud.com/compute-class: gpu  #Specify GPU compute power. 
              alibabacloud.com/compute-qos: default #Specify teh ACS QoS class.
              alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model.
              # Run the application in a high-performance RDMA network. Submit a ticket to obtain the list of GPU models that support RDMA.
              alibabacloud.com/hpn-type: "rdma"
          spec:
            volumes:
              - name: llm-model
                persistentVolumeClaim:
                  ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek.
                  claimName: llm-model
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 32Gi
            containers:
              - name: deepseek-r1-671b-leader
                image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
                env:
                  - name: NCCL_SOCKET_IFNAME #Specify the NIC.
                    value: eth0
                  - name: NCCL_IB_TC
                    value: "136"
                  - name: NCCL_IB_SL
                    value: "5"
                  - name: NCCL_IB_GID_INDEX
                    value: "3"
                  - name: NCCL_DEBUG
                    value: "INFO"
                  - name: NCCL_IB_HCA
                    value: "mlx5"
                  - name: NCCL_NET_PLUGIN
                    value: "none"                
                command:
                  - sh
                  - -c
                  - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);vllm serve /models/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 2048 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager"
    #Set tensor-parallel-size to the number of GPUs provided by each leader or worker pod.
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                  requests:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G           
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/DeepSeek-R1
                    name: llm-model
                  - mountPath: /dev/shm
                    name: shm
        workerTemplate:
          metadata:
            labels: 
              alibabacloud.com/compute-class: gpu  #Specify GPU compute power. 
              alibabacloud.com/compute-qos: default #Specify teh ACS QoS class.
              alibabacloud.com/gpu-model-series: <example-model> ##Specify the GPU model.
              # Run the application in a high-performance RDMA network. Submit a ticket to obtain the list of GPU models that support RDMA.
              alibabacloud.com/hpn-type: "rdma"
          spec:
            volumes:
              - name: llm-model
                persistentVolumeClaim:
                  ## If Fluid is used, specify the name of the Fluid dataset, such as deepseek.
                  claimName: llm-model
              - name: shm
                emptyDir:
                  medium: Memory
                  sizeLimit: 32Gi
            containers:
              - name: deepseek-r1-671b-worker
                image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/vllm:v0.7.2
                env:
                  - name: NCCL_SOCKET_IFNAME #Specify the NIC.
                    value: eth0
                  - name: NCCL_IB_TC
                    value: "136"
                  - name: NCCL_IB_SL
                    value: "5"
                  - name: NCCL_IB_GID_INDEX
                    value: "3"
                  - name: NCCL_DEBUG
                    value: "INFO"
                  - name: NCCL_IB_HCA
                    value: "mlx5"
                  - name: NCCL_NET_PLUGIN
                    value: "none"      
                command:
                  - sh
                  - -c
                  - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                resources:
                  limits:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                  requests:
                    nvidia.com/gpu: "8"
                    cpu: "64"
                    memory: 512G
                ports:
                  - containerPort: 8000
                volumeMounts:
                  - mountPath: /models/DeepSeek-R1
                    name: llm-model
                  - mountPath: /dev/shm
                    name: shm
  3. Create a Service to expose the inference service.

    apiVersion: v1
    kind: Service
    metadata:
      name: ds-leader
    spec:
      ports:
        - name: http
          port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        leaderworkerset.sigs.k8s.io/name: deepseek-r1-671b-fp8-distrubution
        role: leader
      type: ClusterIP

Step 3: Verify the inference service

  1. Run kubectl port-forward to configure port forwarding between the local environment and inference service.

    Note

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress management.

    kubectl port-forward svc/ds-leader 8000:8000

    Expected results:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Send requests to the inference service.

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "ds",
        "messages": [
          {
            "role": "system", 
            "content": "You are a friendly AI assistant."
          },
          {
            "role": "user",
            "content": "Please introduce deep learning."
          }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9,
        "seed": 10
      }'

    Expected results:

    {"id":"chatcmpl-4bc78b66e2a4439f8362bd434a60be57","object":"chat.completion","created":1739501401,"model":"ds","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"OK. I have to think carefully about how to answer this question. First, I have to explain the definition of deep learning. It is a branch of machine learning, right? Then, I need to compare it with traditional machine learning methods to explain the advantages of deep learning, such as automatic feature extraction. It may be necessary to mention neural networks, especially the structure of deep neural networks, such as multiple hidden layers. \n\nNext, we should talk about the core components of deep learning, such as activation functions, loss functions, and optimizers. You may be unfamiliar with these terms, so I need to briefly explain how each component works. For example, ReLU functions as an activation function, Adam functions as an optimizer, and how the cross entropy loss function works. \n\nThen, in the application sector, computer vision and natural language processing are common. A few more examples may be needed, such as image recognition and machine translation, which are much easier to understand. In addition, users may be interested in use scenarios, such as the medical care and finance sectors. \n\nAs for popular frameworks, such as TensorFlow and PyTorch, they are used to make deep learning easier to implement. There is also the importance of hardware acceleration such as GPU, which explains why deep learning is developing so fast now. \n\nIn addition, it is possible to discuss the challenges and limitations of deep learning, such as data dependence, high computing resource requirements, and poor interpretability, so that users can fully understand the advantages and disadvantages. It may also be necessary to mention some future development directions, such as efficient training algorithms and interpretability research. \n\nThe user's need may be to understand the basic concepts of deep learning, and may have some technical background, but not too in-depth. They may want to quickly grasp the key points and applications, so the answer needs to be clearly structured, focused, not too in-depth technical details, but not too brief. Need to balance professionalism and understandability. \n\nAvoid using too many terms, or explain a little when using terms, such as "neural network", "convolutional neural network", etc. Perhaps the user is a student or a practitioner who has just entered the profession and needs to be understood by them in plain language. At the same time, give some examples to help them contact the practical application, so that it is easier to remember. \n\nWe should also pay attention to the relationship between deep learning and machine learning, and we need to clearly distinguish the difference between the two, which shows that deep learning belongs to machine learning, but at a deeper level and deal with more complex problems. The backpropagation algorithm may also be mentioned as one of the key techniques for training. \n\nFinally, make a summary, emphasizing the impact and potential of deep learning and the future development direction, so that users can have a comprehensive understanding. Check for missing important points, such as common model structures such as CNN and RNN, which may also need to be briefly mentioned to show the diversity. \n\nSometimes the user may be interested in the principle, but it is important to keep it simple here, focusing on the overview rather than the in-depth technical details. Ensure that the logic of the answer is smooth, starting from the definition, to the core components, applications, frameworks, challenges, and future directions. This structure is more reasonable. \n</think>\n\nDeep learning is a branch of machine learning that aims to simulate the learning mechanism of the human brain by building multi-layered neural networks (called "deep" networks). It is widely used in Image Recognition, speech processing, Natural Language Processing and other fields by automatically learning complex features and patterns from large amounts of data. \n\n### Core Concepts \n1. **Artificial Neural Network (ANN)**:\n   -Consists of an input layer, multiple hidden layers, and an output layer, each containing multiple neurons. \n   - Information processing is achieved by simulating the activation and transmission of neurons. \n\n2. **Automatic feature extraction**:\n   - Traditional machine learning relies on manual design features, while deep learning automatically extracts abstract features (such as from pixels to edges and shapes of objects) of data through multi-layer networks. \n\n3. **Key components**:\n   - **Activation functions** (such as ReLU and Sigmoid): introduce nonlinearity to enhance model expression. \n   - **Loss function** (e. g. cross entropy, mean square error): measures the difference between the predicted value and the true value. \n   - **Optimizer** (such as SGD and Adam): optimizes network parameters through back propagation to minimize losses. \n\n---\n\n### Typical model\n- **Convolutional neural network (CNN)**:  \n  Designed for images, spatial features are extracted through convolution kernels. Classic models such as ResNet and VGG. \n- **Recurrent Neural Network (RNN)**:  \n  Processing sequence data (text, speech), introducing the memory mechanism, improved versions such as LSTM, GRU. \n- **Transformer**:  \n  Based on the self-attention mechanism, the performance of Natural Language Processing (such as BERT and GPT series) is greatly improved. \n\n---\n\n### Application scenarios\n- **Computer vision**: Face Recognition, medical image analysis (such as lung CT lesion detection). \n- **Natural Language Processing**: Intelligent customer service, document summary generation, and translation (such as DeepL). \n- **Voice technology**: voice assistant (such as Siri), real-time subtitle generation. \n- **reinforcement learning**: game AI (AlphaGo), robot control. \n\n---\n\n ### Advantages and Challenges \n- **Advantages**:\n -Automatically learn complex features and reduce manual intervention. \n -It performs far better than traditional methods in big data and high computing power. \n- **Challenge**:\n  -Relies on a large amount of labeled data (for example, tens of thousands of labeled medical images). \n - The model training cost is high (for example, the GPT-3 training cost exceeds tens of millions of dollars). \n  - The "black box" characteristic leads to poor interpretability and limited application in high-risk areas such as medical care. \n\n---\n\n ### Tools and trends \n- **Mainstream frameworks**:TensorFlow (industrial deployment-friendly) and PyTorch (preferred for research). \n- **Research Direction**:\n -Lightweight model (such as MobileNet for mobile devices). \n -Self-supervised learning (reduces dependence on labeled data). \n -interpretability enhancement (such as visualizing the model decision basis). \n\nDeep learning is pushing the boundaries of artificial intelligence, from generative AI (such as Stable Diffusion to generate images) to autonomous driving, continuously changing the technology ecology. Its future development may achieve breakthroughs in reducing computing costs, improving efficiency and interpretability.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":1131,"completion_tokens":1114,"prompt_tokens_details":null},"prompt_logprobs":null}

    Model parameters

    This section only describes the parameters used to verify the inference service. For other parameters, see DeepSeek API reference.

    Parameter

    Description

    Example

    model

    The model.

    ds

    messages

    The message list.

    • role: the role that initiated the message.

    • content: the content of the message.

    -

    max_tokens

    The maximum number of Completion tokens generated by the model for a request. Valid values:

    • An integer from 1 to 8192.

    • The default is 4096 if the max_tokens parameter is not specified.

    1024

    temperature

    The sampling temperature. A greater value, such as 1, indicates a higher degree of randomness. A smaller value, such as 0.2, indicates a higher degree of concentration and determinacy. Valid values: 0 to 2.

    0.7

References