All Products
Search
Document Center

Platform For AI:FAQ about Model Gallery

Last Updated:Mar 05, 2025

This topic provides answers to some frequently asked questions about deploying or fine-tuning a trained model in Model Gallery.

How do I identify the root cause of a training job failure?

A training job may fail due to various causes, such as invalid dataset formats. You can use the following methods to troubleshoot the failure:

  • View the task diagnosis. Perform the following steps: On the Job Management page, click Training jobs and then click the name of the job that you want to view. On the Task details tab, move the pointer over Failure to view the error message.

    image

  • To view task logs, perform the following steps: On the Job Management page, click Training jobs and then click the name of the job that you want to view. On the Task log tab, view the error messages.

    image

    The following table describes the error messages and the corresponding solutions.

    Error type

    Error message

    Solution

    Input and output-related errors

    ValueError: output channel ${your OSS uri} must be directory

    Check whether the output path of the training is a folder. The output path must be a folder.

    ValueError: train must be a file

    Check whether the selected input path is a file. The input path must be a file.

    FileNotFoundError

    Check whether the selected input path contains a file that meets the requirements.

    JSONDecodeError

    Check whether the format of the input JSON file is correct.

    ValueError: Input data must be a json file or a jsonl file!

    Check whether the input file is valid. Only JSON and JSONL files are supported.

    KeyError: ${some key name}

    In most cases, this error occurs when the training dataset file that you use is in the JSON format. Check whether the key-value pairs in the training dataset file meet the model requirements. For information about the requirements, see the model description page.

    ValueError: Unrecognized model in /ml/input/data/model/.

    PyTorch cannot recognize the provided model file.

    UnicodeDecoderError

    Check whether the encoding format of the input file is valid.

    Input/output error

    Check the read permission of the input path and the read/write permission of the output path.

    NotADirectoryError: [Errno 20] Not a directory:

    Check whether the input or output path is a folder.

    Hyperparameter-related errors

    ERROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 51) of binary: /usr/bin/python (no subprocess-related logs)

    The current model has insufficient memory and Out of Memory (OOM) occurs when the model is loaded. Select a model that has a larger memory size.

    torch.cuda.OutOfMemoryError: CUDA out of memory

    The GPU memory of the current instance is insufficient. Select a GPU-accelerated instance that has a larger GPU memory size or modify the configuration of related hyperparameters, such as, lora dim, or batch size, to reduce the required GPU memory.

    ValueError: No closing quotation

    The algorithm fails to generate a training command because a single quotation mark (")appears in the system prompt or other parameters. Delete the single quotation mark (") or complete the pair of quotation marks.

    Resource configurations-related

    Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run

    The model parameters are in BF16 format. We recommend that you use GPUs with Ampere or newer architectures for model training, such as A10 or A100. Training with GPUs from architectures prior to Ampere converts the parameters to FP16 format.

    RuntimeError: CUDA error: uncorrectable ECC error encountered

    The selected specification has a hardware error. You can try another specification or another region.

    MemoryError: WARNING Insufficient free disk space

    The selected specification does not have enough memory. Use one with larger memory.

    Limit-related

    failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the threshold

    The current training job allows up to 2 GPUs to run at the same time. If this limit is exceeded, resource limiting is triggered. In this case, you need to wait for the training job to complete before starting a new training job or submit a ticket to increase the quota.

How do I debug a deployed model service online?

  1. After you deploy a model, click Model Gallery in the Platform for AI (PAI) console. On the Model Gallery page, click Job Management. On the Training jobs tab of the Job Management page, click the Deployment Jobs tab. On the Deployment Jobs tab, find the deployed model service.

    image

  2. On the Elastic Algorithm Service (EAS) page, find the model service that you queried in Step 1 and click Online Debugging in the Actions column.

    image

  3. Configure request parameters for online debugging.

    1. On the Model Gallery page, search for the desired model and click the model. On the Overview tab of the model details page, view the calling method for specific deployment. For example, if you deploy the DeepSeek-R1-Distill-Qwen-7B model service by using the BladeLLM deployment method, send the HTTP POST request to /v1/chat/completions. The following figure shows the request example.

      image

    2. Configure and send the request.

      Add /v1/chat/completions to the request URL and configure the request body based on the preceding request example.

      image