This topic provides answers to some frequently asked questions about deploying or fine-tuning a trained model in Model Gallery.
How do I identify the root cause of a training job failure?
A training job may fail due to various causes, such as invalid dataset formats. You can use the following methods to troubleshoot the failure:
View the task diagnosis. Perform the following steps: On the Job Management page, click Training jobs and then click the name of the job that you want to view. On the Task details tab, move the pointer over Failure to view the error message.
To view task logs, perform the following steps: On the Job Management page, click Training jobs and then click the name of the job that you want to view. On the Task log tab, view the error messages.
The following table describes the error messages and the corresponding solutions.
Error type
Error message
Solution
Input and output-related errors
ValueError: output channel ${your OSS uri} must be directory
Check whether the output path of the training is a folder. The output path must be a folder.
ValueError: train must be a file
Check whether the selected input path is a file. The input path must be a file.
FileNotFoundError
Check whether the selected input path contains a file that meets the requirements.
JSONDecodeError
Check whether the format of the input JSON file is correct.
ValueError: Input data must be a json file or a jsonl file!
Check whether the input file is valid. Only JSON and JSONL files are supported.
KeyError: ${some key name}
In most cases, this error occurs when the training dataset file that you use is in the JSON format. Check whether the key-value pairs in the training dataset file meet the model requirements. For information about the requirements, see the model description page.
ValueError: Unrecognized model in /ml/input/data/model/.
PyTorch cannot recognize the provided model file.
UnicodeDecoderError
Check whether the encoding format of the input file is valid.
Input/output error
Check the read permission of the input path and the read/write permission of the output path.
NotADirectoryError: [Errno 20] Not a directory:
Check whether the input or output path is a folder.
Hyperparameter-related errors
ERROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 51) of binary: /usr/bin/python (no subprocess-related logs)
The current model has insufficient memory and Out of Memory (OOM) occurs when the model is loaded. Select a model that has a larger memory size.
torch.cuda.OutOfMemoryError: CUDA out of memory
The GPU memory of the current instance is insufficient. Select a GPU-accelerated instance that has a larger GPU memory size or modify the configuration of related hyperparameters, such as, lora dim, or batch size, to reduce the required GPU memory.
ValueError: No closing quotation
The algorithm fails to generate a training command because a single
quotation mark (")
appears in the system prompt or other parameters. Delete the singlequotation mark (")
or complete the pair of quotation marks.Resource configurations-related
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run
The model parameters are in BF16 format. We recommend that you use GPUs with Ampere or newer architectures for model training, such as A10 or A100. Training with GPUs from architectures prior to Ampere converts the parameters to FP16 format.
RuntimeError: CUDA error: uncorrectable ECC error encountered
The selected specification has a hardware error. You can try another specification or another region.
MemoryError: WARNING Insufficient free disk space
The selected specification does not have enough memory. Use one with larger memory.
Limit-related
failed to compose dlc job specs, resource limiting triggered, you are trying to use more GPU resources than the threshold
The current training job allows up to 2 GPUs to run at the same time. If this limit is exceeded, resource limiting is triggered. In this case, you need to wait for the training job to complete before starting a new training job or submit a ticket to increase the quota.
How do I debug a deployed model service online?
After you deploy a model, click Model Gallery in the Platform for AI (PAI) console. On the Model Gallery page, click Job Management. On the Training jobs tab of the Job Management page, click the Deployment Jobs tab. On the Deployment Jobs tab, find the deployed model service.
On the Elastic Algorithm Service (EAS) page, find the model service that you queried in Step 1 and click Online Debugging in the Actions column.
Configure request parameters for online debugging.
On the Model Gallery page, search for the desired model and click the model. On the Overview tab of the model details page, view the calling method for specific deployment. For example, if you deploy the DeepSeek-R1-Distill-Qwen-7B model service by using the BladeLLM deployment method, send the HTTP POST request to /v1/chat/completions. The following figure shows the request example.
Configure and send the request.
Add
/v1/chat/completions
to the request URL and configure the request body based on the preceding request example.