FAQ: Small file optimization and job diagnostics - MaxCompute

This topic answers common questions about small file optimization and job diagnostics in MaxCompute.

Category	Question
Small file optimization	In which scenarios are small files generated? How do I resolve the issue?
Job diagnostics	What do I do if a concurrent insert operation returns the ODPS-0110999 error?
	What do I do if a job returns the ODPS-0130121 error?
	Why does a task show as suspended in DataWorks Operation Center?
	Why does DataWorks Data Integration keep prompting me to check the Order field?
	What do I do if odpscmd -f fails without returning an error message?
	Why are DataWorks synchronization tasks stuck in the waiting state?
	Why does a scheduling resource server always show as stopped?

In which scenarios are small files generated? How do I resolve the issue?

Scenarios

MaxCompute uses Apsara Distributed File System (Pangu) for block storage. Files smaller than the default block size of 64 MB are considered small files.

Small files are typically generated in the following scenarios:

Reduce stage: A large number of small files are produced during the Reduce stage of a job.
Tunnel-based data upload: Small files accumulate when data is uploaded in small batches through Tunnel.

Temporary and expired files: Job execution generates temporary files, and expired files may remain in the recycle bin. These include the following types:

Type	Description
TABLE_BACKUP	Tables retained in the recycle bin past their retention period
FUXI_JOB_TMP	Temporary directories generated during job execution
TMP_TABLE	Temporary tables generated during job execution
INSTANCE	Logs retained in metadata tables during job execution
LIFECYCLE	Tables or partitions that have reached their lifecycle limit
INSTANCEPROFILE	Profile data generated after job submission and execution
VOLUME_TMP	Data without metadata that has a mapped directory on Pangu
TEMPRESOURCE	One-time temporary resource files used by UDFs
FAILOVER	Temporary files retained during system failovers

To check the number of small files in a table, run the following command:

desc extended <table_name>;

Impact

A large number of small files can cause the following issues:

Resource waste: By default, each small file maps to one map instance. A large number of small files leads to excessive instance creation and degrades overall execution performance.
Storage system pressure: A large number of small files increases the load on Pangu and reduces storage efficiency. In extreme cases, Pangu may become unavailable.

Solutions

Choose a solution based on how the small files were generated:

Small files from the Reduce stage

Run an INSERT OVERWRITE statement on the source table or partition, or write data to a new table and then delete the source table.

Small files from Tunnel-based data upload

When using the Tunnel SDK, upload data when the file cache reaches 64 MB.
When using the MaxCompute client (odpscmd), avoid uploading small files frequently. Upload data in bulk when a sufficient number of files have accumulated.
For partitioned tables, configure lifecycle rules for partitions to automatically clear expired data.
Run an INSERT OVERWRITE statement on the source table or partition.

Run the following command to merge small files:

  ALTER TABLE <table_name> [PARTITION (<pt_spec>)] MERGE SMALLFILES;

What do I do if a concurrent insert operation returns the ODPS-0110999 error?

Problem

When you run concurrent insert operations on a table, the following error is returned:

ODPS-0110999: Critical! Internal error happened in commit operation and rollback failed, possible breach of atomicity - Rename directory failed during DDLTask.

Cause

MaxCompute does not support concurrency control. When multiple jobs modify the same table simultaneously, a concurrency conflict may occur at a low probability during operations on the metadata (META) module. This error can also occur when ALTER and INSERT operations run on the same table at the same time.

Solution

Convert the table to a partitioned table so that each SQL statement inserts data into a separate partition. This allows concurrent operations without conflicts.

What do I do if a job returns the ODPS-0130121 error?

Problem

When a job runs, the following error is returned:

FAILED:ODPS-0130121:Invalid argument type - line 1:7 'testfunc':in function class

Cause

The data types of the input parameters passed to a built-in function are invalid.

Solution

Check the data types of the input parameters and make sure they match the requirements of the built-in function. For more information about built-in functions and their parameter types, see the MaxCompute SQL reference documentation.

Why does a task show as suspended in DataWorks Operation Center?

Check whether the task is started based on the project configuration:

Task is started: Check whether an ancestor task has failed. A failed ancestor task blocks downstream tasks.
Task is not started: Right-click the worker node and verify that the node is running properly. If necessary, rename the task and reconfigure its scheduling properties.

Why does DataWorks Data Integration keep prompting me to check the Order field?

This message indicates that the Order field in the database may have been deleted.

To resolve this issue:

Check whether the Order field exists in the source database.
Clear the cache.
Reconfigure or re-create the synchronization task.
Verify the task status.

What do I do if odpscmd -f fails without returning an error message?

Problem

When you run the odpscmd -f command to execute an SQL file in crontab, the execution fails but no error message is returned. However, running the same command directly in a Shell session works correctly.

Cause

In crontab, the output and error streams are not captured by default, so error messages are lost.

Solution

Redirect the output and error streams to a log file when running the command in crontab:

odpscmd -f xxx.sql >> /path/to/odpscmd.log 2>&1

When an issue occurs, check the log file for error details.

Why are DataWorks synchronization tasks stuck in the waiting state?

If synchronization tasks are in the waiting state while using shared scheduling resources, optimize batch synchronization tasks to maximize the synchronization speed.

You can also add dedicated scheduling resources. For more information, see Create and use a custom resource group for Data Integration.

Why does a scheduling resource server always show as stopped?

If a server added through the scheduling resource management feature shows as stopped (even after re-initialization), check the following:

Cloud product interconnection network: Verify that the machine name in the registration information matches the actual hostname of the server. Run the hostname command on the ECS instance to obtain the real hostname. Custom names are not supported.
Virtual private cloud (VPC): Check whether the ECS hostname has been modified. Note that the ECS hostname is different from the instance name. If the hostname was modified, run cat /etc/hosts on the ECS instance to verify that the hostname binding is valid.