This topic answers common questions about small file optimization and job diagnostics in MaxCompute.
In which scenarios are small files generated? How do I resolve the issue?
Scenarios
MaxCompute uses Apsara Distributed File System (Pangu) for block storage. Files smaller than the default block size of 64 MB are considered small files.
Small files are typically generated in the following scenarios:
Reduce stage: A large number of small files are produced during the Reduce stage of a job.
Tunnel-based data upload: Small files accumulate when data is uploaded in small batches through Tunnel.
Temporary and expired files: Job execution generates temporary files, and expired files may remain in the recycle bin. These include the following types:
Type Description TABLE_BACKUP Tables retained in the recycle bin past their retention period FUXI_JOB_TMP Temporary directories generated during job execution TMP_TABLE Temporary tables generated during job execution INSTANCE Logs retained in metadata tables during job execution LIFECYCLE Tables or partitions that have reached their lifecycle limit INSTANCEPROFILE Profile data generated after job submission and execution VOLUME_TMP Data without metadata that has a mapped directory on Pangu TEMPRESOURCE One-time temporary resource files used by UDFs FAILOVER Temporary files retained during system failovers
To check the number of small files in a table, run the following command:
desc extended <table_name>;Impact
A large number of small files can cause the following issues:
Resource waste: By default, each small file maps to one map instance. A large number of small files leads to excessive instance creation and degrades overall execution performance.
Storage system pressure: A large number of small files increases the load on Pangu and reduces storage efficiency. In extreme cases, Pangu may become unavailable.
Solutions
Choose a solution based on how the small files were generated:
Small files from the Reduce stage
Run an INSERT OVERWRITE statement on the source table or partition, or write data to a new table and then delete the source table.
Small files from Tunnel-based data upload
When using the Tunnel SDK, upload data when the file cache reaches 64 MB.
When using the MaxCompute client (odpscmd), avoid uploading small files frequently. Upload data in bulk when a sufficient number of files have accumulated.
For partitioned tables, configure lifecycle rules for partitions to automatically clear expired data.
Run an
INSERT OVERWRITEstatement on the source table or partition.Run the following command to merge small files:
ALTER TABLE <table_name> [PARTITION (<pt_spec>)] MERGE SMALLFILES;
What do I do if a concurrent insert operation returns the ODPS-0110999 error?
Problem
When you run concurrent insert operations on a table, the following error is returned:
ODPS-0110999: Critical! Internal error happened in commit operation and rollback failed, possible breach of atomicity - Rename directory failed during DDLTask.Cause
MaxCompute does not support concurrency control. When multiple jobs modify the same table simultaneously, a concurrency conflict may occur at a low probability during operations on the metadata (META) module. This error can also occur when ALTER and INSERT operations run on the same table at the same time.
Solution
Convert the table to a partitioned table so that each SQL statement inserts data into a separate partition. This allows concurrent operations without conflicts.
What do I do if a job returns the ODPS-0130121 error?
Problem
When a job runs, the following error is returned:
FAILED:ODPS-0130121:Invalid argument type - line 1:7 'testfunc':in function classCause
The data types of the input parameters passed to a built-in function are invalid.
Solution
Check the data types of the input parameters and make sure they match the requirements of the built-in function. For more information about built-in functions and their parameter types, see the MaxCompute SQL reference documentation.
Why does a task show as suspended in DataWorks Operation Center?
Check whether the task is started based on the project configuration:
Task is started: Check whether an ancestor task has failed. A failed ancestor task blocks downstream tasks.
Task is not started: Right-click the worker node and verify that the node is running properly. If necessary, rename the task and reconfigure its scheduling properties.
Why does DataWorks Data Integration keep prompting me to check the Order field?
This message indicates that the Order field in the database may have been deleted.
To resolve this issue:
Check whether the
Orderfield exists in the source database.Clear the cache.
Reconfigure or re-create the synchronization task.
Verify the task status.
What do I do if odpscmd -f fails without returning an error message?
Problem
When you run the odpscmd -f command to execute an SQL file in crontab, the execution fails but no error message is returned. However, running the same command directly in a Shell session works correctly.
Cause
In crontab, the output and error streams are not captured by default, so error messages are lost.
Solution
Redirect the output and error streams to a log file when running the command in crontab:
odpscmd -f xxx.sql >> /path/to/odpscmd.log 2>&1When an issue occurs, check the log file for error details.
Why are DataWorks synchronization tasks stuck in the waiting state?
If synchronization tasks are in the waiting state while using shared scheduling resources, optimize batch synchronization tasks to maximize the synchronization speed.
You can also add dedicated scheduling resources. For more information, see Create and use a custom resource group for Data Integration.
Why does a scheduling resource server always show as stopped?
If a server added through the scheduling resource management feature shows as stopped (even after re-initialization), check the following:
Cloud product interconnection network: Verify that the machine name in the registration information matches the actual hostname of the server. Run the
hostnamecommand on the ECS instance to obtain the real hostname. Custom names are not supported.Virtual private cloud (VPC): Check whether the ECS hostname has been modified. Note that the ECS hostname is different from the instance name. If the hostname was modified, run
cat /etc/hostson the ECS instance to verify that the hostname binding is valid.