This paper introduces a simple and lightweight serverless solution based on Serverless Workflow and Function Compute for three common scenarios.
Background
- Replication of massive OSS files (within a bucket or across buckets) with the storage type changed from Standard to Archive to reduce costs.
- Restoration of OSS files concurrently for applications to use the backup archive files.
- Decompression of oversized files driven by an event. In this scenario, GB-level packages and packages with more than 100,000 files are automatically decompressed to a new OSS path after uploading.
- Long total processing time: Even highly concurrent access to OSS takes days or more to process hundreds of millions of OSS files.
- Handling exceptions that may occur in a large number of remote calls: Generally, OSS APIs are designed to process a single file. Therefore, processing millions to tens of millions of files requires the same number of remote calls. In a distributed system, you need to handle failures in remote calls.
- State persistence: A checkpoint-like mechanism is required to reduce the occurrence of reprocessing upon partial failure of the original processing. This helps save the overall processing time. For example, the first 10 million of processed keys are skipped in batch processing.
This article will introduce a serverless best practice based on Serverless Workflow and Function Compute (FC) to address the preceding three scenarios.
Replicate and Archive Massive OSS Files
We believe that a simple list-and-copy main program can back up OSS files, but this involves many considerations. For example, how can we automatically restore the operation (for high availability) when the computer stops or the relevant process exits unexpectedly while running the main program? After recovery, how can we quickly know the processed files, for example, by manually modifying the database maintenance status? How can we coordinate the active and standby processes, for example, by manually implementing service discovery? How can we reduce the replication time? How do we choose between the infrastructure maintenance cost, the economic cost, and reliability, for example, by manually implementing parallel calls and management measures? With hundreds of millions of OSS objects, a simple single-threaded main program of list-and-copy cannot fully meet such requirements.

while hasMore {}
to ensure the index file is fully processed. Serverless Workflow adopts the following
implementation logic:
copy_files
task step: Read the size of the input from the offset position of the input index file, extract the files to be processed, and call the OSS CopyObject function through FC.has_more_files
selection step: After you process a batch of files, check whether the current index file is fully processed by running the conditional comparison. If yes, proceed to the success step. If no, input the (offset, size) value of the next page to copy_files for loop execution.start_sub_flow_execution
task step: Since the execution of a single workflow is limited by the number of history events, the event ID for the current workflow can be referred to for judgment during this step. If the number of current events exceeds a threshold, a new identical process is triggered, and the process continues after the sub-process ends. A sub-process can also trigger its own sub-process, which ensures that the entire process can be completed regardless of the number of OSS files.Using the workflow for batch processing can guarantee the following expectations:- Almost arbitrarily long processing time of a single request for any number of files: The workflow can be run for one year at most.
- Free of maintenance and operations and no need to implement high availability on your own: Serverless Workflow and FC are highly available serverless cloud services.
- No need to implement checkpoints and status maintenance: If the process fails for any reason, you can resume it from the last successful offset. You do not need to use any database or queue during this process.
- Retry-upon-failure configuration: Most instantaneous remote call errors can be handled through the configuration of exponential backoff.
Restore OSS Files at High Concurrency and in Batches
- Unlike CopyObject, the Restore operation is asynchronous. That is, after the operation is triggered, you must poll the object status before restoring the files.
- A single object can be restored in minutes and the duration may vary with the object size. This means that a higher concurrency in the whole process is needed to restore the files within the specified time.

- Objects can be restored at high concurrency, reducing the overall recovery duration.
- Status-based polling ensures that all objects are restored at the end of the process.
Decompress Large OSS Files upon Event Triggering
- 10-minute execution time limit for a single function: Decompression is prone to failure due to execution timeout for GB-level packages or packages that contain a large number of small files.
- Low fault tolerance: For the asynchronous call of FC by OSS, the access to OSS within the function may fail immediately. When the function call fails, you can retry the FC asynchronous call three times at most. Otherwise, the message is discarded and the decompression fails.
- Insufficient flexibility: After decompression, multiple users request for sending notifications and SMS messages to message services and delete original packages. However, it is difficult for a single function to meet these demands.



Takeaways
- Long-running processes for up to one year without interruption
- Status maintenance without effects from system failover
- Improved instantaneous error tolerance
- Highly flexible customization
The batch processing of large amounts of OSS files involves more than the three scenarios mentioned in this article. We look forward to discussing more scenarios and requirements with you at a later date.