Best Practices for Batch Processing Massive OSS Files through Serverless Workflow and Function Compute

By Chang Shuai

Background

Thanks to simple APIs and excellent scalability, Object Storage Service (OSS) allows applications in different scenarios to easily store several billion object files every day. The simple data access structure of key-value pairs has greatly simplified data uploading and reading. In addition to uploading and reading, a series of new application scenarios around OSS will emerge soon. Here are some examples:

Replication of massive OSS files (within a bucket or across buckets) with the storage type changed from Standard to Archive to reduce costs.
Restoration of OSS files concurrently for applications to use the backup archive files.
Decompression of oversized files driven by an event. In this scenario, GB-level packages and packages with more than 100,000 files are automatically decompressed to a new OSS path after uploading.

The preceding three scenarios share some common challenges:

Long total processing time: Even highly concurrent access to OSS takes days or more to process hundreds of millions of OSS files.
Handling exceptions that may occur in a large number of remote calls: Generally, OSS APIs are designed to process a single file. Therefore, processing millions to tens of millions of files requires the same number of remote calls. In a distributed system, you need to handle failures in remote calls.
State persistence: A checkpoint-like mechanism is required to reduce the occurrence of reprocessing upon partial failure of the original processing. This helps save the overall processing time. For example, the first 10 million of processed keys are skipped in batch processing.

This article will introduce a serverless best practice based on Serverless Workflow and Function Compute (FC) to address the preceding three scenarios.

Replicate and Archive Massive OSS Files

We believe that a simple list-and-copy main program can back up OSS files, but this involves many considerations. For example, how can we automatically restore the operation (for high availability) when the computer stops or the relevant process exits unexpectedly while running the main program? After recovery, how can we quickly know the processed files, for example, by manually modifying the database maintenance status? How can we coordinate the active and standby processes, for example, by manually implementing service discovery? How can we reduce the replication time? How do we choose between the infrastructure maintenance cost, the economic cost, and reliability, for example, by manually implementing parallel calls and management measures? With hundreds of millions of OSS objects, a simple single-threaded main program of list-and-copy cannot fully meet such requirements.

For example, you need to copy hundreds of millions of OSS files into a bucket to another bucket in the same region to convert standard storage into archive storage. In this oss -batch-copy instance, we provide a workflow application template to back up all the files listed in your index file by calling the OSS CopyObject function in sequence. The index file contains the OSS object meta to be processed. For example:

The index for hundreds of millions of OSS files can be hundreds of GB. Therefore, we need to use the range to read the index file and process part of the OSS files at a time. In this case, we need a control logic similar to while hasMore {} to ensure the index file is fully processed. Serverless Workflow adopts the following implementation logic:

copy_files task step: Read the size of the input from the offset position of the input index file, extract the files to be processed, and call the OSS CopyObject function through FC.
has_more_files selection step: After you process a batch of files, check whether the current index file is fully processed by running the conditional comparison. If yes, proceed to the success step. If no, input the (offset, size) value of the next page to copy_files for loop execution.
start_sub_flow_execution task step: Since the execution of a single workflow is limited by the number of history events, the event ID for the current workflow can be referred to for judgment during this step. If the number of current events exceeds a threshold, a new identical process is triggered, and the process continues after the sub-process ends. A sub-process can also trigger its own sub-process, which ensures that the entire process can be completed regardless of the number of OSS files.

Using the workflow for batch processing can guarantee the following expectations:

Almost arbitrarily long processing time of a single request for any number of files: The workflow can be run for one year at most.
Free of maintenance and operations and no need to implement high availability on your own: Serverless Workflow and FC are highly available serverless cloud services.
No need to implement checkpoints and status maintenance: If the process fails for any reason, you can resume it from the last successful offset. You do not need to use any database or queue during this process.
Retry-upon-failure configuration: Most instantaneous remote call errors can be handled through the configuration of exponential backoff.

Scenario 2: Restore OSS Files at High Concurrency and in Batches

This article introduces an efficient and reliable solution to restore large amounts of OSS archive files by using Serverless Workflow. This scenario is special despite similar challenges to those of copying files:

Unlike CopyObject, the Restore operation is asynchronous. That is, after the operation is triggered, you must poll the object status before restoring the files.
A single object can be restored in minutes and the duration may vary with the object size. This means that a higher concurrency in the whole process is needed to restore the files within the specified time.

With the logic similar to oss-batch-copy, in this instance, you can restore OSS files in batches through ListObjects. Restoring a batch of files is a sub-process. In each sub-process, use this for each parallel loop step to restore OSS objects at high concurrency. A maximum of 100 OSS objects can be restored concurrently. Restore is an asynchronous operation. After each Restore operation for an object, you must poll the object status until the object is restored. Restoring and polling are done in the same concurrent branch.

Restoring files in batches by using Serverless Workflow and FC has the following features:

Objects can be restored at high concurrency, reducing the overall recovery duration.
Status-based polling ensures that all objects are restored at the end of the process.

Decompress Large OSS Files upon Event Triggering

Shared storage of files is a major highlight of OSS. For example, processed content uploaded by one party can be used by downstream applications. Uploading multiple files needs to call the PutObject operation several times, which leads to a high probability of error and upload failure. Therefore, many upstream services call an operation to upload files by using a package. Although this simplifies the operation by the uploader, downstream users want to see the uploaded files in the original structure for their use. The demand here is to automatically decompress a package and store it to another OSS path in response to the OSS file upload event. Today, a function that decompresses a package through FC triggered by events is available in the console. However, the solution based solely on FC has some problems:

10-minute execution time limit for a single function: Decompression is prone to failure due to execution timeout for GB-level packages or packages that contain a large number of small files.
Low fault tolerance: For the asynchronous call of FC by OSS, the access to OSS within the function may fail immediately. When the function call fails, you can retry the FC asynchronous call three times at most. Otherwise, the message is discarded and the decompression fails.
Insufficient flexibility: After decompression, multiple users request for sending notifications and SMS messages to message services and delete original packages. However, it is difficult for a single function to meet these demands.

To address the prolonged execution and custom retries, in this instance, we introduce Serverless Workflow to schedule FC tasks. Start Serverless Workflow after an OSS event triggers FC. Serverless Workflow uses the metadata of the zip package for streaming reading, unzipping, and uploading to the OSS target path. The current marker is returned when the execution time of each function exceeds a threshold. Then, Serverless Workflow determines whether the current marker indicates that all files are processed. If yes, the process ends. If no, the streaming decompression continues from the current marker until the end.

The addition of Serverless Workflow removes the 10-minute limit for function calls. Moreover, built-in status management and custom retry ensure that GB-level packages and packages with more than 100,000 files can be decompressed reliably. Serverless Workflow supports a maximum execution time of one year. On this basis, almost any size of zip packages can be streaming decompressed.

The decompression process can be customized flexibly thanks to Serverless Workflow. The following figure shows how a user notifies the MNS queue after decompression and how to delete the original package in the next step.

Takeaways

As you can see, the mass adoption of OSS brings a series of problems, but the ways to solve them are tedious and error-prone. In this article, we introduce a simple and lightweight serverless solution based on Serverless Workflow and Function Compute for three common scenarios: batch backup of files, high-concurrency recovery, and event-triggered decompression of ultra-large zip files. This solution can efficiently and reliably support the following purposes:

Long-running processes for up to one year without interruption
Status maintenance without effects from system failover
Improved instantaneous error tolerance
Highly flexible customization

The batch processing of large amounts of OSS files involves more than the three scenarios mentioned in this article. We look forward to discussing more scenarios and requirements with you at a later date.

Community

Best Practices for Batch Processing Massive OSS Files through Serverless Workflow and Function Compute

Background

Replicate and Archive Massive OSS Files

Scenario 2: Restore OSS Files at High Concurrency and in Batches

Decompress Large OSS Files upon Event Triggering

Takeaways

Read previous post:

Read next post:

Alibaba Cloud Serverless

You may also like

Comments

Alibaba Cloud Serverless

Related Products

Hybrid Cloud Distributed Storage

OSS(Object Storage Service)

Storage Capacity Unit

Hybrid Cloud Storage