All Products
Search
Document Center

CloudFlow:Use CloudFlow and Function Compute to build a batch ETL processing system

Last Updated:Dec 17, 2025

You can use CloudFlow and Function Compute to build a batch extract, transform, load (ETL) processing system that provides more flexible and cost-effective data processing solutions. This allows you to focus more on business logic without the need to manage the underlying server resources.

Background information

New technologies such as cloud computing, AI, and IoT are being widely adopted. The amount of generated data has seen explosive growth and data has become an important asset, resulting in increasingly high demands for data collecting and processing capabilities, such as operational monitoring of application services, analyzing operational data, and filtering and preprocessing data for deep learning. The capabilities have become a core competitive advantage and directly affect the operational efficiency of services. You can use existing ETL systems to achieve the preceding purposes. However, you might want to use a self-managed service in most cases.

  • Your data processing tasks run irregularly and you want to consume no resources when no tasks run.

  • Your data processing requirements involve only a few simple steps and can be quickly met by using a self-managed service.

  • Your data processing workflow involves a large number of custom steps and cannot be flexibly processed by existing systems. You need a self-managed service to meet your business requirements.

  • You do not want to spend excessive effort on building and maintaining various open source data processing modules, but you want to achieve good performance in processing a large number of concurrent data processing requests.

If you have the preceding requirements or you want to implement a batch data processing system that features high flexibility, high reliability, cost-effectiveness, and high performance, the serverless solution in this topic provides an optimal option.

Use case

Assume that you want to process an amount of data whose value is data_1 or data_2. You want to count the number of occurrences of data with each value and store the statistical results in the data warehouse. If a large amount of data is involved or the data sources are heterogeneous, data processing is difficult to quickly complete within a short period of time by using one process at a time. In this case, the combination of CloudFlow and Function Compute provides an efficient solution.

To demonstrate the core data processing capabilities, Alibaba Cloud Object Storage Service (OSS) is used as the storage infrastructure for data warehousing, which is representative of various types of database services.

The following solution shows how to use CloudFlow and Function Compute to implement a cost-effective and highly elastic data processing system. In this system, Function Compute dynamically provides underlying computing resources for data processing and statistics based on the amount of data. CloudFlow helps orchestrate the upstream and downstream logic of complex business.

Implementation

In general data processing business, take note of the following items:

Data sources: the sources of the data that you want to process. In most cases, data comes from various sources, including databases and text files such as log files. In this example, a function is used to generate a small amount of data for illustration. In actual scenarios, you can use various custom data sources.

Processing framework or mode: the framework or mode that is used to process data, such as MapReduce. In this example, CloudFlow is used.

Destination: the data warehouse. In this example, OSS is used as the data warehouse, which serves as the destination of the final data.

Data processing workflow:

Randomly split the raw data into three to five shards. Each shard contains data with values data_1 and data_2. Use the mapper function to count the number of occurrences of data with each value on each shard and store the intermediate results. Use the reduce function to process the statistical results of each shard in a unified manner, sum the results, and store the final result. The following steps are included:

  1. Obtain data from the data sources.

  2. Split data into shards randomly or based on specific rules.

  3. Use the improved data parallel processing capabilities of MapReduce to process data.

  4. Store the results to the destination.

The following figure shows the Alibaba Cloud services and modules used in the system and their interactions.

image

Prerequisites

Procedure

  1. Deploy a workflow and functions in the Function Compute application center.

    1. Access ETLDataProcessing.

    2. Click Deploy in the Development & Experience section to go to the Function Compute application center to deploy an application.image

    3. On the Create Application page, configure parameters and then click Create and Deploy Default Environment.

      The following table describes the key parameters that you need to configure. Use the default values for other parameters.

      Parameter

      Description

      Basic Configurations

      Deployment Type

      The deployment type. Select Directly Deploy.

      Role Name

      • If you use an Alibaba Cloud account to create an application in the application center for the first time, click Authorize Now to go to the Role Templates page, create a service-linked role named AliyunFCServerlessDevsRole, and then click Confirm Authorization Policy.

        image.png

      • If you are using a RAM user, follow the on-screen instructions to copy the authorization link to your Alibaba Cloud account for authorization. After you complete the authorization, click Authorized.

        image.png

        Note

        If the Failed to obtain the role message appears, contact the corresponding Alibaba Cloud account to attach the AliRAMReadOnlyAccess and AliyunFCFullAccess policies to the current RAM user. For more information, see Grant permissions to a RAM user by using an Alibaba Cloud account.

      Advanced Settings

      Region

      The region in which you want to deploy the application.

      Workflow Execution Role

      The service-linked role that is used to execute the workflow. Create a service-linked role in advance and attach the AliyunFCInvocationAccess policy to the service-linked role.

      Function Service Role

      The service-linked role that is used by Function Compute to access other cloud services. If you do not have special requirements, we recommend that you use the default service-linked role AliyunFCDefaultRole provided by Function Compute.

      Object Storage Bucket Name

      The name of an OSS bucket in the same region as the workflow and functions that you want to deploy.

      Wait 1 to 2 minutes for the application deployment to complete. The system automatically creates three functions and a workflow named etl-data-processing-2q1i. You can log on to the Function Compute console and the CloudFlow console to view the creation results.

      • shards-spliter: reads data from the data sources, splits the source data into shards based on specific rules, and returns the shards to the workflow.

      • mapper: the map function in the MapReduce framework. This function filters, cleans, and computes the shard data. In most cases, multiple function instances are generated in parallel based on the number of shards in a data processing workflow to improve the processing speed. The results returned by the mapper function on each shard are stored in a specific OSS directory.

      • reducer: the reduce function in the MapReduce framework. This function integrates and merges the results returned by the mapper function and pushes the final result to OSS.

      You can obtain all the sample code from the Function Compute application center and the CloudFlow console.

  2. Verify the results.

    1. Log on to the CloudFlow console. In the top navigation bar, select a region.

    2. On the Workflows page, click the etl-data-processing-2q1i workflow. On the Workflow Details page, click the Execution Records tab and click Started Execution.

      After the execution is complete, view the input and output of the workflow.

    3. Log on to the OSS console to view the content of the shard data and the merge result.

      image