Developer Content

Data Analysis in Asynchronous Task Processing Systems

Data processing, machine learning training, and data statistical analysis are the most common types of offline tasks. This type of task often undergoes a series of preprocessing before being uniformly sent from upstream to the task platform for batch training and analysis. In terms of processing languages, Python has become one of the most commonly used languages in the data field due to its rich data processing libraries. Function computing natively supports Python runtime and the rapid introduction of third-party libraries, making it extremely convenient to use function computing for asynchronous tasks.

Common appeals for data analysis scenarios

Data analysis scenarios often have the characteristics of long execution time and large concurrency. In offline scenarios, a large amount of data is often triggered periodically for centralized processing. Due to this triggering characteristic, business parties often have high requirements for resource utilization (cost), hoping to meet efficiency while minimizing costs. The specific summary is as follows:

1. Convenient program development, friendly to third-party packages and custom dependencies;

2. Support long-term operation. You can view the task status during execution, or log in to the machine to perform operations. Support manually stopping tasks if data errors occur;

3. High resource utilization and optimal cost.

The above requirements are very suitable for using functions to calculate asynchronous tasks.

Typical Case - Database Autonomy Service

Basic business information

The database patrol platform within Alibaba Cloud Group is mainly used for optimizing and analyzing slow queries and logs of SQL statements. The entire platform task is divided into two main tasks: offline training and online analysis. The computing scale of online analysis business has reached tens of thousands of cores, and the daily execution time of offline business is also tens of thousands of core hours. Due to the uncertainty of online analysis and offline training time, it is difficult to improve the overall resource utilization rate of the cluster, and it requires great elastic computing power support when the business peak comes. After using function calculations, the architecture of the entire business is shown below:

Business pain points and architecture evolution

The database patrol platform is responsible for database SQL optimization and analysis in various regions of Alibaba's entire network. MySQL data comes from various clusters in each region, and is pre aggregated and stored uniformly in the region dimension. During analysis, due to the need for cross region aggregation and statistics, the patrol platform first attempts to build a large Flink cluster on the intranet for statistical analysis. However, in practical use, the following problems have been encountered:

1. Data processing algorithm iteration is tedious. Mainly reflected in the deployment, testing, and release of algorithms. Flink's runtime capability greatly limits the release cycle;

Flink support is not very good for common and customized third-party libraries. Some of the machine learning and statistical libraries that the algorithm relies on are either not available in Flink's official Python runtime, or the version is old, inconvenient to use, and unable to meet requirements;

3. The Flink forwarding link is long and Flink troubleshooting is difficult;

4. At peak times, both elastic speed and resources are difficult to meet requirements. And the overall cost is very high.

After understanding function calculation, I conducted algorithm task migration for the Flink calculation section, migrating core training and statistical algorithms to function calculation. By using functions to calculate the related capabilities provided by asynchronous tasks, the overall development, operation and maintenance, and costs have been greatly improved.

Effect of migration function calculation architecture

After calculating the migration function, the system can fully handle peak traffic and quickly complete daily analysis and training tasks;

2. The rich runtime capabilities of function computing support rapid iteration of business;

3. The cost of the same number of cores is calculated to be 1/3 of the original Flink.

Function computation asynchronous tasks are very suitable for such data processing tasks. Functional computing can reduce the cost of computing resources while freeing you from the complexities of platform operations and maintenance, focusing on algorithm development and optimization.

Best Practices for Asynchronous Tasks in Function Computing - Kafka ETL

ETL is a relatively common task in data processing. The original data either exists in Kafka or DB because the business needs to process the data and then dump it to other storage media (or save it back to the original task queue). This type of business also belongs to obvious task scenarios.

If you use middleware services on the cloud (such as Kafka on the cloud), you can use powerful triggers for functional computing to integrate ecological and convenient Kafka without paying attention to business related operations such as Kafka Connector deployment and error handling.

Requirements for ETL task scenarios

An ETL task often includes three parts: Source, Sink, and processing unit. Therefore, in addition to the computational power requirements, ETL tasks also require a strong upstream and downstream connectivity ecosystem for the task system. In addition, due to the accuracy requirements of data processing, it is necessary for the task processing system to be able to provide the operational semantics of task duplication removal and Exactly Once. In addition, the ability to compensate for processing failed messages (such as retry and dead letter queues) is required. The summary is as follows:

1. Accurate execution of tasks:

A. Task repetition triggering supports de duplication;

B. Task support compensation, dead letter queue;

2. Upstream and downstream of the task:

A. Be able to easily pull data and transfer it to other systems after processing;

Requirements for operator capabilities:

A. Support the ability to customize operators and flexibly perform various data processing tasks.

Serverless Task support for ETL tasks

The Destinating function supported by function calculation can well support the related demands of ETL tasks for convenient connection between upstream and downstream, and accurate task execution. The rich runtime support for functional computing also makes the task of data processing extremely flexible. In the Kafka ETL task processing scenario, the main serverless task capabilities we use are as follows:

Asynchronous target configuration function:

A. Support automatic delivery of tasks to downstream systems (such as queues) by configuring task success targets;

B. By configuring task failure targets, support the dead letter queue capability, deliver failed tasks to the message queue, and wait for subsequent compensation processing;

Flexible operator and third-party library support:

Python is one of the most widely used languages in the field of data processing due to its rich support of third-party libraries for statistics and computation. The Python Runtime for function computing supports packaging of third-party libraries, enabling you to quickly conduct prototype validation and test launch.

Kafka ETL Task Processing Example

Let's take a simple ETL task processing as an example, and the data source is from Kafka.

After function calculation and processing, the task execution results and upstream and downstream information are pushed to the message service MNS. The source code for the function calculation part of the project is shown in:

https://github.com/awesome-fc/Stateful-Async-Invocation

Resource Preparation

Kafka Resource Preparation

Enter the Kafka console, click Purchase Instance, and then deploy. Wait for the instance deployment to complete;

Enter the created instance and create a topic for testing.

Target Resource Preparation (MNS)

Enter the MNS console and create two queues:

1. Dead letter queue: Used as a dead letter queue. When message processing fails, the execution context information will be posted here;

2. fc etl processed message: Used as the push target after the task is successfully executed.

After creation, as shown in the following figure:

deploy

1. Download and install Serverless Devs:

npm install @serverless-devs/s

For detailed documentation, please refer to the Serverless Devs installation documentation

2. Configure key information:

s config add

For detailed documents, please refer to Alibaba Cloud key configuration documents

3. Enter the project, modify the target ARN in the s.yaml file to the MNS queue ARN created above, and modify the service role to an existing role;

4. Deployment: s deploy - t s.yaml

Configure ETL tasks

Enter the kafka console - connector task list tab, and click Create Connector;

After configuring the basic information and source topic, configure the target service. In this case, we choose function calculation as the goal

You can configure the sending batch size and retry times based on business requirements. So far, we have completed the basic configuration of the task. Note: Please select the "Asynchronous" mode for the sending mode.

Entering the asynchronous configuration page for function calculation, we can see the current configuration as follows:

Test ETL tasks

Enter the kafka console - connector task list tab, and click Test; After filling in the message content, click Send:

After sending multiple messages, enter the function console. We can see multiple messages being executed. At this point, we choose to use the method of stopping a task to simulate a task execution failure:

Enter the message service MNS console, and we can see

Entering the queue details, we can see two message contents. Take the successful message content as an example:

Here, we can see the original content returned by the function in the "responsePayload" key. Generally, we will return the results of data processing as a response, so in subsequent processing, we can obtain the processed results by reading "responsePayload".

The "requestPayload" key is the original content calculated by the Kafka trigger function. By reading the content in this data, you can obtain the original data.

Best Practices for Asynchronous Tasks in Function Computing - Audio Video Processing

With the development of computer technology and network, video on demand technology has been favored by education, entertainment and other industries due to its good human-computer interaction and streaming media transmission technology. Currently, the product lines of cloud computing platform manufacturers are constantly maturing and improving. If you want to build video on demand applications, directly going to the cloud will clear various obstacles such as hardware procurement and technology. Taking Alibaba Cloud as an example, typical solutions are as follows:

In this solution, the object storage OSS can support massive video storage, and the collected and uploaded videos are transcoded to adapt to various terminals and CDNs to accelerate the speed of video playback by terminal devices. In addition, there are some content security review requirements, such as pornography and terrorism detection.

Audio and video are typical long-term processing scenarios, which are very suitable for using functional computing tasks.

Requirements for audio and video processing

In video on demand solutions, video transcoding is the most computationally intensive subsystem. Although you can use specialized transcoding services on the cloud, in some scenarios, you still choose to build your own transcoding services, such as:

• Need for more flexible video processing services. For example, a set of video processing services based on FFmpeg has been deployed on a virtual machine or container platform, but they want to improve resource utilization on this basis to achieve rapid resilience and stability in the case of significant peaks and valleys, and sudden increases in traffic;

• Need to quickly process multiple oversized videos in batch. For example, every Friday, hundreds of large videos of over 4 GB and 1080P are regularly generated, and each task may take several hours to execute;

• Desire to master the progress of video processing tasks in real time; In some cases where errors occur, it is necessary to log in to the instance to troubleshoot problems or even stop executing tasks to avoid resource consumption.

Serverless Task support for audio and video scenarios

The above appeals are typical mission scenarios. Because such tasks often have the characteristics of peaks and valleys, how to conduct the operation and maintenance of computing resources and minimize their costs is even greater than the workload of actual video processing business.

The product form of Serverless Task was born to address such scenarios. With Serverless Task, you can quickly build a highly resilient, highly available, low-cost, and maintenance free video processing platform.

In this scenario, the main capabilities of the Serverless Task that we will use are as follows:

1. Free operation and maintenance&low cost: computing resources can be used with bombs, and no fees will be paid if not used;

2. Long term execution task load friendly: a single instance supports a maximum execution time of 24 hours;

3. Task duplication removal: Support error compensation at the trigger end. For single tasks, Serverless Tasks can achieve the ability to automatically remove duplicates, making execution more reliable;

4. Task Observability: All tasks in execution, successfully executed, and failed to execute can be traced and queried; Support task execution history data query and task log query;

5. Task operability: You can stop and retry the task;

6. Agile development&testing: Official support for automated one-click deployment using S tools; Support the ability to log in to running function instances. You can directly log in to the instance to debug third-party programs such as ffmpeg, and what you see is what you get.

Serverless - FFmpeg video transcoding

Initialization project: s init video transcode - d video transcode

Enter the project and deploy: cd video transcode&&s deploy

Call function

Initiate 5 asynchronous task function calls

Log in to the FC console

The execution status of each transcoding task can be clearly seen:

• When did the transcoding of A video start and end

• The B video transcoding task does not meet expectations, so I can click to stop calling halfway

• Through call status filtering and time window filtering, I can know how many tasks are currently being executed, and what is the historical completion status

• Can trace the execution log of each transcoding task and trigger payload

• When there is an exception in your transcoding function, the execution of the dest fail function will be triggered. You can add custom logic to this function, such as an alarm

After transcoding, you can also log in to the OSS console and view the transcoded video in the specified output directory.

Application of Serverless Asynchronous Task Processing System in the Field of Data Analysis

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse