Application of Asynchronous Task Processing System in the Field of Data Analysis

Data analysis in asynchronous task processing system

Data processing, machine learning training and data statistical analysis are the most common offline tasks. Such tasks are often sent to the task platform by the upstream for batch training and analysis after a series of pretreatment. In terms of processing language, Python has become one of the most commonly used languages in the data field due to its rich data processing libraries. Function calculation natively supports Python runtime, and supports the rapid introduction of third-party libraries, making it extremely convenient to use function calculation to process asynchronous tasks.

Common appeals of data analysis scenarios

Data analysis scenarios often have the characteristics of long execution time and large amount of concurrency. In offline scenarios, a batch of large amounts of data are often triggered regularly for centralized processing. Due to this triggering feature, business parties often have high requirements for resource utilization (cost), and expect to meet efficiency while minimizing costs. The details are summarized as follows:

1. The program development is convenient and friendly to third-party packages and customized dependencies;

2. Long-term operation is supported. You can view the task status during execution, or log in to the machine for operation. In case of data error, the task can be stopped manually;

3. High resource utilization and optimal cost.

The above requirements are very suitable for using functions to calculate asynchronous tasks.

Typical case - database autonomy service

Basic business information

The database patrol platform within Alibaba Cloud Group is mainly used to optimize and analyze the slow query and log of sql statements. The whole platform task is divided into offline training and online analysis. The computing scale of online analysis business has reached tens of thousands of cores, and the daily execution time of offline business is also tens of thousands of core hours. Due to the uncertainty of online analysis and offline training time, it is difficult to improve the overall resource utilization rate of the cluster, and it needs a great deal of elastic computing support when the business peak comes. After using function calculation, the architecture of the whole business is as follows:

Business pain points and architecture evolution

The database patrol platform is responsible for the optimization and analysis of database SQL in all regions of Alibaba's network. MySQL data comes from each cluster in each region and is pre-aggregated and stored in the region dimension. During the analysis, due to the need for cross-region aggregation and statistics, the patrol platform first tried to build a large Flink cluster on the intranet for statistical analysis. However, in practical use, the following problems have been encountered:

1. Data processing algorithm iteration is tedious. It is mainly reflected in the deployment, testing and release of the algorithm. Flink's runtime capability greatly limits the release cycle;

2. Flink support is not very good for common and some customized third-party libraries. Some machine learning and statistical libraries that the algorithm relies on are either not available in Flink's official Python runtime, or the version is old, inconvenient to use, and unable to meet the requirements;

3. The Flink forwarding link is long, and Flink troubleshooting is difficult;

4. The elastic speed and resources are difficult to meet the requirements at peak. And the overall cost is very high.

After understanding the function calculation, the algorithm task was migrated for the Flink calculation part, and the core training and statistical algorithm was migrated to the function calculation. By using functions to calculate the related capabilities provided by asynchronous tasks, the overall development, operation and maintenance and costs have been greatly improved.

The effect of the migration function after calculating the architecture

1. After the migration function is calculated, the system can fully undertake peak traffic and quickly complete daily analysis and training tasks;

2. The rich runtime capability of function calculation supports the rapid iteration of business;

3. The cost of the same number of cores has changed to 1/3 of the original Flink.

Function calculation asynchronous tasks are very suitable for such data processing tasks. While reducing the cost of computing resources, function computing can free you from the complicated platform operation and maintenance work and focus on algorithm development and optimization.

Best Practices for Asynchronous Tasks of Function Computation-Kafka ETL

ETL is a common task in data processing. The original data either exists in Kafka or DB, because the business needs to process the data and then dump it to other storage media (or save it back to the original task queue). This kind of business also belongs to the obvious task scenario. If you use the middleware service on the cloud (such as Kafka on the cloud), you can use the powerful trigger of function calculation to integrate the ecological and convenient integration of Kafka without paying attention to the deployment of Kafka Connector, error handling and other business-related operations.

Requirements of ETL task scenario

An ETL task often includes three parts: Source, Sink and processing unit. Therefore, in addition to the requirements for computing power, the ETL task also requires a strong upstream and downstream connection ecosystem of the task system. In addition, due to the accuracy requirements of data processing, the task processing system needs to be able to provide the operation semantics of task de-duplication and Exactly Once. In addition, the ability to compensate (such as retry and dead-letter queue) for processing failed messages is required. The summary is as follows:

1. Accurate execution of tasks:

A. Task repetition trigger supports de-duplication;

B. Task support compensation, dead letter queue;

2. Upstream and downstream of the task:

A. It can easily pull data and transfer it to other systems after processing;

3. Requirements for operator capability:

A. It supports the ability of user-defined operators and can flexibly perform various data processing tasks.

Serverless Task support for ETL tasks

The Destinating function supported by function calculation can well support the relevant demands of ETL tasks for convenient connection between upstream and downstream and accurate task execution. Rich Runtime support for function calculation also makes the task of data processing extremely flexible. In the Kafka ETL task processing scenario, we mainly use the following Serverless task capabilities:

1. Asynchronous target configuration function:

A. Support automatic delivery of tasks to downstream systems (such as queues) by configuring task success targets;

B. By configuring the task failure target, support the dead-letter queue capability, deliver the failed task to the message queue, and wait for the subsequent compensation processing;

2. Flexible operator and third-party library support:

A. Python is one of the most widely used languages in the field of data processing due to its rich support of third-party libraries for statistics and computation. The Python Runtime for function calculation supports the packaging of third-party libraries, enabling you to quickly carry out prototype verification and test online.

Kafka ETL task processing example

We take the simple ETL task processing as an example. The data source is from Kafka. After function calculation, the task execution results and upstream and downstream information are pushed to the message service MNS. For the source code of the function calculation part, see: https://github.com/awesome-fc/Stateful-Async-Invocation

Resource preparation

Kafka resource preparation

1. Enter the Kafka console, click Purchase Instance, and then deploy. Wait for instance deployment to complete;

2. Enter the created instance and create a topic for testing.

Target resource preparation (MNS)

Enter the MNS console and create two queues:

1. dead-letter-queue: used as the dead-letter queue. When the message processing fails, the execution context information will be delivered here;

2. fc-etl-processed-message: used as the push target after the task is successfully executed.

After creation, as shown in the following figure:

deploy

1. Download and install Serverless Devs:

npm install @serverless-devs/s

For detailed documents, refer to the Serverless Devs installation document

2. Configure key information:

s config add

For detailed documents, please refer to Alibaba Cloud key configuration document

3. Enter the project, modify the target ARN in the s.yaml file to the MNS queue ARN created above, and modify the service role to an existing role;

4. Deployment: s deploy - t s.yaml

Configure ETL tasks

1. Enter the kafka console - connector task list tab, and click Create Connector;

2. After configuring the basic information and source topic, configure the target service. Here we choose function calculation as the target

You can configure the sending batch size and retry times according to business requirements. So far, we have completed the basic configuration of the task.

Note: Please select "Asynchronous" mode for the sending mode.

Enter the function calculation asynchronous configuration page, and we can see the current configuration as follows:

Test ETL task

1. Enter the kafka console - connector task list tab, and click Test; After filling in the message content, click Send

2. After sending multiple messages, enter the function console. We can see that there are multiple messages in execution. At this time, we choose to use the method of stopping the task to simulate a task execution failure:

3. Enter the message service MNS console, and we can see that there is an available message in the two previously created queues, representing the task content of one execution and failure respectively:

4. Enter the queue details, and we can see two message contents. Take the successful message content as an example:

Here, we can see that the "responsePayload" key contains the original content returned by the function. Generally, we will return the result of data processing as a response, so in subsequent processing, we can obtain the processed result by reading "responsePayload".

The "requestPayload" key is the original content calculated by the Kafka trigger function. The original data can be obtained by reading the content in this data.

Best Practice for Asynchronous Task of Function Calculation - Audio and Video Processing

With the development of computer technology and network, video-on-demand technology is favored by education, entertainment and other industries because of its good human-computer interaction and streaming media transmission technology. At present, the product lines of cloud computing platform manufacturers are constantly mature and perfect. If you want to build video-on-demand applications, directly going to the cloud will clear up various obstacles such as hardware procurement and technology. Taking Alibaba Cloud as an example, typical solutions are as follows:

In this solution, the object storage OSS can support massive video storage. The collected and uploaded video is transcoded to adapt to various terminals and CDN to accelerate the speed of video playback by terminal devices. In addition, there are some content security review requirements, such as yellow identification, terrorist identification, etc.

Audio and video is a typical long-term processing scenario, which is very suitable for using function computing tasks.

Requirements for audio and video processing

In the VOD solution, video transcoding is the most computationally intensive subsystem. Although you can use the special transcoding service on the cloud, in some scenarios, you will still choose to build your own transcoding service, for example:

• Need more flexible video processing services. For example, a set of video processing services have been deployed on the virtual machine or container platform based on FFmpeg, but they want to improve the resource utilization rate on this basis, and achieve fast elasticity and stability in the case of obvious peaks and valleys and sudden increase in traffic;

• Need to quickly process multiple large videos in batch. For example, hundreds of large videos of more than 4 GB and 1080P are regularly generated on Friday, and each task may take several hours to execute;

• Hope to grasp the progress of video processing tasks in real time; In addition, in some cases of errors, you need to log in to the instance to troubleshoot problems or even stop executing tasks to avoid resource consumption.

Serverless Task support for audio and video scenarios

The above appeals are typical mission scenarios. Because such tasks often have the characteristics of peaks and valleys, how to operate and maintain computing resources and reduce their costs as much as possible is even more workload than the actual video processing business. The product form of Serverless Task was born to solve this kind of scenario. Through Serverless Task, you can quickly build a video processing platform with high flexibility, high availability, low cost and no operation and maintenance.

In this scenario, the main capabilities of the Serverless Task we will use are as follows:

1. Free operation and maintenance&low cost: computing resources can be used as soon as they are used, and no charge will be paid if they are not used;

2. Long-term execution task load is friendly: a single instance can support up to 24 hours of execution time;

3. Task duplication removal: support error compensation at the trigger end. For a single task, Serverless Task can achieve the ability of automatic de-duplication and more reliable execution;

4. Task observability: all tasks in execution, successful execution and failed execution can be traced and queried; Support task execution history data query and task log query;

5. Task operable: you can stop and retry the task;

6. Agile development&testing: official support for automated one-click deployment of S tools; Support the ability to log in to the running function instance. You can directly log in to the instance to debug third-party programs such as ffmpeg. What you see is what you get.

Serverless - FFmpeg video transcoding

Project source code: https://github.com/devsapp/start-ffmpeg/tree/master/transcode/src

deploy

1. Download and install Serverless Devs:

npm install @serverless-devs/s

For detailed documents, refer to the Serverless Devs installation document

2. Configure key information:

s config add

For detailed documents, please refer to Alibaba Cloud key configuration document

3. Initialization item: s init video-transcode - d video-transcode

4. Enter the project and deploy: cd video transcode&&s deploy

Call function

1. Initiate 5 asynchronous task function calls

2. Log in to FC Console

The execution of each transcoding task can be clearly seen:

• When did video transcoding start and end

• B video transcoding task does not meet expectations, I can click to stop calling halfway

• Through call status filtering and time window filtering, I can know how many tasks are currently being executed and what is the historical completion status

• Trace the execution log of each transcoding task and trigger payload

• When your transcoding function has an exception, the execution of the dest-fail function will be triggered. You can add your custom logic in this function, such as alarm

After transcoding, you can also log in to the OSS console to the specified output directory to view the transcodied video.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us