How to diagnose scheduled tasks through link tracking

Background

What is a scheduled task?

A scheduled task is a business logic that runs periodically in a business application system. Because it runs in the back-end process, the execution state and execution link are often invisible.

https://developer.aliyun.com/article/882393

What is link tracing

With the large-scale application of distributed microservice architecture in enterprises, the application platform for business operation is a complex system engineering composed of different business applications of various business R&D teams, and there are various forms of access interaction between them.

In the face of such a complex system structure, all downstream service states are black box unknowable for business portal applications. The corresponding operation and maintenance problems also follow:

• When the portal service is unavailable, how to quickly locate which service node is unavailable and why?

• How to quickly locate and analyze the neutral bottleneck of the service link?

• How to control the complete execution process of the service link?

In the face of the above problems, Dapper's paper on Google's distributed link tracking system started the implementation of various types of distributed link tracking, and many related systems emerged, such as Zipkin, Skywalking, and Pinpoint. The core logic of all these is to build the link context information of the corresponding request at the beginning of a business request, and to transparently transmit and improve the corresponding link node information during the service call process. Finally, a complete call chain data structure is built through the request TraceId (the request's link ID) and the parent-child dependency of each node.

Main divisions of the whole distributed full link tracking platform:

• The application side completes service call burying points. Common methods: manually calling SDK burying points, and automatically burying points in java agent mode

• Communication interaction between services. Trace information needs to be added to the corresponding communication protocol for transmission to ensure that trace information is shared in the whole call chain

• Trace information is reported to the full link tracking platform for storage presentation

Based on the above main links, each open source solution has realized different data structures in the collection, transmission and storage links. In order to unify the data structure in the link tracking field, OpenTracing and OpenTelemetry have emerged to define the corresponding specifications and protocols.

Why link tracking is required for scheduled tasks

Why the analysis task failed to execute

As the business continues to develop, the timing tasks of business development will become more and more complex, and the following forms will develop during the execution of the timing tasks:

• Call various downstream application services of other business parties

• Call other middleware services (such as Redis, mq, etc.)

• N subtasks will be split and distributed to different machines for distributed parallel batch processing, and each subtask processing is a complete set of complex combinations

When faced with such a complex timed task scenario, if the task execution is abnormal, the corresponding problem location will become very complex. With the support of full link tracking capability, the problem can be quickly located and handled.

Why the Analysis Task Is Slow to Execute

In general scenarios, offline tasks often take on the business scenario of mass data processing. Therefore, many timed offline tasks have the characteristics of time-consuming operation. There is often a huge space for performance optimization on these time-consuming tasks. Performance improvement can directly optimize the use efficiency of basic resources and save business costs.

On the task scheduling platform, we can effectively lock the time-consuming bottleneck of business processing for further business performance optimization as a reference through the task execution timeout alarm, combined with the task execution link tracking capability.

Full link flow control

Under the full link tracking system, other subsequent capabilities can be expanded:

• Gray level publishing: gray level capability of the whole link of tasks in the process of timed task application publishing

• Full link pressure test: timed tasks participate in the full link pressure test through the service test tag

• Traffic isolation: Scheduled tasks call downstream services, and downstream services are isolated according to the traffic source

Link Tracking Solution for Scheduled Tasks

Open source solutions

From the perspective of open source timing task platform, at present, common open source solutions do not support visual query of task execution link, and it will be difficult to analyze the problems under abnormal execution of complex tasks or fragmented tasks.

In addition, on the open source link tracking platform, some collection end agents in the corresponding open source solution integrate the timing task framework to perform the entry buried point collection, but this mode is relatively separated from the task scheduling platform side. From the perspective of being responsible for timing task operation and maintenance, to specifically lock a task execution link, you need to search and match the corresponding execution records through logs or according to the execution time, When there is a lot of data on the link tracking platform, it is inconvenient to quickly and uniquely lock the target link.

Alibaba Solution

Alibaba's distributed task scheduling platform SchedulerX provides a one-stop link tracking solution. It can bind task execution information with link tracking trace information. Users can easily view the complete call chain of a task, an execution, and a fragment from the task scheduling side.

Advantages of Alibaba SchedulerX:

• Accurately locate the task execution trace information: common link tracking platforms are only responsible for generating traceIds during task execution, and do not provide binding relationships with specific tasks. It becomes very complex to analyze the call chain of a task from thousands of traceIds; SchedulerX can quickly locate the call chain for each scheduling, whether it is a single task or a partition of a distributed task.

• The dispatching side supports the control of sampling rate: manual operation once supports mandatory sampling and dynamic configuration of sampling rate.

• Operation and maintenance free and low cost: Java business applications deployed through EDAS naturally support the timed task Trace capability, without the need to build a link tracking server platform and agent collection, reduce business costs, and can jump from the task scheduling side to the call chain with one click.

Scheduled task link tracking customer case

Slow implementation of an e-commerce business positioning task

User case: The current e-commerce business scenarios are all based on the microservice architecture. Timed task operation involves many applications and links are deep. When users are slow to run a task, they want to quickly locate which business application party and which business function are the bottleneck of the implementation link.

The following shows how to analyze the task execution time. After the task is triggered and executed, downstream business application services will be called for many times to complete the entire business logic. The whole task takes a long time to execute.

As shown in the figure above, in normal circumstances, one execution is less than 5 seconds, but the last two executions took more than 15 seconds. The task configuration timeout alarm can monitor that the execution record exceeds the expected execution time, and the call link of the execution record enters the next step of analysis.

As shown in the figure above, the complete call chain can be obtained through automatic jump of link tracking (the self built platform can also copy TraceId to query and lock). From the figure above, the business applications and IPs with high execution time consumption can be obtained, which can lock the user information saving service of the downstream business application ServiceApplication with obvious time consumption.

Abnormal execution of batch processing positioning for a financial account

User case: When a financial institution upgrades an old business system, it needs to periodically batch migrate and upgrade all customer account information to the new system. Every day, it will load a batch of account information from the old system and distribute it in the business cluster to complete the upgrade and migration of each account information; When an exception occurs to an account, you need to be able to quickly locate the location and cause of the exception.

The MapReduce model of SchedulerX is used for distributed batch running. Each subtask corresponds to a customer account information business processing, which can display the execution list of each subtask, and provide functions such as link tracking, rerunning, and log viewing.

As shown in the figure above, when the entire task fails abnormally, enter the subtask list to lock the failed subtasks (for example, the account 1000002 fails to process).

As shown in the above figure, through link tracking, it is automatically adjusted to the complete execution call chain of this sub task (the self built platform can copy TraceId query locking), which can quickly locate the business application and IP where the business processing exception is located.

As shown in the above figure, expand the details of the failed node to further obtain the information about the failed content (such as the case: the field of account 1000002 is too long when updating the name information). So far, a distributed batch task with a business execution exception of multi-party service calls can be quickly positioned.

A game service analyzes the Http execution link

User case: A game business system uses C++, Go and other technology stacks internally. SchedulerX does not provide direct access to the corresponding language SDK. The user accesses SchedulerX by exposing the http service mode and triggers the operation of the scheduler on a regular basis, and supports the complete call chain view of http task execution.

The following shows that after an http service is scheduled on a regular basis, multiple downstream application business services will be called.

Through the above execution link, you can obtain a complete execution link of an http timing task in the entire service cluster. If you simply query the call link of the http service on the link tracking platform, you will often list a bunch of request records and cannot quickly distinguish whether it is triggered by a timed task. Therefore, compared with the above methods, SchedulerX provides a clearer access for task execution link tracking and analysis in the scenario of operation and maintenance timing task execution on the task scheduling platform side.

Summary

The distributed task scheduling platform SchedulerX effectively introduces the visual full link tracking capability used in the microservice scenario to the timed task processing scenario, which will greatly improve the observability of timed tasks at runtime, and effectively help to locate and analyze exceptions, time consumption, execution jams and other problems during the execution of timed tasks.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us