How to perform scheduled tasks through link tracking

Introduction: The distributed task scheduling platform SchedulerX effectively introduces the visual full-link tracking capability used in microservice scenarios to the timing task processing scenario, which will greatly improve the observability of timing tasks at runtime and effectively help timing task execution Locating and analyzing problems such as exceptions, time-consuming, and execution jams during the process.
Background
What is a scheduled task
Timed tasks are business logics that run periodically in a business application system. Because it runs in the back-end process, there is often the invisibility of execution status and execution link "Common Timing Task Technical Solution".



What is Link Tracking
With the large-scale application of distributed micro-service architecture in enterprises, the application platform for business operation is a complex system engineering composed of different business applications of various business R&D teams, and there are various forms of access and interaction between them. '



Faced with the above-mentioned complex system structure, all downstream service states are black-box unknowable for the business entry-end application. The corresponding operation and maintenance problems also follow:

When the entrance service is unavailable, how to quickly locate which service node is unavailable and why?
How to quickly locate and analyze performance bottlenecks in business links?
How to control the complete execution process of the business link?


In the face of the above problems, the Dapper paper on Google's distributed link tracking system has opened up the implementation of various distributed link tracking, and many related systems have emerged, such as: Zipkin, Skywalking, and Pinpoint. The core logic of all of these is to construct the link context information of the corresponding request at the beginning of a business request, and transparently transmit and improve the corresponding link node information during the service call process, and finally pass the request TraceId (the link identifier of this request ) and the parent-child dependencies of each node build a complete call chain data structure.



The main divisions of the entire distributed full link tracking platform:

The application side completes the service call burying point, common methods: manually call the SDK burying point, java agent mode automatic burying point
For communication and interaction between services, Trace information needs to be added to the corresponding communication protocol for transmission to ensure that Trace information is shared in the entire call chain
Trace information is reported to the full link tracking platform for storage and display


Based on the above-mentioned main links, each open source solution has realized different data structures in the links of collection, transmission and storage. In order to realize the unified data structure in the field of link tracing, OpenTracing and OpenTelemetry emerged to define the corresponding specifications and protocols.



Why timed tasks need link tracking


Analyzing why the task execution failed
As the business continues to develop, the timing tasks of business development will become more and more complex. During the execution of the timing tasks, the following forms will develop:

Will call various downstream application services of other business parties
Will call other middleware services (such as: redis, mq, etc.)
N subtasks will be divided and distributed to different machines for distributed parallel batch processing, and each subtask processing is a complete set of complex combinations


When faced with such a complex timing task scenario, if the task execution is abnormal, the corresponding problem location will become very complicated. With the support of complete full-link tracking capabilities, problems will be quickly located and dealt with.







Why Analysis Tasks Execute Slowly
In general scenarios, offline tasks often undertake large-scale data processing business scenarios, so many scheduled offline tasks have the characteristics of long running time, and there is often a huge room for performance optimization on these long-time-consuming tasks, and performance improvement can directly optimize the foundation Resource usage efficiency and business cost savings







On the task scheduling platform, we can issue alarms through task execution timeouts, combined with task execution link tracking capabilities, can effectively lock the time-consuming bottlenecks of business processing for further business performance optimization as a reference.



Full Link Flow Control
Under the full-link tracking system, other follow-up capability expansions can be carried out:

Gray-scale publishing: full-link gray-scale capability of tasks during the release process of scheduled task applications
Full link stress test: Scheduled tasks participate in full link stress test through the service test label
Traffic isolation: Scheduled tasks call downstream services, and downstream services are isolated according to traffic sources





Timed task link tracking solution


open source solution
From the perspective of the open source timing task platform, the current common open source solutions do not support visual query of task execution links, and it is difficult to analyze complex tasks or problems under abnormal execution of fragmented tasks.



In addition, on the open source link tracking platform, part of the collection agent in the corresponding open source solution integrates the timing task framework to perform entry point collection. However, this mode is relatively separated from the task scheduling platform side. From the perspective of being responsible for timing task operation and maintenance, think about it specifically. To lock a certain task execution link, it is necessary to retrieve and match the corresponding execution records through the log or according to the execution time. When there is a lot of data on the link tracking platform, it is very inconvenient to quickly and uniquely lock the target link.






Ali solution
Alibaba's distributed task scheduling platform SchedulerX provides a one-stop link tracking solution, which can bind task execution information with link tracking Trace information, and users can easily view a task, a certain time from the task scheduling side Execution, the complete call chain of a shard.






Advantages of Ali SchedulerX solution:

Accurately locate task execution Trace information: Common link tracking platforms are only responsible for generating traceIds when tasks are executed, and do not provide binding relationships with specific tasks. If you want to analyze the call chain of a task from thousands of traceIds, it becomes Very complex; whether SchedulerX is a stand-alone task or a fragment of a distributed task, each scheduling can quickly locate the call chain.
The scheduling side supports controlling the sampling rate: manual operation supports mandatory sampling, and dynamically configures the sampling rate.
O&M-free and low-cost: Java business applications deployed through EDAS naturally support scheduled task Trace capabilities, without the need for self-built link tracking server platforms and agent collection, reducing business costs, and can jump to the call chain from the task scheduling side with one click .


Timed task link tracking customer case


An e-commerce business positioning task is executed slowly
User case: At present, e-commerce business scenarios are based on the micro-service architecture system. The running of scheduled tasks involves many applications and deep links. When a user runs a certain task slowly, he hopes to quickly locate which business application and which business function is the execution link bottleneck point.





The following will show how to analyze the execution time of the task. After the execution of the task is triggered, multiple downstream business application services will be called to complete the entire business logic. The entire task execution takes a long time.






As shown in the figure above, under normal circumstances, one execution is <5 seconds, but the last two executions take more than 15s. Through the task configuration timeout alarm, it can be detected that the execution record exceeds the expected execution time, and the call link of the execution record enters next analysis.







As shown in the figure above, the complete call chain can be obtained through link tracking and automatic jump (similarly, those who build their own platform can copy the TraceId query and lock). The user information saving service of the downstream business application ServiceApplication is obviously time-consuming.



Batch positioning execution of a financial account is abnormal
User case: A financial institution upgrades the old business system, and needs to migrate and upgrade all customer account information to the new system in batches on a regular basis. Every day, a batch of account information will be loaded from the old system and distributed in the business cluster. Complete each Account information upgrade and migration; when an abnormality occurs in an account, it is necessary to quickly locate the location and cause of the abnormality.



Distributed batch running is carried out through the MapReduce model of SchedulerX. Each subtask corresponds to a customer account information business process. It can display the execution list of each subtask and provide functions such as link tracking, rerunning, and log viewing.






As shown in the figure above, when the execution of the entire task fails abnormally, enter the subtask list to lock the failed subtask (for example: account 1000002 failed to process).







As shown in the figure above, through link tracking, the complete execution call chain automatically adjusted to the subtask (the self-built platform can copy the TraceId query lock), can quickly locate the business application and IP where the business processing exception is located.







As shown in the figure above, expand the details of the failed node to obtain further failed content information (for example, the field of account 1000002 is too long when updating the name information), so far, a distributed batch task with multi-party service call business execution exception is enough was quickly located.



A game business analysis Http execution link
User case: A game business system uses C++, Go and other technology stacks internally. SchedulerX does not provide direct access to the corresponding language SDK. Users access SchedulerX by exposing http services to trigger the operation at regular intervals, and support it to implement http task execution View the complete call chain.



The following shows that after an http service is regularly scheduled, multiple downstream application business service calls will be made internally.







Through the above execution link, a complete execution link of an http scheduled task in the entire business cluster can be obtained. If you simply query the invocation link of the http service on the link tracking platform, there will often be a bunch of request records listed and it is impossible to quickly distinguish whether it is triggered by a certain scheduled task. Therefore, compared with the above methods, SchedulerX provides a clearer task execution link tracking and analysis entry for the execution status of the O&M scheduled tasks on the task scheduling platform side.



Summarize
The distributed task scheduling platform SchedulerX effectively introduces the visual full-link tracking capability used in microservice scenarios to the timing task processing scenario, which will greatly improve the observability of timing tasks during runtime and effectively help timing tasks during execution. Locating and analyzing problems such as abnormality, time-consuming, execution stuck, etc.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us