edit-icon download-icon

Configure dependencies for tasks with different cycles

Last Updated: Apr 23, 2018

During big data development, tasks are often dependent on other tasks that run on different cycles. For example, daily tasks are dependent on hourly tasks and hourly tasks are dependent on minute-level tasks. In this case, how to use DataWorks when developing these two scenarios?

This document uses these two scenarios to discuss scheduling dependencies, parameters, execution, and other considerations, so as to introduce best practices for scheduling dependencies when tasks run on different cycles.

First, let’s go through a few concepts:

  • Business date: This is the date on which business data was produced. Here, it refers to the business data from one full day. In DataWorks, tasks are run on the daily basis to process the business data from the previous day (24 hours). Therefore, the business date = daily scheduling date - 1 day.

  • Dependency: The dependency relationship describes the semantic connections between two or more nodes or workflows. The running status of an upstream node or workflow may affect that of the downstream node or workflow, but the running status of a downstream node or workflow cannot affect that of the upstream node or workflow.

  • Scheduled instance: When the DataWorks scheduling system schedules the execution of periodic tasks, it first instantiates the task based on its configuration. Each instance carries an group of attributes, including specific scheduled time, status, upstream and downstream dependencies.

    NOTE:

    Currently the instances automatically scheduled by DTplus DataWorks each day are generated at 23:30 the night before.

  • Scheduling rules: To run scheduled tasks, the following conditions must be satisfied

    • Confirm that all upstream task instances have been run successfully. If all upstream task instances have run successfully, the triggered task enters the “Waiting for Scheduled Time” status.

    • Confirm that the task instance’s scheduled time has arrived. After the task instance enters the “Waiting for Scheduled Time” status, it checks whether its scheduled time has arrived. Once this time arrives, the task instance enters the “Waiting for Resources” status.

    • Confirm that the currently scheduled resources are sufficient. After the task instance enters the “Waiting for Resources” status, it checks that the resources currently scheduled to this project are sufficient. If the resources are sufficient, the task can be run successfully.

Daily tasks dependent on hourly tasks

Business scenario

The system must record statistics on incremental business data produced each hour. After summarizing the data from the final hour of a day, it must run a task to summarize the data for the whole day,

Demand analysis

  • The system runs a task every hour to calculate the volume of data produced during the previous hour. You must configure a task to run every hour each day. Statistics on the data from the last hour of the day are calculated by the first task instance on the next day.

  • The final summary task is run once a day after data statistics are calculated for the last hour of the day. Therefore, you must configure one daily task that is dependent on the first hourly task instance each day.

The resulting schedule format is shown in the following figure:

1

However, after configuring the scheduling dependencies exactly according to the scheduled tasks defined in the preceding figure, the scheduled task instances will not achieve the results in the figure, but rather those shown in the following figure:

2

In this figure, the daily task can only be run after all the other instances of the hourly task for the current day have been completed. This means that, if the daily task only depends on the first instance of the hourly task, its results cannot meet the requirements of the scenario.

To meet the requirements of this scenario, you must configure Cross-cycle Dependencies for the tasks. You can set the Cross-cycle Dependency attribute of the hourly task to Self-dependent and then set the daily task’s scheduled time to 00:00 and configure its dependency attribute to reflect a dependency on the hourly task.

The scheduling format of this final plan is shown in the following figure:

3

Now, the hourly task instances are run as a series. If the first instance for the current day can be run successfully, this means that all the previous instances during the previous day were run successfully. Therefore, the daily task only has to be dependent on the first instance of the day.

Configuration practices

The scheduling configuration of the hourly task is shown in the following figure:

1

1

Parameter configuration: The hourly task starts an instance each hour to process the data from the previous hour. For example, you can use the configuration $[yyyy-mm-dd-hh24-1/24]. Daily task: If the date format is ‘yyyymmdd’, use ${bdp.system.bizdate}. If the date format is ‘yyyy-mm-dd’, use the custom parameter $[yyyy-mm-dd-1]. The specific configuration depends on the actual design details. The parameter configurations are shown in the following figure:

1

Testing, data population, and automatic scheduling

  • The scheduled time for the daily task instance is 2017-01-11 00:00:00.

  • The scheduled times for the hourly task instances are hourly from 2017-01-11 00:00:00 to 2017-01-11 23:00:00.

  • ${bdp.system.bizdate} is assigned the value 20170110 (the instance’s scheduled date minus one day, in the format yyyymmdd).

  • $[yyyy-mm-dd-hh24-1/24] is assigned the values 2017-01-10-23 to 2017-01-11-22 (the instances’ scheduled times minus one hour, in the format yyyy-mm-dd-hh).

Automatic scheduling: The scheduling system automatically generates instances, with the scheduled time for each instance set to the current date. This is shown in the final plan figure in the Demand analysis section.

Hourly tasks dependent on minute-level tasks

Business scenario

The system has a task that synchronizes data every 30 minutes. It incrementally imports the system data from the previous 30 minutes to MaxCompute. This task is scheduled to run every day every half an hour. Now, you must configure an hourly task to record statistics every six hours. This task will record statistics for the data produced from 00:00 to 06:00, 06:00 to 12:00, 12:00 to 18:00, and 18:00 to 00:00 each day.

Demand analysis

Minute-level task

  • The 00:00 instance synchronizes data from the last 30 minute period of the previous day, producing a table partition with the format yyyy-mm-dd-23:30 and the date of the previous day.

  • The 00:30 instance synchronizes the data produced from 00:00 to 00:30 on the current day, producing a table partition with the format yyyy-mm-dd-00:00 and the date of the current day.

  • The 01:00 instance synchronizes the data produced from 00:30 to 01:00 on the current day, producing a table partition with the format yyyy-mm-dd-00:30 and the date of the current day.

Similar tasks are run every 30 minutes until the 23:30 instance synchronizes the data produced from 23:00 to 23:30 on the current day, producing a table partition with the format yyyy-mm-dd-23:00 and the date of the current day.

Hourly task

  • The hourly task records statistics every six hours and runs four times a day.

  • The hourly task for 00:00 to 06:00 depends on the 12 minute-level task instances run from 00:30 to 06:00 on the current day.

  • The hourly task for 06:00 to 12:00 depends on the 12 minute-level task instances run from 06:30 to 12:00 on the current day.

  • The hourly task for 12:00 to 18:00 depends on the 12 minute-level task instances run from 12:30 to 18:00 on the current day.

  • The hourly task for 18:00 to 00:00 the next day depends on the 11 minute-level task instances run from 18:30 to 23:30 on the current day and the one instance run at 00:00 the next day.

The resulting schedule format is shown in the following figure:

4

However, after configuring the scheduling dependencies exactly according to the scheduled tasks defined in the preceding figure, the scheduled task instances will not achieve the results in the figure, but rather those shown in the following figure:

5

As shown in the preceding figure, data produced between 18:00 on the 10th and 00:00 on the 11th was processed by an hourly task instance at 00:00 on the 11th. This instance only depended on the minute-level task instance that ran at 00:00 on the 11th. There was no guarantee that the other minute-level tasks that ran from 18:30 to 23:30 on the 10th were successful.

To meet the requirements of this scenario, you must configure Cross-cycle Dependencies for the tasks. You can set the Cross-cycle Dependency attribute of the minute-level task to Self-dependent and then set the hourly task’s dependency attribute to reflect a dependency on the minute-level task.

The scheduling format of this final plan is shown in the following figure:

6

Configuration practices

The scheduling configuration of the minute-level task is shown in the following figure:

1

The scheduling configuration of the hourly task is shown in the following figure:

1

Parameter configuration: Each instance of the minute-level task processes the partition that produced data in the previous 30 minutes, by using a parameter $[yyyy-mm-dd-hh24:mi-30/24/60]. The specific configuration depends on the actual design details.

Testing, data population, and automatic scheduling

Testing and data population: Both are manually-generated scheduling instances for the selected business date. For example, you can select the business date 2017-01-10.

  • In this case, the scheduled time for the minute-level task instances was 30 minutes ranging from 2017-01-11 00:00:00 to 2017-01-11 23:30:00, which added up to a total of 48 instances.

  • The scheduled time for hourly task instances was six hours ranging from 00:00:00, 06:00:00, 12:00:00, to 18:00:00 on 2017-01-11, which added up a total of four instances.

  • $[yyyy-mm-dd-hh24:mi-30/24/60] is assigned the values 2017-01-10-23:30 to 2017-01-11-23:00 (the instances’ scheduled times minus 30 minutes, in the format yyyy-mm-dd-hh:mm).

Automatic scheduling: The scheduling system automatically generates instances, with the scheduled time for each instance set to the current date. This is shown in the final plan figure in the Demand analysis section.

Conclusion

  • When tasks with a longer cycle depend on tasks with a shorter cycle and the task with the shorter cycle is self-dependent: For the instances scheduled on the current day, each instance of the longer-cycle task only depends on the instance of the shorter-cycle task with the scheduled time nearest to (and less than) its own scheduled time.

  • When tasks with a longer cycle (hourly) depend on tasks with a shorter cycle (minute-level) and the task with the shorter cycle is not self-dependent: For the instances scheduled on the current day, each instance of the longer-cycle task depends on each instance of the shorter-cycle task with a scheduled time less than or equal to its own scheduled time, provided this instance is not a dependency of another instance of this task. This is not the case when daily, weekly, or monthly tasks depend on hourly or minute-level tasks because daily task instances are dependent on all hourly or minute-level task instances.

  • When you use both scheduling cycle and scheduling time parameters, the value of the final scheduling parameter is determined by the scheduled time of each scheduled instance. In the scheduling system, the business date = the instance’s scheduled date minus one day.

Thank you! We've received your feedback.