Perform job scheduling in Data Lakehouse Edition - AnalyticDB for MySQL

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) provides job scheduling for batch SQL and Spark applications to implement flexible extract-transform-load (ETL) data processing and development.

Flowchart

Prerequisites

If you use an Alibaba Cloud account to perform job scheduling, make sure that the following conditions are met:
- A privileged account is created for an AnalyticDB for MySQL cluster. For more information, see Create a database account.
- A job resource group with at least 8 AnalyticDB Compute Units (ACUs) of reserved computing resources is created. For more information, see Create a resource group.
If you perform job scheduling as a Resource Access Management (RAM) user, make sure that the following conditions are met:
- The AliyunServiceRoleForAnalyticDBForMySQL role of AnalyticDB for MySQL is configured. For more information, see Role description.
- A standard account is created for an AnalyticDB for MySQL cluster and associated with the RAM user. For more information, see Associate or disassociate a database account with or from a RAM user.
- A job resource group with at least 8 AnalyticDB Compute Units (ACUs) of reserved computing resources is created. For more information, see Create a resource group.

Project management

A project is a collection of workflows. Projects help you manage workflows in a centralized manner. Projects can be used only by Alibaba Cloud accounts and authorized RAM users. The following table describes the permissions owed by different roles.

Role	Policy	Permission description
Administrator	AliyunADBFullAccess	Has all permissions on a project.
Developer	AliyunADBDeveloperAccess	Has edit permissions on a project. For example, a developer can edit workflows, scheduled tasks, and workflow nodes.
Visitor	AliyunADBReadOnlyAccess	Has read-only permissions on a project. For example, a visitor can view process design and code content such as workflows and scheduled tasks.

Project management allows you to create projects, edit and delete existing projects, and manage members.

Create a project

Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select a region.
In the left-side navigation pane, click Clusters.
On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the Cluster ID.
In the left-side navigation pane, click Job Scheduling.
On the Projects tab, click Create Project.
In the panel that appears, enter a project name and click OK.

Manage members

Find a project and click the icon in the Actions column.
Click Add Member.
In the Members section, select unauthorized accounts and click the icon.
Note Unauthorized accounts refer to all RAM users that are not added to member management within an Alibaba Cloud account.
Click OK.

Workflow orchestration

Create a workflow

In the upper-right corner, choose Workflows > Workflow Templates.
Click Create Workflow.
Note SQL and Spark task nodes can be created.

On the Create Workflow page, drag a node to the canvas and configure its parameters.

The following table describes the parameters configured for SQL task nodes.

Parameter	Description
Node Name	The name of the SQL task node.
Schema	The schema configured for the SQL task node.
SQL Statements	The SQL statements. You can enter SQL statements or select existing SQL statements. Existing SQL statements refer to the SQL statements saved on the page after you choose Job Editor > SQL Development.
Custom Parameters	The custom parameters that can be used as variables in SQL statements. For example, you can set the prop of a custom parameter to customer and its value to ${table}. In this case, the `SELECT * FROM ${table};` statement can be changed to `SELECT * FROM customer;`.
Pre-SQL	The pre-SQL statements that are executed before SQL statements of the current workflow. Only DML statements are allowed, such as INSERT, UPDATE, and DELETE.
Post-SQL	The post-SQL statements that are executed after SQL statements of the current workflow. Only DDL statements are allowed, such as CREATE and DROP.
Retry Attempts	The maximum number of retry attempts for the current SQL statements and the interval between each attempt. Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.
Description	The description of the SQL task node.

The following table describes the parameters configured for Spark task nodes.

Parameter	Description
Node Name	The name of the Spark task node.
Resource Group	The resource group configured for the Spark task node. You can select only a job resource group because Spark tasks are resource-intensive. If no options are available, you must create a job resource group first. For more information, see Create a resource group.
Type	Batch: batch application Streaming: streaming application SQL: Spark SQL application
Spark Statements	The Spark statements. You can enter the default task parameter templates or select existing Spark templates. Existing Spark templates refer to the Spark templates saved on the page after you choose Job Editor > Spark JAR Development.
Retry Attempts	The maximum number of retry attempts for the current Spark statements and the interval between each attempt. Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.
Description	The description of the Spark task node.

Click OK to add.

In the upper-right corner of the canvas, click the

icon. In the panel that appears, enter a workflow name, and configure global parameters.

Parameter	Description
Set Global Parameters	The global parameters that can be used as variables in Spark statements for workflow nodes. For example, you can set the value of the global variable ${table} to test. In this case, the `SELECT * FROM ${table};` statement can be changed to `SELECT * FROM test;`.

Publish a workflow

After you create a workflow, click the icon to publish the workflow.

Execute a workflow

You can use one of the following methods to execute a workflow:

Run Now: immediately executes the current workflow. The following table describes the startup parameters.

Parameter	Description
Failure Policy	Continue: If a task fails, other parallel tasks are still executed until the entire process ends. End: If a task fails, all executing tasks are terminated and the entire process is terminated. Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.
Supplement Data	Specifies whether to enable the Supplement Data feature. If you enable this feature, data is supplemented every day at 00:00 within the specified scheduled time range for the executing workflow. You can set the execution method to Serial or Parallel. Serial: Within the specified time range, data is supplemented sequentially for each day. Only a single process instance is generated. Parallel: Within the specified time range, data is supplemented simultaneously for all the days. Multiple process instances are generated.
Set Startup Parameters	The global parameters that can be used as variables in SQL statements for workflow nodes. If you have configured global parameters when you create the workflow, the global parameters are automatically populated as startup parameters. Global parameters are a subset of startup parameters.
Rerun	Specifies whether to re-execute the process that is terminated due to task failure. Note You can also manually re-execute the process that is terminated due to task failure on the Workflow Instances page.

Schedule: configures schedule parameters to allow the workflow to be repeatedly executed within a time range. The following table describes the schedule parameters.

Parameter	Description
Schedule Status	Specifies whether the workflow is scheduled to be published or unpublished.
Start and End Time	The start and end time within which the task is scheduled to be executed.
Schedule	The time scheduled to execute the task, including Minutes, Hours, Days, Months, and Weeks. After you select the scheduled time, you can click Execution Time to view the value of Time for Next Five Executions.
Failure Policy	Continue: If a task fails, other parallel tasks are still executed until the entire process ends. End: If a task fails, all executing tasks are terminated and the entire process is terminated. Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.

Workflow state management

After you execute a workflow, you can view the state of the workflow and pause or terminate the workflow on the Workflow Instances page. You can also view the states of tasks on the Task Instances page. If the workflow fails, you can re-execute or recover the workflow.

View the state of a workflow instance

After a workflow is published and executed, a workflow instance is generated. The following table describes the operations that you can perform on the workflow instance.

Operation	Description
Rerun	Re-executes the workflow.
Recover	Resumes executing the workflow from the failed node without repeated execution on the successful nodes.
Terminate	Terminates the running workflow. You can click Rerun to re-execute the workflow.
Pause	Pauses the running workflow. You can click Rerun to re-execute the workflow.
Delete	Deletes the workflow instance.

View the state of a task instance

You can view the execution states and logs of task nodes in a workflow instance.

For an SQL task, you can view the system running and scheduling logs of the task.
For a Spark task, you can view the manually displayed logs and system running and scheduling logs of the task.

Monitor workflows

You can monitor the states of all workflows and task nodes in a project.

Submitted: displays the number of workflows that are submitted. After a workflow is submitted, the workflow waits to be executed by the scheduling system.
To Be Paused: displays the number of workflows to be paused. After you initiate a pause operation on a workflow, the workflow waits to be paused by the scheduling system.
To Be Stopped: displays the number of workflows to be stopped. After you initiate a terminate operation on a workflow, the workflow waits to be stopped by the scheduling system.