AnalyticDB for MySQL Data Lakehouse Edition (V3.0) provides job scheduling for batch SQL and Spark applications to implement flexible extract-transform-load (ETL) data processing and development.
Flowchart
Prerequisites
- If you use an Alibaba Cloud account to perform job scheduling, make sure that the following conditions are met:
- A privileged account is created for an AnalyticDB for MySQL cluster. For more information, see Create a database account.
- A job resource group with at least 8 AnalyticDB Compute Units (ACUs) of reserved computing resources is created. For more information, see Create a resource group.
- If you perform job scheduling as a Resource Access Management (RAM) user, make sure that the following conditions are met:
- The AliyunServiceRoleForAnalyticDBForMySQL role of AnalyticDB for MySQL is configured. For more information, see Role description.
- A standard account is created for an AnalyticDB for MySQL cluster and associated with the RAM user. For more information, see Associate or disassociate a database account with or from a RAM user.
- A job resource group with at least 8 AnalyticDB Compute Units (ACUs) of reserved computing resources is created. For more information, see Create a resource group.
Project management
Role | Policy | Permission description |
Administrator | AliyunADBFullAccess | Has all permissions on a project. |
Developer | AliyunADBDeveloperAccess | Has edit permissions on a project. For example, a developer can edit workflows, scheduled tasks, and workflow nodes. |
Visitor | AliyunADBReadOnlyAccess | Has read-only permissions on a project. For example, a visitor can view process design and code content such as workflows and scheduled tasks. |
Project management allows you to create projects, edit and delete existing projects, and manage members.
Create a project
- Log on to the AnalyticDB for MySQL console.
- In the upper-left corner of the page, select a region.
- In the left-side navigation pane, click Clusters.
- On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the Cluster ID.
- In the left-side navigation pane, click Job Scheduling.
- On the Projects tab, click Create Project.
- In the panel that appears, enter a project name and click OK.
Manage members
- Find a project and click the icon in the Actions column.
- Click Add Member.
- In the Members section, select unauthorized accounts and click the icon. Note Unauthorized accounts refer to all RAM users that are not added to member management within an Alibaba Cloud account.
- Click OK.
Workflow orchestration
Create a workflow
- In the upper-right corner, choose .
- Click Create Workflow. Note SQL and Spark task nodes can be created.
- On the Create Workflow page, drag a node to the canvas and configure its parameters. The following table describes the parameters configured for SQL task nodes.
Parameter Description Node Name The name of the SQL task node. Schema The schema configured for the SQL task node. SQL Statements The SQL statements. You can enter SQL statements or select existing SQL statements. Existing SQL statements refer to the SQL statements saved on the page after you choose
.Custom Parameters The custom parameters that can be used as variables in SQL statements. For example, you can set the prop of a custom parameter to customer and its value to ${table}. In this case, the SELECT * FROM ${table};
statement can be changed toSELECT * FROM customer;
.Pre-SQL The pre-SQL statements that are executed before SQL statements of the current workflow. Only DML statements are allowed, such as INSERT, UPDATE, and DELETE. Post-SQL The post-SQL statements that are executed after SQL statements of the current workflow. Only DDL statements are allowed, such as CREATE and DROP. Retry Attempts The maximum number of retry attempts for the current SQL statements and the interval between each attempt. Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.Description The description of the SQL task node. The following table describes the parameters configured for Spark task nodes.Parameter Description Node Name The name of the Spark task node. Resource Group The resource group configured for the Spark task node. You can select only a job resource group because Spark tasks are resource-intensive. If no options are available, you must create a job resource group first. For more information, see Create a resource group.
Type - Batch: batch application
- Streaming: streaming application
- SQL: Spark SQL application
Spark Statements The Spark statements. You can enter the default task parameter templates or select existing Spark templates. Existing Spark templates refer to the Spark templates saved on the page after you choose
.Retry Attempts The maximum number of retry attempts for the current Spark statements and the interval between each attempt. Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.Description The description of the Spark task node. - Click OK to add.
- In the upper-right corner of the canvas, click the icon. In the panel that appears, enter a workflow name, and configure global parameters.
Parameter Description Set Global Parameters The global parameters that can be used as variables in Spark statements for workflow nodes. For example, you can set the value of the global variable ${table} to test. In this case, the SELECT * FROM ${table};
statement can be changed toSELECT * FROM test;
.
Publish a workflow
After you create a workflow, click the icon to publish the workflow.
Execute a workflow
- Run Now: immediately executes the current workflow. The following table describes the startup parameters.
Parameter Description Failure Policy - Continue: If a task fails, other parallel tasks are still executed until the entire process ends.
- End: If a task fails, all executing tasks are terminated and the entire process is terminated.
Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.Supplement Data Specifies whether to enable the Supplement Data feature. If you enable this feature, data is supplemented every day at 00:00 within the specified scheduled time range for the executing workflow. You can set the execution method to Serial or Parallel. - Serial: Within the specified time range, data is supplemented sequentially for each day. Only a single process instance is generated.
- Parallel: Within the specified time range, data is supplemented simultaneously for all the days. Multiple process instances are generated.
Set Startup Parameters The global parameters that can be used as variables in SQL statements for workflow nodes. If you have configured global parameters when you create the workflow, the global parameters are automatically populated as startup parameters. Global parameters are a subset of startup parameters. Rerun Specifies whether to re-execute the process that is terminated due to task failure. Note You can also manually re-execute the process that is terminated due to task failure on the Workflow Instances page. - Schedule: configures schedule parameters to allow the workflow to be repeatedly executed within a time range. The following table describes the schedule parameters.
Parameter Description Schedule Status Specifies whether the workflow is scheduled to be published or unpublished. Start and End Time The start and end time within which the task is scheduled to be executed. Schedule The time scheduled to execute the task, including Minutes, Hours, Days, Months, and Weeks. After you select the scheduled time, you can click Execution Time to view the value of Time for Next Five Executions. Failure Policy - Continue: If a task fails, other parallel tasks are still executed until the entire process ends.
- End: If a task fails, all executing tasks are terminated and the entire process is terminated.
Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.
Workflow state management
After you execute a workflow, you can view the state of the workflow and pause or terminate the workflow on the Workflow Instances page. You can also view the states of tasks on the Task Instances page. If the workflow fails, you can re-execute or recover the workflow.
View the state of a workflow instance
Operation | Description |
Rerun | Re-executes the workflow. |
Recover | Resumes executing the workflow from the failed node without repeated execution on the successful nodes. |
Terminate | Terminates the running workflow. You can click Rerun to re-execute the workflow. |
Pause | Pauses the running workflow. You can click Rerun to re-execute the workflow. |
Delete | Deletes the workflow instance. |
View the state of a task instance
- For an SQL task, you can view the system running and scheduling logs of the task.
- For a Spark task, you can view the manually displayed logs and system running and scheduling logs of the task.
Monitor workflows
- Submitted: displays the number of workflows that are submitted. After a workflow is submitted, the workflow waits to be executed by the scheduling system.
- To Be Paused: displays the number of workflows to be paused. After you initiate a pause operation on a workflow, the workflow waits to be paused by the scheduling system.
- To Be Stopped: displays the number of workflows to be stopped. After you initiate a terminate operation on a workflow, the workflow waits to be stopped by the scheduling system.