All Products
Search
Document Center

AnalyticDB for MySQL:Perform job scheduling in Data Lakehouse Edition

Last Updated:Jan 31, 2024

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) provides job scheduling for batch SQL and Spark applications to implement flexible extract-transform-load (ETL) data processing and development.

Flowchart

workflow

Prerequisites

  • If you use an Alibaba Cloud account to perform job scheduling, make sure that the following conditions are met:
    • A privileged account is created for an AnalyticDB for MySQL cluster. For more information, see Create a database account.
    • A job resource group with at least 8 AnalyticDB Compute Units (ACUs) of reserved computing resources is created. For more information, see Create a resource group.
  • If you perform job scheduling as a Resource Access Management (RAM) user, make sure that the following conditions are met:

Project management

A project is a collection of workflows. Projects help you manage workflows in a centralized manner. Projects can be used only by Alibaba Cloud accounts and authorized RAM users. The following table describes the permissions owed by different roles.
RolePolicyPermission description
AdministratorAliyunADBFullAccessHas all permissions on a project.
DeveloperAliyunADBDeveloperAccessHas edit permissions on a project. For example, a developer can edit workflows, scheduled tasks, and workflow nodes.
VisitorAliyunADBReadOnlyAccessHas read-only permissions on a project. For example, a visitor can view process design and code content such as workflows and scheduled tasks.

Project management allows you to create projects, edit and delete existing projects, and manage members.

Create a project

  1. Log on to the AnalyticDB for MySQL console.
  2. In the upper-left corner of the page, select a region.
  3. In the left-side navigation pane, click Clusters.
  4. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the Cluster ID.
  5. In the left-side navigation pane, click Job Scheduling.
  6. On the Projects tab, click Create Project.
  7. In the panel that appears, enter a project name and click OK.

Manage members

  1. Find a project and click the 1 icon in the Actions column.
  2. Click Add Member.
  3. In the Members section, select unauthorized accounts and click the 2 icon.
    Note Unauthorized accounts refer to all RAM users that are not added to member management within an Alibaba Cloud account.
  4. Click OK.

Workflow orchestration

Create a workflow

  1. In the upper-right corner, choose Workflows > Workflow Templates.
  2. Click Create Workflow.
    Note SQL and Spark task nodes can be created.
  3. On the Create Workflow page, drag a node to the canvas and configure its parameters.
    The following table describes the parameters configured for SQL task nodes.
    ParameterDescription
    Node NameThe name of the SQL task node.
    SchemaThe schema configured for the SQL task node.
    SQL StatementsThe SQL statements. You can enter SQL statements or select existing SQL statements.

    Existing SQL statements refer to the SQL statements saved on the page after you choose Job Editor > SQL Development.

    Custom ParametersThe custom parameters that can be used as variables in SQL statements. For example, you can set the prop of a custom parameter to customer and its value to ${table}. In this case, the SELECT * FROM ${table}; statement can be changed to SELECT * FROM customer;.
    Pre-SQLThe pre-SQL statements that are executed before SQL statements of the current workflow. Only DML statements are allowed, such as INSERT, UPDATE, and DELETE.
    Post-SQLThe post-SQL statements that are executed after SQL statements of the current workflow. Only DDL statements are allowed, such as CREATE and DROP.
    Retry AttemptsThe maximum number of retry attempts for the current SQL statements and the interval between each attempt.
    Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.
    DescriptionThe description of the SQL task node.
    The following table describes the parameters configured for Spark task nodes.
    ParameterDescription
    Node NameThe name of the Spark task node.
    Resource GroupThe resource group configured for the Spark task node. You can select only a job resource group because Spark tasks are resource-intensive.

    If no options are available, you must create a job resource group first. For more information, see Create a resource group.

    Type
    • Batch: batch application
    • Streaming: streaming application
    • SQL: Spark SQL application
    Spark StatementsThe Spark statements. You can enter the default task parameter templates or select existing Spark templates.

    Existing Spark templates refer to the Spark templates saved on the page after you choose Job Editor > Spark JAR Development.

    Retry AttemptsThe maximum number of retry attempts for the current Spark statements and the interval between each attempt.
    Note If you set the maximum number of retries to 3 and the first retry succeeds, no further retries are attempted.
    DescriptionThe description of the Spark task node.
  4. Click OK to add.
  5. In the upper-right corner of the canvas, click the 3 icon. In the panel that appears, enter a workflow name, and configure global parameters.
    ParameterDescription
    Set Global ParametersThe global parameters that can be used as variables in Spark statements for workflow nodes. For example, you can set the value of the global variable ${table} to test. In this case, the SELECT * FROM ${table}; statement can be changed to SELECT * FROM test;.

Publish a workflow

After you create a workflow, click the 4 icon to publish the workflow.

Execute a workflow

You can use one of the following methods to execute a workflow:
  • Run Now: immediately executes the current workflow. The following table describes the startup parameters.
    ParameterDescription
    Failure Policy
    • Continue: If a task fails, other parallel tasks are still executed until the entire process ends.
    • End: If a task fails, all executing tasks are terminated and the entire process is terminated.
    Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.
    Supplement DataSpecifies whether to enable the Supplement Data feature. If you enable this feature, data is supplemented every day at 00:00 within the specified scheduled time range for the executing workflow. You can set the execution method to Serial or Parallel.
    • Serial: Within the specified time range, data is supplemented sequentially for each day. Only a single process instance is generated.
    • Parallel: Within the specified time range, data is supplemented simultaneously for all the days. Multiple process instances are generated.
    Set Startup ParametersThe global parameters that can be used as variables in SQL statements for workflow nodes. If you have configured global parameters when you create the workflow, the global parameters are automatically populated as startup parameters. Global parameters are a subset of startup parameters.
    RerunSpecifies whether to re-execute the process that is terminated due to task failure.
    Note You can also manually re-execute the process that is terminated due to task failure on the Workflow Instances page.
  • Schedule: configures schedule parameters to allow the workflow to be repeatedly executed within a time range. The following table describes the schedule parameters.
    ParameterDescription
    Schedule StatusSpecifies whether the workflow is scheduled to be published or unpublished.
    Start and End TimeThe start and end time within which the task is scheduled to be executed.
    ScheduleThe time scheduled to execute the task, including Minutes, Hours, Days, Months, and Weeks. After you select the scheduled time, you can click Execution Time to view the value of Time for Next Five Executions.
    Failure Policy
    • Continue: If a task fails, other parallel tasks are still executed until the entire process ends.
    • End: If a task fails, all executing tasks are terminated and the entire process is terminated.
    Note Failure policies are applicable only to parallel task scenarios. In serial task scenarios, if a task fails, none of the other serial tasks can continue.

Workflow state management

After you execute a workflow, you can view the state of the workflow and pause or terminate the workflow on the Workflow Instances page. You can also view the states of tasks on the Task Instances page. If the workflow fails, you can re-execute or recover the workflow.

View the state of a workflow instance

After a workflow is published and executed, a workflow instance is generated. The following table describes the operations that you can perform on the workflow instance.
OperationDescription
RerunRe-executes the workflow.
RecoverResumes executing the workflow from the failed node without repeated execution on the successful nodes.
TerminateTerminates the running workflow. You can click Rerun to re-execute the workflow.
PausePauses the running workflow. You can click Rerun to re-execute the workflow.
DeleteDeletes the workflow instance.

View the state of a task instance

You can view the execution states and logs of task nodes in a workflow instance.
  • For an SQL task, you can view the system running and scheduling logs of the task.
  • For a Spark task, you can view the manually displayed logs and system running and scheduling logs of the task.

Monitor workflows

You can monitor the states of all workflows and task nodes in a project.
  • Submitted: displays the number of workflows that are submitted. After a workflow is submitted, the workflow waits to be executed by the scheduling system.
  • To Be Paused: displays the number of workflows to be paused. After you initiate a pause operation on a workflow, the workflow waits to be paused by the scheduling system.
  • To Be Stopped: displays the number of workflows to be stopped. After you initiate a terminate operation on a workflow, the workflow waits to be stopped by the scheduling system.