Manage Lindorm Spark jobs by using Data Management Service (DMS) - Lindorm

Prerequisites

Before you begin, ensure that you have:

An activated DMS account
LDPS activated for your Lindorm instance. See Activate LDPS and modify the configurations
A compiled Spark job (JAR, Python, or SQL). See Create a job in Java or Create a job in Python
The job file uploaded to Hadoop Distributed File System (HDFS) or Object Storage Service (OSS). See Upload files in the Lindorm console

Create a Lindorm Spark task flow

To create a task flow, you need: a task flow name, the path to your Spark job file in HDFS or OSS, and your Lindorm instance ID and region.

Log on to the DMS console V5.0.
Go to the Task Orchestration page.
- Simple mode: In the Scene Guide section, click Data Transmission and Processing (DTS). Then click Task Orchestration in the Data processing section.
- Normal mode: In the top navigation bar, choose DTS > Data Development > Task Orchestration.
Click Create Task Flow.
In the Create Task Flow dialog box, enter a Task Flow Name and optional Description, then click OK.
In the Task Type section on the left, drag Lindorm Spark nodes onto the canvas. Connect nodes to define dependencies between them.

Configure each Lindorm Spark node:

Double-click the node, or click the node and then click the icon.

In the Basic configuration section, set the following parameters:

Parameter	Description
Region	The region where your Lindorm instance is deployed.
Lindorm Instance	The ID of your Lindorm instance.
Task Type	The Spark job type: JAR, Python, or SQL. JAR Python

In the Job configuration section, paste and edit the configuration template for your job type:

JAR job configuration
Python job configuration
SQL job configuration

JAR job

{
  "mainResource": "oss://path/to/your/file.jar",
  "mainClass": "path.to.main.class",
  "args": ["arg1", "arg2"],
  "configs": {
    "spark.hadoop.fs.oss.endpoint": "",
    "spark.hadoop.fs.oss.accessKeyId": "",
    "spark.hadoop.fs.oss.accessKeySecret": "",
    "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem",
    "spark.sql.shuffle.partitions": "20"
  }
}

Parameter	Type	Required	Description	Example
`mainResource`	String	Yes	Path to the JAR file in HDFS or OSS. HDFS: `hdfs:///path/spark-examples_2.12-3.1.1.jar`; OSS: `oss://testBucketName/path/spark-examples_2.12-3.1.1.jar`	HDFS: `hdfs:///path/spark-examples_2.12-3.1.1.jar`; OSS: `oss://testBucketName/path/spark-examples_2.12-3.1.1.jar`
`mainClass`	String	Yes	Entry point class for the JAR job.	`com.aliyun.ldspark.SparkPi`
`args`	Array	No	Arguments passed to `mainClass`.	`["arg1", "arg2"]`
`configs`	JSON	No	Spark system parameters. If the job is stored in OSS, configure the OSS keys below.	`{"spark.sql.shuffle.partitions": "200"}`

If the JAR file is stored in OSS, set the following keys inside configs:

Key	Description
`spark.hadoop.fs.oss.endpoint`	OSS endpoint where the job file is stored.
`spark.hadoop.fs.oss.accessKeyId`	AccessKey ID for OSS access. See Obtain an AccessKey pair.
`spark.hadoop.fs.oss.accessKeySecret`	AccessKey secret for OSS access. See Obtain an AccessKey pair.
`spark.hadoop.fs.oss.impl`	OSS file system class. Set to `org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem`.

Python job

{
  "mainResource": "oss://path/to/your/file.py",
  "args": ["arg1", "arg2"],
  "configs": {
    "spark.hadoop.fs.oss.endpoint": "",
    "spark.hadoop.fs.oss.accessKeyId": "",
    "spark.hadoop.fs.oss.accessKeySecret": "",
    "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem",
    "spark.submit.pyFiles": "oss://path/to/your/project_file.py,oss://path/to/your/project_module.zip",
    "spark.archives": "oss://path/to/your/environment.tar.gz#environment",
    "spark.sql.shuffle.partitions": "20"
  }
}

Parameter	Type	Required	Description	Example
`mainResource`	String	Yes	Path to the Python file in OSS or HDFS. OSS: `oss://testBucketName/path/spark-examples.py`; HDFS: `hdfs:///path/spark-examples.py`	OSS: `oss://testBucketName/path/spark-examples.py`; HDFS: `hdfs:///path/spark-examples.py`
`args`	Array	No	Arguments passed to the Python script.	`["arg1", "arg2"]`
`configs`	JSON	No	Spark system parameters. If the job is stored in OSS, configure the four OSS keys (same as JAR job) plus the Python-specific keys below.	`{"spark.sql.shuffle.partitions": "200"}`

Python-specific keys in configs:

Key	Description
`spark.submit.pyFiles`	Comma-separated OSS paths to additional Python files or ZIP modules.
`spark.archives`	OSS path to a Python environment archive (`.tar.gz`).

Click Try Run in the upper-left corner to verify the job runs as expected.

After all nodes are configured, click Publish in the upper-left corner to publish the task flow.

View publishing history and logs

On the Task Orchestration page, click the name of the task flow.
Click Go to O&M in the upper-right corner.

View the history or logs:
- Publishing history: On the Task Flow Information page, click the Published Tasks tab to see all published versions of the task flow.
- Run logs: On the Running History tab, select Scheduling Trigger or Triggered Manually from the drop-down list to filter runs. The list shows all nodes in the task flow and their execution status. Click View for a node to view the logs for the submission of the Lindorm Spark job, and obtain the job ID and SparkUI of the node.

If job submission fails, record the job ID and the Spark UI URL before submit a ticket.

Advanced settings

After changing any advanced setting in the DMS console, republish the task flow for the changes to take effect.

Configure scheduling

Set a scheduling policy to run the task flow automatically on a fixed schedule.

On the Task Orchestration page, click the name of the task flow.
In the lower-left corner, click Task Flow Information.

In the Scheduling Settings section on the right, turn on Enable Scheduling and configure the policy. The following table describes the parameters.

Parameter	Description
Scheduling Type	Cyclic scheduling: runs on a repeating schedule (for example, every week). Schedule once: runs once at a specific time.
Effective Time	The period during which the scheduling policy is active. Defaults to January 1, 1970–January 1, 9999 (permanent).
Scheduling Cycle	Frequency of execution: Hour, Day, Week, or Month.
Timed Scheduling	How to define the trigger time within the cycle. Run at an interval: set Starting Time, Intervals (hours), and End Time. Run at the specified point in time: pick specific hours using the Specified Time field.
Specified Time	If Scheduling Cycle is Week, select days of the week. If Month, select days of the month.
Specific Point in Time	The exact time on specified days when the task flow runs (for example, `02:55`).
Cron expression	Auto-generated cron expression based on the scheduling parameters above.

Example: To run a task flow at 00:00 and 12:00 every day:

Set Scheduling Type to Cyclic scheduling.
Set Scheduling Cycle to Hour.
In Timed Scheduling, select Specified Time, then select 0Hour and 12Hour.

Configure variables

For task flows with cyclic scheduling, configure time variables to pass dynamic date values to your jobs. For example, the built-in bizdate variable resolves to the day before the scheduled execution time.

On the task flow page, double-click the Lindorm Spark node, or click it and then click the icon.
In the right-side navigation pane, click Variable Setting.
On the Node Variable or Task Flow Variable tab, add the variable.
Reference the variable in the Job configuration section. For all available variables, see Variables.

Manage notifications

Enable notifications to receive alerts based on task flow execution results.

In the lower-left corner of the task flow page, click Notification Configurations.
Turn on the notification types you need:
- Basic Notifications
  - Success Notification: triggers when the task flow completes successfully.
  - Failure Notification: triggers when the task flow fails.
- Timeout Notification: triggers when the task flow execution time exceeds the configured timeout threshold.
- Alert Notification: triggers when the task flow is about to start.
(Optional) Configure notification recipients. See Manage notification rules.

Execute SQL statements

Log on to the DMS console V5.0.
Click the Home tab.
In the left-side navigation pane, click the icon to add an instance.
In the Add Instance dialog box, select Lindorm_Compute in the NoSQL Database section.
Enter the Instance Region, Instance ID, Database Account, and Database password, then click Submit.
In the confirmation dialog box, click Submit to open the SQL Console.
On the SQLConsole tab, enter your SQL statement and click Execute.

What's next

For more about DMS task orchestration, see Overview.