DataWorks lets you create Lindorm Spark nodes to develop and schedule Spark tasks written in Java, Scala, or Python. This topic walks you through uploading your code to LindormDFS, configuring the node, and running a test execution.
Background information
Lindorm is a distributed computing service based on a cloud native architecture. It supports Community Edition computing models and Apache Spark, and is in-depth integrated with the features provided by the Lindorm storage engine. Lindorm meets computing requirements in various scenarios, such as massive data processing, interactive analytics, machine learning, and graph computing.
Prerequisites
Before you begin, make sure you have:
-
A Lindorm instance created and associated with your workspace as a computing resource
-
(RAM users only) Your RAM user added to the workspace and assigned the Development or Workspace Administrator role — see Add members to a workspace
The Workspace Administrator role grants extensive permissions. Assign it only when necessary.
Create a Lindorm Spark node
For steps on creating the node, see Create a Lindorm Spark node.
Upload your code to LindormDFS
Before configuring the node, upload your JAR package or Python file to LindormDFS so the node can reference it. The upload steps are the same for all languages.
-
Log on to the Lindorm console. In the top navigation bar, select the target region and find your Lindorm instance on the Instances page.
-
Click the instance name to open the instance details page.
-
In the left-side navigation pane, click Compute Engine.
-
On the Job Management tab, click Upload Resource.
-
In the upload dialog box, click the dotted area and select the file to upload:
-
Java or Scala: Upload a JAR package. For a quick test, download spark-examples_2.12-3.3.0.jar.
-
Python: Upload a
.pyfile. For a quick test, save the following script aspi.py:import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
-
-
Click Upload.
-
After the upload completes, find the file under Upload Resource on the Job Management tab. Click the
icon to the left of the file to copy its storage path in LindormDFS. You need this path when configuring the node.
Configure the Lindorm Spark node
On the configuration tab of the node, set the parameters for your language.
Java or Scala
| Parameter | Description |
|---|---|
| Main JAR Resource | The LindormDFS storage path you copied in the previous step. |
| Main Class | The fully qualified main class name in the JAR. For the sample JAR, use org.apache.spark.examples.SparkPi. |
| Parameters | Runtime parameters passed to the program. Use the ${var} format for dynamic parameters. |
| Configuration Items | Spark runtime properties. For available properties, see Job configuration instructions. To set global Spark properties shared across jobs, configure them when you associate the Lindorm computing resource. |
Python
| Parameter | Description |
|---|---|
| Main Package | The LindormDFS storage path of the .py file you copied in the previous step. |
| Parameters | Runtime parameters passed to the script. Use the ${var} format for dynamic parameters. |
| Configuration Items | Spark runtime properties. For available properties, see Job configuration instructions. |
Debug the Lindorm Spark node
-
In the right-side navigation pane, click the Run Configuration tab and set the following parameters:
Parameter Description Computing Resource Select the Lindorm computing resource associated with your workspace. Lindorm Resource Group Select the Lindorm resource group you specified when associating the computing resource. Resource Group Select the resource group that passed the connectivity test during computing resource association. Script Parameters If you defined variables in ${Parameter name}format, enter the Parameter Name and Parameter Value here. These values are substituted at runtime. For more information, see Sources and expressions of scheduling parameters. -
Save and run the node.
What's next
-
Node scheduling configuration: Click Properties in the right-side navigation pane and configure scheduling properties under Scheduling Policies to have DataWorks run the node on a schedule.
-
Node deployment: Click the
icon in the top toolbar to deploy the node to the production environment. Nodes can only be scheduled periodically after they are deployed to production.