Run Lindorm Spark Jobs in DataWorks with JAR & Python - DataWorks

Background information

Lindorm is a distributed computing service based on a cloud native architecture. It supports Community Edition computing models and Apache Spark, and is in-depth integrated with the features provided by the Lindorm storage engine. Lindorm meets computing requirements in various scenarios, such as massive data processing, interactive analytics, machine learning, and graph computing.

Prerequisites

Before you begin, make sure you have:

A Lindorm instance created and associated with your workspace as a computing resource
(RAM users only) Your RAM user added to the workspace and assigned the Development or Workspace Administrator role — see Add members to a workspace

Note

The Workspace Administrator role grants extensive permissions. Assign it only when necessary.

Create a Lindorm Spark node

For steps on creating the node, see Create a Lindorm Spark node.

Upload your code to LindormDFS

Before configuring the node, upload your JAR package or Python file to LindormDFS so the node can reference it. The upload steps are the same for all languages.

Log on to the Lindorm console. In the top navigation bar, select the target region and find your Lindorm instance on the Instances page.
Click the instance name to open the instance details page.
In the left-side navigation pane, click Compute Engine.
On the Job Management tab, click Upload Resource.

In the upload dialog box, click the dotted area and select the file to upload:

Java or Scala: Upload a JAR package. For a quick test, download spark-examples_2.12-3.3.0.jar.

Python: Upload a .py file. For a quick test, save the following script as pi.py:

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_: int) -> float:
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

Click Upload.
After the upload completes, find the file under Upload Resource on the Job Management tab. Click the icon to the left of the file to copy its storage path in LindormDFS. You need this path when configuring the node.

Configure the Lindorm Spark node

On the configuration tab of the node, set the parameters for your language.

Java or Scala

Parameter	Description
Main JAR Resource	The LindormDFS storage path you copied in the previous step.
Main Class	The fully qualified main class name in the JAR. For the sample JAR, use `org.apache.spark.examples.SparkPi`.
Parameters	Runtime parameters passed to the program. Use the `${var}` format for dynamic parameters.
Configuration Items	Spark runtime properties. For available properties, see Job configuration instructions. To set global Spark properties shared across jobs, configure them when you associate the Lindorm computing resource.

Python

Parameter	Description
Main Package	The LindormDFS storage path of the `.py` file you copied in the previous step.
Parameters	Runtime parameters passed to the script. Use the `${var}` format for dynamic parameters.
Configuration Items	Spark runtime properties. For available properties, see Job configuration instructions.

Debug the Lindorm Spark node

In the right-side navigation pane, click the Run Configuration tab and set the following parameters:

Parameter	Description
Computing Resource	Select the Lindorm computing resource associated with your workspace.
Lindorm Resource Group	Select the Lindorm resource group you specified when associating the computing resource.
Resource Group	Select the resource group that passed the connectivity test during computing resource association.
Script Parameters	If you defined variables in `${Parameter name}` format, enter the Parameter Name and Parameter Value here. These values are substituted at runtime. For more information, see Sources and expressions of scheduling parameters.

Save and run the node.

What's next

Node scheduling configuration: Click Properties in the right-side navigation pane and configure scheduling properties under Scheduling Policies to have DataWorks run the node on a schedule.
Node deployment: Click the icon in the top toolbar to deploy the node to the production environment. Nodes can only be scheduled periodically after they are deployed to production.