This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access the Lindorm file engine.

Prerequisites

  • A Spark virtual cluster (VC) is created. For more information, see Create a virtual cluster.
  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.
  • The Classless Inter-Domain Routing (CIDR) block of the virtual private cloud (VPC) from which you access the Lindorm instance is added to a whitelist of the Lindorm instance in the Lindorm console.
  • The security group ID and vSwitch ID that are used by the serverless Spark engine of DLA to access the file engine of the Lindorm instance are prepared. For more information, see Access your VPC.

Procedure

  1. Prepare the following test code to perform data read and write operations on the HDFS of the Lindorm file engine. Package the test code into the AccessLindormHDFS.py file and upload this file to OSS.
    from pyspark.sql import SparkSession
    import sys
    if __name__ == '__main__':
        spark = SparkSession.builder.getOrCreate()
        welcome_str = "hello, dla-spark"
        # Specify the HDFS directory to store required data.
        hdfsPath = sys.argv[1]
        # Store the welcome string to the specified HDFS directory.
        spark.sparkContext.parallelize(list(welcome_str)).saveAsTextFile(hdfsPath)
        # Read data from the specified HDFS directory and display the data.
        print("----------------------------------------------------------")
        spark.sparkContext.textFile(hdfsPath).collect.foreach(print)
        print("-----------------------------------------------------------")
  2. Log on to the Lindorm console, navigate to the file engine of the Lindorm instance, and generate configuration items with one click. For more information, see Activate the file engine of the Lindorm instance.
  3. Log on to the DLA console.
  4. In the top navigation bar, select the region where the file engine of the Lindorm instance is deployed.
  5. In the left-side navigation pane, choose Serverless Spark > Submit job.
  6. On the Parameter Configuration page, click Create Job.
  7. In the Create Job dialog box, configure the parameters and click OK to create a Spark job.
  8. In the Job List navigation tree, click the Spark job that you created and enter the following content of the job in the code editor. Then, save and submit the Spark job.
    {
        "name": "Lindorm",
        "args": [
            "<fs.defaultFS>/tmp/test-lindorm.txt"
        ],
        "conf": {
            "spark.driver.resourceSpec": "medium",
            "spark.executor.resourceSpec": "medium",
            "spark.executor.instances": 1,
            "spark.kubernetes.pyspark.pythonVersion": "3",
            "spark.dla.job.log.oss.uri": "oss://<OSS directory where your UI logs are saved>",
            "spark.dla.eni.enable": "true",
            "spark.dla.eni.security.group.id": "<ID of the security group>",
            "spark.dla.eni.vswitch.id": "<ID of the vSwitch>",
            "spark.hadoop.dfs.nameservices": "<dfs.nameservices>",
            "spark.hadoop.dfs.client.failover.proxy.provider.<dfs.nameservices>": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
            "spark.hadoop.dfs.ha.namenodes.<dfs.nameservices>": "nn1,nn2",
            "spark.hadoop.dfs.namenode.rpc-address.<dfs.nameservices>.nn1": "<dfs.namenode.rpc-address.<dfs.nameservices>.nn1>",
            "spark.hadoop.dfs.namenode.rpc-address.<dfs.nameservices>.nn2": "<dfs.namenode.rpc-address.<dfs.nameservices>.nn2>"
            
        },
        "file": "oss://path/to/AccessLindormHDFS.py"
    }
    Parameters
    Parameter Example Description
    args The value of args: <fs.defaultsFS> is derived from the value of fs.defaultsFS in the core-site configuration item that is generated in Step 2. For more information, see Activate the file engine of the Lindorm instance. None.
    spark.driver.resourceSpec medium The specifications of the Spark driver. Valid values:
    • small: indicates 1 CPU core and 4 GB of memory.
    • medium: indicates 2 CPU cores and 8 GB of memory.
    • large: indicates 4 CPU cores and 16 GB of memory.
    • xlarge: indicates 8 CPU cores and 32 GB of memory.
    spark.executor.resourceSpec medium The specifications of the Spark executor. Valid values:
    • small: indicates 1 CPU core and 4 GB of memory.
    • medium: indicates 2 CPU cores and 8 GB of memory.
    • large: indicates 4 CPU cores and 16 GB of memory.
    • xlarge: indicates 8 CPU cores and 32 GB of memory.
    spark.executor.instances 1 The number of executors.
    spark.kubernetes.pyspark.pythonVersion 3 The running version of Python. Valid values:
    • 2: Python 2.0
    • 3: Python 3.0
    spark.dla.job.log.oss.uri oss://<OSS directory where your Spark UI logs are saved> The OSS directory where Spark UI logs are saved.
    spark.dla.eni.enable true Specifies whether to grant the permissions to access the VPC. To access data in the VPC, you must set this parameter to true.
    spark.dla.eni.security.group.id <ID of your security group> The ID of the security group that is used to access the VPC.
    spark.dla.eni.vswitch.id <ID of your vSwitch> The ID of the vSwitch that is used to access the VPC.
    spark.hadoop.dfs.nameservices The value is derived from the value of dfs.nameservices in the hdfs-site configuration item that is generated in Step 2. The parameter required to connect to Hadoop.
    spark.hadoop.dfs.client.failover.proxy.provider.<dfs.nameservices> The value is derived from the value of dfs.client.failover.proxy.provider.<dfs.nameservices> in the hdfs-site configuration item that is generated in Step 2. The parameter required to connect to Hadoop.
    spark.hadoop.dfs.ha.namenodes.<dfs.nameservices> The value is derived from the value of dfs.ha.namenodes.<dfs.nameservices> in the hdfs-site configuration item that is generated in Step 2. The parameter required to connect to Hadoop.
    spark.hadoop.dfs.namenode.rpc-address.<dfs.nameservices>.nn1 The value is derived from the value of dfs.namenode.rpc-address.<dfs.nameservices>.nn1 in the hdfs-site configuration item that is generated in Step 2. The parameter required to connect to Hadoop.
    spark.hadoop.dfs.namenode.rpc-address.<dfs.nameservices>.nn2 The value is derived from the value of dfs.namenode.rpc-address.<dfs.nameservices>.nn2 in the hdfs-site configuration item that is generated with one click in Step 2. The parameter required to connect to Hadoop.