AnalyticDB for MySQL supports the Spark computing engine. This topic describes how to purchase and manage Spark compute nodes.

Description

Spark is a fast and widely used computing engine designed for large-scale data processing. AnalyticDB for MySQL can be combined with Spark to provide capabilities to analyze AnalyticDB for MySQL databases.

Create Spark compute nodes

You can purchase Spark compute nodes after you create AnalyticDB for MySQL clusters.

  1. Log on to the AnalyticDB for MySQL console with an Alibaba Cloud account.
  2. Click the name of the cluster for which you want to create Spark compute nodes.
  3. In the left-side navigation pane, choose Spark > Clusters.
  4. In the Note message, click Create to go to the buy page of Spark clusters.
  5. Set the number of compute nodes and click Buy Now to purchase Spark compute nodes.
    Note
    • A maximum of 200 compute nodes can be purchased. You are charged CNY 5.6 per hour for each compute node.
    • When a Spark cluster is being created, features related to Spark and Airflow are disabled.

Manage clusters

  1. Click the name of the cluster for which a Spark cluster is created in the AnalyticDB for MySQL console.
  2. In the left-side navigation pane, choose Spark > Clusters. On the page that appears, the basic information and status of the Spark cluster is displayed.
  3. Click Set UI Access Password to set the UI access username and password. The username and password take effect 3 minutes after they are created.

Manage resources

  1. On the Cluster Information page, choose Spark > Resources.
  2. On the resource management page, upload JAR packages that you edit for a Spark job to an Object Storage Service (OSS) directory and run the job by using the package.
    • Upload File: Upload JAR packages that you edit for a Spark job to an OSS directory.
    • Upload Directory: Batch upload JAR packages in the directory.
    • Create Folder: Create a folder to manage JAR packages that you upload.
    • Delete Resource: Delete JAR packages that are no longer needed.
    Note Uploaded files can be previewed, deleted, or downloaded. You can click Copy Path in the Actions column corresponding to a file to copy its OSS path. The path can be used to set parameters in Airflow scripts.

Test jobs

  1. On the Cluster Information page, choose Spark > Job Test.
  2. Configure the following parameters.
    Parameter Data type Required Description
    className String Yes The entry class of Java or Scala, such as com.aliyun.spark.oss.SparkReadOss. This parameter is not required for Python.
    conf Map Yes The parameters required for the Spark job. The parameters are the same as those configured in Apache Spark. The parameters are in the format of key: value. If you do not specify the conf parameter, the default settings you configure when you create a virtual cluster are used.

    spark.adb.userName: required.

    file String Yes The directory where the main files of the Spark job are saved. The main files can be the JAR packages that contain the entry classes or entry execution files of Python.
    jars List<String> No The JAR packages on which the Spark job depends. When the Spark job is running, JAR packages are added to the classpaths of the Java virtual machines (JVMs) of the driver and executors.
    proxyUser String No The proxy user for running the Spark job.
    args List<String> No The parameters that you want to configure for the Spark job.
    pyFiles List No The Python files on which PySpark depends. These files must be in the ZIP, PY, or EGG format. If PySpark depends on multiple Python files, we recommend that you use files in the ZIP or EGG format. These files can be referenced in the Python code by using a module.
    files List No The files on which the Spark job depends. These files are downloaded to the directory of the executed process. You can specify an alias for a file, such as oss://bucket/xx/yy.txt#yy. In this case, you need only to enter ./yy in the code to access the file. If you do not specify an alias for the file, you must use ./yy.txt to access the file.
    driverMemory String No The memory used by each driver process.
    driverCores Integer No The number of CPU cores used by each driver process.
    executorMemory String No The memory used by each executor process.
    executorCores Integer No The number of CPU cores used by each executor process.
    numExecutors Integer No The number of executors used in each session. We recommend that you specify this parameter.
    archives List No The packages on which the Spark job depends. The packages must be in the ZIP, TGZ, TAR, or TAR.GZ format. The packages are decompressed to the directory where the current Spark process is running. You can specify an alias for a package, such as oss://bucket/xx/yy.zip#yy. In this case, you need only to enter ./yy/zz.txt in the code to access the zz.txt file. If you do not specify an alias for the package, you must use ./yy.zip/zz.txt to access the file. In this example, zz.txt is a file in the yy.zip package.
    queue String No The name of the Yarn queue that submits the job.
    name String No The name of the Spark job. We recommend that you specify this parameter.
  3. Click Test Running to check whether the JAR packages uploaded in the preceding step function normally.
    Note In the left-side navigation pane, choose Spark > Clusters. In the UI Access section, click the value of the Yarn Cluster UI parameter. On the page that appears, view the test result of the JAR packages.